# Summary of the Code

This code implements a **Gradient Boosting Classifier** to classify emoticon sequences. It starts by loading the dataset and splits the data into features (`X`) and labels (`Y`). The code creates a unique mapping for each emoticon by assigning an integer to each unique character. The emoticon strings are then converted into lists of these integers and padded to ensure consistent sequence lengths across the dataset.

The padded data is split into training and testing sets (80% for training, 20% for testing). The Gradient Boosting Classifier is trained on the training set, and predictions are made on the test set. Finally, the model's performance is evaluated using the accuracy score, which is printed as the final result.



In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Load the dataset
emoticon_data = pd.read_csv("train_emoticon.csv")

# Split data into features (X) and target (Y)
X = emoticon_data['input_emoticon']
Y = emoticon_data['label']

# Step 1: Create a unique mapping for each emoticon
unique_emoticons = set(''.join(X))  # Create a set of all unique emoticons
emoticon_to_num = {emoticon: idx for idx, emoticon in enumerate(unique_emoticons)}

# Step 2: Convert each emoticon string to a list of numbers
X_numeric = [[emoticon_to_num[emoticon] for emoticon in emoticon_string] for emoticon_string in X]

# Step 3: Pad sequences manually using NumPy
max_len = max(len(seq) for seq in X_numeric)  # Get the maximum sequence length
X_padded = np.array([seq + [0] * (max_len - len(seq)) for seq in X_numeric])  # Pad with zeros

# Step 4: Split the data into training and test sets
X_train, X_test, Y_train, Y_test = train_test_split(X_padded, Y, test_size=0.2, random_state=42)

# Step 5: Initialize and train the Gradient Boosting model
gb_model = GradientBoostingClassifier(random_state=42)
gb_model.fit(X_train, Y_train)
# Step 6: Predict on the test set
Y_pred = gb_model.predict(X_test)

# Step 7: Evaluate the model
accuracy = accuracy_score(Y_test, Y_pred)
print(f"Gradient Boosting Accuracy: {accuracy:.4f}")

Gradient Boosting Accuracy: 0.6158


In [None]:
!pip install tensorflow



# Summary of the Code

This code implements an **LSTM (Long Short-Term Memory) neural network** to classify emoticon sequences.

- **Data Preparation**: The dataset is loaded, and a unique mapping of each emoticon to a numeric value is created. The sequences are then padded to ensure uniform length using Keras' `pad_sequences`. The target variable (`Y`) is label-encoded and converted to one-hot encoding for multi-class classification.

- **Model Architecture**: The neural network uses an Embedding layer to transform the numeric values of the emoticons into dense vectors, followed by an LSTM layer to capture sequence dependencies. A Dropout layer is added to prevent overfitting, and the final Dense layer uses softmax for multi-class classification.

- **Training and Evaluation**: The model is compiled using the Adam optimizer and categorical crossentropy as the loss function. It is trained over 10 epochs with a batch size of 32, and the test accuracy is evaluated. The model achieves an accuracy of 94.84%.



In [3]:
import pandas as pd
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.utils import to_categorical

# Load the dataset
emoticon_data = pd.read_csv("train_emoticon.csv")

# Step 1: Create a unique mapping for each emoticon
X = emoticon_data['input_emoticon']
Y = emoticon_data['label']

# Create a unique mapping for emoticons
unique_emoticons = set(''.join(X))  # Create a set of all unique emoticons
emoticon_to_num = {emoticon: idx + 1 for idx, emoticon in enumerate(unique_emoticons)}  # Map emoticons to numbers, +1 for padding index

# Convert each emoticon string to a list of numbers
X_numeric = [[emoticon_to_num[emoticon] for emoticon in emoticon_string] for emoticon_string in X]

# Step 2: Pad sequences manually using NumPy or Keras
max_len = max(len(seq) for seq in X_numeric)  # Get the maximum sequence length
X_padded = pad_sequences(X_numeric, maxlen=max_len, padding='post', value=0)  # Pad with zeros

# Step 3: Label encode the target variable (Y)
label_encoder = LabelEncoder()
Y_encoded = label_encoder.fit_transform(Y)
Y_categorical = to_categorical(Y_encoded)  # Convert to one-hot encoding if multi-class classification

# Step 4: Split the data into training and test sets
X_train, X_test, Y_train, Y_test = train_test_split(X_padded, Y_categorical, test_size=0.2, random_state=42)

# Step 5: Define the LSTM model
vocab_size = len(emoticon_to_num) + 1  # +1 for padding
embedding_dim = 64
lstm_units = 128

model = Sequential()

# Add Embedding layer (converts each emoticon number to a dense vector of given size)
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_len))

# Add LSTM layer
model.add(LSTM(lstm_units, return_sequences=False))

# Add Dropout to prevent overfitting
model.add(Dropout(0.3))

# Add a Dense output layer with softmax for classification
model.add(Dense(Y_categorical.shape[1], activation='softmax'))  # Assuming multi-class classification

# Step 6: Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Step 7: Train the model
batch_size = 32
epochs = 10

history = model.fit(X_train, Y_train, validation_data=(X_test, Y_test), batch_size=batch_size, epochs=epochs, verbose=2)

# Step 8: Evaluate the model
test_loss, test_accuracy = model.evaluate(X_test, Y_test)
print(f"Neural Network Test Accuracy: {test_accuracy:.4f}")


Epoch 1/10




177/177 - 6s - 32ms/step - accuracy: 0.7209 - loss: 0.5223 - val_accuracy: 0.8715 - val_loss: 0.3139
Epoch 2/10
177/177 - 5s - 26ms/step - accuracy: 0.8895 - loss: 0.2525 - val_accuracy: 0.9075 - val_loss: 0.2299
Epoch 3/10
177/177 - 3s - 17ms/step - accuracy: 0.9190 - loss: 0.1905 - val_accuracy: 0.9336 - val_loss: 0.1731
Epoch 4/10
177/177 - 5s - 29ms/step - accuracy: 0.9340 - loss: 0.1569 - val_accuracy: 0.9463 - val_loss: 0.1356
Epoch 5/10
177/177 - 6s - 32ms/step - accuracy: 0.9499 - loss: 0.1219 - val_accuracy: 0.9400 - val_loss: 0.1439
Epoch 6/10
177/177 - 4s - 25ms/step - accuracy: 0.9522 - loss: 0.1151 - val_accuracy: 0.9435 - val_loss: 0.1389
Epoch 7/10
177/177 - 7s - 38ms/step - accuracy: 0.9605 - loss: 0.0960 - val_accuracy: 0.9477 - val_loss: 0.1189
Epoch 8/10
177/177 - 4s - 20ms/step - accuracy: 0.9684 - loss: 0.0810 - val_accuracy: 0.9541 - val_loss: 0.1030
Epoch 9/10
177/177 - 5s - 29ms/step - accuracy: 0.9723 - loss: 0.0714 - val_accuracy: 0.9484 - val_loss: 0.1310
Epo

### Summary of the Code

This code implements an LSTM-based neural network for emoticon classification using a dataset of emoticons and their corresponding labels. The process starts by loading the dataset and converting each emoticon string into a numeric sequence. A unique numeric mapping is created for each emoticon, and the sequences are padded to ensure uniform length. The target labels are converted into a one-hot encoded format for multi-class classification.

The data is split into two parts: 20% of the data is used for training, and 20% is reserved for testing. The model is built using a sequential architecture with an embedding layer that converts numeric emoticons into dense vectors. This is followed by an LSTM layer to capture temporal patterns, a dropout layer to prevent overfitting, and a final dense layer with softmax activation to predict the class of each input sequence.

The model is compiled using the Adam optimizer and categorical cross-entropy loss function, and is trained over 10 epochs with a batch size of 32. After training, the model achieves a test accuracy of 86%, demonstrating a reasonable performance on the emoticon classification task.

This pipeline showcases how LSTM networks can be effectively applied to sequence-based classification problems like emoticon recognition.


In [4]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding, Dropout
from tensorflow.keras.utils import to_categorical

# Load the dataset
emoticon_data = pd.read_csv("train_emoticon.csv")

# Split data into features (X) and target (Y)
X = emoticon_data['input_emoticon']
Y = emoticon_data['label']

# Step 1: Create a unique mapping for each emoticon
unique_emoticons = set(''.join(X))  # Create a set of all unique emoticons
emoticon_to_num = {emoticon: idx + 1 for idx, emoticon in enumerate(unique_emoticons)}  # +1 for padding

# Step 2: Convert each emoticon string to a list of numbers
X_numeric = [[emoticon_to_num[emoticon] for emoticon in emoticon_string] for emoticon_string in X]

# Step 3: Pad sequences manually using NumPy
max_len = max(len(seq) for seq in X_numeric)  # Get the maximum sequence length
X_padded = np.array([seq + [0] * (max_len - len(seq)) for seq in X_numeric])  # Pad with zeros

# Step 4: Convert labels to categorical format
Y_categorical = to_categorical(Y)

# Step 5: Split the data into 20% training and 20% testing
# First, split off the 20% test set from the full data
X_train_val, X_test, Y_train_val, Y_test = train_test_split(X_padded, Y_categorical, test_size=0.2, random_state=42)

# Then, split off 25% of the remaining 80% for training (which is 20% of the total)
X_train, _, Y_train, _ = train_test_split(X_train_val, Y_train_val, test_size=0.75, random_state=42)

# Step 6: Define the LSTM model
vocab_size = len(emoticon_to_num) + 1  # +1 for padding
embedding_dim = 64
lstm_units = 128

model = Sequential()

# Add Embedding layer (converts each emoticon number to a dense vector of given size)
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_len))

# Add LSTM layer
model.add(LSTM(lstm_units, return_sequences=False))

# Add Dropout to prevent overfitting
model.add(Dropout(0.3))

# Add a Dense output layer with softmax for classification
model.add(Dense(Y_categorical.shape[1], activation='softmax'))  # Assuming multi-class classification

# Step 7: Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Step 8: Train the model on 20% training data
batch_size = 32
epochs = 10

history = model.fit(X_train, Y_train, validation_data=(X_test, Y_test), batch_size=batch_size, epochs=epochs, verbose=2)

# Step 9: Evaluate the model
test_loss, test_accuracy = model.evaluate(X_test, Y_test)
print(f"Neural Network Test Accuracy: {test_accuracy:.4f}")


Epoch 1/10




45/45 - 3s - 77ms/step - accuracy: 0.5388 - loss: 0.6913 - val_accuracy: 0.5078 - val_loss: 0.6883
Epoch 2/10
45/45 - 2s - 50ms/step - accuracy: 0.6716 - loss: 0.6019 - val_accuracy: 0.7754 - val_loss: 0.4875
Epoch 3/10
45/45 - 1s - 28ms/step - accuracy: 0.8404 - loss: 0.3713 - val_accuracy: 0.8319 - val_loss: 0.3721
Epoch 4/10
45/45 - 1s - 28ms/step - accuracy: 0.8905 - loss: 0.2650 - val_accuracy: 0.8496 - val_loss: 0.3626
Epoch 5/10
45/45 - 2s - 46ms/step - accuracy: 0.9202 - loss: 0.2288 - val_accuracy: 0.8545 - val_loss: 0.3256
Epoch 6/10
45/45 - 2s - 40ms/step - accuracy: 0.9371 - loss: 0.1843 - val_accuracy: 0.8637 - val_loss: 0.3247
Epoch 7/10
45/45 - 1s - 26ms/step - accuracy: 0.9308 - loss: 0.1778 - val_accuracy: 0.8757 - val_loss: 0.3328
Epoch 8/10
45/45 - 1s - 28ms/step - accuracy: 0.9484 - loss: 0.1349 - val_accuracy: 0.8672 - val_loss: 0.3713
Epoch 9/10
45/45 - 1s - 22ms/step - accuracy: 0.9527 - loss: 0.1314 - val_accuracy: 0.8517 - val_loss: 0.3857
Epoch 10/10
45/45 - 1