# LSTM Training for Sentiment Classification (Google Colab)

This notebook was implemented and executed in **Google Colab** in order to take advantage of GPU acceleration for Deep Learning training.  
Here, we load tweet embeddings generated previously with Word2Vec and use them to train an LSTM model for sentiment classification.  

In [14]:
!pip install tensorflow



In [13]:
import pandas as pd
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from sklearn.model_selection import train_test_split
import ast
from keras.layers import BatchNormalization
from keras.callbacks import EarlyStopping, ReduceLROnPlateau

## Mounting Google Drive

We mount Google Drive to access the dataset (`sentiment140_vectors.csv`)

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Loading the Dataset with Tweet Embeddings

We load the CSV file containing tweet vectors and sentiment labels.


In [5]:
data_path = "/content/drive/MyDrive/sentiment140-project/data/sentiment140_vectors.csv"
df = pd.read_csv(data_path)

# Convert string representation of vectors into actual lists
df["vector"] = df["vector"].apply(ast.literal_eval)

## Preparing Input and Labels

We reshape the input to match the expected format of LSTM:  
`(samples, timesteps, features)`


In [6]:
X = np.vstack(df["vector"].values).astype(np.float32)
y = df["sentiment"].values

# Reshape to (samples, 1 timestep, vector_size)
X = X.reshape((X.shape[0], 1, X.shape[1]))


In [7]:
print(X.shape)
print(y.shape)

(1581466, 1, 100)
(1581466,)


## Splitting Data into Training and Validation Sets


In [8]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)


## Building the LSTM Model

We define a sequential LSTM model with two layers.  
Dropout and BatchNormalization are used to improve generalization.  
The final Dense layer uses a sigmoid activation to perform binary classification.


In [15]:
model = Sequential([
    LSTM(128, return_sequences=True, input_shape=(X.shape[1], X.shape[2])),
    BatchNormalization(),
    Dropout(0.3),

    LSTM(64),
    BatchNormalization(),
    Dropout(0.3),

    Dense(1, activation="sigmoid")
])

  super().__init__(**kwargs)


## Compiling the Model

We compile the model using binary crossentropy loss and the Adam optimizer.  
Accuracy will be used as the main evaluation metric.


In [16]:
model.compile(
    loss="binary_crossentropy",
    optimizer="adam",
    metrics=["accuracy"]
)


## Setting up Callbacks

We use EarlyStopping to prevent overfitting, and ReduceLROnPlateau to lower the learning rate when the model stagnates.


In [17]:
callbacks = [
    EarlyStopping(monitor="val_loss", patience=3, restore_best_weights=True),
    ReduceLROnPlateau(monitor="val_loss", factor=0.5, patience=2)
]

## Training the Model

We train the model using 80% of the data for training and 20% for validation.  
This step may take a few minutes depending on dataset size and available resources.


In [19]:
history = model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=10,
    batch_size=64,
    callbacks=callbacks
)


Epoch 1/10
[1m19769/19769[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m205s[0m 10ms/step - accuracy: 0.7457 - loss: 0.5128 - val_accuracy: 0.7626 - val_loss: 0.4862 - learning_rate: 0.0010
Epoch 2/10
[1m19769/19769[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m262s[0m 10ms/step - accuracy: 0.7565 - loss: 0.4974 - val_accuracy: 0.7658 - val_loss: 0.4814 - learning_rate: 0.0010
Epoch 3/10
[1m19769/19769[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m199s[0m 10ms/step - accuracy: 0.7593 - loss: 0.4927 - val_accuracy: 0.7685 - val_loss: 0.4769 - learning_rate: 0.0010
Epoch 4/10
[1m19769/19769[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m204s[0m 10ms/step - accuracy: 0.7615 - loss: 0.4899 - val_accuracy: 0.7700 - val_loss: 0.4743 - learning_rate: 0.0010
Epoch 5/10
[1m19769/19769[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m205s[0m 10ms/step - accuracy: 0.7625 - loss: 0.4875 - val_accuracy: 0.7714 - val_loss: 0.4726 - learning_rate: 0.0010
Epoch 6/10
[1m19769/19769[0m [32

## Saving the Trained Model

We save the trained LSTM model to Google Drive.
However, the final model file has already been uploaded to the project's `models/` directory for organization and versioning.  

In [20]:
model_path = "/content/drive/MyDrive/sentiment140-project/models/lstm_sentiment140.h5"
model.save(model_path)
print(f"Model saved to {model_path}")




Model saved to /content/drive/MyDrive/sentiment140-project/models/lstm_sentiment140.h5
