<a href="https://colab.research.google.com/github/Nour-Yasser/Tweet-Sentiment-Analysis/blob/main/TweetSentimentAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
import re
import time
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import LSTM

Importing necessary libraries:
This cell imports essential libraries for data handling (pandas), machine learning with TensorFlow (tensorflow and Keras modules), dataset splitting (train_test_split), regular expressions (re), and measuring training time (time). These tools are used throughout the project to build, train, and evaluate the RNN and LSTM models for sentiment analysis.

In [None]:
file_path = 'training.csv'
df = pd.read_csv(file_path, header = None, encoding = 'latin-1')

Loading the dataset:
This cell reads the tweet sentiment dataset from a CSV file named 'training.csv' into a pandas DataFrame. The file has no header row, so header=None is used. The 'latin-1' encoding ensures special characters in the tweets are correctly read.


In [None]:
df = df[[0,5]]
df.columns = ['label', 'text']
df['label'] = df['label'].replace(4,1)
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['label'] = df['label'].replace(4,1)


Unnamed: 0,label,text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


Select only the columns for sentiment label and tweet text

Rename columns to 'label' and 'text' for clarity

Convert positive labels from 4 to 1 for binary classification (0 = negative, 1 = positive)

Display the first few rows to verify the data looks correct

In [None]:
def preprocess_text(text):
    text = str(text).lower()  # ensure it's string + lowercase
    text = re.sub(r'@\w+', '', text)  # remove @mentions
    text = re.sub(r'http\S+|www\S+', '', text)  # remove URLs
    text = re.sub(r'#', '', text)  # remove hashtag symbols
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)  # remove punctuation
    text = re.sub(r'\s+', ' ', text).strip()  # trim extra spaces
    return text

df['text'] = df['text'].apply(preprocess_text)


Define a function to clean and preprocess tweet text:

Convert text to lowercase

Remove Twitter usernames (mentions starting with @)

Remove URLs starting with http or www

Remove hashtag symbols (#) but keep the words

Remove punctuation and special characters

Remove extra spaces and trim leading/trailing spaces

Apply this function to every tweet in the dataset to clean the text before modeling


In [None]:
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=10000, oov_token = "<OOV>")
tokenizer.fit_on_texts(df['text'])
sequence = tokenizer.texts_to_sequences(df['text'])
padded = tf.keras.preprocessing.sequence.pad_sequences(sequence, maxlen=50, padding='post')


Create a tokenizer to convert words into integer indexes, limiting vocabulary to the top 10,000 words

Include an out-of-vocabulary (<OOV>) token to handle unseen words during training and testing

Fit the tokenizer on the cleaned tweet texts to build the word index

Convert each tweet into a sequence of integers based on the tokenizer’s vocabulary

Pad all sequences to a fixed length of 50 tokens, adding zeros at the end to ensure uniform input size for the models

In [None]:
# Get features and labels
X = padded
y = df['label'].values

# Split: 80% train, 10% val, 10% test
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)


Assign the padded sequences as features (X) and the sentiment labels as targets (y)

Split the dataset into training (80%) and temporary sets (20%)

Further split the temporary set equally into validation (10%) and testing (10%) sets

Use a fixed random seed (random_state=42) for reproducible splits

In [None]:
rnn_model = Sequential([
    Embedding(input_dim=10000, output_dim=64, input_length=100),
    SimpleRNN(64),
    Dense(1, activation='sigmoid')
])


rnn_model.compile(loss='binary_crossentropy', optimizer= 'adam', metrics=['accuracy'])

Build a simple RNN model using Keras Sequential API:

Add an Embedding layer to convert word indices into 64-dimensional vectors

Add a SimpleRNN layer with 64 units to process the sequential data

Add a Dense output layer with a sigmoid activation for binary classification

Compile the model using binary crossentropy loss, Adam optimizer, and track accuracy as a metric


In [None]:
# Train RNN
start_rnn = time.time()
rnn_history = rnn_model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=2, batch_size=128)
end_rnn = time.time()


Epoch 1/2
[1m10000/10000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m64s[0m 6ms/step - accuracy: 0.7628 - loss: 0.4945 - val_accuracy: 0.7943 - val_loss: 0.4423
Epoch 2/2
[1m10000/10000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m60s[0m 6ms/step - accuracy: 0.7992 - loss: 0.4370 - val_accuracy: 0.8011 - val_loss: 0.4365



Start a timer to measure training duration

Train the RNN model on the training data (X_train, y_train)

Use the validation set (X_val, y_val) to monitor performance during training

Train for 2 epochs with a batch size of 128 samples

Stop the timer after training completes to calculate total training time


In [None]:
# Evaluate on test set
rnn_loss, rnn_accuracy = rnn_model.evaluate(X_test, y_test)
rnn_time = end_rnn - start_rnn

[1m5000/5000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 3ms/step - accuracy: 0.8016 - loss: 0.4353


Evaluate the trained RNN model’s performance on the unseen test set (X_test, y_test)

Calculate and store the test loss and accuracy metrics

Compute the total training time by subtracting start time from end time

In [None]:
# Build LSTM model
lstm_model = Sequential([
    Embedding(input_dim=10000, output_dim=64, input_length=100),
    LSTM(64),
    Dense(1, activation='sigmoid')
])

lstm_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Create an LSTM model using Keras Sequential API:

Add an Embedding layer to convert words into 64-dimensional vectors

Add an LSTM layer with 64 units to capture long-term dependencies in the text

Add a Dense output layer with sigmoid activation for binary sentiment classification

Compile the model with binary crossentropy loss, Adam optimizer, and accuracy as the evaluation metric

In [None]:
# Train LSTM
start_lstm = time.time()
lstm_history = lstm_model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=2, batch_size=128)
end_lstm = time.time()

Epoch 1/2
[1m10000/10000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m62s[0m 6ms/step - accuracy: 0.7694 - loss: 0.4738 - val_accuracy: 0.8151 - val_loss: 0.4052
Epoch 2/2
[1m10000/10000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m84s[0m 6ms/step - accuracy: 0.8222 - loss: 0.3923 - val_accuracy: 0.8215 - val_loss: 0.3915


Start a timer to track training duration

Train the LSTM model on the training data (X_train, y_train)

Validate the model’s performance using the validation set (X_val, y_val) during training

Train for 2 epochs with a batch size of 128 samples

Stop the timer after training completes to measure total training time

In [None]:
# Evaluate on test set
lstm_loss, lstm_accuracy = lstm_model.evaluate(X_test, y_test)
lstm_time = end_lstm - start_lstm

[1m5000/5000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 4ms/step - accuracy: 0.8232 - loss: 0.3900


Evaluate the trained LSTM model on the test dataset (X_test, y_test)

Calculate and save the test loss and accuracy metrics

Determine the total training time by subtracting the start time from the end time

In [None]:
import pandas as pd

# For RNN history
rnn_metrics_df = pd.DataFrame({
    'epoch': list(range(1, len(rnn_history.history['loss']) + 1)),
    'train_loss_rnn': rnn_history.history['loss'],
    'train_acc_rnn': rnn_history.history['accuracy'],
    'val_loss_rnn': rnn_history.history['val_loss'],
    'val_acc_rnn': rnn_history.history['val_accuracy']
})

# For LSTM history
lstm_metrics_df = pd.DataFrame({
    'epoch': list(range(1, len(lstm_history.history['loss']) + 1)),
    'train_loss_lstm': lstm_history.history['loss'],
    'train_acc_lstm': lstm_history.history['accuracy'],
    'val_loss_lstm': lstm_history.history['val_loss'],
    'val_acc_lstm': lstm_history.history['val_accuracy']
})

# Optionally merge these two on 'epoch' for easy comparison
metrics_comparison_df = pd.merge(rnn_metrics_df, lstm_metrics_df, on='epoch')
print(metrics_comparison_df)


   epoch  train_loss_rnn  train_acc_rnn  val_loss_rnn  val_acc_rnn  \
0      1        0.464015       0.783286      0.442296     0.794275   
1      2        0.436882       0.799580      0.436501     0.801138   

   train_loss_lstm  train_acc_lstm  val_loss_lstm  val_acc_lstm  
0         0.433035        0.798212       0.405184      0.815106  
1         0.390978        0.822405       0.391529      0.821494  


In [None]:
results = pd.DataFrame({
    "Model": ["RNN", "LSTM"],
    "Training Time (s)": [round(rnn_time, 2), round(lstm_time, 2)],
    "Test Loss": [round(rnn_loss, 4), round(lstm_loss, 4)],
    "Test Accuracy": [round(rnn_accuracy, 4), round(lstm_accuracy, 4)]
})

print(results)


  Model  Training Time (s)  Test Loss  Test Accuracy
0   RNN             124.47     0.4372         0.8004
1  LSTM             165.58     0.3918         0.8224
