# **ML Project: Tweet Sentiment Analysis**

**Workflow and Requirements for Submission of the Reference Model**

This file contains all the workflow needed to arrive at the submission of our reference model that utilizes traditional supervised ML techniques (**SVM**, **Logistic Regression**).

---

### Required Files and Structure

The following files are preliminary and **essential** to be in the **current folder** where this Jupyter Notebook is running:

1. **Dataset Folder**:  
   - The folder must be named **`twitter-datasets`**.  
   - It should contain the following files:
     - **test_data.txt**: The test dataset.
     - **train_neg.txt**: Negative training samples.
     - **train_neg_full.txt**: Full negative training samples.
     - **train_pos.txt**: Positive training samples.
     - **train_pos_full.txt**: Full positive training samples.

2. **Vector Embeddings**:  
   - The file contains all the GloVe embeddings used throughout this process.  
   - It can be downloaded from the following link:  
     [https://nlp.stanford.edu/data/glove.twitter.27B.zip](https://nlp.stanford.edu/data/glove.twitter.27B.zip)  

3. **SymSpell Dictionary**:  
   - The dictionary file used for spell checking is **`en-80k.txt`**.  
   - It can be downloaded from this link:  
     [https://github.com/wolfgarbe/SymSpell/blob/master/SymSpell.FrequencyDictionary/en-80k.txt](https://github.com/wolfgarbe/SymSpell/blob/master/SymSpell.FrequencyDictionary/en-80k.txt)

4. **Helpers File**:  
   - The file **`helpers.py`** must be included in the current directory.  
   - It contains utility functions used throughout the workflow, such as:
     - Tokenization and preprocessing.
     - Spell checking using SymSpell.
     - Handling embeddings and batch processing.

---

### Important Notes:
- The **names of the files should not be changed**.  
- To understand the reasoning behind certain parameters and thresholds used in this notebook, refer to the **`HP_Tuning.ipynb`** file, which should be available in this repository.

---


### **Data Pre-processing**

**Import Libraries**

In [21]:
# Importing Libraries
import re
import random
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.utils import shuffle

In [22]:
# Importing NLTK resources for text preprocessing
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Download necessary NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

# Initialize stopwords
STOPWORDS = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/selim_sherif/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/selim_sherif/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /home/selim_sherif/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [23]:
# Importing Custom Helpers
import importlib
import helpers
from helpers import *
importlib.reload(helpers)


<module 'helpers' from '/home/selim_sherif/Documents/EPFL/MA1/CS-433_Machine-Learning/projects/project2/CS-433-ML-Project2/helpers.py'>

**Import Dataset**

In [24]:
pos_file = "twitter-datasets/train_pos_full.txt"
neg_file = "twitter-datasets/train_neg_full.txt"
test_file = "twitter-datasets/test_data.txt"
output_file = "vocab_tweets.txt"  # For cleaned tweets without labels

with open(pos_file, "r", encoding="utf-8") as pos_file:
    pos_tweets = pos_file.readlines()

with open(neg_file, "r", encoding="utf-8") as neg_file:
    neg_tweets= neg_file.readlines()

with open(test_file, "r", encoding="utf-8") as test_file:
    test_tweets= test_file.readlines()

print(f"Loaded {len(pos_tweets)} positive tweets.")
print(f"Loaded {len(neg_tweets)} negative tweets.")
print(f"Loaded {len(test_tweets)} negative tweets.")


Loaded 1250000 positive tweets.
Loaded 1250000 negative tweets.
Loaded 10000 negative tweets.


**Remove Duplicate and Empty Tweets**


In [25]:
# Clean tweets and process duplicates and empty tweets
pos_tweets = remove_duplicates_and_empty(pos_tweets)
neg_tweets = remove_duplicates_and_empty(neg_tweets)

# Get the lengths
length_pos = len(neg_tweets)
length_neg = len(neg_tweets)

# Output the lengths
print(f"Number of processed positive tweets: {length_pos}")
print(f"Number of processed negative tweets: {length_neg}")


Number of processed positive tweets: 1142838
Number of processed negative tweets: 1142838


**Clean Useless Characters and Words**

In [26]:
c_pos_tweets = clean_tweets(pos_tweets)
c_neg_tweets = clean_tweets(neg_tweets)
c_test_tweets = clean_tweets(test_tweets)

**Fix Contraction, Tokenize and Spell Check the Tweets**

The spell checker used here is **SymSpell**, which is robust but also very quick and efficient in this case with limited computational resources.


In [27]:
c2_pos_tweets = tokenize_and_correct_tweets(c_pos_tweets)
c2_neg_tweets = tokenize_and_correct_tweets(c_neg_tweets)
c2_test_tweets = tokenize_and_correct_tweets(c_test_tweets)

**Lemmatize the Tweets**

The Lemmatizer used here is the classic `nltk` lemmatizer, which is good enough for this baseline model.


In [28]:
c3_pos_tweets = lemmatize_tweets(c2_pos_tweets)
c3_neg_tweets = lemmatize_tweets(c2_neg_tweets)
c3_test_tweets = lemmatize_tweets(c2_test_tweets)

**Merge the Tweets and Save the File (OPTIONAL to save file)**

In [9]:
# Combine the positive tweets with their labels (as tuples)
l_c3_pos_tweets = [(1, tweet) for tweet in c3_pos_tweets ]  # (label, tweet) for positive tweets
# Combine the negative tweets with their labels (as tuples)
l_c3_neg_tweets = [(0, tweet) for tweet in c3_neg_tweets ]  # (label, tweet) for negative tweets

# Combine and shuffle the data
all_tweets = c3_pos_tweets + c3_neg_tweets  # Only tweets (no labels)
l_all_tweets = l_c3_pos_tweets + l_c3_neg_tweets

# Shuffle the tweets
random.shuffle(all_tweets)
random.shuffle(l_all_tweets)


In [10]:
# Write only the tweets (without labels) to the output file
with open("all_tweets.txt", "w", encoding="utf-8") as f:
    for tweet in all_tweets:
        f.write(f"{tweet}\n")
# Shuffle the tweets


# Write only the tweets (without labels) to the output file
with open("l_all_tweets.txt", "w", encoding="utf-8") as f:
    for tweet in l_all_tweets:
        f.write(f"{tweet}\n")


#### **Build The Vocabulary .pkl Dict**

In [11]:
vocab_file = "vocab.pkl"               

print("Building vocabulary...")
vocab, word_counts = build_vocab(all_tweets)  # Get both vocab and frequencies
with open(vocab_file, "wb") as f:
    pickle.dump(vocab, f, protocol=pickle.HIGHEST_PROTOCOL)                  # Save the original vocab to the .pkl file

print(f"Vocabulary size: {len(vocab)}")

Building vocabulary...
Vocabulary size: 5434


In [12]:
# Define the GloVe file and corresponding dimension
glove_file = "glove.twitter.27B.200d.txt"
embedding_dim = 200

# Process only the 200-dimensional GloVe embeddings
print(f"Processing GloVe embeddings for {embedding_dim} dimensions...")

# Load GloVe embeddings
glove_embeddings = load_glove_embeddings(glove_file)

# Map vocabulary to GloVe embeddings
embedding_matrix_200 = map_vocab_to_glove(vocab, glove_embeddings, embedding_dim)
print(f"Embedding matrix shape for {embedding_dim} dimensions: {embedding_matrix_200.shape}")

# Save embedding_matrix to file
output_embedding_file = f"embedding_matrix_glove_{embedding_dim}.npy"
np.save(output_embedding_file, embedding_matrix_200)
print(f"Saved embedding matrix to {output_embedding_file}")

print(f"Variable available for use: embedding_matrix_200")


Processing GloVe embeddings for 200 dimensions...
Loaded 1193514 word vectors from GloVe.
Embedding matrix shape for 200 dimensions: (26245, 200)
Saved embedding matrix to embedding_matrix_glove_200.npy
Variable available for use: embedding_matrix_200


### **Make sure tweets tokens are part of the vocab**

In [13]:
print("Processing tweets...")
final_tweets = process_tweets(l_all_tweets, vocab )
print(f"Processed {len(final_tweets)} labeled tweets.")

Processing tweets...
Skipping invalid tweet with label 1: ['tweetsformysweetbarbieforteza']
Skipping invalid tweet with label 0: ['meany']
Skipping invalid tweet with label 1: ['cheerleader', 'ifindthatattracrive']
Skipping invalid tweet with label 1: ['jujurakusakithati']
Skipping invalid tweet with label 1: ['messier']
Skipping invalid tweet with label 1: ['thankyoou']
Skipping invalid tweet with label 0: ['helpmeplease']
Skipping invalid tweet with label 1: ['youknowyouliveinthesuburbswhenthatsthehighlightofyourday']
Skipping invalid tweet with label 0: ['askplease']
Skipping invalid tweet with label 1: ['goodmorningg']
Skipping invalid tweet with label 0: ['rolando', 'bola', 'thaayzita']
Skipping invalid tweet with label 1: ['followingback']
Skipping invalid tweet with label 0: ['youonlytweetmewhenyourebored']
Skipping invalid tweet with label 1: ['prick', 'youreallyareasessionpussy']
Skipping invalid tweet with label 1: ['minister']
Skipping invalid tweet with label 1: []
Skipping

These next cells are used to loas the data from the saved files (If all are saved), if the kernel crashes

In [14]:
# # Open the file and process line by line
# final_tweets = []
# with open("l_all_tweets.txt", "r", encoding="utf-8") as file:
#     for line in file:
#         # Evaluate each line as a Python tuple (safe only if you trust the file)
#         try:
#             record = eval(line.strip())  # Convert string to tuple
#             final_tweets.append(record)
#         except Exception as e:
#             print(f"Error parsing line: {line} - {e}")
# # Load the GloVe embedding matrix
# embedding_matrix_200 = np.load("embedding_matrix_glove_200.npy")
# # Load the vocabulary from the .pkl file
# with open("vocab.pkl", "rb") as file:
#     vocab = pickle.load(file)


(1, ['data', 'but', 'i', 'thought'])
(1, ['i', 'did', 'at', 'first', 'but', 'she', 'is', 'going', 'on', 'me', '.', 'kim', 'is', 'my', 'far'])
(0, ['awe', 'is', 'not', 'visible'])
(0, ['so', 'my', 'mom', 'got', 'me', 'a', 'surprise', 'that', 'wa', '$', 'and', 'she', 'will', 'not', 'tell', 'me', 'what', 'it', 'is', 'bo'])
(1, ['downplaying', '-', 'no', 'one', 'doe', 'it', 'better', '(', 'acoustic', ')', '-', 'you', 'me', 'at', 'six', 'heart'])


### **Model Training**

In [15]:
# Split into train and test sets
test_size = 20000  # Adjust as needed
test_data = final_tweets[:test_size]  # First `test_size` entries for testing
train_data = final_tweets[test_size:]  # Remaining data for training


In [None]:
# Use the 200-dimensional embedding matrix
embedding_matrix = embedding_matrix_200


# Prepare training data
train_labels, train_tokenized_tweets = zip(*train_data)
X_train = get_batch_embeddings(train_tokenized_tweets, vocab, embedding_matrix)
y_train = np.array(train_labels)

# Normalize training data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)

In [19]:

# Training parameters
alpha = 0.01  # Regularization parameter for SGD
max_iter = 10000
eta0 = 0.001
penalty = "l2"
loss = "hinge"
tol =1e-4

print("\nTraining and evaluating model with 200-dimensional embeddings...")

# Initialize and train classifier
clf = SGDClassifier(loss=loss, max_iter=max_iter, alpha=alpha, tol=tol ,penalty=penalty, learning_rate="constant", eta0=eta0)
print("Training model...")
clf.fit(X_train, y_train)  # Fit the model directly on the entire dataset
print("Training with 200-dimensional embeddings complete.")

# Prepare test data
test_labels, test_tokenized_tweets = zip(*test_data)
X_test = get_batch_embeddings(test_tokenized_tweets, vocab, embedding_matrix)
y_test = np.array(test_labels)

# Normalize test data
X_test = scaler.transform(X_test)

# Predict and evaluate
y_pred = clf.predict(X_test)
print("\nClassification Report for 200-dimensional embeddings:")
print(classification_report(y_test, y_pred))



Training and evaluating model with 200-dimensional embeddings...
Training model...
Training with 200-dimensional embeddings complete.

Classification Report for 200-dimensional embeddings:
              precision    recall  f1-score   support

           0       0.77      0.73      0.75     10086
           1       0.74      0.78      0.76      9914

    accuracy                           0.76     20000
   macro avg       0.76      0.76      0.76     20000
weighted avg       0.76      0.76      0.76     20000



This next cell is for the crowd AI submission whuch yielded and accuracy of **0.77**

In [32]:
import numpy as np
import pandas as pd

# Prepare test data
print("\nPreparing test data for submission...")
X_submission = get_batch_embeddings(c3_test_tweets, vocab, embedding_matrix)

# Normalize test data using the same scaler used during training
X_submission = scaler.transform(X_submission)

# Predict test data (model generates 0 and 1)
print("Predicting submission data...")
y_submission = clf.predict(X_submission)

# Convert 0 to -1 in predictions
y_submission_adjusted = np.where(y_submission == 0, -1, y_submission)

# Save adjusted predictions to a file (submission.csv)
submission_filename = "submission.csv"
print(f"Saving adjusted predictions to {submission_filename}...")

# Create a DataFrame with Id starting at 1
submission_df = pd.DataFrame({
    "Id": range(1, len(y_submission_adjusted) + 1),  # IDs start at 1
    "Prediction": y_submission_adjusted
})

# Save to CSV
submission_df.to_csv(submission_filename, index=False)

print("Submission file created successfully with 'Id' starting at 1.")



Preparing test data for submission...
Predicting submission data...
Saving adjusted predictions to submission.csv...
Submission file created successfully with 'Id' starting at 1.
