# Twitter NLP Sentiment Classifier

This notebook builds a simple NLP model to classify tweets as positive, neutral, or negative using the [Twitter140 dataset](https://www.kaggle.com/datasets/kazanova/sentiment140)



In [1]:
#import basic core libraries 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

---
### 1. Data Preprocessing 🔨

#### Steps 📄:
- Load and inspect Sentiment140 dataset

- Clean raw tweet text

- Normalize case and remove stopwords

- Split into training, validation, and testing sets

We'll begin by loading our dataframe and exploring the dataset to get an understanding of its structure and the distribution of sentiment labels.

In [2]:
df = pd.read_csv("../data/Sentiment140.csv", encoding="latin-1", header=None)

print(df.head())
df[0].value_counts()

   0           1                             2         3                4  \
0  0  1467810369  Mon Apr 06 22:19:45 PDT 2009  NO_QUERY  _TheSpecialOne_   
1  0  1467810672  Mon Apr 06 22:19:49 PDT 2009  NO_QUERY    scotthamilton   
2  0  1467810917  Mon Apr 06 22:19:53 PDT 2009  NO_QUERY         mattycus   
3  0  1467811184  Mon Apr 06 22:19:57 PDT 2009  NO_QUERY          ElleCTF   
4  0  1467811193  Mon Apr 06 22:19:57 PDT 2009  NO_QUERY           Karoli   

                                                   5  
0  @switchfoot http://twitpic.com/2y1zl - Awww, t...  
1  is upset that he can't update his Facebook by ...  
2  @Kenichan I dived many times for the ball. Man...  
3    my whole body feels itchy and like its on fire   
4  @nationwideclass no, it's not behaving at all....  


0
0    800000
4    800000
Name: count, dtype: int64

We are only going to be using the sentiment and text, which are found in column 1 and 5 respectively.<br/>
The sentiment is distributed evenly between 0 (negative) and 4 (positive).<br/>
It is also clear that the text will need some cleaning (tags, urls, special characters, etc.)

Lets label our columns and grab the ones we want. <br/>
We'll also map all 4s to 1 for binary simplicity.

In [3]:
df.columns = ["sentiment", "id", "date", "flag", "user", "text"]
df = df[["sentiment", "text"]]

df["sentiment"] = df["sentiment"].replace({4:1})

print(df.head())
print(df["sentiment"].value_counts())

   sentiment                                               text
0          0  @switchfoot http://twitpic.com/2y1zl - Awww, t...
1          0  is upset that he can't update his Facebook by ...
2          0  @Kenichan I dived many times for the ball. Man...
3          0    my whole body feels itchy and like its on fire 
4          0  @nationwideclass no, it's not behaving at all....
sentiment
0    800000
1    800000
Name: count, dtype: int64


Next we'll use the `regular expressions` module and `NLTK` library to clean the tweets.

In [4]:
import re
import nltk
from nltk.corpus import stopwords

nltk.download("stopwords")
stop_words = set(stopwords.words("english"))

# We are removing urls, tags (@), hashtags, and special characters from the tweets
# We also remove stopwords, words that contrubute very little to the sentiment of the tweets (ex. "the", "and", or "is")
def clean_tweets(text):
    text = re.sub(r"http\S+|www\S+|https\S+", "", text)
    text = re.sub(r"@\w+", "", text)
    text = re.sub(r"#\w+", "", text)
    text = re.sub(r"[^a-zA-Z\s]", "", text)
    text = text.lower().strip().split()
    text = [word for word in text if word not in stop_words]
    return " ".join(text)

df["text"] = df["text"].apply(clean_tweets)
print (df["text"].head())

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\theli\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


0        thats bummer shoulda got david carr third day
1    upset cant update facebook texting might cry r...
2    dived many times ball managed save rest go bounds
3                     whole body feels itchy like fire
4                             behaving im mad cant see
Name: text, dtype: object


In [5]:
def tokenize(text):
    # Keep apostrophes and basic punctuation
    text = re.sub(r"[^a-zA-Z'\s]", "", text)  # Allow apostrophes
    words = text.lower().strip().split()
    # Custom stop words - preserve sentiment carriers
    custom_stopwords = set(stopwords.words('english')) - {'not', 'no', 'never', 'very'}
    return [w for w in words if w not in custom_stopwords]

Now we can split the data into train, validation, and test using stratified sampling to ensure fair evaluation.

In [6]:
from sklearn.model_selection import train_test_split

print (df["sentiment"].value_counts())
#print (df["text"].value_counts())
df = df.drop_duplicates(subset="text")
print (df["sentiment"].value_counts())
#print (df["text"].value_counts())

df = df.sample(frac=0.1, random_state=1)  # use only 10% of the full dataset (temporary)

# Split into train&val (90%) and the test set (10%)
df_trainandval, df_test = train_test_split(df, test_size=0.1, stratify=df["sentiment"], random_state=1)

# Now split train&val into train (90% of 90% --> 81%) and val (10% of 90% --> 9%)
df_train, df_val = train_test_split(df_trainandval, test_size=0.1, stratify=df_trainandval["sentiment"], random_state=1)

print(f"train split: {df_train['sentiment'].value_counts(normalize=True)}")
print(f"validation split: {df_val['sentiment'].value_counts(normalize=True)}")
print(f"test split: {df_test['sentiment'].value_counts(normalize=True)}")

sentiment
0    800000
1    800000
Name: count, dtype: int64
sentiment
0    748482
1    731249
Name: count, dtype: int64
train split: sentiment
0    0.504877
1    0.495123
Name: proportion, dtype: float64
validation split: sentiment
0    0.504881
1    0.495119
Name: proportion, dtype: float64
test split: sentiment
0    0.504866
1    0.495134
Name: proportion, dtype: float64


We will vectorize text at the beginning of our ML approach and tokenize at the start of our DL approach.

---
### 2. Classical ML Approach 💻

#### Steps 📄:
- Extract features with TF-IDF

- Train logistic regression model

- Evaluate model on validation and test set

- Analyze performance

We will start our ML approach by vectorizing our text data and fitting our model to the train data.

In [95]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=15000)

# These will be the features for our ML model, we fit on our training features
ml_X_train = vectorizer.fit_transform(df_train["text"])
ml_X_val = vectorizer.transform(df_val["text"])
ml_X_test = vectorizer.transform(df_test["text"])

# The labels for our ML model
ml_y_train = df_train["sentiment"]
ml_y_val = df_val["sentiment"]
ml_y_test = df_test["sentiment"]

Now we can take a look at the first 10 words inside the internal feature matrix and their respective IDF scores.

In [96]:
print (f"First 10 words: {list(vectorizer.vocabulary_.keys())[:10]}")
print (f"IDF Scores: {vectorizer.idf_[:10]}")

First 10 words: ['studying', 'dates', 'suck', 'got', 'all', 'the', 'titles', 'and', 'majority', 'of']
IDF Scores: [ 8.2504272   8.20786759 10.37432051 10.28730913 10.82630563 10.13315845
  9.8277768  10.06416558  9.55333995 10.13315845]


Next we'll initialize and train a logistic regression model on our data.

In [97]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(max_iter=1000)
clf.fit(ml_X_train, ml_y_train)

Now lets get a baseline to see how our model performs on its training data.

In [55]:
from sklearn.metrics import accuracy_score

ml_train_preds = clf.predict(ml_X_train)

print (f"Training Accuracy: {accuracy_score(ml_y_train, ml_train_preds)}")

Training Accuracy: 0.8028395061728395


Our training accuracy is good, 75-80% is a nice range to fall in for this model on this dataset. We're just going to be using this model as a baseline to compare against our DL model so we don't necessarily need to fine tune the hyperparameters (unless we see an issue). Before we evaluate on our test set, though, lets check our validation accuracy to make sure the model isn't overfitting.

In [56]:
from sklearn.metrics import classification_report

ml_val_preds = clf.predict(ml_X_val)

print (f"Validation Accuracy: {accuracy_score(ml_y_val, ml_val_preds)}")
print (classification_report(ml_y_val, ml_val_preds))

Validation Accuracy: 0.7693055555555556
              precision    recall  f1-score   support

           0       0.78      0.75      0.76      7185
           1       0.76      0.79      0.77      7215

    accuracy                           0.77     14400
   macro avg       0.77      0.77      0.77     14400
weighted avg       0.77      0.77      0.77     14400



Nice, it looks like the model isn't overfitting and is doing well at generalizing to our validation set.

Now lets evaluate the model's performance on the test set.

In [57]:
ml_test_preds = clf.predict(ml_X_test)

print (f"Test Accuracy: {accuracy_score(ml_y_test, ml_test_preds)}")
print (classification_report(ml_y_test, ml_test_preds))

Test Accuracy: 0.7698125
              precision    recall  f1-score   support

           0       0.78      0.76      0.77      7984
           1       0.76      0.78      0.77      8016

    accuracy                           0.77     16000
   macro avg       0.77      0.77      0.77     16000
weighted avg       0.77      0.77      0.77     16000



Great, 77% accuracy is a nice baseline to compare to moving fowards. Now we can move to our DL approach.

---
### 3. Deep Learning Approach 🧠

#### Steps 📄:
- Tokenize and vectorize features

- Create dataset and dataloaders

- Build and train LSTM model

- Test on validation and test sets

Lets start by building our vocabulary which includes tokenization and numericalization.

In [7]:
import torch
from collections import Counter

class Vocabulary:
    def __init__(self, min_freq=5):
        self.min_freq = min_freq
        self.word2idx = {"<pad>": 0, "<unk>": 1}
        self.idx2word = {0: "<pad>", 1: "<unk>"}
        self.idx = 2
    
    # Build vocabulary, updating word2idx and idx2word with words that appear >= min_freq
    def build_vocab(self, tweets):
        counter = Counter()
        for tweet in tweets:
            counter.update(tweet.lower().split())

        for word, count in counter.items():
            if count>= self.min_freq and word not in self.word2idx:
                self.word2idx[word] = self.idx
                self.idx2word[self.idx] = word
                self.idx+= 1
    
    # Convert a given tweet to indexes
    def numericalize(self, tweet):
        tokens = tweet.lower().split()
        return [self.word2idx.get(token, self.word2idx["<unk>"]) for token in tokens]

# Build vocab
vocab = Vocabulary()
#vocab.build_vocab(df_train["text"])
train_texts, test_texts, train_labels, test_labels = train_test_split(df["text"], df["sentiment"], test_size=0.2, random_state=1)
vocab.build_vocab(train_texts)

# Check
print (f"Vocab size: {len(vocab.word2idx)}")
print (f"First 3 words: {vocab.idx2word[2]}, {vocab.idx2word[3]}, {vocab.idx2word[4]}")

Vocab size: 11747
First 3 words: sorry, delayed, response


Great, now lets create our `Dataset` class.

In [8]:
from torch.utils.data import Dataset

# PyTorch Dataset for our DataLoaders
class TweetDataset(Dataset):
    def __init__(self, tweets, sentiments, vocab, max_len=100):
        self.tweets = tweets
        self.sentiments = sentiments
        self.vocab = vocab
        self.max_len = max_len
        
    def __len__(self):
        return len(self.tweets)
    
    # Convert to numerical tokens and return feature and label as tensor pair
    def __getitem__(self, index):
        text = self.tweets[index]
        label = self.sentiments[index]

        # Convert text to numericalized tokens
        numericalized = self.vocab.numericalize(text)
        
        return (
            torch.tensor(numericalized, dtype=torch.long),
            torch.tensor(label, dtype=torch.float)
        )

Next lets make the collate function for our DataLoaders to pad our texts and batch our features and labels

In [9]:
from torch.nn.utils.rnn import pad_sequence

def collate(batch):
    texts, labels = zip(*batch)

    # pads tweets to same length and returns as a stacked tensor (batch size, longest tweet)
    padded_texts = pad_sequence(texts, batch_first=True, padding_value=0)

    # stacks labels into tensor
    labels = torch.stack(labels)

    return padded_texts, labels

Now lets make the `DataLoaders` for our model.

In [10]:
from torch.utils.data import DataLoader

max_len = 100
batch_size = 64

# Creates datasets
#train_dataset = TweetDataset(df_train["text"].tolist(), df_train["sentiment"].tolist(), vocab, max_len)
#val_dataset = TweetDataset(df_val["text"].tolist(), df_val["sentiment"].tolist(), vocab, max_len)
test_dataset = TweetDataset(df_test["text"].tolist(), df_test["sentiment"].tolist(), vocab, max_len)
train_dataset = TweetDataset(train_texts.tolist(), train_labels.tolist(), vocab, max_len)
val_dataset = TweetDataset(test_texts.tolist(), test_labels.tolist(), vocab, max_len)
    

train_loader = DataLoader(train_dataset, batch_size=batch_size, 
                             shuffle=True, collate_fn=collate)
val_loader = DataLoader(test_dataset, batch_size=batch_size, 
                            shuffle=False, collate_fn=collate)
# Wraps datasets in DataLoaders and shuffles training dataset
#train_loader = DataLoader(train_dataset, batch_size, shuffle=True, collate_fn=collate)
#val_loader = DataLoader(val_dataset, batch_size, shuffle=False, collate_fn=collate)
test_loader = DataLoader(test_dataset, batch_size, shuffle=False, collate_fn=collate)

Before we move to building the model, lets verify our data.

In [11]:
# Check our dataset lengths
print (f"Training samples: {len(train_dataset)}")
print (f"Validation samples: {len(val_dataset)}")
print (f"Test samples: {len(test_dataset)}")

# Check the first training sample
text, sentiment = train_dataset[0]
print (f"Raw sample text: {df_train['text'].iloc[0]}")
print (f"Processed sample text: {text}")
print (f"Sample label: {sentiment}")

# Check batch shapes
batch_texts, batch_sentiments = next(iter(train_loader))
print (f"Texts tensor shape: {batch_texts.shape}")
print (f"Labels tensor shape: {batch_sentiments.shape}")

#Check batch content
first_text = [vocab.idx2word[i] for i in (batch_texts[0].tolist()) if i != vocab.word2idx["<pad>"]]
print (f"First text in batch: {' '.join(first_text)}")
print (f"First label in batch: {batch_sentiments[0].item()}")

Training samples: 118378
Validation samples: 29595
Test samples: 14798
Raw sample text: tweet lost
Processed sample text: tensor([ 2,  3,  4,  5,  1,  6,  7,  8,  9, 10, 11])
Sample label: 0.0
Texts tensor shape: torch.Size([64, 15])
Labels tensor shape: torch.Size([64])
First text in batch: wait til hear rob pattinson sing uhhh <unk> serious <unk> alright <unk>
First label in batch: 0.0


Great, our datasets look good and our DataLoaders appear to be working. <br/>
Feature tensors are the right shape (labels as well) and padding looks to be working.

Now lets move foward with making our LSTM model.

In [12]:
import torch.nn as nn
      
class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, pad_index, dropout_rate):
        super().__init__()
        # Embedding layer: converts tweets (stored as vocab indexes) to dense embedded vectors
        self.embedding = nn.Embedding(vocab_size, embedding_dim, pad_index)
        nn.init.uniform_(self.embedding.weight, -0.05, 0.05)
        # LSTM layer: processes embedded input, learning sequential patterns
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
        # Dropout layer: drops out percentage of neurons for regularization
        self.dropout = nn.Dropout(dropout_rate)
        # Fully connected layer: takes final hidden state from LSTM and converts to output classes
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        embedded = self.embedding(x)
        lstm_out, _ = self.lstm(embedded)
        
        last_state = self.dropout(lstm_out[:, -1, :])
        return self.fc(last_state)

With our layers and forward pass defined, lets move on to defining our hyperparameters.

In [13]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = LSTMClassifier(
    vocab_size=len(vocab.word2idx),
    embedding_dim = 100,
    hidden_dim = 256,
    output_dim = 1,
    pad_index = 0,
    dropout_rate = 0.5
).to(device)

# We are using BCE for our loss function to classify our logits from the FC layer w/ softmax 
# Adam is a good, simple optimizer for NLP tasks, especially when using LSTMs like this model
criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=.001)

With our model defined, lets build the training loop.

In [14]:
def train_loop(model, iterator):
    epoch_loss = 0
    epoch_acc = 0
    batches = len(iterator)
    model.train()
    
    for batch_idx, batch in enumerate(iterator):
        optimizer.zero_grad()
        texts, labels = batch
        texts, labels = texts.to(device), labels.to(device)
        
        predictions = model(texts).squeeze(1)
        loss = criterion(predictions, labels)
        
        # Calculate accuracy
        rounded_preds = torch.round(torch.sigmoid(predictions))
        correct = (rounded_preds == labels).float()
        acc = correct.sum() / len(correct)
        
        loss.backward()
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        print(f"Processing batch {batch_idx + 1}/{batches}", end='\r')
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

Next, lets make our evaluation function.

In [15]:
def evaluate(model, iterator, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.eval()
    
    with torch.no_grad():
        for batch in iterator:
            texts, labels = batch
            texts, labels = texts.to(device), labels.to(device)
            
            predictions = model(texts).squeeze(1)
            loss = criterion(predictions, labels)
            
            rounded_preds = torch.round(torch.sigmoid(predictions))
            correct = (rounded_preds == labels).float()
            acc = correct.sum() / len(correct)
            
            epoch_loss += loss.item()
            epoch_acc += acc.item()
            
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [16]:
print (torch.__version__)
print (torch.cuda.is_available())
print (torch.version.cuda)

overlap = set(df_train['text']) & set(df_val['text'])
print(f'Overlap: {len(overlap)}')


for epoch in range(10):
    train_loss, train_acc = train_loop(model, train_loader)
    val_loss, val_acc = evaluate(model, val_loader, criterion=criterion)
    print (f"Epoch {epoch+1}: Training Acc = {train_acc} | Val Acc = {val_acc} | Training Loss = {train_loss}| Val Loss = {val_loss}")

2.0.0+cu118
True
11.8
Overlap: 0
Epoch 1: Training Acc = 0.726281772851944 | Val Acc = 0.7934690580285829 | Training Loss = 0.5347615146636963| Val Loss = 0.4461599515172942
Epoch 2: Training Acc = 0.7868194980879087 | Val Acc = 0.8077470752699621 | Training Loss = 0.45774676300383904| Val Loss = 0.4201616828554663
Epoch 3: Training Acc = 0.8032251447922475 | Val Acc = 0.8201008315744072 | Training Loss = 0.42355592462662106| Val Loss = 0.4009446560822684
Epoch 4: Training Acc = 0.8192507239612373 | Val Acc = 0.8308766936433727 | Training Loss = 0.39191481665985006| Val Loss = 0.3822268681151086
Epoch 5: Training Acc = 0.8357380148204597 | Val Acc = 0.8423260470916485 | Training Loss = 0.3589323350303882| Val Loss = 0.3594742101328126
Epoch 6: Training Acc = 0.8546400418474868 | Val Acc = 0.8546509393330278 | Training Loss = 0.3243649388084541| Val Loss = 0.34300552604013473
Epoch 7: Training Acc = 0.8715866312787339 | Val Acc = 0.8636757238157864 | Training Loss = 0.28851353252658973|

### 4. Comparisons and Conclusions 💭