# Twitter NLP Sentiment Classifier

This notebook builds a simple NLP model to classify tweets as positive, neutral, or negative using the [Twitter140 dataset](https://www.kaggle.com/datasets/kazanova/sentiment140)



In [37]:
#import basic core libraries 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

---
### 1. Data Preprocessing 🔨

#### Steps 📄:
- Load and inspect Sentiment140 dataset

- Clean raw tweet text

- Normalize case and remove stopwords

- Split into training, validation, and testing sets

We'll begin by loading our dataframe and exploring the dataset to get an understanding of its structure and the distribution of sentiment labels.

In [38]:
df = pd.read_csv("../data/Sentiment140.csv", encoding="latin-1", header=None)

print(df.head())
df[0].value_counts()

   0           1                             2         3                4  \
0  0  1467810369  Mon Apr 06 22:19:45 PDT 2009  NO_QUERY  _TheSpecialOne_   
1  0  1467810672  Mon Apr 06 22:19:49 PDT 2009  NO_QUERY    scotthamilton   
2  0  1467810917  Mon Apr 06 22:19:53 PDT 2009  NO_QUERY         mattycus   
3  0  1467811184  Mon Apr 06 22:19:57 PDT 2009  NO_QUERY          ElleCTF   
4  0  1467811193  Mon Apr 06 22:19:57 PDT 2009  NO_QUERY           Karoli   

                                                   5  
0  @switchfoot http://twitpic.com/2y1zl - Awww, t...  
1  is upset that he can't update his Facebook by ...  
2  @Kenichan I dived many times for the ball. Man...  
3    my whole body feels itchy and like its on fire   
4  @nationwideclass no, it's not behaving at all....  


0
0    800000
4    800000
Name: count, dtype: int64

We are only going to be using the sentiment and text, which are found in column 1 and 5 respectively.<br/>
The sentiment is distributed evenly between 0 (negative) and 4 (positive).<br/>
It is also clear that the text will need some cleaning (tags, urls, special characters, etc.)

Lets label our columns and grab the ones we want. <br/>
We'll also map all 4s to 1 for binary simplicity.

In [39]:
df.columns = ["sentiment", "id", "date", "flag", "user", "text"]
df = df[["sentiment", "text"]]

df["sentiment"] = df["sentiment"].replace({4:1})

print(df.head())
print(df["sentiment"].value_counts())

   sentiment                                               text
0          0  @switchfoot http://twitpic.com/2y1zl - Awww, t...
1          0  is upset that he can't update his Facebook by ...
2          0  @Kenichan I dived many times for the ball. Man...
3          0    my whole body feels itchy and like its on fire 
4          0  @nationwideclass no, it's not behaving at all....
sentiment
0    800000
1    800000
Name: count, dtype: int64


Next we'll use the regular expressions module and NLTK library to clean the tweets.

In [40]:
import re
from tracemalloc import stop
import nltk
from nltk.corpus import stopwords

nltk.download("stopwords")
stop_words = set(stopwords.words("english"))

def clean_tweets(text):
    text = re.sub(r"http\S+|www\S+|https\S+", "", text)
    text = re.sub(r"@\w+", "", text)
    text = re.sub(r"#\w+", "", text)
    text = re.sub(r"[^a-zA-Z\s]", "", text)
    text = text.lower().strip().split()
    text = [word for word in text if word not in stop_words]
    return " ".join(text)

df["text"] = df["text"].apply(clean_tweets)
print (df["text"].head())

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\theli\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


0        thats bummer shoulda got david carr third day
1    upset cant update facebook texting might cry r...
2    dived many times ball managed save rest go bounds
3                     whole body feels itchy like fire
4                             behaving im mad cant see
Name: text, dtype: object


Now we can split the data into train, validation, and test using stratified sampling to ensure fair evaluation.

In [42]:
from sklearn.model_selection import train_test_split

df_trainandval, df_test = train_test_split(df, test_size=0.1, stratify=df["sentiment"], random_state=1)

df_train, df_val = train_test_split(df_trainandval, test_size=0.1, stratify=df_trainandval["sentiment"], random_state=1)

print(f"train split: {df_train['sentiment'].value_counts(normalize=True)}")
print(f"validation split: {df_val['sentiment'].value_counts(normalize=True)}")
print(f"test split: {df_test['sentiment'].value_counts(normalize=True)}")

train split: sentiment
1    0.5
0    0.5
Name: proportion, dtype: float64
validation split: sentiment
0    0.5
1    0.5
Name: proportion, dtype: float64
test split: sentiment
0    0.5
1    0.5
Name: proportion, dtype: float64


We will vectorize text at the beginning of our ML approach and tokenize at the start of our DL approach.

---
### 2. Classical ML Approach 💻

#### Steps 📄:
- Extract features with TF-IDF

- Train logistic regression model

- Evaluate model on validation and test set

- Analyze performance

We will start our ML approach by vectorizing our text data and fitting our model to the train data.

In [56]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=15000)

ml_X_train = vectorizer.fit_transform(df_train["text"])
ml_X_val = vectorizer.transform(df_val["text"])
ml_X_test = vectorizer.transform(df_test["text"])

ml_y_train = df_train["sentiment"]
ml_y_val = df_val["sentiment"]
ml_y_test = df_test["sentiment"]

Now we can take a look at the first 10 words inside the internal feature matrix and their respective IDF scores.

In [57]:
print (f"First 10 words: {list(vectorizer.vocabulary_.keys())[:10]}")
print (f"IDF Scores: {vectorizer.idf_[:10]}")

First 10 words: ['want', 'write', 'something', 'site', 'please', 'post', 'behalf', 'details', 'author', 'im']
IDF Scores: [ 9.89301038 10.29567043 10.94765954 11.41123228 11.22464633 11.64080672
 10.44006494 11.20359292  9.62805656 11.33712431]


Next we'll initialize and train a logistic regression model on our data.

In [58]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(max_iter=1000)
clf.fit(ml_X_train, ml_y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,1000


Now lets get a baseline to see how our model performs on its training data.

In [59]:
from sklearn.metrics import accuracy_score

ml_train_preds = clf.predict(ml_X_train)

print (f"Training Accuracy: {accuracy_score(ml_y_train, ml_train_preds)}")

Training Accuracy: 0.7867854938271605


Our training accuracy is good, 75-80% is a nice range to fall in for this model on this dataset. We're just going to be using this model as a baseline to compare against our DL model so we don't necessarily need to fine tune the hyperparameters (unless we see an issue). Before we evaluate on our test set, though, lets check our validation accuracy to make sure the model isn't overfitting.

In [60]:
from sklearn.metrics import classification_report

ml_val_preds = clf.predict(ml_X_val)

print (f"Validation Accuracy: {accuracy_score(ml_y_val, ml_val_preds)}")
print (classification_report(ml_y_val, ml_val_preds))

Validation Accuracy: 0.7804722222222222
              precision    recall  f1-score   support

           0       0.79      0.76      0.78     72000
           1       0.77      0.80      0.78     72000

    accuracy                           0.78    144000
   macro avg       0.78      0.78      0.78    144000
weighted avg       0.78      0.78      0.78    144000



Nice, it looks like the model isn't overfitting and is doing well at generalizing to our validation set.

Now lets evaluate the model's performance on the test set.

In [61]:
ml_test_preds = clf.predict(ml_X_test)

print (f"Test Accuracy: {accuracy_score(ml_y_test, ml_test_preds)}")
print (classification_report(ml_y_test, ml_test_preds))

Test Accuracy: 0.782
              precision    recall  f1-score   support

           0       0.79      0.76      0.78     80000
           1       0.77      0.80      0.79     80000

    accuracy                           0.78    160000
   macro avg       0.78      0.78      0.78    160000
weighted avg       0.78      0.78      0.78    160000



Great, 78.2% accuracy is a great baseline to compare to moving fowards. Now we can move to our DL approach.

---
### 3. Deep Learning Approach 🧠

### 4. Comparisons and Conclusions 💭