# Twitter NLP Sentiment Classifier

This notebook builds a simple NLP model to classify tweets as positive, neutral, or negative using the [Twitter140 dataset](https://www.kaggle.com/datasets/kazanova/sentiment140)



In [37]:
#import basic core libraries 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

---
### 1. Data Preprocessing 🔨

#### Steps 📄:
- Load and inspect Sentiment140 dataset

- Clean raw tweet text

- Normalize case and remove stopwords

- Split into training, validation, and testing sets

We'll begin by loading our dataframe and exploring the dataset to get an understanding of its structure and the distribution of sentiment labels.

In [38]:
df = pd.read_csv("../data/Sentiment140.csv", encoding="latin-1", header=None)

print(df.head())
df[0].value_counts()

   0           1                             2         3                4  \
0  0  1467810369  Mon Apr 06 22:19:45 PDT 2009  NO_QUERY  _TheSpecialOne_   
1  0  1467810672  Mon Apr 06 22:19:49 PDT 2009  NO_QUERY    scotthamilton   
2  0  1467810917  Mon Apr 06 22:19:53 PDT 2009  NO_QUERY         mattycus   
3  0  1467811184  Mon Apr 06 22:19:57 PDT 2009  NO_QUERY          ElleCTF   
4  0  1467811193  Mon Apr 06 22:19:57 PDT 2009  NO_QUERY           Karoli   

                                                   5  
0  @switchfoot http://twitpic.com/2y1zl - Awww, t...  
1  is upset that he can't update his Facebook by ...  
2  @Kenichan I dived many times for the ball. Man...  
3    my whole body feels itchy and like its on fire   
4  @nationwideclass no, it's not behaving at all....  


0
0    800000
4    800000
Name: count, dtype: int64

We are only going to be using the sentiment and text, which are found in column 1 and 5 respectively.<br/>
The sentiment is distributed evenly between 0 (negative) and 4 (positive).<br/>
It is also clear that the text will need some cleaning (tags, urls, special characters, etc.)

Lets label our columns and grab the ones we want. <br/>
We'll also map all 4s to 1 for binary simplicity.

In [39]:
df.columns = ["sentiment", "id", "date", "flag", "user", "text"]
df = df[["sentiment", "text"]]

df["sentiment"] = df["sentiment"].replace({4:1})

print(df.head())
print(df["sentiment"].value_counts())

   sentiment                                               text
0          0  @switchfoot http://twitpic.com/2y1zl - Awww, t...
1          0  is upset that he can't update his Facebook by ...
2          0  @Kenichan I dived many times for the ball. Man...
3          0    my whole body feels itchy and like its on fire 
4          0  @nationwideclass no, it's not behaving at all....
sentiment
0    800000
1    800000
Name: count, dtype: int64


Next we'll use the regular expressions module and NLTK library to clean the tweets.

In [40]:
import re
from tracemalloc import stop
import nltk
from nltk.corpus import stopwords

nltk.download("stopwords")
stop_words = set(stopwords.words("english"))

def clean_tweets(text):
    text = re.sub(r"http\S+|www\S+|https\S+", "", text)
    text = re.sub(r"@\w+", "", text)
    text = re.sub(r"#\w+", "", text)
    text = re.sub(r"[^a-zA-Z\s]", "", text)
    text = text.lower().strip().split()
    text = [word for word in text if word not in stop_words]
    return " ".join(text)

df["text"] = df["text"].apply(clean_tweets)
print (df["text"].head())

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\theli\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


0        thats bummer shoulda got david carr third day
1    upset cant update facebook texting might cry r...
2    dived many times ball managed save rest go bounds
3                     whole body feels itchy like fire
4                             behaving im mad cant see
Name: text, dtype: object


Now we can split the data into train, validation, and test using stratified sampling to ensure fair evaluation.

In [42]:
from sklearn.model_selection import train_test_split

df_trainandval, df_test = train_test_split(df, test_size=0.1, stratify=df["sentiment"], random_state=1)

df_train, df_val = train_test_split(df_trainandval, test_size=0.1, stratify=df_trainandval["sentiment"], random_state=1)

print(f"train split: {df_train['sentiment'].value_counts(normalize=True)}")
print(f"validation split: {df_val['sentiment'].value_counts(normalize=True)}")
print(f"test split: {df_test['sentiment'].value_counts(normalize=True)}")

train split: sentiment
1    0.5
0    0.5
Name: proportion, dtype: float64
validation split: sentiment
0    0.5
1    0.5
Name: proportion, dtype: float64
test split: sentiment
0    0.5
1    0.5
Name: proportion, dtype: float64


### 2. Classical ML Approach 💻

### 3. Deep Learning Approach 🧠

### 4. Comparisons and Conclusions 💭