# Tweet Depression Sentiment Analysis Using NLTK's TF-IDF

### About Dataset
Context: 
Finding if a person is depressed from their use of words on social media can definitely help in the cure!

Can you predict depression? - 
Sentimental Analysis can be very useful to find out depression and cure it before someone gets into serious trouble

Steps:-

1. Preprocessing and cleaning
2. Train-test split (as we try to test model on only training data)
3. TF-IDF   (Preventing data leakage)
4. Train our Models
5. Testing the Models

In [1]:
# Loading the dataset
import pandas as pd
df = pd.read_csv("sentiment_tweets.csv")
df.drop(columns='Index',inplace=True)       # remove index column
df.rename(columns={'message to examine': 'text','label (depression result)': 'label'},inplace=True)
df

Unnamed: 0,text,label
0,just had a real good moment. i missssssssss hi...,0
1,is reading manga http://plurk.com/p/mzp1e,0
2,@comeagainjen http://twitpic.com/2y2lx - http:...,0
3,@lapcat Need to send 'em to my accountant tomo...,0
4,ADD ME ON MYSPACE!!! myspace.com/LookThunder,0
...,...,...
10309,No Depression by G Herbo is my mood from now o...,1
10310,What do you do when depression succumbs the br...,1
10311,Ketamine Nasal Spray Shows Promise Against Dep...,1
10312,dont mistake a bad day with depression! everyo...,1


In [2]:
df['label'].value_counts()      # Imbalanced dataset

label
0    8000
1    2314
Name: count, dtype: int64

In [3]:
# Data Cleaning and Preprocessing
df.duplicated().sum()

np.int64(31)

In [4]:
# Dropping duplicates
df.drop_duplicates(inplace=True)
df.shape

(10283, 2)

In [5]:
df.isnull().sum()   #No missing values

text     0
label    0
dtype: int64

In [6]:
# Importing necessary libraries
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()

In [7]:
df['text'].to_list()

['just had a real good moment. i missssssssss him so much, ',
 'is reading manga  http://plurk.com/p/mzp1e',
 '@comeagainjen http://twitpic.com/2y2lx - http://www.youtube.com/watch?v=zoGfqvh2ME8 ',
 "@lapcat Need to send 'em to my accountant tomorrow. Oddly, I wasn't even referring to my taxes. Those are supporting evidence, though. ",
 'ADD ME ON MYSPACE!!!  myspace.com/LookThunder',
 'so sleepy. good times tonight though ',
 '@SilkCharm re: #nbn as someone already said, does fiber to the home mean we will all at least be regular now ',
 '23 or 24ï¿½C possible today. Nice ',
 'nite twitterville  workout in the am  -ciao',
 "@daNanner Night, darlin'!  Sweet dreams to you ",
 'Good morning everybody! ',
 "Finally! I just created my WordPress Blog. There's already a blog up on the Seattle Coffee Community  ... http://tinyurl.com/c5uufd",
 'kisha they cnt get over u til they get out frm under u just remember ur on top ',
 '@nicolerichie Yes i remember that band, It was Awesome, Will you p

In [8]:
from bs4 import BeautifulSoup

In [9]:
# Lowering the case
df['text'] = df['text'].str.lower()
df.head()

Unnamed: 0,text,label
0,just had a real good moment. i missssssssss hi...,0
1,is reading manga http://plurk.com/p/mzp1e,0
2,@comeagainjen http://twitpic.com/2y2lx - http:...,0
3,@lapcat need to send 'em to my accountant tomo...,0
4,add me on myspace!!! myspace.com/lookthunder,0


In [None]:
## Text Preprocessing and Cleaning (Install nltk, bs4, and lxml if not already installed)

# Removing special characters
df['text'] = df['text'].apply(lambda x: re.sub('[^a-z A-Z 0-9-]+','',x))

# Removing stopwords
df['text'] = df['text'].apply(lambda x: " ".join([y for y in x.split() if y not in stopwords.words('english')]))

# Remove url
df['text'] = df['text'].apply(lambda x: re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '' , str(x)))

## Remove html tags
df['text'] = df['text'].apply(lambda x: BeautifulSoup(x, 'lxml').get_text())

## Remove any additional spaces
df['text'] = df['text'].apply(lambda x: " ".join(x.split()))

In [11]:
# Lemmatizing
def lemmatize_words(text):
    return " ".join([wnl.lemmatize(word) for word in text.split()])

df['text']=df['text'].apply(lambda x:lemmatize_words(x))

In [12]:
df['text'].tolist()

['real good moment missssssssss much',
 'reading manga httpplurkcompmzp1e',
 'comeagainjen httptwitpiccom2y2lx - httpwwwyoutubecomwatchvzogfqvh2me8',
 'lapcat need send em accountant tomorrow oddly wasnt even referring tax supporting evidence though',
 'add myspace myspacecomlookthunder',
 'sleepy good time tonight though',
 'silkcharm nbn someone already said fiber home mean least regular',
 '23 24c possible today nice',
 'nite twitterville workout -ciao',
 'dananner night darlin sweet dream',
 'good morning everybody',
 'finally created wordpress blog there already blog seattle coffee community httptinyurlcomc5uufd',
 'kisha cnt get u til get frm u remember ur top',
 'nicolerichie yes remember band awesome please reply',
 'really love reflection shadow',
 'blueaero ooo fantasy like fantasy novel check',
 'rokchic28 probs sell nothing blog httpsnedwancom ill get listen band itunes',
 'shipovalov quotnokla connecting peoplequot',
 'stayed late start early good thing like job',
 'kalpen

In [13]:
df.shape

(10283, 2)

## Create TF-IDF

In [14]:
## Input and Output Features
X = df['text']
y = df['label']

In [15]:
# Train-test split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,stratify=y,random_state=42)
len(X_train),len(X_test)

(8226, 2057)

In [16]:
# Create TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=2500,ngram_range=(1,2))

In [17]:
X_train = tfidf.fit_transform(X_train).toarray()
X_test = tfidf.transform(X_test).toarray()

In [18]:
X_train.shape, X_test.shape

((8226, 2500), (2057, 2500))

In [19]:
# Check the distribution of the classes before resampling
y_train.value_counts()

label
0    6396
1    1830
Name: count, dtype: int64

In [20]:
# Over-sampling the minority class
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X_train, y_train = sm.fit_resample(X_train, y_train)
X_train.shape, y_train.shape

((12792, 2500), (12792,))

In [21]:
# Check the distribution of the classes after resampling
y_train.value_counts()

label
0    6396
1    6396
Name: count, dtype: int64

In [22]:
# Changing output display
import numpy as np
np.set_printoptions(edgeitems=30, linewidth=100000, formatter = dict(float=lambda x: "%.3g" % x))

In [23]:
X_train       #different values other than 0 and 1 are there

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.413, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 

In [24]:
tfidf.vocabulary_    

{'heyy': np.int64(1056),
 'aww': np.int64(166),
 'thanks': np.int64(2175),
 'much': np.int64(1502),
 'hun': np.int64(1106),
 'rock': np.int64(1856),
 'xoxo': np.int64(2463),
 'thanks much': np.int64(2178),
 'gonna': np.int64(928),
 'go': np.int64(909),
 'eat': np.int64(666),
 'lunch': np.int64(1373),
 'yummy': np.int64(2498),
 'soup': np.int64(2023),
 'gonna go': np.int64(929),
 'recording': np.int64(1814),
 'music': np.int64(1508),
 'hit': np.int64(1066),
 'back': np.int64(172),
 'sometime': np.int64(2004),
 'hope': np.int64(1080),
 'talk': np.int64(2145),
 'soon': np.int64(2010),
 'best': np.int64(220),
 'hi': np.int64(1057),
 'hello': np.int64(1045),
 'depression': np.int64(522),
 'anxiety': np.int64(114),
 'worst': np.int64(2445),
 'hell': np.int64(1043),
 'emoji': np.int64(676),
 'upside': np.int64(2321),
 'down': np.int64(634),
 'face': np.int64(750),
 'hello depression': np.int64(1046),
 'depression anxiety': np.int64(527),
 'emoji upside': np.int64(682),
 'upside down': np.int6

In [25]:
from sklearn.linear_model import LogisticRegression
sentiment_tfidf_model = LogisticRegression(class_weight='balanced').fit(X_train,y_train)
sentiment_tfidf_model

In [26]:
y_pred = sentiment_tfidf_model.predict(X_test)

# Performance Metrics
from sklearn.metrics import accuracy_score,f1_score, precision_score, recall_score, confusion_matrix
print('Confusion Matrix: \n',confusion_matrix(y_test,y_pred))
print('\nAccuracy: ',accuracy_score(y_test,y_pred))
print('\nPrecision: \n',precision_score(y_test,y_pred))
print('\nRecall: \n',recall_score(y_test,y_pred))
print('\nF1-Score: \n',f1_score(y_test,y_pred))

Confusion Matrix: 
 [[1591    9]
 [  18  439]]

Accuracy:  0.9868740884783666

Precision: 
 0.9799107142857143

Recall: 
 0.9606126914660832

F1-Score: 
 0.9701657458563536


In [27]:
from sklearn.metrics import classification_report
print('\nClassification Report: \n',classification_report(y_test,y_pred))


Classification Report: 
               precision    recall  f1-score   support

           0       0.99      0.99      0.99      1600
           1       0.98      0.96      0.97       457

    accuracy                           0.99      2057
   macro avg       0.98      0.98      0.98      2057
weighted avg       0.99      0.99      0.99      2057



The model gives excellent performance across all metrics and handles imbalance robustly. It is catching almost all instances of the minority class — critical in imbalanced problems.
The F1 score balances precision and recall well. It is also not producing false positives at all.

In [28]:
# Saving the model and the vectorizer
import joblib
with open('model.pkl', 'wb') as model_file:
    joblib.dump(sentiment_tfidf_model, model_file)

with open('vectorizer.pkl', 'wb') as vectorizer_file:
    joblib.dump(tfidf, vectorizer_file)