# 📝 Twitter Sentiment Analysis using NLP & Machine Learning

This project aims to classify tweets as **positive** or **negative** using Natural Language Processing (NLP) techniques and machine learning models.

We will:
- Clean and preprocess text data
- Perform Exploratory Data Analysis (EDA)
- Vectorize text using TF-IDF
- Train and evaluate multiple ML models
- Tune hyperparameters for best performance


## 📂 Dataset Description

**Source:** [Sentiment140 dataset](https://www.kaggle.com/datasets/kazanova/sentiment140)

**Columns:**
- `target`: Sentiment label (0 = Negative, 1 = Positive)
- `id`: Tweet ID (removed in preprocessing)
- `date`: Date of the tweet (not used in modeling)
- `flag`: Query flag (not used in modeling)
- `user`: Username (removed in preprocessing)
- `text`: Tweet content

**Dataset Size:** 1.6 million tweets (balanced: 800k positive, 800k negative)


In [1]:
import numpy as np
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


In [4]:
import nltk
nltk.download('stopwords')

print(stopwords.words('hindi'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/aloknegi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
column_names = ['target', 'id', 'date', 'flag', 'user', 'text']
df = pd.read_csv('/Users/aloknegi/Downloads/twitter_data.csv',names=column_names,encoding='ISO-8859-1')

# EDA

In [7]:
df.shape

(1600000, 6)

In [8]:
df.head()

Unnamed: 0,target,id,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [9]:
df.info()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   target  1600000 non-null  int64 
 1   id      1600000 non-null  int64 
 2   date    1600000 non-null  object
 3   flag    1600000 non-null  object
 4   user    1600000 non-null  object
 5   text    1600000 non-null  object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB


In [10]:
df.isnull().sum()

target    0
id        0
date      0
flag      0
user      0
text      0
dtype: int64

## 🧹 Data Cleaning & Preprocessing

Steps:
1. Removed unnecessary columns (`id`, `flag`, `user`)
2. Converted sentiment labels from (4 → 1) to have binary labels
3. Lowercased text
4. Removed punctuation, special characters, and numbers
5. Tokenized text
6. Removed stopwords
7. Lemmatized words
8. Created a new column `lemmatized_text` for processed tweets


In [11]:
df.drop(columns=['id', 'flag', 'user'], inplace=True)

In [67]:

df.duplicated().sum()


397

In [13]:
df['target'].value_counts()

target
0    800000
4    800000
Name: count, dtype: int64

In [14]:
df['target'] = df['target'].replace(4, 1)

In [15]:
df['target'].value_counts()

target
0    800000
1    800000
Name: count, dtype: int64

In [16]:
df

Unnamed: 0,target,date,text
0,0,Mon Apr 06 22:19:45 PDT 2009,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,Mon Apr 06 22:19:49 PDT 2009,is upset that he can't update his Facebook by ...
2,0,Mon Apr 06 22:19:53 PDT 2009,@Kenichan I dived many times for the ball. Man...
3,0,Mon Apr 06 22:19:57 PDT 2009,my whole body feels itchy and like its on fire
4,0,Mon Apr 06 22:19:57 PDT 2009,"@nationwideclass no, it's not behaving at all...."
...,...,...,...
1599995,1,Tue Jun 16 08:40:49 PDT 2009,Just woke up. Having no school is the best fee...
1599996,1,Tue Jun 16 08:40:49 PDT 2009,TheWDB.com - Very cool to hear old Walt interv...
1599997,1,Tue Jun 16 08:40:49 PDT 2009,Are you ready for your MoJo Makeover? Ask me f...
1599998,1,Tue Jun 16 08:40:49 PDT 2009,Happy 38th Birthday to my boo of alll time!!! ...


In [17]:
!pip install swifter > /dev/null  
import swifter
swifter.set_defaults(progress_bar=False)
from nltk.stem import WordNetLemmatizer

In [18]:
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/aloknegi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/aloknegi/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/aloknegi/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [19]:
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

In [20]:
def lemmatize_tweet(content):
    # Keep only letters
    content = re.sub('[^a-zA-Z]', ' ', content)
    # Lowercase
    content = content.lower()
    # Tokenize
    words = content.split()
    # Remove stopwords and lemmatize
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    return ' '.join(words)

In [21]:
df['lemmatized_text'] = df['text'].swifter.apply(lemmatize_tweet)

Pandas Apply:   0%|          | 0/1600000 [00:00<?, ?it/s]

In [22]:
df

Unnamed: 0,target,date,text,lemmatized_text
0,0,Mon Apr 06 22:19:45 PDT 2009,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",switchfoot http twitpic com zl awww bummer sho...
1,0,Mon Apr 06 22:19:49 PDT 2009,is upset that he can't update his Facebook by ...,upset update facebook texting might cry result...
2,0,Mon Apr 06 22:19:53 PDT 2009,@Kenichan I dived many times for the ball. Man...,kenichan dived many time ball managed save res...
3,0,Mon Apr 06 22:19:57 PDT 2009,my whole body feels itchy and like its on fire,whole body feel itchy like fire
4,0,Mon Apr 06 22:19:57 PDT 2009,"@nationwideclass no, it's not behaving at all....",nationwideclass behaving mad see
...,...,...,...,...
1599995,1,Tue Jun 16 08:40:49 PDT 2009,Just woke up. Having no school is the best fee...,woke school best feeling ever
1599996,1,Tue Jun 16 08:40:49 PDT 2009,TheWDB.com - Very cool to hear old Walt interv...,thewdb com cool hear old walt interview http b...
1599997,1,Tue Jun 16 08:40:49 PDT 2009,Are you ready for your MoJo Makeover? Ask me f...,ready mojo makeover ask detail
1599998,1,Tue Jun 16 08:40:49 PDT 2009,Happy 38th Birthday to my boo of alll time!!! ...,happy th birthday boo alll time tupac amaru sh...


# 📊 Feature & Target Definition + Data Splitting
We separated:
- **X (Features):** Preprocessed tweet text
- **y (Target):** Sentiment labels (positive / negative / neutral)

Then, we split the dataset into:
- **Training set:** Used to train the machine learning model
- **Test set:** Used to evaluate model performance on unseen data

Why split?
This prevents overfitting by ensuring the model is evaluated on data it hasn’t seen before, giving a more reliable measure of real-world performance.


In [23]:
x=df['lemmatized_text'].values
y=df['target'].values

In [24]:
x

array(['switchfoot http twitpic com zl awww bummer shoulda got david carr third day',
       'upset update facebook texting might cry result school today also blah',
       'kenichan dived many time ball managed save rest go bound', ...,
       'ready mojo makeover ask detail',
       'happy th birthday boo alll time tupac amaru shakur',
       'happy charitytuesday thenspcc sparkscharity speakinguph h'],
      dtype=object)

In [25]:
y

array([0, 0, 0, ..., 1, 1, 1])

In [26]:
X_train , X_test , Y_train , Y_test = train_test_split(x,y,test_size=0.2,stratify=y,random_state=2)

## 🔤 Text Vectorization

We converted text into numerical features using:
- **TF-IDF Vectorization** (Term Frequency - Inverse Document Frequency)

**Why TF-IDF?**
It reduces the weight of common words and increases the weight of rare but important words, improving classification accuracy over basic Bag of Words.


In [28]:
vectorizer = TfidfVectorizer(max_features=10000)

X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

## 🤖 Model Training & Evaluation

Models Tested:
- Logistic Regression
- SVC (Support Vector Classifier)


Metrics:
- Accuracy Score
- Classification Report
- Confusion Matrix


In [30]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

model =LogisticRegression()

In [31]:
model.fit(X_train, Y_train)
y_pred = model.predict(X_test)
print(f'Accuracy of {model}: {accuracy_score(Y_test, y_pred):.4f}')
print(classification_report(Y_test, y_pred))

Accuracy of LogisticRegression(): 0.7746
              precision    recall  f1-score   support

           0       0.79      0.75      0.77    160000
           1       0.76      0.80      0.78    160000

    accuracy                           0.77    320000
   macro avg       0.78      0.77      0.77    320000
weighted avg       0.78      0.77      0.77    320000



In [32]:
from sklearn.svm import LinearSVC
model_SVC = LinearSVC()

model_SVC.fit(X_train, Y_train)
y_pred = model_SVC.predict(X_test)
print(f'Accuracy of {model_SVC}: {accuracy_score(Y_test, y_pred):.4f}')
print(classification_report(Y_test, y_pred))

Accuracy of LinearSVC(): 0.7741
              precision    recall  f1-score   support

           0       0.79      0.75      0.77    160000
           1       0.76      0.80      0.78    160000

    accuracy                           0.77    320000
   macro avg       0.77      0.77      0.77    320000
weighted avg       0.77      0.77      0.77    320000



# Hyper tuning of Logistic Regression 

In [76]:
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
import numpy as np

# --- Step 1: Take a much smaller sample ---
X_sample, _, y_sample, _ = train_test_split(
    X_train, Y_train, 
    train_size=5000,  # just 5k rows for tuning
    stratify=Y_train, 
    random_state=42
)

# --- Step 2: Parameter distribution ---
param_dist = {
    'C': np.logspace(-3, 2, 6),   # smaller range
    'penalty': ['l1', 'l2'],
    'solver': ['saga']  # supports both penalties, good for sparse
}

# --- Step 3: Randomized search ---
log_reg = LogisticRegression(max_iter=500, random_state=42)
rand_search = RandomizedSearchCV(
    estimator=log_reg,
    param_distributions=param_dist,
    n_iter=6,  # only 6 combos
    scoring='accuracy',
    cv=3,
    verbose=1,
    n_jobs=-1
)

# --- Step 4: Fit on the small sample ---
rand_search.fit(X_sample, y_sample)

print("Best Params:", rand_search.best_params_)

# --- Step 5: Retrain on full train data ---
best_model = LogisticRegression(max_iter=500, random_state=42, **rand_search.best_params_)
best_model.fit(X_train, Y_train)


Fitting 3 folds for each of 6 candidates, totalling 18 fits
Best Params: {'solver': 'saga', 'penalty': 'l2', 'C': 0.01}


In [64]:
model_h=LogisticRegression(
    C=0.01,            # Lower C = stronger regularization → less overfitting
    penalty='l2',      # Ridge regularization (works well with high-dimensional sparse data like TF-IDF)
    solver='saga',     # Efficient solver for large sparse datasets, supports l1 and l2
    max_iter=500,      # Number of iterations to converge
    random_state=42    # For reproducibility
)

model_h.fit(X_train, Y_train)
y_pred = model_h.predict(X_test)
print(f'Accuracy of {model_h}: {accuracy_score(Y_test, y_pred):.4f}')
print(classification_report(Y_test, y_pred))


Accuracy of LogisticRegression(C=0.01, max_iter=500, random_state=42, solver='saga'): 0.7594
              precision    recall  f1-score   support

           0       0.77      0.74      0.75    160000
           1       0.75      0.78      0.76    160000

    accuracy                           0.76    320000
   macro avg       0.76      0.76      0.76    320000
weighted avg       0.76      0.76      0.76    320000



# Hyper tuning of LinearSVC

In [62]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split, RandomizedSearchCV
import numpy as np

# Sample a smaller subset (no scaling)
X_sample, _, y_sample, _ = train_test_split(
    X_train, Y_train,
    train_size=2000,
    stratify=Y_train,
    random_state=42
)

param_dist = {
    'C': np.logspace(-3, 2, 6),
    'penalty': ['l2'],
    'loss': ['hinge', 'squared_hinge'],
    'max_iter': [20000, 50000, 100000],  # Higher iterations
    'dual': [False]
}

svc = LinearSVC(random_state=42)

rand_search = RandomizedSearchCV(
    estimator=svc,
    param_distributions=param_dist,
    n_iter=5,
    scoring='accuracy',
    cv=2,
    verbose=1,
    n_jobs=1,
    random_state=42
)

rand_search.fit(X_sample, y_sample)
print("Best Params:", rand_search.best_params_)

best_params = rand_search.best_params_
svc_final = LinearSVC(**best_params, random_state=42)
svc_final.fit(X_train, Y_train)

# Predict on test set as usual (no scaling)
# y_pred = svc_final.predict(X_test)


Fitting 2 folds for each of 5 candidates, totalling 10 fits
Best Params: {'penalty': 'l2', 'max_iter': 50000, 'loss': 'squared_hinge', 'dual': False, 'C': 0.1}


6 fits failed out of a total of 10.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
6 fits failed with the following error:
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/anaconda3/lib/python3.12/site-packages/sklearn/base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/site-packages/sklearn/svm/_classes.py", line 325, in fit
    self.coef_, self.intercept_, n_iter_ = _fit_liblinear(
                                           ^^^^^^^^^^^^^^^
  File "/opt

In [59]:
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report

# Initialize LinearSVC with the best parameters found
svc_best = LinearSVC(random_state=42, **best_params)

# Train on full training data
svc_best.fit(X_train, Y_train)

# Predict on test set
y_pred = svc_best.predict(X_test)

# Evaluate performance
print(f'Accuracy of LinearSVC: {accuracy_score(Y_test, y_pred):.4f}')
print(classification_report(Y_test, y_pred))


Accuracy of LinearSVC: 0.7740
              precision    recall  f1-score   support

           0       0.79      0.75      0.77    160000
           1       0.76      0.80      0.78    160000

    accuracy                           0.77    320000
   macro avg       0.77      0.77      0.77    320000
weighted avg       0.77      0.77      0.77    320000



# 📌 Conclusion
Both Logistic Regression and LinearSVC achieved ~77% accuracy on Twitter sentiment classification, with very similar precision and recall.

- Hyperparameter tuning had minimal effect — Logistic Regression slightly dropped, while LinearSVC stayed consistent.

- Choice of model can depend on needs: Logistic Regression for interpretability, LinearSVC for speed on large text datasets.