<a href="https://colab.research.google.com/github/Larinwa/Flexisaf_Project/blob/main/Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Supervised Sentiment Analysis Using Natural Language Processing (NLP)

This project focuses on building a supervised sentiment analysis system using Natural Language Processing (NLP) techniques to automatically determine the emotional tone of tweets. The dataset used is the Sentiment140 dataset, obtained from Kaggle, which contains 1.6 million tweets collected via the Twitter API. Each tweet is labeled with a sentiment score: 0 for negative, 2 for neutral, and 4 for positive. Although the dataset contains six columns (target, ids, date, flag, user, text), only the text and target columns are used for this analysis.

The main goal of the project is to develop a utilize the various text preprocessing tools. This involves performing essential NLP preprocessing steps such as lowercasing, removal of URLs, user mentions, punctuation, special characters, stopword removal, and lemmatization in order to clean and standardize the text data.

To have a completely robust pipline, a machine learning model was trained using the preprocessed data. After preprocessing, the cleaned tweets are converted into numerical features using TF-IDF vectorization, enabling machine learning models to interpret the text. A Logistic Regression classifier is then trained to learn sentiment patterns from the data. To ensure fair evaluation, the dataset is split into training and testing sets, allowing the model to be trained on unseen data and evaluated objectively. Finally, the trained model is applied to the entire dataset to generate sentiment predictions, which are compared with the original labels to assess performance.

This approach provides a complete and structured workflow for building an effective sentiment analysis system using classical NLP and machine learning techniques.

###STEP 1: Import Required Libraries

Necessary Python libraries are imported for data manipulation, text preprocessing, NLP operations, and machine learning modeling. Required NLTK resources such as stopwords and wordnet are also downloaded to support text cleaning and lemmatization.


In [1]:
import numpy as np
import pandas as pd
import re
import nltk
import spacy
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Download required NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

###STEP 2: Download and Load Dataset

The Sentiment140 dataset is downloaded from Kaggle, stored locally, and loaded into a Pandas DataFrame. This allows for efficient data handling, exploration, and preprocessing.

In [2]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("kazanova/sentiment140")

print("Path to dataset files:", path)

Using Colab cache for faster access to the 'sentiment140' dataset.
Path to dataset files: /kaggle/input/sentiment140


In [4]:
import os
dataset = '/kaggle/input/sentiment140'
print(os.listdir(dataset))

['training.1600000.processed.noemoticon.csv']


In [5]:
import pandas as pd

data = '/kaggle/input/sentiment140/training.1600000.processed.noemoticon.csv'

df = pd.read_csv(data, encoding='latin-1', header=None)

df.head()


Unnamed: 0,0,1,2,3,4,5
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [6]:
df.columns = ['target', 'id', 'date', 'flag', 'user', 'text']

###STEP 3: Inspect Dataset and Extract Useful Columns

The dataset is inspected to understand its structure, features, and data types. Since only the tweet text and sentiment label are relevant for this task, the text and target columns are extracted for further analysis. This reduces noise and improves computational efficiency.

In [7]:
df.head()

Unnamed: 0,target,id,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [8]:
df.isna().sum()

Unnamed: 0,0
target,0
id,0
date,0
flag,0
user,0
text,0


In [9]:
df['target'].value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,800000
4,800000


In [10]:
df['text']

Unnamed: 0,text
0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,is upset that he can't update his Facebook by ...
2,@Kenichan I dived many times for the ball. Man...
3,my whole body feels itchy and like its on fire
4,"@nationwideclass no, it's not behaving at all...."
...,...
1599995,Just woke up. Having no school is the best fee...
1599996,TheWDB.com - Very cool to hear old Walt interv...
1599997,Are you ready for your MoJo Makeover? Ask me f...
1599998,Happy 38th Birthday to my boo of alll time!!! ...


In [11]:
new_df= df[['target', 'text']]

In [12]:
new_df.head(5)

Unnamed: 0,target,text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


###STEP 4: Split Dataset Before Preprocessing

The dataset is split into training (80%) and testing (20%) sets before preprocessing. This is a critical step to prevent data leakage, ensuring that the model does not gain prior knowledge of the test data during training. The reserved test set will be later used for unbiased evaluation of model performance.

In [13]:
X = new_df['text']
y = new_df['target']

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

### STEP 5: Text Preprocessing

A comprehensive preprocessing pipeline is applied to clean and standardize the tweet text. The following steps are performed:

* Convert text to lowercase

* Remove URLs

* Remove user mentions and hashtags

* Remove punctuation and special characters

* Tokenize text into words

* Remove stopwords

* Apply lemmatization to reduce words to their base for

This process significantly improves data quality and enhances model performance by reducing noise and standardizing vocabulary.

In [14]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    if pd.isna(text):
        return ""

    # Lowercase
    text = text.lower()

    # Remove URLs
    text = re.sub(r'http\S+|\bwww\.\S+|https\S+', '', text)

    # Remove mentions & hashtags
    text = re.sub(r'@\w+|#\w+', '', text)

    # Remove punctuation & numbers
    text = re.sub(r'[^a-z\s]', '', text)

    # Split into words (tokenize)
    words = text.split()

    # Remove stopwords & lemmatize
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]

    # Rejoin into cleaned string
    return " ".join(words)


In [15]:
X_train_cleaned= X_train.apply(preprocess_text)
X_test_cleaned= X_test.apply(preprocess_text)

In [16]:
X_train_cleaned.head()

Unnamed: 0,text
1036873,lol get idea far advance even june yet need th...
287781,worst headache ever
333391,sad wont see miss already yeah thats perfect c...
1484559,doesnt know spell conked
562778,quotso stand one know u wont get used wont get...


### STEP 6: Feature Extraction Using TF-IDF Vectorization

The cleaned text is transformed into numerical features using TF-IDF (Term Frequencyâ€“Inverse Document Frequency) vectorization. This technique captures both word importance and frequency, enabling the machine learning model to understand textual patterns. Unigrams and bigrams are used, with a feature limit of 5000 to balance performance and computational efficiency.

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    max_features=5000,
    ngram_range=(1,2)
)

X_train_tfidf = vectorizer.fit_transform(X_train_cleaned)
X_test_tfidf  = vectorizer.transform(X_test_cleaned)


### STEP 7: Model Training Using Logistic Regression

A Logistic Regression classifier is trained using the TF-IDF transformed training data. Logistic Regression is chosen for its efficiency, scalability, and strong performance in text classification tasks, particularly for large and sparse datasets.

In [18]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)
model.fit(X_train_tfidf, y_train)


### STEP 8: Model Evaluation on Test Dataset

The trained model is evaluated using the reserved 20% test dataset. Performance metrics including accuracy, precision, recall, and F1-score are computed to assess the modelâ€™s predictive capability and generalization performance.

Achieved Accuracy: ~77.7%

This indicates strong sentiment classification performance on unseen data.

In [19]:
from sklearn.metrics import accuracy_score, classification_report

y_pred = model.predict(X_test_tfidf)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))


Accuracy: 0.776875
              precision    recall  f1-score   support

           0       0.79      0.75      0.77    160000
           4       0.76      0.80      0.78    160000

    accuracy                           0.78    320000
   macro avg       0.78      0.78      0.78    320000
weighted avg       0.78      0.78      0.78    320000



### STEP 9: Predict Sentiments for Entire Dataset

The trained model is applied to the entire dataset to generate predicted sentiment labels. This enables full-scale sentiment classification and allows for comprehensive performance comparison.

In [20]:
new_df['text_cleaned'] = new_df['text'].apply(preprocess_text)
X_all_tfidf = vectorizer.transform(new_df['text_cleaned'])
new_df['predicted'] = model.predict(X_all_tfidf)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['text_cleaned'] = new_df['text'].apply(preprocess_text)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['predicted'] = model.predict(X_all_tfidf)


### STEP 10: Convert Predictions to Readable Sentiment Labels

The numerical predictions are mapped to human-readable sentiment labels:

0 = Negative

4 = Positive

This improves interpretability and allows for clearer comparison between predicted and actual sentiments. Emojis w

In [21]:
new_df['predicted_sentiment'] = new_df['predicted'].apply(lambda x: "Positive ðŸ˜Š" if x == 4 else "Negative ðŸ˜ ")
new_df['actual_sentiment'] = new_df['target'].apply(lambda x: "Positive ðŸ˜Š" if x == 4 else "Negative ðŸ˜ ")


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['predicted_sentiment'] = new_df['predicted'].apply(lambda x: "Positive ðŸ˜Š" if x == 4 else "Negative ðŸ˜ ")


### STEP 11: Performance Comparison and Final Evaluation

The predicted sentiments are compared against the original dataset labels to determine overall correctness. Key metrics such as total correct predictions, incorrect predictions, and overall accuracy are computed.

Final Results:

Correct Predictions: 1,245,721

Incorrect Predictions: 354,279

Overall Accuracy: 77.86%

This confirms strong and consistent model performance across the entire dataset.

In [22]:
# Correct prediction
new_df['correct'] = new_df['predicted'] == new_df['target']

# Accuracy
accuracy = new_df['correct'].mean()
print("Overall Accuracy:", accuracy)

# Detailed report
from sklearn.metrics import classification_report
print(classification_report(new_df['actual_sentiment'], new_df['predicted_sentiment']))


Overall Accuracy: 0.778575625
              precision    recall  f1-score   support

  Negative ðŸ˜        0.79      0.75      0.77    800000
  Positive ðŸ˜Š       0.77      0.80      0.78    800000

    accuracy                           0.78   1600000
   macro avg       0.78      0.78      0.78   1600000
weighted avg       0.78      0.78      0.78   1600000



In [23]:
new_df.head()

Unnamed: 0,target,text,text_cleaned,predicted,predicted_sentiment,actual_sentiment,correct
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",awww thats bummer shoulda got david carr third...,0,Negative ðŸ˜,Negative ðŸ˜,True
1,0,is upset that he can't update his Facebook by ...,upset cant update facebook texting might cry r...,0,Negative ðŸ˜,Negative ðŸ˜,True
2,0,@Kenichan I dived many times for the ball. Man...,dived many time ball managed save rest go bound,4,Positive ðŸ˜Š,Negative ðŸ˜,False
3,0,my whole body feels itchy and like its on fire,whole body feel itchy like fire,0,Negative ðŸ˜,Negative ðŸ˜,True
4,0,"@nationwideclass no, it's not behaving at all....",behaving im mad cant see,0,Negative ðŸ˜,Negative ðŸ˜,True


In [24]:
new_df["predicted_sentiment"].value_counts()

Unnamed: 0_level_0,count
predicted_sentiment,Unnamed: 1_level_1
Positive ðŸ˜Š,838067
Negative ðŸ˜,761933


In [25]:
new_df["actual_sentiment"].value_counts()

Unnamed: 0_level_0,count
actual_sentiment,Unnamed: 1_level_1
Negative ðŸ˜,800000
Positive ðŸ˜Š,800000


In [26]:
new_df["correct"].value_counts()

Unnamed: 0_level_0,count
correct,Unnamed: 1_level_1
True,1245721
False,354279


###STEP 12: Removal of Emojis

As part of additional text normalization, emojis present in the predicted and actual sentiment columns are removed using regular expressions. Although emojis can sometimes carry emotional meaning, their removal ensures textual consistency, cleaner output formatting, and standardized sentiment labels, which is especially useful for evaluation, reporting, and visualization purposes.

In [27]:
def remove_emoji(string):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

In [28]:
new_df["actual_sentiment"] = new_df["actual_sentiment"].apply(remove_emoji)
new_df["predicted_sentiment"] = new_df["predicted_sentiment"].apply(remove_emoji)
new_df.head()

Unnamed: 0,target,text,text_cleaned,predicted,predicted_sentiment,actual_sentiment,correct
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",awww thats bummer shoulda got david carr third...,0,Negative,Negative,True
1,0,is upset that he can't update his Facebook by ...,upset cant update facebook texting might cry r...,0,Negative,Negative,True
2,0,@Kenichan I dived many times for the ball. Man...,dived many time ball managed save rest go bound,4,Positive,Negative,False
3,0,my whole body feels itchy and like its on fire,whole body feel itchy like fire,0,Negative,Negative,True
4,0,"@nationwideclass no, it's not behaving at all....",behaving im mad cant see,0,Negative,Negative,True


###Conclusion

This project successfully demonstrates a complete supervised sentiment analysis pipeline, combining text preprocessing, feature engineering, machine learning modeling, and performance evaluation. Through careful preprocessing and effective model training, the system achieves high predictive accuracy, making it suitable for large-scale sentiment classification tasks.

The developed workflow is robust, scalable, and adaptable, and can easily be extended to real-time tweet analysis, customer feedback evaluation, social media monitoring, and opinion mining applications. This project highlights the effectiveness of combining classical NLP techniques with machine learning models for practical sentiment analysis solutions.