# 🎯 Sentiment Analysis - Twitter Dataset
_A project to classify sentiments from tweets using Naive Bayes and TF-IDF._

## 📂 Table of Contents
1. [Introduction](#introduction)
2. [Project Objective](#project-objective)
3. [Data Preparation](#data-preparation)
4. [Pipeline Development](#pipeline-development)
5. [Model Development](#model-development)
6. [Validation Data Testing](#validation-testing)
7. [Single Input Prediction](#single-input-prediction)
8. [Conclusion and Next Steps](#conclusion)


## 📖 Introduction <a name="introduction"></a>
In this project, we build a sentiment analysis model to classify tweets as Positive or Negative using the Twitter dataset.

We use natural language processing (NLP) techniques to clean, vectorize, and classify tweets, and we evaluate the model using both training and validation datasets.


## 🎯 Project Objective <a name="project-objective"></a>
> - Build a text classification model using Naive Bayes.  
> - Preprocess tweets using a custom pipeline.  
> - Evaluate the model using real validation data and test on single inputs.


## 🛠️ Data Preparation <a name="data-preparation"></a>

In [1]:
# 📥 Import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline

In [2]:
# 📂 Load Dataset
data = pd.read_csv("twitter_training.csv", header=None)
data.columns = ['Tweet_ID', 'Entity', 'Sentiment', 'Review']

Drop entries where sentiment is either irrelevant or neutral

In [3]:
# 🧹 Clean Dataset
data.dropna(inplace=True)
data = data[data['Sentiment'] != 'Irrelevant']
data = data[data['Sentiment'] != 'Neutral']
data.drop(['Tweet_ID', 'Entity'], axis=1, inplace=True)

# ✅ Check Cleaned Data
data.head()

Unnamed: 0,Sentiment,Review
0,Positive,im getting on borderlands and i will murder yo...
1,Positive,I am coming to the borders and I will kill you...
2,Positive,im getting on borderlands and i will kill you ...
3,Positive,im coming on borderlands and i will murder you...
4,Positive,im getting on borderlands 2 and i will murder ...


## 🏗️ Pipeline Development <a name="pipeline-development"></a>

The TextPreprocessor transformer removes common English stopwords from text data, tokenizes each sentence, and returns the cleaned text while the pipeline  first removes stopwords from the text using TextPreprocessor. Then, it converts the cleaned text into numerical features using TfidfVectorizer.

In [4]:
# Custom transformer to remove stopwords
class TextPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.tokenizer = RegexpTokenizer(r'\w+')
        self.stopwords = set(stopwords.words('english'))

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X.apply(self._remove_stopwords)

    def _remove_stopwords(self, sentence):
        tokens = self.tokenizer.tokenize(sentence)
        filtered = [word for word in tokens if word.lower() not in self.stopwords]
        return ' '.join(filtered)

# Build the pipeline
text_pipeline = Pipeline([
    ('preprocess', TextPreprocessor()),
    ('tfidf', TfidfVectorizer())
])

## 🤖 Model Development <a name="model-development"></a>

In [5]:
# 🚀 Feature Extraction
X = text_pipeline.fit_transform(data['Review'])
y = data['Sentiment']

# 🔀 Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 🧠 Model Training
model = MultinomialNB()
model.fit(X_train, y_train)

# 📊 Evaluate on Test Set
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

    Negative       0.88      0.92      0.90      4467
    Positive       0.91      0.86      0.88      4136

    accuracy                           0.89      8603
   macro avg       0.89      0.89      0.89      8603
weighted avg       0.89      0.89      0.89      8603



## 🧪 Validation Data Testing <a name="validation-testing"></a>

Load the validation data and preprocess it. Then, we use the trained model to predict the sentiment of each tweet in the validation set.

```python

In [6]:
# 📂 Load Validation Dataset
val_data = pd.read_csv("twitter_validation.csv", header=None)

# 🧹 Preprocess Validation Data
def preprocess_data(df):
    df = df.copy()
    df.columns = ['Tweet_ID', 'Entity', 'Sentiment', 'Review']
    df.dropna(inplace=True)
    df = df[df['Sentiment'] != 'Irrelevant']
    df = df[df['Sentiment'] != 'Neutral']
    df.drop(['Tweet_ID', 'Entity'], axis=1, inplace=True)
    return df

val_data = preprocess_data(val_data)

# 🔄 Transform Validation Data
val_X = text_pipeline.transform(val_data['Review'])
val_y = val_data['Sentiment']

# 🔮 Predict and Evaluate
val_y_pred = model.predict(val_X)
print(classification_report(val_y, val_y_pred))


              precision    recall  f1-score   support

    Negative       0.93      0.95      0.94       266
    Positive       0.95      0.93      0.94       277

    accuracy                           0.94       543
   macro avg       0.94      0.94      0.94       543
weighted avg       0.94      0.94      0.94       543



## 📝 Single Input Prediction <a name="single-input-prediction"></a>

In [8]:
# 🎤 Predict Sentiment on New Input
user_review = pd.Series(str(input("Type a review to predict its sentiment: ")))

new_review = text_pipeline.transform(user_review)
print("Review:", user_review.values[0])
print("Predicted Sentiment:", model.predict(new_review))


Review: It was an amazing experience all round.
Predicted Sentiment: ['Positive']


## ✅ Conclusion and Next Steps <a name="conclusion"></a>
> - Successfully built a sentiment analysis model using Naive Bayes and TF-IDF.  
> - Evaluated model on training and validation data.  
> - Enabled real-time single input prediction.  
> - Future improvements: try deep learning models, handle sarcasm, or integrate real-time Twitter API.
