# Twitter Sentiment Analysis

## Dataset Description

The dataset consists of tweets along with their sentiment labels. Each row represents a tweet that has been preprocessed, with a corresponding category indicating its sentiment:

- **-1.0** → Negative Sentiment  
- **0.0** → Neutral Sentiment  
- **1.0** → Positive Sentiment  

### Sample Data

| clean_text | category |
|------------|----------|
| when modi promised “minimum government maximum... | -1.0 |
| talk all the nonsense and continue all the dra... | 0.0 |
| what did just say vote for modi welcome bjp t... | 1.0 |
| asking his supporters prefix chowkidar their n... | 1.0 |
| answer who among these the most powerful world... | 1.0 |



In [1]:
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

In [2]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [3]:
df=pd.read_csv('Twitter_Data.csv')
df.head()

Unnamed: 0,clean_text,category
0,when modi promised “minimum government maximum...,-1.0
1,talk all the nonsense and continue all the dra...,0.0
2,what did just say vote for modi welcome bjp t...,1.0
3,asking his supporters prefix chowkidar their n...,1.0
4,answer who among these the most powerful world...,1.0


In [4]:
df.shape

(162980, 2)

In [5]:
df['category'].value_counts()

Unnamed: 0_level_0,count
category,Unnamed: 1_level_1
1.0,72250
0.0,55213
-1.0,35510


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 162980 entries, 0 to 162979
Data columns (total 2 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   clean_text  162976 non-null  object 
 1   category    162973 non-null  float64
dtypes: float64(1), object(1)
memory usage: 2.5+ MB


## Dataset Preprocessing

In [7]:
df.isnull().sum()

Unnamed: 0,0
clean_text,4
category,7


In [8]:
df=df.dropna()

In [9]:
corpus=[]
for review in df['clean_text']:
    review=re.sub('[^a-zA-Z]',' ',review)
    review=review.lower().split()
    review=[word for word in review if word not in stopwords.words('english')]
    review=' '.join(review)
    corpus.append(review)
corpus[-1]

'ever listen like gurukul discipline maintained even narendra modi rss maintaining culture indian attack politics someone attack hinduism rss take action proud'

In [10]:
x=df['clean_text']
y=df['category']

In [11]:
tfidf=TfidfVectorizer()
x=tfidf.fit_transform(x)

In [12]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)

### Logistic Regression

In [13]:
lr_model=LogisticRegression(max_iter=500)
lr_model.fit(x_train,y_train)
lr_pred=lr_model.predict(x_test)
print("accuracy:",accuracy_score(y_test,lr_pred))
print(classification_report(y_test,lr_pred))


training accuracy: 0.9506883988494726
Testing accuracy: 0.9193103025096644
              precision    recall  f1-score   support

        -1.0       0.92      0.80      0.86      7152
         0.0       0.90      0.98      0.94     11067
         1.0       0.94      0.93      0.93     14375

    accuracy                           0.92     32594
   macro avg       0.92      0.90      0.91     32594
weighted avg       0.92      0.92      0.92     32594



### Decision Tree

In [14]:
dt_model=DecisionTreeClassifier()
dt_model.fit(x_train,y_train)
dt_pred=dt_model.predict(x_test)
print("accuracy:",accuracy_score(y_test,lr_pred))
print(classification_report(y_test,lr_pred))

accuracy: 0.9999923298178331
accuracy: 0.9193103025096644
              precision    recall  f1-score   support

        -1.0       0.92      0.80      0.86      7152
         0.0       0.90      0.98      0.94     11067
         1.0       0.94      0.93      0.93     14375

    accuracy                           0.92     32594
   macro avg       0.92      0.90      0.91     32594
weighted avg       0.92      0.92      0.92     32594



### Random Forest

In [None]:
rf_model=RandomForestClassifier()
rf_model.fit(x_train,y_train)
rf_pred=rf_model.predict(x_test)
print(accuracy_score(y_test,rf_pred))
print(classification_report(y_test,rf_pred))