## **P3_TWITTER_SENTIMENT_ANALYSIS USING MACHINE LEARNING**

#### **Project Description**:
The primary goal is to develop a sentiment analysis model that can accurately classify the
sentiment of text data, providing valuable insights into public opinion, customer feedback, and
social media trends.

##### APPROACH OF THE PROJECT: Step-wise
   1. **Sentiment Analysis**: Analyzing text data to determine the emotional tone, whether positive,
    negative, or neutral.
   2. **Natural Language Processing (NLP)**: Utilizing algorithms and models to understand and
    process human language.
   3. **Machine Learning Algorithms**: Implementing models for sentiment classification, such as
    Support Vector Machines, Naive Bayes, or deep learning architectures.
   4. **Feature Engineering**: Identifying and extracting relevant features from text data to enhance
    model performance.
   5. **Data Visualization**: Presenting sentiment analysis results through effective visualizations for
    clear interpretation.

It is Project 4 Proposal Level-1 
**DATE**: 21 august 2024

DATASET: https://www.kaggle.com/datasets/saurabhshahane/twitter-sentiment-dataset

In [2]:
#------------------------IMPORT NECESSARY LIBRARIES----------------------------------
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
#to keep everything in one plane
%matplotlib inline 

In [3]:
# warnings module in Python provides a way to control how warnings handled within a Python script
import warnings

In [4]:
warnings.filterwarnings('ignore')#ignore the warnings

LOADING THE DATA

In [5]:
#--------------------------------LOADING DATA-----------------------------------------
Twitter_Data= pd.read_csv('Twitter_Data.csv')

DATA INSPECTION

In [6]:
#displays top 5 row values
Twitter_Data.head()

Unnamed: 0,clean_text,category
0,when modi promised “minimum government maximum...,-1.0
1,talk all the nonsense and continue all the dra...,0.0
2,what did just say vote for modi welcome bjp t...,1.0
3,asking his supporters prefix chowkidar their n...,1.0
4,answer who among these the most powerful world...,1.0


In [7]:
#displays last 5 row values
Twitter_Data.tail()

Unnamed: 0,clean_text,category
162975,why these 456 crores paid neerav modi not reco...,-1.0
162976,dear rss terrorist payal gawar what about modi...,-1.0
162977,did you cover her interaction forum where she ...,0.0
162978,there big project came into india modi dream p...,0.0
162979,have you ever listen about like gurukul where ...,1.0


In [8]:
#for finding out the shape of the data. it is a attribute not a method
Twitter_Data.shape

(162980, 2)

In [9]:
#printing the no. of rows and columns
print("Number of Rows are",Twitter_Data.shape[0])
print("Number of Columns are",Twitter_Data.shape[1])

Number of Rows are 162980
Number of Columns are 2


In [10]:
#Information About Our Dataset Like
#the Total Number of Rows, Total Number of Columns, Datatypes of Each Column And Memory Requirement
Twitter_Data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 162980 entries, 0 to 162979
Data columns (total 2 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   clean_text  162976 non-null  object 
 1   category    162973 non-null  float64
dtypes: float64(1), object(1)
memory usage: 2.5+ MB


In [11]:
#to Get Overall Statistics About The Dataset
Twitter_Data.describe()

Unnamed: 0,category
count,162973.0
mean,0.225436
std,0.781279
min,-1.0
25%,0.0
50%,0.0
75%,1.0
max,1.0


CHECKING NULL VALUES

In [12]:
#Check Null Values In The Dataset
Twitter_Data.isnull()

Unnamed: 0,clean_text,category
0,False,False
1,False,False
2,False,False
3,False,False
4,False,False
...,...,...
162975,False,False
162976,False,False
162977,False,False
162978,False,False


In [13]:
#Check the sum of Null Values In The Dataset
Twitter_Data.isnull().sum()

clean_text    4
category      7
dtype: int64

In [14]:
#fill the missing categorial values with modes
Twitter_Data['clean_text'].fillna(str(Twitter_Data['clean_text'].mode().values[0]),inplace=True)
Twitter_Data['category'].fillna(str(Twitter_Data['category'].mode().values[0]),inplace=True)

In [15]:
# AGAIN Check the sum of Null Values In The Dataset
Twitter_Data.isnull().sum()

clean_text    0
category      0
dtype: int64

DUPLICACY CHECKING

In [16]:
#to check duplicate values in dataset
Twitter_Data.duplicated().any()

True

In [17]:
#dropping duplicate values
Twitter_Data = Twitter_Data.drop_duplicates()#to check duplicate values in dataset

In [18]:
#AGAIN check duplicate values in dataset
Twitter_Data.duplicated().any()

False

In [19]:
Twitter_Data= Twitter_Data[Twitter_Data['category']!= '0.0']

In [20]:
Twitter_Data

Unnamed: 0,clean_text,category
0,when modi promised “minimum government maximum...,-1.0
1,talk all the nonsense and continue all the dra...,0.0
2,what did just say vote for modi welcome bjp t...,1.0
3,asking his supporters prefix chowkidar their n...,1.0
4,answer who among these the most powerful world...,1.0
...,...,...
162975,why these 456 crores paid neerav modi not reco...,-1.0
162976,dear rss terrorist payal gawar what about modi...,-1.0
162977,did you cover her interaction forum where she ...,0.0
162978,there big project came into india modi dream p...,0.0


**DATA PREPROCESSING**

In [21]:
from sklearn.model_selection import train_test_split
train,test= train_test_split(Twitter_Data,test_size=0.1)#90% to the training data

In [22]:
train

Unnamed: 0,clean_text,category
115521,correct but irrelevant modi losing whatever th...,-1.0
24792,modi insulted veteran leaders like advani not ...,0.0
94883,how can jnu students opposing mba course which...,-1.0
148875,modi represents shift from the macaulay versio...,1.0
101234,fake stories one believe you and your party ou...,-1.0
...,...,...
67341,pakistan pakistanlovers hugging each other whi...,-1.0
30077,most probably after elections they fear modi m...,1.0
36004,people who fled will brought justice rahul inv...,0.0
85396,live cong wanted kartarpur would have been ind...,1.0


In [23]:
test

Unnamed: 0,clean_text,category
41636,this indicates ganga has become cleaner modi h...,0.0
12174,think about this modi jibig curse india\n,0.0
49264,congress will say modi wasting money missiles ...,-1.0
138301,the time 2014 elections modi promised many thi...,1.0
54136,the role modis defining leadership and the par...,-1.0
...,...,...
40308,key suggestions show both positives modi govt ...,1.0
132078,modi has done privatisation profits and social...,0.0
144373,india dreams india led happy now and,1.0
40261,use the same oil burn them verge discovering b...,0.0


In [24]:
train.head()

Unnamed: 0,clean_text,category
115521,correct but irrelevant modi losing whatever th...,-1.0
24792,modi insulted veteran leaders like advani not ...,0.0
94883,how can jnu students opposing mba course which...,-1.0
148875,modi represents shift from the macaulay versio...,1.0
101234,fake stories one believe you and your party ou...,-1.0


In [25]:
#removing the hastags and @
pattern= "(#\w+)|(RT\s@\w+:)|(http.*)|(@\w+)"

In [26]:
for val in train['clean_text']:
      print(val)

correct but irrelevant modi losing whatever the function parameters
modi insulted veteran leaders like advani not letting them contest lok sabha polls arvind kejriwal 
how can jnu students opposing mba course which charges marketlinked fees modi wasnt obsessed with his petty antinational stuff would have actual taught these guys something about the free ride they have been getting 
modi represents shift from the macaulay version hinduism propagated nehru and leftists the real india bharat which cradle dharmic ideology not merely the nehruvian idea india where mughals are known all and lalitaditya nowhere mentioned
fake stories one believe you and your party our vote only for modi 
agree kcr quoting masood azhar unacceptable but what wrong people question the figure 300 300 was not quoted airforce defense ministry amit shah referenced but not part gov its not official
iii modi also didnt refer 300
thanx lot sir for your prompt response was expected modi mumkin jai hind
this has happened

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



In [27]:
#===========================================OTHER NECESSARY LIBRARIES=====================================================
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
import re

In [28]:
Twitter_Data.columns

Index(['clean_text', 'category'], dtype='object')

In [29]:
ps= PorterStemmer()
lemmatizer = WordNetLemmatizer()

In [30]:
def Clean_text(Twitter_Data):
    tweets = []
    sentiments = []
    for index,row in Twitter_Data.iterrows():#it makes the code run faster
        sentence = re.sub(pattern,'',row.clean_text)#to replace the removed words with nothing
        words = [e.lower() for e in sentence.split()]
        words = [lemmatizer.lemmatize(word) for word in words if word not in stopwords.words('english')]
        words = ' '.join(words)
        tweets.append(words)
        sentiments.append(row.category)
    return tweets,sentiments

In [31]:
train_tweets,train_sentiments = Clean_text(train)

In [32]:
final_data = {'clean_text':train_tweets,'category':train_sentiments}

In [33]:
processed_data = pd.DataFrame(final_data)

In [34]:
processed_data

Unnamed: 0,clean_text,category
0,correct irrelevant modi losing whatever functi...,-1.0
1,modi insulted veteran leader like advani letti...,0.0
2,jnu student opposing mba course charge marketl...,-1.0
3,modi represents shift macaulay version hinduis...,1.0
4,fake story one believe party vote modi,-1.0
...,...,...
146675,pakistan pakistanlovers hugging modi smiling b...,-1.0
146676,probably election fear modi much opposition,1.0
146677,people fled brought justice rahul involved nir...,0.0
146678,live cong wanted kartarpur would india latest ...,1.0


### Converting Words into Vectors

In [35]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(1,3))
cv.fit(processed_data['clean_text'])

In [36]:
X_train = cv.transform(processed_data['clean_text'])

In [37]:
print(X_train.shape)

(146680, 2538909)


In [38]:
from sklearn.decomposition import PCA
pca=PCA(n_components=100)
reduced_data=pca.fit_transform(X_train)

In [39]:
print(reduced_data.shape)

(146680, 100)


In [40]:
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler(with_mean=False)
reduced_data=scaler.fit_transform(reduced_data)

In [41]:
reduced_data

array([[-0.08531971, -0.46188267, -0.05852396, ...,  0.00232431,
        -0.1313505 , -0.07184987],
       [-0.0296877 , -0.48287175, -0.067775  , ...,  1.02687591,
        -2.03163351,  1.35808562],
       [-0.06010901, -0.47075112, -0.161701  , ..., -0.07780511,
        -0.03654966, -0.10249727],
       ...,
       [ 0.17464727, -0.44784385, -0.49962579, ...,  0.26286916,
         0.3400652 ,  0.06251505],
       [-1.44879111,  1.85230429, -0.2984801 , ..., -1.67974466,
         1.29738377, -1.89077505],
       [ 0.12360982, -0.4646182 , -0.60404176, ..., -0.69381669,
         1.26724261, -0.6959688 ]])

In [42]:
target = processed_data['category'].values

In [43]:
target

array([-1.0, 0.0, -1.0, ..., 0.0, 1.0, 0.0], dtype=object)

#### ===================**Sentiment Analysis (Model Building)**==========================

In [44]:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()

In [45]:
print(reduced_data.shape)
print(target.shape)

(146680, 100)
(146680,)


In [46]:
target=target.reshape(-1)

In [47]:
print(target.shape)

(146680,)


In [48]:
print(reduced_data.dtype)
print(target.dtype)

float64
object


In [49]:
target=target.astype(np.float32)

In [50]:
print(reduced_data.dtype)
print(target.dtype)

float64
float32


In [51]:
from sklearn.preprocessing import MinMaxScaler
scaler=MinMaxScaler()
reduced_data_scaled= scaler.fit_transform(reduced_data)
classifier.fit(reduced_data_scaled,target)

In [52]:
classifier.fit(reduced_data_scaled,target)

In [53]:
test_tweets,test_sentiments = Clean_text(test)

In [54]:
data_test = {'clean_text':test_tweets,'category':test_sentiments}
final_test_data = pd.DataFrame(data_test)

In [55]:
X_test = cv.transform(final_test_data['clean_text'])

In [56]:
X_test.shape

(16298, 2538909)

In [57]:
y_pred = classifier.predict(reduced_data_scaled)

In [None]:
final_test_data

In [None]:
actual_values = final_test_data['category'].values

In [None]:
actual_values

In [None]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
final_test_data['category'] = le.fit_transform(final_test_data['category'])

In [None]:
final_test_data

In [None]:
X_test.toarray()[0]

================================================THANKYOU===============================================

or queries mail at: ranisoni6298@gmail.com