# Sentiment Analysis using Multinomial Naive Bayes

This notebook demonstrates the application of Multinomial Naive Bayes for sentiment analysis on a dataset comprising 1.6 million tweets. The dataset, known as Sentiment140, contains tweets annotated with sentiments.


## Dataset Overview

The Sentiment140 dataset includes the following columns:

- **target**: the polarity of the tweet (0 = negative, 4 = positive)
- **ids**: The id of the tweet (2087)
- **date**: the date of the tweet (Sat May 16 23:58:44 UTC 2009)
- **flag**: The query (lyx). If there is no query, then this value is NO_QUERY.
- **user**: the user that tweeted (robotickilldozr)
- **text**: the text of the tweet (Lyx is cool)

The objective is to predict the sentiment of the tweets as positive or negative using the text data.

In [2]:
## Importing Libraries

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
from sklearn.feature_extraction.text import TfidfVectorizer


In [3]:
# Load the dataset
df = pd.read_csv('sample_dataset.csv', names=['target', 'ids', 'date', 'flag', 'user', 'text'])

In [4]:
# Display the first few rows of the dataframe
df.head()

Unnamed: 0,target,ids,date,flag,user,text
,target,ids,date,flag,user,text
734779.0,0,2264631528,Sun Jun 21 04:28:40 PDT 2009,NO_QUERY,andychong9,dad having fever again.. not looking too good
632647.0,0,2232741622,Thu Jun 18 20:19:48 PDT 2009,NO_QUERY,SLeepdepD,@judahgabriel i wish i had that much to say
337706.0,0,2014269923,Wed Jun 03 00:59:10 PDT 2009,NO_QUERY,BlackCat_Saya,@rohan_01 you know..it's really sad that u kno...
465228.0,0,2175302784,Mon Jun 15 00:36:56 PDT 2009,NO_QUERY,sawarahh,@cathicks i don't get it.


In [5]:
all_text = ' '.join(df['text'].values)
all_text

"text dad having fever again.. not looking too good  @judahgabriel i wish i had that much to say  @rohan_01 you know..it's really sad that u know that ur classmate don't care about you..        sad memories t.t @cathicks i don't get it.  @dougsky i will have a look when i get home!  @mikegentile hey stud. go figure you come to the land of whores when i am gone in the bahamas  @alyxayer  and @richbello ... hey it was a player from akron!   i hope you guys thought of me, haha, it's okay i know you didn't! @joejgirl2009 oh... cool... standing out...  eaten way too much junk food. feel as if i'm about to explode! not good  needs help from a wordpress theme expert  http://plurk.com/p/ywsk3 a great hard training weekend is over.  a couple days of rest and lets do it again!  lots of computer time to put in now  @faketragedycom i know  @teufl0302 @stevebrunton sweeties, i gotta go... having dinner and eating lots of cake with my mom  school over in 4 more days no more teachers!! grad on wednes

In [6]:
sentences = all_text.split('. ')

In [7]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(sentences)
tf_idf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

In [8]:
tf_idf_df

Unnamed: 0,19,4211c,62,about,ache,across,actually,advent,after,again,...,xps,xxandip,yeah,yes,yo,you,your,ywsk3,¹à¹,à¹
0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.410152,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000
1,0.0,0.0,0.0,0.192474,0.000000,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.266583,0.0,0.0,0.000000,0.000000
2,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000
3,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000
4,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.166741,0.0,0.0,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
62,0.0,0.0,0.0,0.000000,0.070691,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.070691,0.141382
63,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000
64,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.194067,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000
65,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000


## Feature Extraction

We use CountVectorizer to convert text data into a matrix of token counts.

In [9]:
# Initializing CountVectorizer
vectorizer = CountVectorizer(stop_words='english', lowercase=True)

# Fitting and transforming the text data
X = vectorizer.fit_transform(df['text'])
y = df['target']

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Applying Multinomial Naive Bayes

Now, we apply the Multinomial Naive Bayes algorithm to predict the sentiment of tweets.

In [10]:
# Initializing the Multinomial Naive Bayes classifier
clf = MultinomialNB()

# Training the model
clf.fit(X_train, y_train)

# Predicting on the test set
y_pred = clf.predict(X_test)

# Evaluating the model
print(f'Accuracy: {accuracy_score(y_test, y_pred)}')
print(classification_report(y_test, y_pred))

Accuracy: 0.6190476190476191
              precision    recall  f1-score   support

           0       0.64      0.82      0.72        11
           4       0.57      0.44      0.50         9
      target       0.00      0.00      0.00         1

    accuracy                           0.62        21
   macro avg       0.40      0.42      0.41        21
weighted avg       0.58      0.62      0.59        21



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
