# **SENTIMENT ANALYSIS ON CHAT DATASET**
**BUILDING A SENTIMENT ANALYSIS MODEL USING NAIVE BAYES CLASSIFIER TRAINED WITH VECTORS MADE USING BOTH COUNT VECTORIZER AND TFIDF VECTORIZER.**

In [83]:
# Importing all Necessary libraries
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB,BernoulliNB
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
# Loading the dataset
df = pd.read_csv('chat_dataset.csv')

In [3]:
df.head()

Unnamed: 0,message,sentiment
0,I really enjoyed the movie,positive
1,The food was terrible,negative
2,I'm not sure how I feel about this,neutral
3,The service was excellent,positive
4,I had a bad experience,negative


In [4]:
df.tail()

Unnamed: 0,message,sentiment
579,I have to cancel my vacation plans because I c...,negative
580,My computer crashed and I lost all my importan...,negative
581,I got into a car accident and my car is totale...,negative
582,I have a cold and can't stop coughing. it's re...,negative
583,I just found out my ex is dating someone new. ...,negative


In [23]:
# Defining a function to remove junk emoji characters, stop-words and preprocess the text
def preprocess_text(text):
    # Removing junk emoji characters
    text = re.sub(r'[^\x00-\x7F]+', '', text)
    # Converting to lowercase
    text = text.lower()
    # Removing punctuation
    text = re.sub(r'[^\w\s]', '', text)
    # Removing numbers
    text = re.sub(r'\d+', '', text)
    #Removing Stop-words
    stop_words = set(stopwords.words('english'))
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return ' '.join(filtered_words)
    return text


In [42]:
# Applying the preprocessing function to the text data
df['messsage'] = df['message'].apply(preprocess_text)

In [43]:
#Verifying and comparing the changes made in the dataframe
df.sample(10)

Unnamed: 0,message,sentiment,messsage,newMesssage
493,I'm so disappointed in myself for making that ...,negative,im disappointed making mistake,im disappointed making mistake
168,The scenery here is unremarkable,neutral,scenery unremarkable,scenery unremarkable
40,The service at this hotel was terrible,negative,service hotel terrible,service hotel terrible
492,I had a terrible day at work today 😞,negative,terrible day work today,terrible day work today
311,The coffee was excellent,positive,coffee excellent,coffee excellent
443,I'm feeling so content with my life right now 😊,positive,im feeling content life right,im feeling content life right
253,The game was average,neutral,game average,game average
526,I hate feeling like I'm not good enough 😔,negative,hate feeling like im good enough,hate feeling like im good enough
345,I'm not feeling very productive today 📉,neutral,im feeling productive today,im feeling productive today
119,The traffic is heavy,negative,traffic heavy,traffic heavy


In [44]:
df.head()

Unnamed: 0,message,sentiment,messsage,newMesssage
0,I really enjoyed the movie,positive,really enjoyed movie,really enjoyed movie
1,The food was terrible,negative,food terrible,food terrible
2,I'm not sure how I feel about this,neutral,im sure feel,im sure feel
3,The service was excellent,positive,service excellent,service excellent
4,I had a bad experience,negative,bad experience,bad experience


In [26]:
#applying TF-IDF vectorization
vectorizer = TfidfVectorizer()

In [50]:
df=df.drop('messsage',axis=1)

In [51]:
df.head()

Unnamed: 0,message,sentiment,newMesssage
0,I really enjoyed the movie,positive,really enjoyed movie
1,The food was terrible,negative,food terrible
2,I'm not sure how I feel about this,neutral,im sure feel
3,The service was excellent,positive,service excellent
4,I had a bad experience,negative,bad experience


In [55]:
# Target labels
y = df['sentiment']

In [56]:
# Fitting and transforming the text data
X = vectorizer.fit_transform(df['newMesssage'])

In [57]:
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

In [58]:
# Initializing the Naive Bayes classifier
nb = MultinomialNB()

In [59]:
# Hyperparameter tuning using GridSearchCV
param_grid = {
    'alpha': [0.1, 0.5, 1.0, 1.5, 2.0]
}


In [60]:
grid_search = GridSearchCV(nb, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

In [61]:
# Best model
best_nb = grid_search.best_estimator_
print(best_nb)


MultinomialNB(alpha=0.1)


In [62]:
# Predict on the test set
y_pred = best_nb.predict(X_test)

In [63]:
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

Accuracy: 0.815068493150685


Using the TFIDF Vectorizer we achieve an accuracy score of 81.5% by removing stop words, emojis, numbers and punctuations.

Using the apperance-based (binary=True) CountVectorizer and creating a Bernoulli Model

In [80]:
vectorizer3=CountVectorizer(binary=True)

In [81]:
X3=vectorizer3.fit_transform(df['newMesssage'])

In [82]:
X_train3, X_test3, y_train3, y_test3 = train_test_split(X3, y, test_size=0.25, random_state=1)

In [89]:
nb3 = BernoulliNB()

In [86]:
param_grid = {
    'alpha': [0.1, 0.5, 1.0, 1.5, 2.0]
}

In [91]:
grid_search = GridSearchCV(nb3, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train3, y_train3)

In [92]:
best_nb = grid_search.best_estimator_
print(best_nb)

BernoulliNB(alpha=0.1)


In [93]:
y_pred3 = best_nb.predict(X_test3)

In [95]:
accuracy = accuracy_score(y_test3, y_pred3)
print(f'Accuracy: {accuracy}')

Accuracy: 0.8424657534246576


The Multinomial Model trained with vectors found using appearance-based CountVectorizer achieve an accuracy of 84.24%. Thus, The Multinomial Naive Bayes model trained with the Appearance-based CountVectorizer vectors offers highest accuracy.

Using the frequency-based (binary=false) CountVectorizer and creating a new model.

In [96]:
vectorizer2=CountVectorizer(binary=False)

In [97]:
X2=vectorizer2.fit_transform(df['newMesssage'])

In [98]:
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y, test_size=0.25, random_state=1)

In [99]:
nb2 = MultinomialNB()

In [100]:
param_grid = {
    'alpha': [0.1, 0.5, 1.0, 1.5, 2.0]
}


In [101]:
grid_search = GridSearchCV(nb2, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train2, y_train2)

In [102]:
best_nb = grid_search.best_estimator_
print(best_nb)

MultinomialNB(alpha=0.5)


In [103]:
y_pred2 = best_nb.predict(X_test2)

In [104]:
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

Accuracy: 0.8287671232876712


The Multinomial Model trained with vectors found using frequency-based CountVectorizer achieve an accuracy of 82.87%.