# Nave Bayes Binary Text classification
Naive Bayes is a powerful machine learning algorithm used for binary text classification. It works by calculating the probability of a given text belonging to each class, and then assigns the text to the class with the highest probability. Naive Bayes is simple, fast, and effective, and has been used extensively for text classification tasks. It is particularly useful for classifying large volumes of text due to its speed and accuracy. Naive Bayes is a great choice for binary text classification, as it is able to quickly and accurately classify texts into two categories.

By:
<h3><a href='https://www.linkedin.com/in/sirqasim/'>Muhammad Qasim</a></h3>

[Source Kaggle](https://www.kaggle.com/competitions/sentiment-analysis-on-movie-reviews/data)
### classes
* 0 - negative
* 1 - somewhat negative
* 2 - neutral
* 3 - somewhat positive
* 4 - positive

In [2]:
import pandas as pd

In [3]:
df1 = pd.read_feather("https://github.com/EnggQasim/PGD_Batch2_Machine_Learning/blob/main/class14_nlp_navey_bayes_text_classification/data.feather?raw=true")
df1.head()

Unnamed: 0,Phrase,Sentiment,input_data
0,would have a hard time sitting through this one,0,"['would, hard, time, sit, one, ']"
1,have a hard time sitting through this one,0,"['have, hard, time, sit, one, ']"
2,Aggressive self-glorification and a manipulati...,0,"['aggressive, self, clarification, manipulativ..."
3,self-glorification and a manipulative whitewash,0,"['self, clarification, manipulative, whitewash..."
4,Trouble Every Day is a plodding mess .,0,"['trouble, every, day, pad, mess, ., ']"


In [17]:
import pandas as pd
import numpy as np
import re
import time

In [18]:
df1['input_data1'] = df1.input_data.apply(lambda x:" ".join(x))
df1.head()

Unnamed: 0,Phrase,Sentiment,input_data,input_data1
0,would have a hard time sitting through this one,0,"['would, hard, time, sit, one, ']",'would hard time sit one '
1,have a hard time sitting through this one,0,"['have, hard, time, sit, one, ']",'have hard time sit one '
2,Aggressive self-glorification and a manipulati...,0,"['aggressive, self, clarification, manipulativ...",'aggressive self clarification manipulative wh...
3,self-glorification and a manipulative whitewash,0,"['self, clarification, manipulative, whitewash...",'self clarification manipulative whitewash '
4,Trouble Every Day is a plodding mess .,0,"['trouble, every, day, pad, mess, ., ']",'trouble every day pad mess . '


In [19]:
df1.Sentiment.value_counts(normalize=True)*100

4    56.554859
0    43.445141
Name: Sentiment, dtype: float64

In [20]:
print("Boths classes Number of samples:\t\t\n",(df1.Sentiment.value_counts(normalize=True)*100))
print()
print("Boths classes ratio:\t\t\t\n",(df1.Sentiment.value_counts(normalize=True)*100)*10/100)
print()
print("Boths classes 20% sample count\n",(df1.Sentiment.value_counts())*20/100)

Boths classes Number of samples:		
 4    56.554859
0    43.445141
Name: Sentiment, dtype: float64

Boths classes ratio:			
 4    5.655486
0    4.344514
Name: Sentiment, dtype: float64

Boths classes 20% sample count
 4    1841.2
0    1414.4
Name: Sentiment, dtype: float64


In [21]:
test = pd.concat([df1[df1.Sentiment==4].sample(1841),
                    df1[df1.Sentiment==0].sample(1414)])[["input_data1","Sentiment"]]
print(len(test))
display(test.head(1))

train = df1.loc[~df1.index.isin(test.index.values)][["input_data1","Sentiment"]]
print(len(train))
display(train.head(1))

3255


Unnamed: 0,input_data1,Sentiment
9423,"'an honest , sensitive story vietnamese point ...",4


13023


Unnamed: 0,input_data1,Sentiment
1,'have hard time sit one ',0


In [22]:
X = df1[["input_data1"]]# all Data for corpus
print(X.head())

                                         input_data1
0                         'would hard time sit one '
1                          'have hard time sit one '
2  'aggressive self clarification manipulative wh...
3       'self clarification manipulative whitewash '
4                    'trouble every day pad mess . '


In [55]:
from sklearn.feature_extraction.text import TfidfVectorizer
# list of text documents
text = X.input_data1.values
# create the transform
vectorizer = TfidfVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# summarize
print(vectorizer.vocabulary_)

# encode document
# vector = vectorizer.transform(text)

{'would': 8436, 'hard': 3468, 'time': 7667, 'sit': 6864, 'one': 5239, 'have': 3501, 'aggressive': 260, 'self': 6651, 'clarification': 1320, 'manipulative': 4602, 'whitewash': 8327, 'trouble': 7823, 'every': 2621, 'day': 1852, 'pad': 5375, 'mess': 4759, 'is': 4071, 'padding': 5376, 'could': 1664, 'hate': 3496, 'reason': 6080, 'oedekerk': 5208, 'realization': 6076, 'childhood': 1261, 'dream': 2274, 'martial': 4646, 'arts': 504, 'flick': 2969, 'prove': 5907, 'sometimes': 7000, 'youth': 8484, 'remain': 6188, 'baseball': 687, 'movies': 4947, 'try': 7834, 'mythic': 4995, 'hampered': 3444, 'paralyze': 5416, 'insurgent': 3981, 'script': 6607, 'aim': 272, 'poetry': 5679, 'end': 2475, 'sound': 7029, 'like': 4401, 'satire': 6524, 'little': 4443, 'sense': 6666, 'go': 3273, 'avoid': 602, 'almost': 314, 'feel': 2852, 'movie': 4940, 'interest': 4001, 'entertain': 2526, 'amuse': 361, 'us': 8052, 'progression': 5877, 'ramble': 6019, 'coherence': 1401, 'give': 3247, 'new': 5083, 'mean': 4697, 'phrase': 

In [56]:
train_x = vectorizer.transform(train.input_data1.values).toarray()
train_y = train.Sentiment.values
test_x = vectorizer.transform(test.input_data1.values).toarray()
test_y = test.Sentiment.values
print(len(train_x),len(train_y))
print(len(test_x),len(test_y))

13023 13023
3255 3255


# Train Vavie Bayes ML algorithm

In [57]:
from sklearn.naive_bayes import BernoulliNB
clf = BernoulliNB()
clf.fit(train_x,train_y)

BernoulliNB()

# Test Data

In [58]:
y_predict = clf.predict(test_x)
y_predict

array([4, 4, 4, ..., 0, 0, 0], dtype=int64)

## Confusion Matrix

In [59]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

cm = confusion_matrix(test_y,y_predict)
cm

array([[1238,  176],
       [  93, 1748]], dtype=int64)

<img src='https://miro.medium.com/max/712/1*Z54JgbS4DUwWSknhDCvNTQ.png'>

In [60]:
accuracy_score(test_y, y_predict)

0.917357910906298

In [None]:
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(train_x,train_y)

In [37]:
y_predict = clf.predict(test_x)
y_predict

array([4, 4, 4, ..., 0, 0, 0], dtype=int64)

In [40]:
accuracy_score(test_y, y_predict)  #Decision TREE

0.8866359447004608

In [46]:
from sklearn.ensemble import RandomForestClassifier  
clf = RandomForestClassifier()  
clf.fit(train_x,train_y)

RandomForestClassifier()

In [47]:
y_predict = clf.predict(test_x)
y_predict

array([4, 4, 4, ..., 0, 0, 0], dtype=int64)

In [49]:
accuracy_score(test_y, y_predict)  #Random Forest

0.9127496159754225

In [79]:
from sklearn import svm
clf = svm.SVC()
clf.fit(train_x,train_y)

SVC()

In [80]:
y_predict = clf.predict(test_x)
y_predict

array([4, 4, 4, ..., 0, 0, 0], dtype=int64)

In [81]:
accuracy_score(test_y, y_predict)  #SVM

0.9453149001536099

In [82]:
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=1)
clf.fit(train_x,train_y)

KNeighborsClassifier(n_neighbors=1)

In [83]:
y_predict = clf.predict(test_x)
y_predict

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


array([4, 4, 4, ..., 0, 0, 0], dtype=int64)

In [84]:
accuracy_score(test_y, y_predict)  #KNeighborsClassifier

0.9299539170506912