# Explore here

In [1]:
import pandas as pd

total_data = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/naive-bayes-project-tutorial/main/playstore_reviews.csv")

total_data.head()

Unnamed: 0,package_name,review,polarity
0,com.facebook.katana,privacy at least put some option appear offli...,0
1,com.facebook.katana,"messenger issues ever since the last update, ...",0
2,com.facebook.katana,profile any time my wife or anybody has more ...,0
3,com.facebook.katana,the new features suck for those of us who don...,0
4,com.facebook.katana,forced reload on uploading pic on replying co...,0


In [2]:
total_data

Unnamed: 0,package_name,review,polarity
0,com.facebook.katana,privacy at least put some option appear offli...,0
1,com.facebook.katana,"messenger issues ever since the last update, ...",0
2,com.facebook.katana,profile any time my wife or anybody has more ...,0
3,com.facebook.katana,the new features suck for those of us who don...,0
4,com.facebook.katana,forced reload on uploading pic on replying co...,0
...,...,...,...
886,com.rovio.angrybirds,loved it i loooooooooooooovvved it because it...,1
887,com.rovio.angrybirds,all time legendary game the birthday party le...,1
888,com.rovio.angrybirds,ads are way to heavy listen to the bad review...,0
889,com.rovio.angrybirds,fun works perfectly well. ads aren't as annoy...,1


2.1 Removing spaces and converting the text to lowercase:

In [3]:
total_data["review"] = total_data["review"].str.strip().str.lower()

In [4]:
from sklearn.model_selection import train_test_split

X = total_data["review"]
y = total_data["polarity"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

X_train.head()

331    just did the latest update on viber and yet ag...
733    keeps crashing it only works well in extreme d...
382    the fail boat has arrived the 6.0 version is t...
704    superfast, just as i remember it ! opera mini ...
813    installed and immediately deleted this crap i ...
Name: review, dtype: object

In [5]:
y_train.head()

331    0
733    0
382    0
704    1
813    1
Name: polarity, dtype: int64

2.2 Transform the text into a word count matrix. This is a way to obtain numerical features from the text. For this, we use the training set to train the transformer and apply it in test:

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

vec_model = CountVectorizer(stop_words = "english")
X_train = vec_model.fit_transform(X_train).toarray()
X_test = vec_model.transform(X_test).toarray()

vec_model = CountVectorizer(stop_words = "english")

By setting the stop_words parameter to "english", it instructs the CountVectorizer to remove common English stop words (such as "the", "is", "and", etc.) from the input text data. Stop words are common words that typically do not carry much information for natural language processing tasks.

X_train = vec_model.fit_transform(X_train).toarray()

This line fits the CountVectorizer to the training data (X_train) and transforms the training data into a document-term matrix. The fit_transform method first learns the vocabulary of the training data (i.e., it assigns a unique integer index to each unique word in the training data), and then it transforms the training data into a sparse matrix representation where each row corresponds to a document in the training data, and each column corresponds to a unique word in the vocabulary. The .toarray() method converts this sparse matrix representation into a dense NumPy array.

X_test = vec_model.transform(X_test).toarray()

This line transforms the test data (X_test) using the same CountVectorizer object that was fitted to the training data. It ensures that the test data is represented using the same vocabulary that was learned from the training data. Again, the .toarray() method converts the sparse matrix representation into a dense NumPy array.

In [8]:
X_train

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

<b>Step 3: Build a naive bayes model</>

Start solving the problem by implementing a model of which you will have to choose which of the three implementations to use: <b>GaussianNB</b>, <b>MultinomialNB</b> or <b>BernoulliNB</b>, according to what we have studied in the module. Try now to train it with the two other implementations and confirm if the model you have chosen is the right one.

In [9]:
from sklearn.naive_bayes import GaussianNB

GNB_model = GaussianNB()
GNB_model.fit(X_train, y_train)

In [10]:
y_pred_GNB = GNB_model.predict(X_test)
y_pred_GNB

array([0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1,
       0, 0, 0])

In [11]:
from sklearn.metrics import accuracy_score

accuracy_score_GNB = accuracy_score(y_test, y_pred_GNB)
accuracy_score_GNB

0.8044692737430168

In [12]:
from pickle import dump

dump(GNB_model, open("naive_bayes_GaussianNB.sav", "wb"))

In [13]:
from sklearn.naive_bayes import MultinomialNB

MNB_model = MultinomialNB()
MNB_model.fit(X_train, y_train)

In [14]:
y_pred_MNB = MNB_model.predict(X_test)
y_pred_MNB

array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 0, 0])

In [15]:
accuracy_score_MNB = accuracy_score(y_test, y_pred_MNB)
accuracy_score_MNB

0.8156424581005587

In [16]:
dump(MNB_model, open("naive_bayes_MultinomialNB.sav", "wb"))

In [17]:
from sklearn.naive_bayes import BernoulliNB

BNB_model = BernoulliNB()
BNB_model.fit(X_train, y_train)

In [18]:
y_pred_BNB = BNB_model.predict(X_test)
y_pred_BNB

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0])

In [19]:
accuracy_score_BNB = accuracy_score(y_test, y_pred_BNB)
accuracy_score_BNB

0.770949720670391

In [20]:
dump(BNB_model, open("naive_bayes_BernoulliNB.sav", "wb"))

In [33]:
print("GaussianNB Accuracy Score")
print(accuracy_score_GNB )
print("*"*20)
print("MultinomialNB Accuracy Score")
print(accuracy_score_MNB )
print("*"*20)
print("BernoulliNB Accuracy Score")
print(accuracy_score_BNB )

most_accurate = accuracy_score_GNB
most_accurate_name = 'GaussianNB'
if accuracy_score_MNB > most_accurate:
    most_accurate = accuracy_score_MNB
    most_accurate_name = 'MultinomialNB'
if accuracy_score_BNB > most_accurate:
    most_accurate = accuracy_score_BNB 
    most_accurate_name = 'BernoulliNB'

print("*"*20)
print("*"*20)
print("The Naive Bayes model with the highest accuracy score is", most_accurate_name, "with an accuracy score of", most_accurate)


GaussianNB Accuracy Score
0.8044692737430168
********************
MultinomialNB Accuracy Score
0.8156424581005587
********************
BernoulliNB Accuracy Score
0.770949720670391
********************
********************
The Naive Bayes model with the highest accuracy score is MultinomialNB with an accuracy score of 0.8156424581005587


<b>Step 6: Explore other alternatives</b>

Which other models of the ones we have studied could you use to try to overcome the results of a Naive Bayes? Argue this and train the model.

In [25]:
from sklearn.linear_model import LogisticRegression

LR_model = LogisticRegression()
LR_model.fit(X_train, y_train)

In [26]:
y_pred_LR = LR_model.predict(X_test)
y_pred_LR

array([0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0,
       1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0,
       0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0,
       0, 0, 0])

In [27]:
accuracy_score_LR = accuracy_score(y_test, y_pred_LR)
accuracy_score_LR

0.8324022346368715

In [44]:
most_accurate2 = accuracy_score_GNB
most_accurate_name2 = 'GaussianNB'
if accuracy_score_MNB > most_accurate2:
    most_accurate2 = accuracy_score_MNB
    most_accurate_name2 = 'MultinomialNB'
if accuracy_score_BNB > most_accurate2:
    most_accurate2 = accuracy_score_BNB 
    most_accurate_name2 = 'BernoulliNB'
if accuracy_score_LR > most_accurate2:
    most_accurate2 = accuracy_score_LR 
    most_accurate_name2 = 'Logistic Regression'


print("The model with the highest accuracy score is", most_accurate_name2, "with an accuracy score of", most_accurate2)
print("*"*40)
print("The", most_accurate_name2, "is approximately", round((most_accurate2-most_accurate)*100, 4), "%", " more accurate")

The model with the highest accuracy score is Logistic Regression with an accuracy score of 0.8324022346368715
****************************************
The Logistic Regression is approximately 1.676 %  more accurate
