# Naive Bayes Project 

## Importing libraries

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB, BernoulliNB, GaussianNB
from sklearn.metrics import accuracy_score, classification_report
from sklearn.ensemble import RandomForestClassifier
import joblib
import os
from sklearn.linear_model import LogisticRegression


## Loading the dataset

> We will proceed with loading the dataset, inspecting its shape and printing some of the first rows to check that everything was loaded correctly. 

In [3]:
url = "https://raw.githubusercontent.com/4GeeksAcademy/naive-bayes-project-tutorial/main/playstore_reviews.csv"
df = pd.read_csv(url)

print("Dataset shape:", df.shape)
print(df.head())

Dataset shape: (891, 3)
          package_name                                             review  \
0  com.facebook.katana   privacy at least put some option appear offli...   
1  com.facebook.katana   messenger issues ever since the last update, ...   
2  com.facebook.katana   profile any time my wife or anybody has more ...   
3  com.facebook.katana   the new features suck for those of us who don...   
4  com.facebook.katana   forced reload on uploading pic on replying co...   

   polarity  
0         0  
1         0  
2         0  
3         0  
4         0  


> We can see that the dataset was loaded successfully. It contains 891 rows and 3 columns. All the reviews and polarity ratings have been successfully loaded.

## Preprocessing data

> As the project instructions state, we'll need to drop the ***package_name*** variable, since it's irrelevant for our analysis here. 

In [4]:
df = df.drop('package_name', axis=1)

# Checking that everything works:
print("Columns after dropping 'package_name':", df.columns.tolist())

Columns after dropping 'package_name': ['review', 'polarity']


> Now that's working, so we'll proceed with the standardization of the review text. We'll remove extra spaces and convert everything to lower cases. This will make our analysis job easier. 

In [5]:
df['review'] = df['review'].str.strip().str.lower()

# Checking that everything works
print("First 5 reviews after cleaning:")
for i, text in enumerate(df['review'].head()):
    print(f"Review {i+1}: '{text}'")
    assert text == text.strip()   # Should have no leading/trailing spaces
    assert text == text.lower()   # Should be all lowercase
print("\nAll text is cleaned (lowercase, stripped).")

First 5 reviews after cleaning:
Review 1: 'privacy at least put some option appear offline. i mean for some people like me it's a big pressure to be seen online like you need to response on every message or else you be called seenzone only. if only i wanna do on facebook is to read on my newsfeed and just wanna response on message i want to. pls reconsidered my review. i tried to turn off chat but still can see me as online.'
Review 2: 'messenger issues ever since the last update, initial received messages don't get pushed to the messenger app and you don't get notification in the facebook app or messenger app. you open the facebook app and happen to see you have a message. you have to click the icon and it opens messenger. subsequent messages go through messenger app, unless you close the chat head... then you start over with no notification and having to go through the facebook app.'
Review 3: 'profile any time my wife or anybody has more than one post and i view them it would take m

> We'll proceed with the definition of predictor and target variables. For our case, the reviews are the predictor and the polarity, our target. 

In [6]:
X = df['review']
y = df['polarity']

> After doing that, we'll split the dataset into training and test sets, preparing everything for our models, as per project instructions. 

In [7]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Checking that everything works
print("Train set size:", X_train.shape[0])
print("Test set size:", X_test.shape[0])

Train set size: 712
Test set size: 179


> Next, using Bag-of-Words, we transform the text into a count matrix, using the training set to fit the transformer and the test set to apply it.

In [8]:
vec_model = CountVectorizer(stop_words="english")
X_train_vec = vec_model.fit_transform(X_train).toarray()
X_test_vec = vec_model.transform(X_test).toarray()

# Checking that everything works
print("Train vectorized shape:", X_train_vec.shape)
print("Test vectorized shape:", X_test_vec.shape)


Train vectorized shape: (712, 3272)
Test vectorized shape: (179, 3272)


## Training and evaluating Naive Bayes models

> For this project, I've decided I'm going to try first with MultinomialNB, because it is recommended for text data, which we are using here. Logically makes sense. 

In [11]:
# MultinomialNB
nb_m = MultinomialNB()
nb_m.fit(X_train_vec, y_train)
y_pred_m = nb_m.predict(X_test_vec)
print("MultinomialNB accuracy:", accuracy_score(y_test, y_pred_m))
print('\n')
print(classification_report(y_test, y_pred_m))

MultinomialNB accuracy: 0.8547486033519553


              precision    recall  f1-score   support

           0       0.84      0.96      0.90       117
           1       0.89      0.66      0.76        62

    accuracy                           0.85       179
   macro avg       0.87      0.81      0.83       179
weighted avg       0.86      0.85      0.85       179



> After running it, we can see that using MultinomialNB gives us an accuracy of ***0.8547486033519553***

> This looks pretty good, but we'll test the other two (BernoulliNB & GaussianNB), to see if we get better results

In [None]:
# BernoulliNB
nb_b = BernoulliNB()
nb_b.fit(X_train_vec, y_train)
y_pred_b = nb_b.predict(X_test_vec)
print("BernoulliNB accuracy:", accuracy_score(y_test, y_pred_b))
print('\n')
print(classification_report(y_test, y_pred_b))

BernoulliNB accuracy: 0.7821229050279329


              precision    recall  f1-score   support

           0       0.76      0.97      0.85       117
           1       0.87      0.44      0.58        62

    accuracy                           0.78       179
   macro avg       0.82      0.70      0.72       179
weighted avg       0.80      0.78      0.76       179



In [14]:
# GaussianNB
nb_g = GaussianNB()
nb_g.fit(X_train_vec, y_train)
y_pred_g = nb_g.predict(X_test_vec)
print("GaussianNB accuracy:", accuracy_score(y_test, y_pred_g))
print('\n')
print(classification_report(y_test, y_pred_g))


GaussianNB accuracy: 0.8156424581005587


              precision    recall  f1-score   support

           0       0.84      0.89      0.86       117
           1       0.76      0.68      0.72        62

    accuracy                           0.82       179
   macro avg       0.80      0.78      0.79       179
weighted avg       0.81      0.82      0.81       179



> As expected, the other models don't get much accurate than the first one. It is the model that suits best this case study. 

## Random Forest as an alternative

> As requested, we'll try using the Random Forest model to see if we get better results than the MultinomialNB, which has an accuracy of 0.8547486033519553

In [15]:
rf = RandomForestClassifier(random_state=42, n_jobs=-1)
rf.fit(X_train_vec, y_train)
y_pred_rf = rf.predict(X_test_vec)
print("Random Forest accuracy:", accuracy_score(y_test, y_pred_rf))
print('\n')
print(classification_report(y_test, y_pred_rf))

Random Forest accuracy: 0.8212290502793296


              precision    recall  f1-score   support

           0       0.83      0.91      0.87       117
           1       0.79      0.66      0.72        62

    accuracy                           0.82       179
   macro avg       0.81      0.78      0.79       179
weighted avg       0.82      0.82      0.82       179



> We conclude the analysis with the results of the Random Forest model. We can see that the accuracy is ***0.8212290502793296***, which positions it as the second best model to use, in this scenario, meaning it has ***NOT*** improved the previous model, so we will stick with MultinomialNB

## Saving the models

> One of the last steps, saving the models. Pretty straighforward.

In [None]:
os.makedirs('models', exist_ok=True)
joblib.dump(nb_m, 'models/best_naive_bayes.pkl')
joblib.dump(rf, 'models/random_forest_vec.pkl')
print("Models saved to 'models/' directory.")

## Trying Logistic Regression

> As the last step of this project, it requires us to try with one last model: ***Logistical Regression***

> So let's do it and see what are the results.

In [16]:
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train_vec, y_train)
y_pred_lr = lr.predict(X_test_vec)
print("Logistic Regression accuracy:", accuracy_score(y_test, y_pred_lr))
print(classification_report(y_test, y_pred_lr))

Logistic Regression accuracy: 0.8324022346368715
              precision    recall  f1-score   support

           0       0.86      0.89      0.87       117
           1       0.78      0.73      0.75        62

    accuracy                           0.83       179
   macro avg       0.82      0.81      0.81       179
weighted avg       0.83      0.83      0.83       179



> FINALLY! Our results came in! We can see that the accuracy of this model is ***0.8324022346368715***, which positions it as the SECOND best performing model and takiing the Random Forest's place, putting it in third place.

> With this we conclude this project.