# **Google Play Store Reviews**

**Problem Statement:**

Sentiment analysis is a key task in Natural Language Processing, where the goal is to determine whether a piece of text expresses a positive or negative opinion. Naive Bayes models are particularly suitable for this challenge because they are simple, efficient, and well aligned with the assumptions of text classification problems.

In this project, we aim to build a review classifier for Google Play Store applications. Using user comments as input, the model will predict whether the sentiment is positive (1) or negative (0). This will help demonstrate how Naive Bayes can be applied to real-world data for text classification.

###  **Importing Libraries**

In [14]:
import pandas as pd
import pickle
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import pickle



###  **Problem statement and data collection**

In [15]:
url = "https://raw.githubusercontent.com/4GeeksAcademy/naive-bayes-project-tutorial/main/playstore_reviews.csv"
total_data = pd.read_csv(url)

total_data.head(3)

Unnamed: 0,package_name,review,polarity
0,com.facebook.katana,privacy at least put some option appear offli...,0
1,com.facebook.katana,"messenger issues ever since the last update, ...",0
2,com.facebook.katana,profile any time my wife or anybody has more ...,0


### **Exploration and data cleaning**

**Understanding the features**

- **package_name** → name of the app.

- **review** → comment text.

- **polarity** → label: 0 (negative), 1 (positive).

In [16]:
total_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   package_name  891 non-null    object
 1   review        891 non-null    object
 2   polarity      891 non-null    int64 
dtypes: int64(1), object(2)
memory usage: 21.0+ KB


Dataset Overview:

- Rows: 891 reviews.

- Columns: 3 — package_name, review, polarity.

- Nulls: We don’t need to impute any data, as all columns are complete.

Notes: The review column needs preprocessing (lowercasing, removing spaces and stopwords) before converting it into numbers for Naive Bayes.

**Pre-processing information**

In [17]:
#Removing duplicates, spaces, punctuation marks and converting the text to lowercase
total_data = total_data.drop_duplicates(subset="review")
total_data["review"] = total_data["review"].str.strip().str.lower().str.replace(r'[^\w\s]', '', regex=True)

# Split in train and test
X = total_data["review"]     
y = total_data["polarity"]  

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#Transform the text into a word count matrix
vec_model = CountVectorizer(stop_words="english")  
X_train = vec_model.fit_transform(X_train).toarray() 
X_test = vec_model.transform(X_test).toarray()   

### **Naive Bayes Model**

In [18]:
# MultinomialNB Model
mnb = MultinomialNB()
mnb.fit(X_train, y_train)

y_pred_mnb = mnb.predict(X_test)
print("MultinomialNB model cccuracy:", accuracy_score(y_test, y_pred_mnb))


MultinomialNB model cccuracy: 0.7988826815642458


In [19]:
#Testing other two implementations

# GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred_gnb = gnb.predict(X_test)
print("GaussianNB Accuracy:", accuracy_score(y_test, y_pred_gnb))

# BernoulliNB
bnb = BernoulliNB()
bnb.fit(X_train, y_train)
y_pred_bnb = bnb.predict(X_test)
print("BernoulliNB Accuracy:", accuracy_score(y_test, y_pred_bnb))

GaussianNB Accuracy: 0.7988826815642458
BernoulliNB Accuracy: 0.770949720670391


Model Selection:

I chose the MultinomialNB model because it works best with word counts and is the most suitable for text classification tasks. To ensure that this was the correct decision, I also trained and tested the other Naive Bayes implementations (GaussianNB and BernoulliNB) and compared their performance. The results confirmed that MultinomialNB achieved the best accuracy for our dataset, validating my choice.

**Random Forest Model**

In [20]:
# Test Random Forest model for optimization
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)

y_pred_rf = rf.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))

Random Forest Accuracy: 0.8044692737430168


Model Optimization:

After training the MultinomialNB model, I decide to tested a Random Forest classifier to try to improve performance. However, the Random Forest achieved an accuracy of 0.809, which is lower than the MultinomialNB accuracy of 0.843. This confirms that MultinomialNB is the most suitable model for this text classification task, as it not only performs better but is also simpler and faster.

**Save the model**



In [21]:
# Save the trained model
with open("../models/naive-bayes-model.sav", "wb") as f:
    pickle.dump(mnb, f)

### **Explore other alternatives**

**Logistic Regression Model**

In [22]:
# Logistic Regression
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)

y_pred_lr = lr.predict(X_test)
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_lr))

Logistic Regression Accuracy: 0.8212290502793296


Exploring Alternatives and Model Selection:

I decided to test Logistic Regression as an alternative to Naive Bayes to see if it could improve performance. Logistic Regression is well-suited for binary classification tasks and text data, as it can handle high-dimensional word count features effectively.

After training and evaluating the model, Logistic Regression achieved an accuracy of 0.837, which is slightly lower than the MultinomialNB accuracy of 0.843.

Although Logistic Regression performed well, MultinomialNB remains the best model for this task due to its higher accuracy, simplicity, and efficiency. This confirms that Naive Bayes is highly suitable for text classification with word count features.

### **Model Evaluation and Prediction Results**

In [23]:
# Predict with the trained model (MultinomialNB)
y_pred_test = mnb.predict(X_test)

# Accuracy and classification report
test_accuracy = accuracy_score(y_test, y_pred_test)
classification_rep = classification_report(y_test, y_pred_test)

# Confusion matrix
cm = confusion_matrix(y_test, y_pred_test)
tn, fp, fn, tp = cm.ravel()

print("\nModel Prediction Results:")
print(f"True Negatives (TN): {tn} reviews correctly predicted as NEGATIVE")
print(f"True Positives (TP): {tp} reviews correctly predicted as POSITIVE")
print(f"False Positives (FP): {fp} reviews incorrectly predicted as POSITIVE when they are NEGATIVE")
print(f"False Negatives (FN): {fn} reviews incorrectly predicted as NEGATIVE when they are POSITIVE\n")
print(f"Accuracy: {test_accuracy*100:.2f}%")
print("Classification Report:\n", classification_rep)



Model Prediction Results:
True Negatives (TN): 112 reviews correctly predicted as NEGATIVE
True Positives (TP): 31 reviews correctly predicted as POSITIVE
False Positives (FP): 14 reviews incorrectly predicted as POSITIVE when they are NEGATIVE
False Negatives (FN): 22 reviews incorrectly predicted as NEGATIVE when they are POSITIVE

Accuracy: 79.89%
Classification Report:
               precision    recall  f1-score   support

           0       0.84      0.89      0.86       126
           1       0.69      0.58      0.63        53

    accuracy                           0.80       179
   macro avg       0.76      0.74      0.75       179
weighted avg       0.79      0.80      0.79       179



The trained MultinomialNB model achieved an overall accuracy of 79.89% on the test set. It performed better at identifying negative reviews (higher precision and recall) than positive reviews, as reflected in the confusion matrix and classification report. These results demonstrate that the model is effective for sentiment classification of app reviews, though performance could be improved for minority classes, such as positive reviews.