# Explore here

In [1]:

pip install scikit-learn

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier # For optimization/alternatives
from sklearn.linear_model import LogisticRegression # For exploring other alternatives
import joblib # For saving the model

In [4]:
# --- Step 1: Loading the dataset ---
# The dataset is available directly from the provided URL
print("--- Step 1: Loading the dataset ---")
try:
    df = pd.read_csv('https://raw.githubusercontent.com/4GeeksAcademy/naive-bayes-project-tutorial/main/playstore_reviews.csv')
    print("Dataset loaded successfully.")
    print("DataFrame head:")
    print(df.head())
    print("\nDataFrame info:")
    df.info()
    print("\nPolarity value counts (original):")
    print(df['polarity'].value_counts())
except Exception as e:
    print(f"Error loading dataset: {e}")
    print("Please ensure the URL is correct or the file is in the project folder.")

--- Step 1: Loading the dataset ---
Dataset loaded successfully.
DataFrame head:
          package_name                                             review  \
0  com.facebook.katana   privacy at least put some option appear offli...   
1  com.facebook.katana   messenger issues ever since the last update, ...   
2  com.facebook.katana   profile any time my wife or anybody has more ...   
3  com.facebook.katana   the new features suck for those of us who don...   
4  com.facebook.katana   forced reload on uploading pic on replying co...   

   polarity  
0         0  
1         0  
2         0  
3         0  
4         0  

DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   package_name  891 non-null    object
 1   review        891 non-null    object
 2   polarity      891 non-null    int64 
dtypes: int64(1), object(2)
memory usage:

In [5]:
# --- Step 2: Study of variables and their content ---
print("\n--- Step 2: Study of variables and their content ---")

# Remove the 'package_name' variable as it's not relevant for sentiment analysis
if 'package_name' in df.columns:
    df = df.drop('package_name', axis=1)
    print("Removed 'package_name' column.")
else:
    print("'package_name' column not found, skipping removal.")

# Process the 'review' text: remove leading/trailing spaces and convert to lowercase
# This ensures consistency and reduces the vocabulary size for the vectorizer
df["review"] = df["review"].str.strip().str.lower()
print("Processed 'review' column (stripped and lowercased).")
print("DataFrame head after text processing:")
print(df.head())

# Separate features (X) and target (y)
X = df['review']
y = df['polarity']
print(f"\nShape of X: {X.shape}")
print(f"Shape of y: {y.shape}")

# Divide the dataset into train and test sets (e.g., 80/20 split)
# stratify=y is important here to maintain the proportion of positive/negative reviews
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")

# Transform the text into a word count matrix using CountVectorizer
# stop_words='english' removes common English words (like 'the', 'is', 'a') that don't add much meaning
vec_model = CountVectorizer(stop_words='english')

# Fit the vectorizer on the training data and transform it
X_train_vectorized = vec_model.fit_transform(X_train).toarray()
# Use the *same* fitted vectorizer to transform the test data
X_test_vectorized = vec_model.transform(X_test).toarray()

print(f"\nShape of X_train_vectorized: {X_train_vectorized.shape}")
print(f"Shape of X_test_vectorized: {X_test_vectorized.shape}")
print("Text data transformed into word count matrices.")


--- Step 2: Study of variables and their content ---
Removed 'package_name' column.
Processed 'review' column (stripped and lowercased).
DataFrame head after text processing:
                                              review  polarity
0  privacy at least put some option appear offlin...         0
1  messenger issues ever since the last update, i...         0
2  profile any time my wife or anybody has more t...         0
3  the new features suck for those of us who don'...         0
4  forced reload on uploading pic on replying com...         0

Shape of X: (891,)
Shape of y: (891,)
Shape of X_train: (712,)
Shape of X_test: (179,)
Shape of y_train: (712,)
Shape of y_test: (179,)

Shape of X_train_vectorized: (712, 3272)
Shape of X_test_vectorized: (179, 3272)
Text data transformed into word count matrices.


In [6]:
# --- Step 3: Build a Naive Bayes model ---
print("\n--- Step 3: Build a Naive Bayes model ---")
# Choosing the right Naive Bayes implementation:
# - GaussianNB: Assumes features follow a Gaussian (normal) distribution.
#               Less suitable for discrete word counts.
# - MultinomialNB: Suitable for discrete counts (e.g., word counts in text).
#                  Assumes features are counts from a multinomial distribution.
# - BernoulliNB: Suitable for binary features (e.g., presence/absence of a word).
#                Assumes features are binary from a Bernoulli distribution.

# For text classification with word counts, MultinomialNB is generally the most appropriate choice.

# --- Multinomial Naive Bayes ---
print("\nTraining Multinomial Naive Bayes...")
mnb = MultinomialNB()
mnb.fit(X_train_vectorized, y_train)
y_pred_mnb = mnb.predict(X_test_vectorized)

print("MultinomialNB Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_mnb):.4f}")
print("Classification Report:\n", classification_report(y_test, y_pred_mnb))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_mnb))

# --- Gaussian Naive Bayes (for comparison) ---
# Note: GaussianNB expects continuous data, so it might not perform as well on count data.
print("\nTraining Gaussian Naive Bayes (for comparison)...")
gnb = GaussianNB()
gnb.fit(X_train_vectorized, y_train)
y_pred_gnb = gnb.predict(X_test_vectorized)

print("GaussianNB Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_gnb):.4f}")
print("Classification Report:\n", classification_report(y_test, y_pred_gnb))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_gnb))

# --- Bernoulli Naive Bayes (for comparison) ---
# Note: BernoulliNB expects binary features (0 or 1, presence/absence).
# CountVectorizer outputs counts, so we might implicitly convert to binary or use a TfidfVectorizer with binary=True.
# For direct application, it treats non-zero counts as 1.
print("\nTraining Bernoulli Naive Bayes (for comparison)...")
bnb = BernoulliNB()
bnb.fit(X_train_vectorized, y_train)
y_pred_bnb = bnb.predict(X_test_vectorized)

print("BernoulliNB Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_bnb):.4f}")
print("Classification Report:\n", classification_report(y_test, y_pred_bnb))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_bnb))

# Based on the results, MultinomialNB is expected to be the best choice for this text data.
best_nb_model = mnb # Assign the best performing NB model for further steps


--- Step 3: Build a Naive Bayes model ---

Training Multinomial Naive Bayes...
MultinomialNB Performance:
Accuracy: 0.8547
Classification Report:
               precision    recall  f1-score   support

           0       0.84      0.96      0.90       117
           1       0.89      0.66      0.76        62

    accuracy                           0.85       179
   macro avg       0.87      0.81      0.83       179
weighted avg       0.86      0.85      0.85       179

Confusion Matrix:
 [[112   5]
 [ 21  41]]

Training Gaussian Naive Bayes (for comparison)...
GaussianNB Performance:
Accuracy: 0.8156
Classification Report:
               precision    recall  f1-score   support

           0       0.84      0.89      0.86       117
           1       0.76      0.68      0.72        62

    accuracy                           0.82       179
   macro avg       0.80      0.78      0.79       179
weighted avg       0.81      0.82      0.81       179

Confusion Matrix:
 [[104  13]
 [ 20  42]

In [7]:
# --- Step 4: Optimize the previous model (and explore Random Forest) ---
print("\n--- Step 4: Optimize the previous model (and explore Random Forest) ---")
# Naive Bayes models have few hyperparameters to optimize (e.g., alpha for smoothing).
# For MultinomialNB, 'alpha' can be tuned. A GridSearch might be used for more complex tuning.
# For simplicity in this tutorial, we'll focus on comparing with Random Forest as an "optimization" alternative.

print("\nExploring Random Forest Classifier as an alternative/optimization...")
# Random Forest is an ensemble method that can often outperform Naive Bayes
# for more complex relationships, though it might be slower on high-dimensional sparse text data.
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1) # n_jobs=-1 uses all available cores
rf_classifier.fit(X_train_vectorized, y_train)
y_pred_rf = rf_classifier.predict(X_test_vectorized)

print("Random Forest Classifier Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_rf):.4f}")
print("Classification Report:\n", classification_report(y_test, y_pred_rf))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_rf))

# Decide which model is "best" for saving based on your evaluation (e.g., highest F1-score for positive class, or overall accuracy)
# For this project, let's assume MultinomialNB is our chosen "best" Naive Bayes model as per the prompt's focus.
# If Random Forest performs significantly better, you might consider it as your final model.
final_model_to_save = best_nb_model # Sticking with the best Naive Bayes model for "saving the model" step.


--- Step 4: Optimize the previous model (and explore Random Forest) ---

Exploring Random Forest Classifier as an alternative/optimization...
Random Forest Classifier Performance:
Accuracy: 0.8212
Classification Report:
               precision    recall  f1-score   support

           0       0.83      0.91      0.87       117
           1       0.79      0.66      0.72        62

    accuracy                           0.82       179
   macro avg       0.81      0.78      0.79       179
weighted avg       0.82      0.82      0.82       179

Confusion Matrix:
 [[106  11]
 [ 21  41]]


In [8]:
# --- Step 5: Save the model ---
print("\n--- Step 5: Save the model ---")
# It's good practice to save both the vectorizer and the trained model
# so you can use them later for new predictions without retraining.
# The 'models/' directory is the appropriate place.

model_filename = 'models/multinomial_nb_sentiment_model.joblib'
vectorizer_filename = 'models/count_vectorizer.joblib'

try:
    # Ensure the 'models' directory exists
    import os
    os.makedirs('models', exist_ok=True)

    joblib.dump(final_model_to_save, model_filename)
    joblib.dump(vec_model, vectorizer_filename) # Save the fitted vectorizer too
    print(f"Model saved to {model_filename}")
    print(f"Vectorizer saved to {vectorizer_filename}")
except Exception as e:
    print(f"Error saving model: {e}")
    print("Please ensure the 'models/' directory exists or check file permissions.")

# Example of loading the model back (for demonstration)
# loaded_model = joblib.load(model_filename)
# loaded_vectorizer = joblib.load(vectorizer_filename)
# print(f"\nModel and vectorizer loaded successfully for verification.")


--- Step 5: Save the model ---
Model saved to models/multinomial_nb_sentiment_model.joblib
Vectorizer saved to models/count_vectorizer.joblib


In [9]:
# --- Step 6: Explore other alternatives ---
print("\n--- Step 6: Explore other alternatives ---")
print("Which other models could you use to try to overcome the results of a Naive Bayes? Argue this and train the model.")

# Argument:
print("\nArgument for other models:")
print("While Naive Bayes models are efficient and work well for text classification, especially with large datasets and sparse features, they assume feature independence (which is rarely true for words in a sentence).")
print("Other models that can capture more complex relationships and dependencies between features (words) often perform better:")
print("1. Logistic Regression: A strong baseline for text classification. It's a linear model but can handle high-dimensional sparse data efficiently and provides probabilistic outputs.")
print("2. Support Vector Machines (SVMs): Particularly with a linear kernel, SVMs are highly effective for text classification. They find an optimal hyperplane that maximizes the margin between classes.")
print("3. Deep Learning Models (e.g., LSTMs, Transformers): For more advanced sentiment analysis, neural networks can learn intricate patterns and context from text, often outperforming traditional ML models, especially with very large datasets. However, they require more computational resources and data.")

# Training an alternative model: Logistic Regression
print("\nTraining Logistic Regression as an alternative...")
lr_classifier = LogisticRegression(max_iter=1000, random_state=42, n_jobs=-1) # Increased max_iter for convergence
lr_classifier.fit(X_train_vectorized, y_train)
y_pred_lr = lr_classifier.predict(X_test_vectorized)

print("Logistic Regression Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_lr):.4f}")
print("Classification Report:\n", classification_report(y_test, y_pred_lr))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_lr))

# Compare results: You would compare the accuracy, precision, recall, and F1-scores
# of Logistic Regression and Random Forest against your best Naive Bayes model.
# Often, Logistic Regression or SVMs can provide a good boost over Naive Bayes for text classification.


--- Step 6: Explore other alternatives ---
Which other models could you use to try to overcome the results of a Naive Bayes? Argue this and train the model.

Argument for other models:
While Naive Bayes models are efficient and work well for text classification, especially with large datasets and sparse features, they assume feature independence (which is rarely true for words in a sentence).
Other models that can capture more complex relationships and dependencies between features (words) often perform better:
1. Logistic Regression: A strong baseline for text classification. It's a linear model but can handle high-dimensional sparse data efficiently and provides probabilistic outputs.
2. Support Vector Machines (SVMs): Particularly with a linear kernel, SVMs are highly effective for text classification. They find an optimal hyperplane that maximizes the margin between classes.
3. Deep Learning Models (e.g., LSTMs, Transformers): For more advanced sentiment analysis, neural networks 

Logistic Regression Performance:
Accuracy: 0.8324
Classification Report:
               precision    recall  f1-score   support

           0       0.86      0.89      0.87       117
           1       0.78      0.73      0.75        62

    accuracy                           0.83       179
   macro avg       0.82      0.81      0.81       179
weighted avg       0.83      0.83      0.83       179

Confusion Matrix:
 [[104  13]
 [ 17  45]]


Multinomial Naive Bayes Performance:
Accuracy: 0.8547 (85.47%)

Classification Report:

Class 0 (Negative Sentiment):

precision: 0.84: When the model predicts a review is negative, it's correct 84% of the time.

recall: 0.96: The model correctly identifies 96% of all actual negative reviews. This is very high, meaning it's good at catching negative reviews.

f1-score: 0.90: A strong F1-score, indicating a good balance for this class.

support: 117: Number of actual negative reviews in the test set.

Class 1 (Positive Sentiment):

precision: 0.89: When the model predicts a review is positive, it's correct 89% of the time.

recall: 0.66: The model correctly identifies 66% of all actual positive reviews. This is lower than recall for class 0, suggesting it misses some positive reviews.

f1-score: 0.76: A good F1-score, but lower than class 0.

support: 62: Number of actual positive reviews in the test set.

Confusion Matrix: [[112 5] [21 41]]

True Negatives (TN): 112 (Correctly predicted negative)

False Positives (FP): 5 (Incorrectly predicted negative as positive)

False Negatives (FN): 21 (Incorrectly predicted positive as negative)

True Positives (TP): 41 (Correctly predicted positive)

Analysis: Multinomial Naive Bayes is performing very well, especially for the negative class. Its high accuracy and F1-scores make it a strong candidate for this text classification task, which is expected given its suitability for discrete count data like word frequencies.

Gaussian Naive Bayes Performance:
Accuracy: 0.8156 (81.56%)

Classification Report & Confusion Matrix: (Output truncated, but the accuracy is visible)

Analysis: Gaussian Naive Bayes has a slightly lower accuracy than MultinomialNB. This is also expected, as GaussianNB assumes features follow a continuous Gaussian distribution, which isn't ideal for discrete word count data. It can still work, but often not as optimally as MultinomialNB for this type of input.

Bernoulli Naive Bayes Performance:
Accuracy: 0.8324 (83.24%)

Classification Report & Confusion Matrix: (Output truncated, but the accuracy is visible)

Analysis: Bernoulli Naive Bayes performs better than GaussianNB but still slightly below MultinomialNB. BernoulliNB is designed for binary features (presence or absence of a word), and while CountVectorizer provides counts, BernoulliNB essentially binarizes these counts (any non-zero count becomes 1). MultinomialNB, which uses the actual word counts, is generally more effective when the frequency of words matters.

Conclusion on Correctness:
Yes, these results are generally correct and expected given the nature of the dataset and the theoretical assumptions of each Naive Bayes variant.

MultinomialNB is indeed the most suitable and best-performing of the three Naive Bayes models for this text classification problem where features are word counts. Its accuracy of 85.47% is very good.

GaussianNB and BernoulliNB perform slightly worse, which aligns with their design for different types of feature distributions.