# Explore here - Problem Statement | Background

**Sentiment analysis**

In this case, we have only 3 variables: 2 predictors and a dichotomous label. Of the two predictors, we are really only interested in the comment part, since the fact of classifying a comment as positive or negative will depend on its content, not on the application from which it was written. Therefore, the package_name variable should be removed.

When we work with text as in this case, it does not make sense to do an EDA, the process is different, since the only variable we are interested in is the one that contains the text. In other cases where the text is part of a complex set with other numeric predictor variables and the prediction objective is different, then it makes sense to apply an EDA.


- **package_name**. Name of the mobile application (categorical)
- **review**. Comment about the mobile application (categorical)
- **polarity**. Class variable (0 or 1), being 0 a negative comment and 1, positive (numeric).


### Import Libraries


In [2]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report
from pickle import dump

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


### Read CSV

In [3]:
#import csv file
tot_data = pd.read_csv('https://raw.githubusercontent.com/4GeeksAcademy/naive-bayes-project-tutorial/main/playstore_reviews.csv')

# Set display options to show all columns (None means unlimited)
pd.set_option('display.max_columns', None)

#Read csv file and display intial rows
tot_data.head(3)

Unnamed: 0,package_name,review,polarity
0,com.facebook.katana,privacy at least put some option appear offli...,0
1,com.facebook.katana,"messenger issues ever since the last update, ...",0
2,com.facebook.katana,profile any time my wife or anybody has more ...,0


In [4]:
# display shape
tot_data.shape

(891, 3)

# Remove 'package_name' and split string

In [5]:
# List of keywords to filter out
keywords = ['package_name']

# Finding columns that contain any of the keywords
columns_to_drop = [col for col in tot_data.columns if any(keyword in col for keyword in keywords)]

tot_data["review"] = tot_data["review"].str.strip().str.lower()

# Dropping these columns from the DataFrame
tot_data = tot_data.drop(columns=columns_to_drop)

In [25]:
# Step 1: Text Preprocessing
# Basic preprocessing can include lowercasing, removing punctuation, etc.
tot_data['review_cleaned'] = tot_data['review'].str.lower()

# Step 2: Feature Extraction
tfidf = TfidfVectorizer(max_features=1000)  # Limit number of features to 5000
features = tfidf.fit_transform(tot_data['review_cleaned'])

# Step 3: Label Encoding
label_encoder = LabelEncoder()
labels = label_encoder.fit_transform(tot_data['polarity'])

# Splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)

In [26]:
print(X_train.toarray()[:5])  # to print the first 5 rows


[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [27]:
# Training the Naive Bayes Classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)

#### Model prediction and Evaluation 


In [28]:
# Predicting and Evaluating
y_pred = nb_classifier.predict(X_test)
print("Accuracy:", round(accuracy_score(y_test, y_pred),5))
#print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.83799


In [40]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline

# Creating a pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(lowercase=True)),
    ('nb', MultinomialNB())
])

# Parameters for Grid Search
parameters = {
    'tfidf__max_features': (1000, 2000),
    'nb__alpha': (1, 2, 3),
    'nb__fit_prior': (True, False),  # Whether to learn class prior probabilities or not 
}

# Grid Search with Cross-Validation
grid_search = GridSearchCV(pipeline, parameters, cv=10, n_jobs=-1, verbose=2)
grid_search.fit(tot_data['review_cleaned'], labels)

# Best Parameters and Scores
print("Best Parameters:", grid_search.best_params_)
print("Best Score:", round(grid_search.best_score_, 5))

# You can now use grid_search.best_estimator_ to make predictions or further analysis

Fitting 10 folds for each of 12 candidates, totalling 120 fits
[CV] END nb__alpha=1, nb__fit_prior=True, tfidf__max_features=1000; total time=   0.0s
[CV] END nb__alpha=1, nb__fit_prior=True, tfidf__max_features=1000; total time=   0.1s
[CV] END nb__alpha=1, nb__fit_prior=True, tfidf__max_features=1000; total time=   0.0s
[CV] END nb__alpha=1, nb__fit_prior=True, tfidf__max_features=1000; total time=   0.0s
[CV] END nb__alpha=1, nb__fit_prior=True, tfidf__max_features=1000; total time=   0.0s
[CV] END nb__alpha=1, nb__fit_prior=True, tfidf__max_features=1000; total time=   0.0s
[CV] END nb__alpha=1, nb__fit_prior=True, tfidf__max_features=1000; total time=   0.0s
[CV] END nb__alpha=1, nb__fit_prior=True, tfidf__max_features=1000; total time=   0.0s
[CV] END nb__alpha=1, nb__fit_prior=True, tfidf__max_features=1000; total time=   0.0s
[CV] END nb__alpha=1, nb__fit_prior=True, tfidf__max_features=2000; total time=   0.1s
[CV] END nb__alpha=1, nb__fit_prior=True, tfidf__max_features=1000;

### Add other models BernoulliNB and XGBoost 

In [41]:
# Import the BernoulliNB classifier
from sklearn.naive_bayes import BernoulliNB

# Training the Naive Bayes Classifier with BernoulliNB
nb_classifier = BernoulliNB()
nb_classifier.fit(X_train, y_train)

In [42]:
# Predicting and Evaluating
y_pred = nb_classifier.predict(X_test)
print("Accuracy:", round(accuracy_score(y_test, y_pred),5))
#print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.80447


In [43]:
# Train Model with Boost
from xgboost import XGBClassifier

xgb_model = XGBClassifier(random_state = 42)
xgb_model.fit(X_train, y_train)

In [44]:
# Predicting and Evaluating
y_pred = xgb_model.predict(X_test)
print("Accuracy:", round(accuracy_score(y_test, y_pred),5))
#print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.78212


 ## Text Preprocessing with CountVectorizer


In [6]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

# Step 1: Text Preprocessing
# Basic preprocessing can include lowercasing, removing punctuation, etc.
tot_data['review_cleaned'] = tot_data['review'].str.lower()

# Step 2: Feature Extraction
count_vec = CountVectorizer(max_features=1000, stop_words='english')  # Limit number of features to 1000
features = count_vec.fit_transform(tot_data['review_cleaned'])

# Step 3: Label Encoding
label_encoder = LabelEncoder()
labels = label_encoder.fit_transform(tot_data['polarity'])

# Splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)

In [7]:
print(X_train.toarray()[:5])  # to print the first 5 rows

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [8]:
# Training the Naive Bayes Classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)

In [9]:
# Predicting and Evaluating
y_pred = nb_classifier.predict(X_test)
print("Accuracy:", round(accuracy_score(y_test, y_pred),5))
#print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.80447


#### Save Model

In [46]:
dump(nb_classifier, open("nb_classifier_default_42.sav", "wb"))