In [88]:
import pandas as pd
import utils
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB, ComplementNB
from sklearn.metrics import accuracy_score, recall_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from math import ceil
from pickle import dump

# Review classifier

In this file we wil explore the creation of a model capable of classifying app reviews to determine if the reviews are positive or negative.

In [10]:
raw_df = utils.load_reviews_db()

Database successfully loaded


In [11]:
raw_df

Unnamed: 0_level_0,package_name,review,polarity
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,com.facebook.katana,privacy at least put some option appear offli...,0
1,com.facebook.katana,"messenger issues ever since the last update, ...",0
2,com.facebook.katana,profile any time my wife or anybody has more ...,0
3,com.facebook.katana,the new features suck for those of us who don...,0
4,com.facebook.katana,forced reload on uploading pic on replying co...,0
...,...,...,...
886,com.rovio.angrybirds,loved it i loooooooooooooovvved it because it...,1
887,com.rovio.angrybirds,all time legendary game the birthday party le...,1
888,com.rovio.angrybirds,ads are way to heavy listen to the bad review...,0
889,com.rovio.angrybirds,fun works perfectly well. ads aren't as annoy...,1


In [26]:
raw_df.polarity.value_counts()

polarity
0    584
1    307
Name: count, dtype: int64

# Data overview

- Since we want our predictions to be based only on thee reviews we only need to work with the `review` variable.
- The dataset is slightly unbalanced, for this reason it would be useful to evaluate the  quality of the models using the recall and accuracy metrics.
- The text in each review may present things like trailing spaces and upper case letters so its recommended to format it before starting to process it.  

# Data preprocessing

## Data formatting

In [13]:
df = raw_df.copy()
df["review"] = raw_df["review"].str.strip().str.lower()

## Text vectorizing and train test splitting.

In [76]:
vec_model = CountVectorizer(stop_words = "english")

x = df.review
y = df.polarity

x_train, x_test, y_train, y_test =train_test_split(x, y, test_size=0.2, random_state=42)

x_train = vec_model.fit_transform(x_train).toarray()
x_test = vec_model.transform(x_test).toarray()

x_train

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], shape=(712, 3310))

# Model creation

Our main focus is going to be creating a NaiveBayes model for classification.
Since we vectorized the reviews using the CountVectorizer model, making our training data (counts of each word) discrete, and since the dataset is imbalanced, the appropriate Naive Bayes model for this case scenario is the Complement Naive Bayes.

## Initial model

In [68]:
model = ComplementNB()
model.fit(x_train, y_train)

y_pred = model.predict(x_test)

acc_score = accuracy_score(y_test, y_pred)
rec_score = recall_score(y_test, y_pred)

print(f'accuracy:   {acc_score}')
print(f'recall:     {rec_score}')

accuracy:   0.8044692737430168
recall:     0.660377358490566


## Is it the right model?

In the next code cell we will check the accuracy and recall measurements of the other Naive Bayes implementations to confirm if the multinomial is indeed the best one.

In [63]:
models = {
    'GaussianNB': GaussianNB(),
    'BernoulliNB': BernoulliNB(),
    'MultinomialNB': MultinomialNB()
}

for model_name, model in models.items():
    model.fit(x_train, y_train)

    y_pred = model.predict(x_test)

    acc_score = accuracy_score(y_test, y_pred)
    rec_score = recall_score(y_test, y_pred)

    print(f'{model_name:15} accuracy: {acc_score}')
    print(f'{model_name:15} recall:   {rec_score}\n')

GaussianNB      accuracy: 0.8044692737430168
GaussianNB      recall:   0.6226415094339622

BernoulliNB     accuracy: 0.770949720670391
BernoulliNB     recall:   0.39622641509433965

MultinomialNB   accuracy: 0.8156424581005587
MultinomialNB   recall:   0.6037735849056604



If we take into account only the accuracy value, the best model would be MultinomialNB, nonetheless, the ComplementNB has a slightly lower accuracy but a much higher recall value. This means that ComplementNB is probably the best choice for creating our model.

This result comes from the lesser assumptions that the ComplementNB has respect to the MultinomialNB implementation.

## Parameter tunning

In [70]:
grid = {
    'alpha':[0.001,0.4,0.5,0.6, 3],
    'fit_prior':[True,False],
    'norm':[True,False]
}

grid = GridSearchCV(model, grid, scoring='balanced_accuracy')
grid.fit(x_train, y_train)

y_pred = grid.best_estimator_.predict(x_test)
params = grid.best_params_

acc_score = accuracy_score(y_test, y_pred)
rec_score = recall_score(y_test, y_pred)

print(params)
print(f'accuracy:   {acc_score}')
print(f'recall:     {rec_score}')

{'alpha': 0.5, 'fit_prior': True, 'norm': False}
accuracy:   0.8044692737430168
recall:     0.660377358490566


## saving the model

In [71]:
import os.path as path

model_name = 'ComplementNB'
param_string_list = ['_' + param.replace('_','') + '_' + str(value) 
                     for param, value in params.items()]
model_name += ''.join(param_string_list) + '.sav'

model_path = path.join('..','models',model_name)
dump(grid.best_estimator_,open(model_path,'wb'))



# Alternative models

Since we managed to vectorize our data we can apply to it pretty much any classification model.

## Decision Tree

In [52]:
tree_model = DecisionTreeClassifier(random_state=42)
tree_model.fit(x_train,y_train)
y_pred = tree_model.predict(x_test)

acc_score = accuracy_score(y_test, y_pred)
rec_score = recall_score(y_test, y_pred)

print(f'leaves:{tree_model.get_n_leaves()}')
print(f'depth:{tree_model.get_depth()}')
print(f'accuracy:   {acc_score}')
print(f'recall:     {rec_score}')



leaves:127
depth:72
accuracy:   0.7150837988826816
recall:     0.6037735849056604


As we can see, the code above created a pretty big decision tree, this means that the tree is probably overfitting our data, and hence causing the accuracy and recall measurements to drop respect to the Naive Bayes models.

## Random forest

In [89]:
RFC_model = RandomForestClassifier(random_state=42)
RFC_model.fit(x_train,y_train)
y_pred = RFC_model.predict(x_test)

acc_score = accuracy_score(y_test, y_pred)
rec_score = recall_score(y_test, y_pred)

print(f'accuracy:   {acc_score}')
print(f'recall:     {rec_score}')

accuracy:   0.7988826815642458
recall:     0.7358490566037735


This solves the overfitting problem of the decision tree

## Logistic regression

**Important note**: We do not need to normalize the values of the log model in this case and as it was tested, doing it would result in worse predicting capabilities for the model.
This happens because all of our data has the same scale already (it is not like comparing kilometers centimeters, we are comparing counts of words for the same review comment), scaling the data in any way would result in a distortion of the data in ways that counteract the L2 regularization that is applied by sklearn to the logistic regression model. 

In [85]:
log_model = LogisticRegression(random_state=42)
log_model.fit(x_train,y_train)
y_pred = log_model.predict(x_test)

acc_score = accuracy_score(y_test, y_pred)
rec_score = recall_score(y_test, y_pred)

print(f'accuracy:   {acc_score}')
print(f'recall:     {rec_score}')

accuracy:   0.8324022346368715
recall:     0.8113207547169812


This is a surprising result. Given my superficial knowledge of the logistic regression model I couldn't come up with a strong explanation of why it performs better than Naive Bayes but I formulated some hypothesis anyway:
- For this particular dataset and task the independency of the features does not hold worsening the Naive Bayes models predictions. In our case that means, the presence and amount of some words affect the presence and amount of some others. This has some chance of being true because those reviews have long sentences and that might be related to more complex grammatical structures.
- The count of the words affect the odds of the review to be positive in an exponential way making the data fit very well the assumptions of the logistic regression model. This would happen for instance if the people giving negative reviews were more prone to just saying what they don't like with phrases like "this is crap" or "it is broke and full of bugs", and at the same time the people who leave positive reviews were prone to emphasize how much they liked the application using long descriptions and using epizeuxis (a repetition of a word or phrase in quick succession) in sentences like "this is very very fun to play".
- It happened by chance because the dataset is fairly small, and if the dataset were bigger then the Naive Bayes model would outperform the logistic regression.

# Conclussions

- models reliant on probabilistic classification beat the tree-based ones, this may be due to the dataset containing few data clusters.
- Even though Naive Bayes is usually better for text classification, in this case logistic regression is significantly better (specially in the recall score).
- More tests and a deeper analysis are needed to understand why the logistic regression model worked better that the Naive Bayes ones.
- Although doing a complete EDA was not necessary, it would be good for some text classification tasks like this one (it would have to be performed after the text vectorization) since multivariate analysis would help identifying if the independency of the features holds for our dataset.