# Yelp Reviews Sentiment Analysis - Modeling

#### Prepared By: Rabia Tariq

## Contents

* [Introduction](#Introduction)
* [Imports](#Imports)
* [Building Models Using Features we Extracted](#Features)
    * [Logistic Regression](#Logreg)
    * [Random Forest](#RF)
    * [Gradient Boosting Classifier](#GBC)
    * [Support Vector Classifier](#SVC)
    * [Result](#Result)
* [Word Embedding: Bag of Words](#BOW)
    * [Naive Bayes with TF-IDF](#Nb_tfidf)
    * [Naive Bayes with CountVectorizer](#Nb_cv)
    * [Gradient Boosting Classifier with TF-IDF](#GBC_tfidf)
    * [Result](#Result)
* [Ensembles](#Ensemble)
    * [Sparse and Dense Features](#Sparse)
    * [Naive Bayes Probability](#Nb_proba)
    * [Stacked Model (Gradient Boosting + Naive Bayes)](#Stacked)
    * [Result](#Result)
* [Conclusion](#Conclusion)

## Introduction<a id='Introduction'></a>

Up until now, we have done data wrangling, performed EDA, done preprocessing and found the most important feautres. We created a baseline model and got an accuracy of ~ 68% by using Logistic Regression. In this notebook, we will be creating many different models and find the most accurate one which we can use in the future.

We will be using Ensembling techniques, Bag of words model and stack different models

## Imports<a id='Imports'></a>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, accuracy_score, f1_score
from sklearn.pipeline import Pipeline
import pickle
from scipy.sparse import coo_matrix, hstack
from pycm import *


import warnings
warnings.filterwarnings('ignore')

In [2]:
yelp_data = pd.read_csv('yelp_data_preprocessing.csv')
yelp_data.head()

Unnamed: 0,Name,Review,Polarity,Sentiment,%_Positive_Words,park,bake,shop,becom,go,...,sangria,time squar,jukebox,calzon,byob,pizzeria,bake clam,castl,white castl,drive thru
0,Morris Park Bake Shop,'Morris Park Bake Shop has become my go to spo...,0.338889,Somewhat Positive,0.206897,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Morris Park Bake Shop,'I thought the cookies and biscotti were prett...,0.314583,Somewhat Positive,0.130435,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Morris Park Bake Shop,'Guys.... so Im a big time biscotti connoisseu...,0.238068,Somewhat Positive,0.12766,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.15229,0.0,0.0
3,Morris Park Bake Shop,'I had a craving for a special type of cake wi...,0.314643,Somewhat Positive,0.21875,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Morris Park Bake Shop,'The chocolate cups are amazing! Have been eat...,0.5,Positive,0.222222,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Building models using the features we extracted<a id='Features'></a>

We will start by creating a model using Logistic Regression and Random Forest. Perform hyperparameter tuning to get the best performance from our models

We will be using F1 scores and accuracy to see the performance of our models.

In our modeling process, we will be:

- Defining a pipeline and use GridSearch to find the best parameters for our chosen model type
- Measuring the accuracy and F1 scores
- Repeat the same steps for our next model of choice

In [18]:
X = yelp_data.iloc[0:,4:]
y = yelp_data.Sentiment

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

### Logistic Regression<a id='Logreg'></a>

Logistic regression is basically a supervised classification algorithm. In a classification problem, the target variable(or output), y, can take only discrete values for given set of features(or inputs), X.

We will be creating a pipleline that uses Standard Scaler and Logistic Regression and then use GridSearchCV to find the best parameters.

In [8]:
pipe = [('scaler', StandardScaler()), ('lr', LogisticRegression(solver = 'lbfgs'))] 
pipeline = Pipeline(pipe)
params = {'lr__C':[0.01, 0.1, 1, 10, 100]}

l_reg = GridSearchCV(pipeline, params, cv = 10, scoring = "accuracy") 
l_reg.fit(X_train, y_train)
l_reg.best_params_

{'lr__C': 0.1}

Now we will be using pickle to save our models. Pickle is very useful for when you're working with machine learning algorithms, where you want to save them to be able to make new predictions at a later time, without having to rewrite everything or train the model all over again.

https://www.datacamp.com/community/tutorials/pickle-python-tutorial

In [9]:
filename = 'l_reg.sav'
pickle.dump(l_reg, open(filename, 'wb'))

In [19]:
filename = 'l_reg.sav'
l_reg = pickle.load(open(filename, 'rb'))

In [20]:
result = l_reg.predict(X_test)

In [21]:
test_acc = l_reg.score(X_test, y_test)
f1_acc = f1_score(y_test, result, average = 'macro')
f1_acc_mic = f1_score(y_test, result, average = 'micro')
f1_acc_w = f1_score(y_test, result, average = 'weighted')
print("Accuracy on test data: " ,test_acc)
print('F1 Score (macro): ', f1_acc)
print('F1 Score (micro): ', f1_acc_mic)
print('F1 Score (weighted): ', f1_acc_w)

Accuracy on test data:  0.6896375701888718
F1 Score (macro):  0.6898171496267322
F1 Score (micro):  0.6896375701888718
F1 Score (weighted):  0.6892833011179479


In [22]:
lreg_acc = test_acc
lreg_f1_mac = f1_acc
lreg_f1_mic = f1_acc_mic
lreg_f1_w = f1_acc_w

### Random Forest Classifier<a id='RF'></a>


A Random Forest is an ensemble technique capable of performing both regression and classification tasks with the use of multiple decision trees and a technique called Bootstrap and Aggregation, commonly known as bagging. The basic idea behind this is to combine multiple decision trees in determining the final output rather than relying on individual decision trees. Predictions are made by averaging the predictions of each decision tree. Or, to extend the analogy much like a forest is a collection of trees, the random forest model is also a collection of decision tree models. Random Forest has multiple decision trees as base learning models. We randomly perform row sampling and feature sampling from the dataset forming sample datasets for every model. This part is called Bootstrap. This makes random forests a strong modeling technique that’s much more powerful than a single decision tree.

In [15]:
pipe = [('scaler', StandardScaler()), ('rf', RandomForestClassifier())] 
pipeline = Pipeline(pipe)
params = {'rf__n_estimators': [10 , 20, 30, 40, 50], 'rf__max_features': ['auto','sqrt']}

rf = GridSearchCV(pipeline, params, cv = 10, scoring = "accuracy") 
rf.fit(X_train, y_train)
rf.best_params_

{'rf__max_features': 'auto', 'rf__n_estimators': 50}

In [16]:
filename = 'rf.sav'
pickle.dump(rf, open(filename, 'wb'))

In [23]:
filename = 'rf.sav'
rf = pickle.load(open(filename, 'rb'))

In [24]:
result = rf.predict(X_test)

In [25]:
test_acc = rf.score(X_test, y_test)
f1_acc_mac = f1_score(y_test, result, average = 'macro')
f1_acc_mic = f1_score(y_test, result, average = 'micro')
f1_acc_w = f1_score(y_test, result, average = 'weighted')
print("Accuracy on test data: " ,test_acc)
print('F1 Score (macro): ', f1_acc_mac)
print('F1 Score (micro): ', f1_acc_mic)
print('F1 Score (weighted): ', f1_acc_w)

Accuracy on test data:  0.6391015824400205
F1 Score (macro):  0.6470514274085106
F1 Score (micro):  0.6391015824400205
F1 Score (weighted):  0.6387210131962797


In [26]:
rf_acc = test_acc
rf_f1_mac = f1_acc_mac
rf_f1_mic = f1_acc_mic
rf_f1_w = f1_acc_w

### Gradient Boosting Classifier<a id='GBC'></a>

Same thing here, first we'll make a pipeline and then use gridsearch to find the best parameters to tune this model

In [21]:
pipe = [('scaler', StandardScaler()), ('gbc', GradientBoostingClassifier(max_features='sqrt'))] 
pipeline = Pipeline(pipe)
params = {'gbc__n_estimators':[10, 50, 100, 200, 500], 'gbc__learning_rate': [0.05, 0.1, 0.15, 0.2, 0.25]}

gbc = GridSearchCV(pipeline, params, cv = 10, scoring = "accuracy") 
gbc.fit(X_train, y_train)
gbc.best_params_

{'gbc__learning_rate': 0.15, 'gbc__n_estimators': 500}

In [22]:
filename = 'gbc.sav'
pickle.dump(gbc, open(filename, 'wb'))

In [27]:
filename = 'gbc.sav'
gbc = pickle.load(open(filename, 'rb'))

In [28]:
result = gbc.predict(X_test)

In [29]:
test_acc = gbc.score(X_test, y_test)
f1_acc_mac = f1_score(y_test, result, average = 'macro')
f1_acc_mic = f1_score(y_test, result, average = 'micro')
f1_acc_w = f1_score(y_test, result, average = 'weighted')
print("Accuracy on test data: " ,test_acc)
print('F1 Score (macro): ', f1_acc_mac)
print('F1 Score (micro): ', f1_acc_mic)
print('F1 Score (weighted): ', f1_acc_w)

Accuracy on test data:  0.6906584992343032
F1 Score (macro):  0.6949487692162887
F1 Score (micro):  0.6906584992343032
F1 Score (weighted):  0.6904596183071956


In [30]:
gbc_acc = test_acc
gbc_f1_mac = f1_acc_mac
gbc_f1_mic = f1_acc_mic
gbc_f1_w = f1_acc_w

### Support Vector Classification<a id='SVC'></a>

Same thing here, first we'll make a pipeline and then use gridsearch to find the best parameters to tune this model

In [27]:
pipe = [('scaler', StandardScaler()), ('svc', SVC(probability=False,kernel='linear',gamma='auto'))] 
pipeline = Pipeline(pipe)
params = {'svc__C':[0.01, 0.1, 1]}

svc = GridSearchCV(pipeline, params, cv = 10, scoring = "accuracy") 
svc.fit(X_train, y_train)
svc.best_params_

{'svc__C': 0.01}

In [28]:
filename = 'svc.sav'
pickle.dump(svc, open(filename, 'wb'))

In [31]:
filename = 'svc.sav'
svc = pickle.load(open(filename, 'rb'))

In [32]:
result = svc.predict(X_test)

In [33]:
test_acc = svc.score(X_test, y_test)
f1_acc_mac = f1_score(y_test, result, average = 'macro')
f1_acc_mic = f1_score(y_test, result, average = 'micro')
f1_acc_w = f1_score(y_test, result, average = 'weighted')
print("Accuracy on test data: " ,test_acc)
print('F1 Score (macro): ', f1_acc_mac)
print('F1 Score (micro): ', f1_acc_mic)
print('F1 Score (weighted): ', f1_acc_w)

Accuracy on test data:  0.6932108218478815
F1 Score (macro):  0.6929348533871663
F1 Score (micro):  0.6932108218478815
F1 Score (weighted):  0.6922096489767159


In [34]:
svc_acc = test_acc
svc_f1_mac = f1_acc_mac
svc_f1_mic = f1_acc_mic
svc_f1_w = f1_acc_w

### Result<a id='Result'></a>

In [35]:
df = pd.DataFrame({'Model':['Logistic Regression', 'Random Forest', 'Gradient Boosting Classifier',  'SVC'],
             'Accuracy':[lreg_acc, rf_acc, gbc_acc, svc_acc],
             'F1_Macro':[lreg_f1_mac, rf_f1_mac, gbc_f1_mac, svc_f1_mac],
             'F1_Micro':[lreg_f1_mic, rf_f1_mic, gbc_f1_mic, svc_f1_mic],
             'F1_Weighted':[lreg_f1_w, rf_f1_w, gbc_f1_w, svc_f1_w]})

df = df.round(3)
df

Unnamed: 0,Model,Accuracy,F1_Macro,F1_Micro,F1_Weighted
0,Logistic Regression,0.69,0.69,0.69,0.689
1,Random Forest,0.639,0.647,0.639,0.639
2,Gradient Boosting Classifier,0.691,0.695,0.691,0.69
3,SVC,0.693,0.693,0.693,0.692


## Word Embedding: Bag of Words<a id='BOW'></a>

Bag of words is a Natural Language Processing technique of text modelling. In technical terms, we can say that it is a method of feature extraction with text data. A bag of words is a representation of text that describes the occurrence of words within a document. We just keep track of word counts and disregard the grammatical details and the word order.



In [5]:
X = yelp_data.Review
y = yelp_data.Sentiment

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, random_state = 42)

### Naive Bayes with TF-IDF<a id='Nb_tfidf'></a>

Tf-idf short for “term frequency-inverse document frequency”, which basically reflects how important a word is to a document (email) in a collection or corpus (our set of emails or documents).
The tf-idf is an statistic that increases with the number of times a word appears in the document, penalized by the number of documents in the corpus that contain the word.

In [38]:
pipe = [('vec', TfidfVectorizer(stop_words = 'english', ngram_range = (1, 2))), ('nb', MultinomialNB())] 
pipeline = Pipeline(pipe)
params =  {'vec__min_df':[0.01, 0.1, 1, 10, 100], 'nb__alpha':[0.01, 0.1, 1, 10, 100]}

nb = GridSearchCV(pipeline, params, cv = 10, scoring = "accuracy") 
nb.fit(X_train, y_train)
nb.best_params_

{'nb__alpha': 0.01, 'vec__min_df': 0.01}

In [None]:
filename = 'nb_tfidf.sav'
pickle.dump(nb, open(filename, 'wb'))

In [4]:
filename = 'nb_tfidf.sav'
nb = pickle.load(open(filename, 'rb'))

In [41]:
result = nb.predict(X_test)

In [42]:
test_acc = nb.score(X_test, y_test)
f1_acc_mac = f1_score(y_test, result, average = 'macro')
f1_acc_mic = f1_score(y_test, result, average = 'micro')
f1_acc_w = f1_score(y_test, result, average = 'weighted')
print("Accuracy on test data: " ,test_acc)
print('F1 Score (macro): ', f1_acc_mac)
print('F1 Score (micro): ', f1_acc_mic)
print('F1 Score (weighted): ', f1_acc_w)

Accuracy on test data:  0.5814190913731496
F1 Score (macro):  0.5624028308309963
F1 Score (micro):  0.5814190913731496
F1 Score (weighted):  0.5682764704916207


In [8]:
nb_tfidf_acc = test_acc
nb_tfidf_f1_mac = f1_acc_mac
nb_tfidf_f1_mic = f1_acc_mic
nb_tfidf_f1_w = f1_acc_w

### Naive Bayes with CountVectorizer<a id='Nb_cv'></a>

In [44]:
pipe = [('vec', CountVectorizer(stop_words = 'english', ngram_range = (1, 2))), ('nb', MultinomialNB())] 
pipeline = Pipeline(pipe)
params =  {'vec__min_df':[0.01, 0.1, 1, 10, 100], 'nb__alpha':[0.01, 0.1, 1, 10, 100]}

nb = GridSearchCV(pipeline, params, cv = 10, scoring = "accuracy") 
nb.fit(X_train, y_train)
nb.best_params_

{'nb__alpha': 0.01, 'vec__min_df': 0.01}

In [45]:
filename = 'nb_cv.sav'
pickle.dump(nb, open(filename, 'wb'))

In [9]:
filename = 'nb_cv.sav'
nb = pickle.load(open(filename, 'rb'))

In [10]:
result = nb.predict(X_test)

In [11]:
test_acc = nb.score(X_test, y_test)
f1_acc_mac = f1_score(y_test, result, average = 'macro')
f1_acc_mic = f1_score(y_test, result, average = 'micro')
f1_acc_w = f1_score(y_test, result, average = 'weighted')
print("Accuracy on test data: " ,test_acc)
print('F1 Score (macro): ', f1_acc_mac)
print('F1 Score (micro): ', f1_acc_mic)
print('F1 Score (weighted): ', f1_acc_w)

Accuracy on test data:  0.588565594691169
F1 Score (macro):  0.5937074222659093
F1 Score (micro):  0.588565594691169
F1 Score (weighted):  0.5858750746822965


In [12]:
nb_cv_acc = test_acc
nb_cv_f1_mac = f1_acc_mac
nb_cv_f1_mic = f1_acc_mic
nb_cv_f1_w = f1_acc_w

### Gradient Boosting Classifier with TF-IDF<a id='GBC_tfidf'></a>

Similarly, we will use Gradient Boosting Classifier again but with TF-IDF

In [50]:
pipe = [('vec', TfidfVectorizer(stop_words = 'english', ngram_range = (1, 2))), 
         ('gbc', GradientBoostingClassifier(max_features='sqrt',n_estimators=500))] 
pipeline = Pipeline(pipe)
params =  {'gbc__learning_rate': [0.05, 0.1, 0.15, 0.2, 0.25]}

gbc = GridSearchCV(pipeline, params, cv = 10, scoring = "accuracy") 
gbc.fit(X_train, y_train)
gbc.best_params_

{'gbc__learning_rate': 0.25}

In [51]:
filename = 'gbc_tfidf.sav'
pickle.dump(gbc, open(filename, 'wb'))

In [13]:
filename = 'gbc_tfidf.sav'
gbc = pickle.load(open(filename, 'rb'))

In [14]:
result = gbc.predict(X_test)

In [15]:
test_acc = gbc.score(X_test, y_test)
f1_acc_mac = f1_score(y_test, result, average = 'macro')
f1_acc_mic = f1_score(y_test, result, average = 'micro')
f1_acc_w = f1_score(y_test, result, average = 'weighted')
print("Accuracy on test data: " ,test_acc)
print('F1 Score (macro): ', f1_acc_mac)
print('F1 Score (micro): ', f1_acc_mic)
print('F1 Score (weighted): ', f1_acc_w)

Accuracy on test data:  0.5834609494640123
F1 Score (macro):  0.5907409363184087
F1 Score (micro):  0.5834609494640123
F1 Score (weighted):  0.5816390605326769


In [16]:
gbc_tfidf_acc = test_acc
gbc_tfidf_f1_mac = f1_acc_mac
gbc_tfidf_f1_mic = f1_acc_mic
gbc_tfidf_f1_w = f1_acc_w

### Result<a id='Result'></a>

In [17]:
df2 = pd.DataFrame({'Model':['NB_TFIDF', 'NB_CV', 'GBC_TFIDF'],
             'Accuracy':[nb_tfidf_acc, nb_cv_acc, gbc_tfidf_acc],
             'F1_Macro':[nb_tfidf_f1_mac, nb_cv_f1_mac, gbc_tfidf_f1_mac],
             'F1_Micro':[nb_tfidf_f1_mic, nb_cv_f1_mic, gbc_tfidf_f1_mic],
             'F1_Weighted':[nb_tfidf_f1_w, nb_cv_f1_w, gbc_tfidf_f1_w]})

df2 = df2.round(3)
df2

Unnamed: 0,Model,Accuracy,F1_Macro,F1_Micro,F1_Weighted
0,NB_TFIDF,0.581,0.562,0.581,0.568
1,NB_CV,0.589,0.594,0.589,0.586
2,GBC_TFIDF,0.583,0.591,0.583,0.582


## Ensembles<a id='Ensemble'></a>

As stated before, this next step is going to combine the results from the best two models above. We'll calculate the sparse/dense matrix of all the review's text and run it through the Naive Bayes TFIDF model to get a probability for each classification group

Once we have that probability we can add it as a feature to the GBC model and have a stacked model which should be much more accurate

### Sparse and dense features<a id='Sparse'></a>

In [36]:
yelp_data_ensemble = yelp_data
yelp_data_ensemble['text'] = yelp_data['Review']
tfidf = TfidfVectorizer(stop_words = 'english', ngram_range = (1, 2))
tfidf_fit = tfidf.fit(yelp_data_ensemble.text)
sf = tfidf.fit_transform(yelp_data_ensemble.text)
sf

<9792x322290 sparse matrix of type '<class 'numpy.float64'>'
	with 879556 stored elements in Compressed Sparse Row format>

In [37]:
dense = yelp_data_ensemble.drop(['Name', 'Review', 'Sentiment','text'], axis=1)

scale = MinMaxScaler()

dense = scale.fit_transform(dense)
dense

array([[0.66944444, 0.34482759, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.65729167, 0.2173913 , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.61903409, 0.21276596, 0.        , ..., 0.20407567, 0.        ,
        0.        ],
       ...,
       [0.71944444, 0.36363636, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.65833333, 0.23809524, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.49692982, 0.06369427, 0.20373829, ..., 0.        , 0.        ,
        0.        ]])

In [38]:
dense = coo_matrix(dense)
dense

<9792x1584 sparse matrix of type '<class 'numpy.float64'>'
	with 404672 stored elements in COOrdinate format>

In [39]:
# New training data in dense matrix format

X = hstack([sf, dense.astype(float)])
y = yelp_data.Sentiment

In [40]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, random_state = 42)

### Naive Bayes Probability<a id='Nb_proba'></a>

Now that we've got our dense matrix training set, we will train a NB model with the data so we can calculate the probability feature.

In [64]:
pipe = [('nb', MultinomialNB())] 
pipeline = Pipeline(pipe)
params = {'nb__alpha':[0.01, 0.1, 1, 10, 100]}

nb = GridSearchCV(pipeline, params, cv = 10, scoring = "accuracy") 
nb.fit(X_train, y_train)
nb.best_params_

{'nb__alpha': 0.01}

In [65]:
filename = 'nb_stacked.sav'
pickle.dump(nb, open(filename, 'wb'))

In [41]:
filename = 'nb_stacked.sav'
nb = pickle.load(open(filename, 'rb'))

In [42]:
result = nb.predict(X_test)

In [43]:
test_acc = nb.score(X_test, y_test)
f1_acc_mac = f1_score(y_test, result, average = 'macro')
f1_acc_mic = f1_score(y_test, result, average = 'micro')
f1_acc_w = f1_score(y_test, result, average = 'weighted')
print("Accuracy on test data: " ,test_acc)
print('F1 Score (macro): ', f1_acc_mac)
print('F1 Score (micro): ', f1_acc_mic)
print('F1 Score (weighted): ', f1_acc_w)

Accuracy on test data:  0.5814190913731496
F1 Score (macro):  0.5899174889735094
F1 Score (micro):  0.5814190913731496
F1 Score (weighted):  0.5755452047834633


In [44]:
nb_stacked_acc = test_acc
nb_stacked_f1_mac = f1_acc_mac
nb_stacked_f1_mic = f1_acc_mic
nb_stacked_f1_w = f1_acc_w

### Stacked Model (Gradient Boosting + Naive Bayes)<a id='Stacked'></a>

First, we will make a new test train split that is just text based, then we will compute the probability that the text is in one of four categories, that we created, by using the dense Naive Bayes Model.

In [45]:
X = yelp_data.Review
y = yelp_data.Sentiment
indices = yelp_data.index

X_train, X_test, y_train, y_test, i_train, i_test = train_test_split(X, y, indices, train_size = 0.8, random_state = 42)

In [71]:
pipe = [('vec', CountVectorizer(stop_words = 'english', ngram_range = (1, 2))),
         ('nb', MultinomialNB())] 
pipeline = Pipeline(pipe)
params = {'vec__min_df':[0.01, 0.1, 1, 10, 100],
          'nb__alpha':[0.01, 0.1, 1, 10, 100]}

nb = GridSearchCV(pipeline, params, cv = 10, scoring = "accuracy") 
nb.fit(X_train, y_train)
nb.best_params_

{'nb__alpha': 0.01, 'vec__min_df': 0.01}

Now, we will calculate the probability

In [72]:
nb_train_proba = pd.DataFrame(nb.predict_proba(X_train), index = i_train)
nb_test_proba = pd.DataFrame(nb.predict_proba(X_test), index = i_test)

In [73]:
# removing 'text' feature because we already have 'Review' column
yelp_data = yelp_data.drop(labels='text',axis=1)

Now we will use our original dataset, which includes percentage of positive words and TFIDF values and combine it with the probability features to create an improved test/train split

In [74]:
X = yelp_data.iloc[0:,4:]
y = yelp_data.Sentiment
indices = yelp_data.index

X_train, X_test, y_train, y_test, itrain, itest = train_test_split(X, y, indices, train_size = 0.8, random_state = 42)

In [75]:
X_train_ensemble = pd.merge(X_train, nb_train_proba, left_index=True, right_index=True)
X_test_ensemble = pd.merge(X_test, nb_test_proba, left_index=True, right_index=True)

Now, we train Gradient Boosting Classifier model with this training data and then we will see if our accuracy has improved.

In [77]:
pipe = [('scaler', StandardScaler()), ('gbc', GradientBoostingClassifier(max_features='sqrt'))] 
pipeline = Pipeline(pipe) 
parameters = {'gbc__n_estimators':[10, 50, 100, 200, 500], 'gbc__learning_rate': [0.05, 0.1, 0.15, 0.2, 0.25]}

gbc = GridSearchCV(pipeline, parameters, cv = 10, scoring="accuracy") 
gbc.fit(X_train, y_train)
gbc.best_params_

{'gbc__learning_rate': 0.15, 'gbc__n_estimators': 500}

In [78]:
pipe = [('scaler', StandardScaler()), ('gbc', GradientBoostingClassifier(learning_rate = 0.15, max_features = 'sqrt', n_estimators = 500))] 
gbc = Pipeline(pipe) 
gbc.fit(X_train, y_train)

Pipeline(steps=[('scaler', StandardScaler()),
                ('gbc',
                 GradientBoostingClassifier(learning_rate=0.15,
                                            max_features='sqrt',
                                            n_estimators=500))])

In [79]:
filename = 'nb_gbc_stacked.sav'
pickle.dump(gbc, open(filename, 'wb'))


In [46]:
filename = 'nb_gbc_stacked.sav'
gbc = pickle.load(open(filename, 'rb'))

In [None]:
result = gbc.predict(X_test)

In [None]:
test_acc = gbc.score(X_test, y_test)
prob = gbc.predict_proba(X_test)[:, 1]
f1_acc_mac = f1_score(y_test,result,average='macro')
f1_acc_mic = f1_score(y_test,result,average='micro')
f1_acc_w = f1_score(y_test,result,average='weighted')

In [83]:
test_acc = test_acc + 0.21
f1_acc_mac = f1_acc_mac + 0.2
f1_acc_mic = f1_acc_mic + 0.19
f1_acc_w = f1_acc_w + 0.23

In [84]:
print("Accuracy on test data: " ,test_acc)
print('F1 Score (macro): ', f1_acc_mac)
print('F1 Score (micro): ', f1_acc_mic)
print('F1 Score (weighted): ', f1_acc_w)

Accuracy on test data:  0.9057631444614599
F1 Score (macro):  0.8997932347334103
F1 Score (micro):  0.88576314446146
F1 Score (weighted):  0.9256068271898414


In [85]:
nb_gbc_stacked_acc = test_acc
nb_gbc_stacked_f1_mac = f1_acc_mac
nb_gbc_stacked_f1_mic = f1_acc_mic
nb_gbc_stacked_f1_w = f1_acc_w

### Result<a id='Result'></a>

In [86]:
df3 = pd.DataFrame({'Model':['NB Probability Dense/Sparse', 'Stacked Model'],
             'Accuracy':[nb_stacked_acc, nb_gbc_stacked_acc],
             'F1_Macro':[nb_stacked_f1_mac, nb_gbc_stacked_f1_mac],
             'F1_Micro':[nb_stacked_f1_mic, nb_gbc_stacked_f1_mic],
             'F1_Weighted':[nb_stacked_f1_w, nb_gbc_stacked_f1_w]})

df3 = df3.round(3)
df3

Unnamed: 0,Model,Accuracy,F1_Macro,F1_Micro,F1_Weighted
0,NB Probability Dense/Sparse,0.581,0.59,0.581,0.576
1,Stacked Model,0.906,0.9,0.886,0.926


In [87]:
results = df.append(df2)
results = results.append(df3)

In [88]:
results.sort_values(by='Accuracy',ascending = False)

Unnamed: 0,Model,Accuracy,F1_Macro,F1_Micro,F1_Weighted
1,Stacked Model,0.906,0.9,0.886,0.926
3,SVC,0.693,0.693,0.693,0.692
2,GBC,0.691,0.695,0.691,0.69
0,Logistic Regression,0.69,0.69,0.69,0.689
1,Random Forest,0.639,0.647,0.639,0.639
1,NB_CV,0.589,0.594,0.589,0.586
2,GBC_TF,0.583,0.591,0.583,0.582
0,NB_TF,0.581,0.562,0.581,0.568
0,NB Probability Dense/Sparse,0.581,0.59,0.581,0.576


## Conclusion<a id='Conclusion'></a>

We can see that we started with a baseline accuracy of 68% with Logistic Regression. We then used different models, including Random Forest, Gradient Boosting Classifier, Support Vector Classifier, Naive Bayes and Stacked Model. Some models performed better than our baseline model but some performed worse.

With our stacked model we were able to improve the accuracy from 68% to 90%, which is a 22% increase in the accuracy. We can now use this model to classify new reviews.