We start by importing the libraries we will be using in our task.

In [1]:
# Importing the libraries
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import GridSearchCV, StratifiedKFold

First, we will read the dataset, then using the Natural Language Toolkit (nltk) to clean the text and create the bag of words model.  

In [2]:
# Importing the dataset
dataset = pd.read_csv('restaurant.tsv', delimiter = '\t', quoting = 3)

In [3]:
# Cleaning the texts
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0, 1000):
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\chaouki\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 2000)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values

To evaluate the model, we will use a 5-folds stratified cross-validator. In this notebook, we will fit two classification models: *(i)* a logistic regression, and *(ii)* a random forest. The metric score we considered is the ROC-AUC score. A grid serach is used to find the best parameters of each model with the purpose of maximize the area under the ROC curve.

In [7]:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)

In [15]:
# Fitting a logistic regression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr_params = {'C': np.logspace(-8, 8, 17)}
log_grid = GridSearchCV(lr, param_grid=lr_params, scoring='roc_auc', cv=skf, n_jobs=-1, verbose=1)
log_grid.fit(X, y)
# Best score and best parameter
print('Best score:',log_grid.best_score_)
print('Best paramter:', log_grid.best_params_)

Fitting 5 folds for each of 17 candidates, totalling 85 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    2.6s


Best score: 0.85715
Best paramter: {'C': 1.0}
[0.80625 0.80708 0.80586 0.80773 0.80773 0.80789 0.81425 0.83885 0.85715
 0.85239 0.84135 0.82757 0.81225 0.79149 0.77703 0.77391 0.77351]


[Parallel(n_jobs=-1)]: Done  85 out of  85 | elapsed:    4.4s finished


In [22]:
# Fitting a random forest
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=0)
n_estimators = [10, 50, 100]
max_depth_values = [5, 6, 7, 8, 9]
max_features_values = [4, 5, 6, 7]
tree_params = {'n_estimators' : n_estimators,
               'max_depth': max_depth_values,
               'max_features': max_features_values}
rf_grid = GridSearchCV(estimator=rf, param_grid=tree_params,
                                  scoring='roc_auc', n_jobs=-1, cv=skf, verbose=1)
rf_grid.fit(X, y)
# Best score and best parameter
print('Best score:',rf_grid.best_score_)
print('Best paramter:', rf_grid.best_params_)

Fitting 5 folds for each of 60 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    5.2s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   15.1s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:   22.6s finished


Best score: 0.83188
Best paramter: {'max_depth': 9, 'max_features': 7, 'n_estimators': 100}
