# ***Tanguy Dabadie NLP Project - Basic Model***

In [25]:
import numpy as np
import pandas as pd
import plotly.express as px
from collections import Counter

# Different vectorizers
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# Different modl architectures
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier

# Importing the preprocessing functions from our python file
from preprocessing_function import preprocess_text, preprocess_text_without_top_ten

# Importing the model class that allow us to iterate Machine Learning models easier
from mlModel import Model

Let's start from the DataFrame we built in the Exploratory Data Analysis notebook :

In [26]:
file = pd.read_csv('Restaurant reviews.csv')
df = pd.DataFrame(file)
df = df.drop(['Restaurant','Reviewer','Metadata','Time','Pictures','7514'], axis=1)
df['Rating'] = df['Rating'].replace('nan', np.nan)
df = df.dropna(subset=['Rating'])
df = df[~df['Rating'].isin(['Like', '1.5', '2.5', '3.5', '4.5'])]

print(f" The rating unique values are : {df['Rating'].unique()}")
df

 The rating unique values are : ['5' '4' '1' '3' '2']


Unnamed: 0,Review,Rating
0,"The ambience was good, food was quite good . h...",5
1,Ambience is too good for a pleasant evening. S...,5
2,A must try.. great food great ambience. Thnx f...,5
3,Soumen das and Arun was a great guy. Only beca...,5
4,Food is good.we ordered Kodi drumsticks and ba...,5
...,...,...
9991,I was never a fan of Chinese food until I visi...,5
9992,I visited this restaurant with friends and was...,5
9993,"Im going to cut to the chase, The food is exce...",5
9995,Madhumathi Mahajan Well to start with nice cou...,3


## ***1. Preprocessing the data***

Let's preprocess the data thanks to the prepocessing function we created in the preprocessing_function.py file :

In [27]:
processed = df['Review'].iloc[:200].apply(lambda x: preprocess_text_without_top_ten(str(x)) if not isinstance(x, float) else "")
processed

0      quite saturday lunch cost effective sate brunc...
1      pleasant evening prompt experience soumen da k...
2      must try great great thnx pradeep subroto pers...
3      soumen da arun great guy behavior sincerety co...
4      ordered kodi drumstick basket mutton thanks pr...
                             ...                        
195    chiken musallam really dat keema also flavour ...
196    experienced best haleem murg musalam taste awe...
197    nyc restaurant come taste haleem even veg sooo...
198    tasty health haleem atmosphere also loved bett...
199    tasty specially satisfied suggest come try gre...
Name: Review, Length: 200, dtype: object

As indicated in the file, our preprocessing function tokenize the text, remove punctuation and convert to lowercase, remove stopwords and lemmatize our text before removing the 10 most common words : 'good', 'food', 'service', 'place', 'biryani', 'ambience', 'nice', 'visit', 'time', 'staff'. Let's see how we can retreive them :

## Most and less common words

Let's see what are the most and less common words in our text after removing the first 10 most commons ones :

In [37]:
# Use the Counter class to return the most frequent words
cnt = Counter()

# Join all the text in the text_wo_stop column using the join() function
joined_text = " ".join(processed)

# Tokenize the text by using the split() function
split = joined_text.split()

# Instantiate the Counter class with your tokenized array
word_counter = Counter(split)

# Use the most_common class method to return the most frequent words
most_common = word_counter.most_common()

most_common[:10]

[('really', 51),
 ('restaurant', 50),
 ('chicken', 50),
 ('also', 48),
 ('great', 48),
 ('experience', 30),
 ('best', 30),
 ('taste', 30),
 ('tasty', 29),
 ('well', 28)]

In [29]:
# Now for the less common ones
most_common[-10:]

[('murg', 1),
 ('musalam', 1),
 ('respectable', 1),
 ('nyc', 1),
 ('sooo', 1),
 ('imporvement', 1),
 ('health', 1),
 ('atmosphere', 1),
 ('sense', 1),
 ('humour', 1)]

Maybe a futur change that could improve or not our model would be to remove the words that occured once in our text thanks to the preprocessing function.

## ***2. Train a baseline model***

The goal here is simply to obtain a baseline model which we'll use as reference for future experiments. Let's train a first machine learning model without any particular parameter tuning of feature engineering.

## Model definition wihtout top 10 most common words

Let's use the class detailed in mlModel.py file to run some Machine Learning models on our data : 

In [33]:
# Set the variables of our model
X = df['Review']
y = df['Rating']
class_labels = ['1', '2', '3', '4', '5']

I uncountered an issue : some NaN values in the Review column part of the X_train data. Therefore I handle this problem in the class in the python file :

In [34]:
baseline_model = Model(X, y, MultinomialNB(), CountVectorizer(preprocessor=preprocess_text_without_top_ten))

## Fiting and prediction

In [35]:
baseline_model.fit()

## Report

The report is composed of a classification report with several index that provide information about the model's performances and a confusion matrix :

In [36]:
baseline_model.class_report(class_labels)

              precision    recall  f1-score   support

           1       0.69      0.85      0.76       352
           2       0.00      0.00      0.00       138
           3       0.39      0.16      0.22       224
           4       0.46      0.57      0.51       473
           5       0.74      0.80      0.77       775

    accuracy                           0.62      1962
   macro avg       0.46      0.48      0.45      1962
weighted avg       0.57      0.62      0.59      1962



## Model definition with top 10 most common words

Let's see if we already have a better accuracy score with the 10 most common words : 

In [13]:
baseline_model = Model(X, y, MultinomialNB(), CountVectorizer(preprocessor=preprocess_text))

## Fiting and prediction

In [14]:
baseline_model.fit()

## Report

In [15]:
baseline_model.class_report(class_labels)

              precision    recall  f1-score   support

           1       0.72      0.85      0.78       352
           2       0.00      0.00      0.00       138
           3       0.40      0.16      0.23       224
           4       0.46      0.57      0.51       473
           5       0.73      0.81      0.77       775

    accuracy                           0.63      1962
   macro avg       0.46      0.48      0.46      1962
weighted avg       0.58      0.63      0.59      1962



Regarder the accuracy score, our baseline model will use the preprocessing function that doesn't remove the top ten most commons words.

### What we can see from that first model :

First we obtain a global accuracy score of 0.63 which is not so good but we can see a first explanation to this : the 2 and 3 classes don't have many samples to help the model training.

## ***3. Improve on the baseline***

### First analisys on the Vectorizer/Model Architecture combination

Now that we setted the baseline model, let's deep into this and try to improve it. The first thing that comes to my mind after implementing a Machine Learning model on my data is that I would like to find a way to have the best settings possile of my baseline model by finding first the best combination of vectorizer and model architecture to have the best accuracy score possible.

To do this, let's use sklearn.model_selection.GridSearchCV that will allow us to find directly the best combination possible given a bunch of combination :

#### Parameters of our GridSearchCV

For running time purposes, many model architectures are commented but have been tested and give the same result :

In [16]:
parameters = {
    'Vectorizer': [CountVectorizer(preprocessor=preprocess_text), TfidfVectorizer(preprocessor=preprocess_text)],
    'Model_Architecture': [MultinomialNB(), SVC(random_state=42)]
}
# OTHER MODEL ARCHITECTURES TESTED :
# RandomForestClassifier(random_state=42), KNeighborsClassifier(), XGBClassifier(random_state=42), LogisticRegression(), GradientBoostingClassifier(random_state=42)

To be able to run that method of search with our class I created a Model.fit_with_grid_search() method in the mlModel.py file :

In [17]:
# Instantiate the model
baseline_model = Model(X, y, MultinomialNB(), CountVectorizer(preprocessor=preprocess_text))

# Fit with GridSearchCV
baseline_model.fit_with_grid_search(parameters)

# Get the best estimator
best_estimator = baseline_model.grid_search.best_estimator_

# Get the accuracy score
accuracy_score = baseline_model.grid_search.best_score_

# Print the best parameters and accuracy score
print("Best parameters:", baseline_model.grid_search.best_params_)
print("Best accuracy score:", accuracy_score)

Best parameters: {'Model_Architecture': SVC(random_state=42), 'Vectorizer': TfidfVectorizer(preprocessor=<function preprocess_text at 0x0000022F6E382340>)}
Best accuracy score: 0.6252554874905109


As said earlier test those different model architectures didn't improve a lot my accuracy score and the best combination stay a SVC with a tf-idf vectorizer :

In [18]:
# Model
baseline_model2 = Model(X, y, SVC(), TfidfVectorizer(preprocessor=preprocess_text))

# Fiting the model
baseline_model2.fit()

# Display the report
baseline_model2.class_report(class_labels)

              precision    recall  f1-score   support

           1       0.71      0.87      0.78       352
           2       0.22      0.01      0.03       138
           3       0.46      0.18      0.26       224
           4       0.49      0.53      0.51       473
           5       0.72      0.85      0.78       775

    accuracy                           0.64      1962
   macro avg       0.52      0.49      0.47      1962
weighted avg       0.60      0.64      0.60      1962



After that research, the best accuracy score we can get so far is 0.64. As you can see I nearly realised it with my first baseline model. Let's head to another idea to try optimising my model.

### Second analysis on the hyperparameters

Now, after optimizing the model architecture and the vectorizer let's try to optimize the hyperparamters of our SVC model architecture :

#### Kernel hyperparameter

In [19]:
# Instantiate the Model class with necessary parameters
model = Model(X, y, SVC(), TfidfVectorizer(preprocessor=preprocess_text))
model.fit()
model.grid_search_kernel(['linear', 'poly', 'rbf', 'sigmoid'])
model.class_report(class_labels)

Best kernel:  rbf
              precision    recall  f1-score   support

           1       0.71      0.87      0.78       352
           2       0.22      0.01      0.03       138
           3       0.46      0.18      0.26       224
           4       0.49      0.53      0.51       473
           5       0.72      0.85      0.78       775

    accuracy                           0.64      1962
   macro avg       0.52      0.49      0.47      1962
weighted avg       0.60      0.64      0.60      1962



We can see that unfortunetely, this even if we changed to the more optimized kernel, we still have the same results... Let's proceed with a second parameter : C

#### C hyperparameter

As with the model's parameters and the kernel hyperparameter, we still implement our GridSearchCV in another method called grid_search_c in our class (in mlModel.py file) :

In [20]:
# Instantiate the Model class with necessary parameters
model = Model(X, y, SVC(kernel='rbf'), TfidfVectorizer(preprocessor=preprocess_text))
model.fit()
model.grid_search_C([0.1, 1, 10, 100, 1000])
model.class_report(class_labels)

Best C:  10
              precision    recall  f1-score   support

           1       0.71      0.87      0.78       352
           2       0.22      0.01      0.03       138
           3       0.46      0.18      0.26       224
           4       0.49      0.53      0.51       473
           5       0.72      0.85      0.78       775

    accuracy                           0.64      1962
   macro avg       0.52      0.49      0.47      1962
weighted avg       0.60      0.64      0.60      1962



Again, even if we optimized this C hyperparameter, the performances of our model don't change.