# Machine Learning A-Z: Section 29 Natural Language Processing - Bag of Words Model

Natural Language Processing is the area of Machine Learning that deals with analyzing written and spoken word to understand the meaning. It is a huge area of ongoing research with many of the previously covered algorithms and many more being used to teach computers what a chunk of language means.

A common model in NLP is the Bag of Words model which simply looks at all the words in a passage and measures the presence or absence of known key words.

In this case we are going to be looking at restaurant reviews and trying to determine whether the review has a positive or negative sentiment.

## Step 1 Import and Prepare the data.

In [1]:
import numpy as np # Libraries for fast linear algebra and array manipulation
import pandas as pd # Import and manage datasets
from plotly import __version__ as py__version__
import plotly.express as px # Libraries for ploting data
import plotly.graph_objects as go # Libraries for ploting data
import re # Library for using regular expressions. Will help us clean the text
import nltk # Library for doing common NLP operations
from nltk.corpus import stopwords # Library for removing irrelevant words
from nltk.stem.porter import PorterStemmer # Library condensing words down to the root word
from sklearn import __version__ as skl__version__
from sklearn.model_selection import train_test_split # Library to split data into training and test sets.
from sklearn.naive_bayes import GaussianNB # Library to do Naive Bayes classification
from sklearn.ensemble import RandomForestClassifier # Library to do Random Forest classification
from sklearn.metrics import confusion_matrix #Function for computing the confusion matrix
from sklearn.feature_extraction.text import CountVectorizer # Library to create the 'one-hot' encoded word matrix

Library versions used in this code:

In [2]:
print('Numpy: ' + np.__version__)
print('Pandas: ' + pd.__version__)
print('Plotly: ' + py__version__)
print('NLTK: ' + nltk.__version__)
print('Scikit-learn: ' + skl__version__)

Numpy: 1.16.4
Pandas: 0.25.1
Plotly: 4.0.0
NLTK: 3.4.4
Scikit-learn: 0.21.2


In [3]:
def LoadData():
    dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)
    return dataset

dataset = LoadData()
print(dataset.head(3))
print()
print(dataset.info())

                                      Review  Liked
0                   Wow... Loved this place.      1
1                         Crust is not good.      0
2  Not tasty and the texture was just nasty.      0

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
Review    1000 non-null object
Liked     1000 non-null int64
dtypes: int64(1), object(1)
memory usage: 15.8+ KB
None


We've now imported the text. In NLP, the choice of delimiter can be important. In this case tab was use since a tab character is unlikely to appear in our text, however this can be very application dependent. And other good delimiter to use can be the pipe '|' character, however in choosing you delimiter you always need to consider what may or may not be in the text you want to delimit.

Since language data is almost always very messy (How many typo's do you make on a daily basis?) we will need to clean and prepare the data before we can run it through out model. Also in most languages you have many variations on the same word (love, loves, loved, etc.)

The steps in cleaning process are:
1. Remove everything that is not a letter (replace with spaces)
1. Make everything lowercase
1. Split wach review into a list of words
1. Remove non-interesting words (ex. a, the, an, and, this, etc.)
1. Stem (i.e. keep only the root of similar words)
1. Join the cleaned words back together separated by a space

In [4]:
corpus = []
for review in dataset['Review']:
    review = re.sub('[^a-zA-Z]', ' ', review)
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)
    

The bag of words model works by essentially one-hot encoding each review. This means a new data structure is create with a column for each word that appears in any review. Each row represents a review and the matrix is filled with the count of each word in the review. What this means though is that we will have a very sparse matrix, which can be difficult for algorithms to deal with. Cleaning the input data already helped reduce the number of words (columns) in this matrix and hence the sparsity. In order to reduce it further we will consider only commonly occuring words.

In [5]:
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:,-1]

We are now ready to train our model. We are going to use a Naive Bayes classifier (as discussed earlier) to classify each word on it's positivity/negativity

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.20, random_state = 42)

classifier = GaussianNB()
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

confusionMatrix_NB = confusion_matrix(y_test, y_pred)

print(confusionMatrix_NB)

[[48 48]
 [18 86]]


We can also try other classification models on our bag of words such as Random Forest

In [7]:
classifier = RandomForestClassifier(n_estimators = 100, criterion = 'entropy', random_state = 42)
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

confusionMatrix_RF = confusion_matrix(y_test, y_pred)

print(confusionMatrix_RF)

[[82 14]
 [42 62]]


Or a Random Forest of CART trees

In [8]:
classifier = RandomForestClassifier(n_estimators = 100, criterion = 'gini', random_state = 42)
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

confusionMatrix_CART = confusion_matrix(y_test, y_pred)

print(confusionMatrix_CART)

[[83 13]
 [47 57]]


We can also evaluate our models based on Accuracy, Precision, Recall, and the F1 Score

As a quick refresher:
* Accuracy is the total number of correct predictions
* Precision is a measure of how many predicted positives are actually positive (How useful are the results?)
* Recall is a measure of how many positives were correctly identified (How complete are the results?)
* F1 is the harmonic mean of Precision and Recall (usefull & complete) and ranges from 1 (best) to 0 (worst)

More info can be found in the relevant Wikipedia articles:
* [Accuracy](https://en.wikipedia.org/wiki/Accuracy_and_precision)
* [Precision & Recall](https://en.wikipedia.org/wiki/Precision_and_recall)
* [F1 Score](https://en.wikipedia.org/wiki/F1_score)

In [9]:
accuracy_NB = (confusionMatrix_NB[1,1]+confusionMatrix_NB[0,0])/len(y_pred)
precision_NB = confusionMatrix_NB[1,1]/(confusionMatrix_NB[1,1] + confusionMatrix_NB[1,0])
recall_NB = confusionMatrix_NB[1,1]/(confusionMatrix_NB[1,1] + confusionMatrix_NB[0,1])
f1_NB = (2 * precision_NB * recall_NB) / (precision_NB + recall_NB)
print(f'Naive Bayes Accuracy: {accuracy_NB*100}%')
print(f'Naive Bayes Precision: {precision_NB*100:0.1f}%')
print(f'Naive Bayes Recall: {recall_NB*100:0.1f}%')
print(f'Naive Bayes F1: {f1_NB:0.3f}')

Naive Bayes Accuracy: 67.0%
Naive Bayes Precision: 82.7%
Naive Bayes Recall: 64.2%
Naive Bayes F1: 0.723


In [10]:
accuracy_RF = (confusionMatrix_RF[1,1]+confusionMatrix_RF[0,0])/len(y_pred)
precision_RF = confusionMatrix_RF[1,1]/(confusionMatrix_RF[1,1] + confusionMatrix_RF[1,0])
recall_RF = confusionMatrix_RF[1,1]/(confusionMatrix_RF[1,1] + confusionMatrix_RF[0,1])
f1_RF = (2 * precision_RF * recall_RF) / (precision_RF + recall_RF)
print(f'Random Forest Accuracy: {accuracy_RF*100}%')
print(f'Random Forest Precision: {precision_RF*100:0.1f}%')
print(f'Random Forest Recall: {recall_RF*100:0.1f}%')
print(f'Random Forest F1: {f1_RF:0.3f}')

Random Forest Accuracy: 72.0%
Random Forest Precision: 59.6%
Random Forest Recall: 81.6%
Random Forest F1: 0.689


In [11]:
accuracy_CART = (confusionMatrix_CART[1,1]+confusionMatrix_CART[0,0])/len(y_pred)
precision_CART = confusionMatrix_CART[1,1]/(confusionMatrix_CART[1,1] + confusionMatrix_CART[1,0])
recall_CART = confusionMatrix_CART[1,1]/(confusionMatrix_CART[1,1] + confusionMatrix_CART[0,1])
f1_CART = (2 * precision_CART * recall_CART) / (precision_CART + recall_CART)
print(f'CART Accuracy: {accuracy_CART*100}%')
print(f'CART Precision: {precision_CART*100:0.1f}%')
print(f'CART Recall: {recall_CART*100:0.1f}%')
print(f'CART F1: {f1_CART:0.3f}')

CART Accuracy: 70.0%
CART Precision: 54.8%
CART Recall: 81.4%
CART F1: 0.655


From the above we can see that Naive Bayes seems to perform the best on this dataset and does relatively well given the limited size of the dataset and sparsity of the matrix.

Ways to reduce the sparcity of the matrix by reducing the number of features, words in this case, will be covered in future articles.

P.S. If you are having trouble getting the required NLTK packages, below is some code that can be used for downloading the NLTK packages. [Source](https://stackoverflow.com/a/50406704)
```
import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download()
```