# Bag of Words (BoW)
An NLP model that represents a text (such as a sentence or document) as a bag (multiset) of its words, disregarding grammar and even word order.

### Example of BoW
Sentence 1: ```John likes to watch movies. Mary likes movies too.```  
Sentence 2: ```John also likes to watch football games.```

Put both sentences into their individual Bags, in Python we use a dictionary to abstract it.
```python
BoW1 = {"John": 1, "likes": 2, "to": 1, "watch": 1, "movies": 2, "Mary": 1, "too": 1}
BoW2 = {"John": 1,"also": 1,"likes": 1,"to": 1, "watch": 1,"football": 1,"games": 1}
```
The key represents the word, and the value is the frequency of the word.

Now we need to "normalize" the item frequencies by removing common words such as prepositions like "to", "as", "the", and etc. from each Bag of words. Another normalization technique is to remove stems from the words such as "loved" becomes "love" or "plays" becomes "play" and etc.

Then we can compare the words and their frequencies in each sentence and see if the two sentences are alike. This would work best on a large data set of sentences.

In [4]:
# importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [5]:
"""
import the data set use tab delimeter because the file is a tsv
- quoting = 3 ignore all double quotes
"""
ignore_quotes = 3
reviews = pd.read_csv("datasets/restaurant_reviews.tsv", delimiter="\t", quoting=ignore_quotes)

reviews.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


# Pre-Processing The Texts
We need to clean the texts before fitting and predicting on the Bag of Words model.

In [6]:
# import regular expressions and NLTK
import re
import nltk

In [7]:
# download the "stop words" list (a list of irrelevant words)
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /home/pravat/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [8]:
# import the stopwords
from nltk.corpus import stopwords

# import the porter stemmer, an algorithm to strip suffixes and receive only the word stem
from nltk.stem.porter import PorterStemmer

In [9]:
# return a cleaned (pre-processed) review text
def get_clean_review(review):
    # get a substring of only alphabetical characters, replace removed characters to a space
    review = re.sub("[^a-zA-Z]", " ", review)

    # lower case the characters in the review
    review = review.lower()
    
    # create a suffix stripper that receives only the word stem 
    ps = PorterStemmer()
    
    # remove irrelevant words (such as prepositions) from the review
    review = review.split()
    clean_review = list()
    for word in review:
        # convert the stopwords list into a set for O(1) get time complexity
        stopwords_set = set(stopwords.words("english"))
        
        # if its not a stopword, add the word into the cleaned review
        if not word in stopwords_set:
            stripped_word = ps.stem(word)
            clean_review.append(stripped_word)
            
    # join the clean review list into a string with a space per each element in the list
    space = " "
    clean_review = space.join(clean_review)
    return clean_review

In [10]:
# corpus is a term in NLP that just refers to a text such as a document, HTML page, etc.
corpus = []

# iterate through rows in the reviews DataFrame
rows = reviews.shape[0]
for row in range(0, rows):
    review = reviews["Review"][row]
    
    # clean the review and append it into the corpus
    corpus.append(get_clean_review(review))

# Sparse Matrix
Now we need to program the bag of words sparse matrix because we pre-processed the texts in the reviews DataFrame.

The matrix is structured as so: every word in the corpus has its individual column and each row is a review.

Therefore, the classification model can classify each review as either positive or negative because it can quantify the frequency of each word per review and use that for classifying the review.

In [11]:
# import the word counter
from sklearn.feature_extraction.text import CountVectorizer

"""
create a word counter object.
- max_features = 1500 use only the top 1500 words (columns)

This class has built-in text pre-processing parameters (such as lowercase, stopwords, etc.)
However, we decided to program the pre-processing manually above because it gives us more control.
Therefore, we're not going to use them since we already programmed the pre-processing above.
"""
max_top_words = 1500
cv = CountVectorizer(max_features=max_top_words)

In [12]:
# x is each word from the sparse matrix where the value is the frequency of the word
x = cv.fit_transform(corpus).toarray()

x

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [13]:
# y is the Liked column from the original reviews data set
y = reviews.iloc[:, 1]

y.head()

0    1
1    0
2    0
3    1
4    1
Name: Liked, dtype: int64

# Create Bag of Words Model
Now that we have our sparse matrix, we can fit this into a classification model and use it to predict a testing set.

We will use a Naive Bayes (probability) classification model.

In [14]:
# split the data set into training and testing data sets
from sklearn.model_selection import train_test_split 
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=0)

In [15]:
# import the gaussian naive bayes class
from sklearn.naive_bayes import GaussianNB

In [16]:
# create a naive bayes classifier, then fit to the training set
classifier = GaussianNB()
classifier.fit(x_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [17]:
# predict the training set results
y_pred = classifier.predict(x_test)

y_pred

array([1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0,
       0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0,
       1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0,
       0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1,
       1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1,
       1, 1])

# Confusion Matrix

In [18]:
# import the confusion matrix function
from sklearn.metrics import confusion_matrix

In [19]:
# create a confusion matrix that compares the y_test (actual) to the y_pred (prediction)
cm = confusion_matrix(y_test, y_pred)

"""
Read the Confusion Matrix diagonally:
55 + 91 = 146 correct predictions
42 + 12 = 54 incorrect predictions
"""
cm

array([[55, 42],
       [12, 91]])