### Assignment: Natural Language Processing

In this assignment, you will work with a data set that contains restaurant reviews. You will use a Naive Bayes model to classify the reviews (positive or negative) based on the words in the review.  The main objective of this assignment is gauge the performance of a Naive Bayes model by using a confusion matrix; however in order to ascertain the efficiacy of the model, you will have to first train the Naive Bayes model with a portion (i.e. 70%) of the underlying data set and then test it against the remainder of the data set . Before you can train the model, you will have to go through a sequence of steps to get the data ready for training the model.

Steps you may need to perform:

**1) **Read in the list of restaurant reviews

**2)** Convert the reviews into a list of tokens

**3) **You will most likely have to eliminate stop words

**4)** You may have to utilize stemming or lemmatization to determine the base form of the words

**5) **You will have to vectorize the data (i.e. construct a document term/word matix) wherein select words from the reviews will constitute the columns of the matrix and the individual reviews will be part of the rows of the matrix

**6) ** Create 'Train' and 'Test' data sets (i.e. 70% of the underlying data set will constitute the training set and 30% of the underlying data set will constitute the test set)

**7)** Train a Naive Bayes model on the Train data set and test it against the test data set

**8) **Construct a confusion matirx to gauge the performance of the model

**Dataset**: https://www.dropbox.com/s/yl5r7kx9nq15gmi/Restaurant_Reviews.tsv?raw=1




In [0]:
!pip install -U nltk
!pip install regex

import nltk

nltk.download('all')

# Inaugural is one of the data packages included within NLTK

# Import the "inaugural" data package
from nltk.corpus import inaugural

Collecting nltk
[?25l  Downloading https://files.pythonhosted.org/packages/50/09/3b1755d528ad9156ee7243d52aa5cd2b809ef053a0f31b53d92853dd653a/nltk-3.3.0.zip (1.4MB)
[K    100% |████████████████████████████████| 1.4MB 6.3MB/s 
[?25hRequirement not upgraded as not directly required: six in /usr/local/lib/python3.6/dist-packages (from nltk) (1.11.0)
Building wheels for collected packages: nltk
  Running setup.py bdist_wheel for nltk ... [?25l- \ | / - \ done
[?25h  Stored in directory: /content/.cache/pip/wheels/d1/ab/40/3bceea46922767e42986aef7606a600538ca80de6062dc266c
Successfully built nltk
Installing collected packages: nltk
  Found existing installation: nltk 3.2.5
    Uninstalling nltk-3.2.5:
      Successfully uninstalled nltk-3.2.5
Successfully installed nltk-3.3
Collecting regex
[?25l  Downloading https://files.pythonhosted.org/packages/64/ca/93cad3699d8022a29493e9cf180f7691ead38da64eae819f9c1ae186ba56/regex-2018.06.09.tar.gz (632kB)
[K    100% |██████████████

[nltk_data]    |   Unzipping corpora/genesis.zip.
[nltk_data]    | Downloading package gutenberg to
[nltk_data]    |     /content/nltk_data...
[nltk_data]    |   Unzipping corpora/gutenberg.zip.
[nltk_data]    | Downloading package ieer to /content/nltk_data...
[nltk_data]    |   Unzipping corpora/ieer.zip.
[nltk_data]    | Downloading package inaugural to
[nltk_data]    |     /content/nltk_data...
[nltk_data]    |   Unzipping corpora/inaugural.zip.
[nltk_data]    | Downloading package indian to /content/nltk_data...
[nltk_data]    |   Unzipping corpora/indian.zip.
[nltk_data]    | Downloading package jeita to /content/nltk_data...
[nltk_data]    | Downloading package kimmo to /content/nltk_data...
[nltk_data]    |   Unzipping corpora/kimmo.zip.
[nltk_data]    | Downloading package knbc to /content/nltk_data...
[nltk_data]    | Downloading package lin_thesaurus to
[nltk_data]    |     /content/nltk_data...
[nltk_data]    |   Unzipping corpora/lin_thesaurus.zip.
[nltk_data]    | Downloa

[nltk_data]    | Downloading package toolbox to /content/nltk_data...
[nltk_data]    |   Unzipping corpora/toolbox.zip.
[nltk_data]    | Downloading package treebank to /content/nltk_data...
[nltk_data]    |   Unzipping corpora/treebank.zip.
[nltk_data]    | Downloading package twitter_samples to
[nltk_data]    |     /content/nltk_data...
[nltk_data]    |   Unzipping corpora/twitter_samples.zip.
[nltk_data]    | Downloading package udhr to /content/nltk_data...
[nltk_data]    |   Unzipping corpora/udhr.zip.
[nltk_data]    | Downloading package udhr2 to /content/nltk_data...
[nltk_data]    |   Unzipping corpora/udhr2.zip.
[nltk_data]    | Downloading package unicode_samples to
[nltk_data]    |     /content/nltk_data...
[nltk_data]    |   Unzipping corpora/unicode_samples.zip.
[nltk_data]    | Downloading package universal_treebanks_v20 to
[nltk_data]    |     /content/nltk_data...
[nltk_data]    | Downloading package verbnet to /content/nltk_data...
[nltk_data]    |   Unzipping corpora/

In [0]:
import string
import regex as re
import numpy as np
import pandas as pd
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from nltk import PorterStemmer, LancasterStemmer, word_tokenize

df = pd.read_table('https://www.dropbox.com/s/yl5r7kx9nq15gmi/Restaurant_Reviews.tsv?raw=1')
df = pd.DataFrame(df)

# Remove punctuation
df['clean'] = df['Review'].apply(lambda x: re.sub(r'[^\w\s]','',x).lower())
en_stopwords = list(set(nltk.corpus.stopwords.words('english')))
keep = ['not', 'no']
en_stopwords = [i for i in en_stopwords if i not in keep]

# Tokenize the sentence
df['tokens']  = df['clean'].apply(lambda x: word_tokenize(x))

# Tokens without stopwords
df['nontokens'] = df['tokens'].apply(lambda x:' '.join([i for i in x if i not in en_stopwords]))

In [0]:
vectorizer = CountVectorizer()
vfit = vectorizer.fit_transform(df['nontokens'])
vfit = vfit.A
print(vfit.shape)

(1000, 1958)


In [0]:
# !pip install gensim
import gensim

from gensim.models.fasttext import FastText
model = FastText(df['tokens'], min_count=1)
model['good']

  import sys


array([-0.17617202, -0.02403634,  0.3577154 , -0.0887966 , -0.00076755,
       -0.09830803, -0.21637231, -0.09890354, -0.04458106,  0.22272782,
       -0.10851035, -0.04727236,  0.2505561 , -0.25612572,  0.21001679,
        0.07326671, -0.16497612, -0.09847432,  0.16445106, -0.02900513,
       -0.12236547,  0.09080641,  0.1500903 , -0.27190936, -0.11710636,
       -0.42419073, -0.17777662, -0.04790103, -0.06882516,  0.07657215,
       -0.174973  , -0.21405645, -0.09907134,  0.0373577 ,  0.05012957,
       -0.03536616, -0.17542887, -0.25021127,  0.03585653,  0.2362842 ,
       -0.02786737, -0.06347536, -0.27179083,  0.17428143, -0.04533366,
       -0.07112062, -0.21571207,  0.11335783, -0.08055928,  0.17356868,
       -0.00181954,  0.03059168, -0.0147906 , -0.09463813,  0.13151124,
       -0.09264892,  0.04354956,  0.14521304,  0.45219973, -0.27504757,
       -0.09550574,  0.00325185, -0.09354495, -0.01502891,  0.1675005 ,
        0.00526447, -0.06332537, -0.36587086, -0.21194302, -0.11

In [0]:
import operator
from operator import itemgetter
from collections import Counter

# Count how many times each work appears
count = Counter(" ".join(df['nontokens']).split(" ")).items()
sorted_count = sorted(count, key=itemgetter(1))
sorted_count.reverse()
print(sorted_count)



In [0]:
# Select 100 most frequent words
top100 = [i[0] for i in sorted_count[:100]]
print(top100)

['food', 'not', 'place', 'good', 'service', 'great', 'back', 'like', 'go', 'time', 'really', 'best', 'dont', 'ever', 'would', 'also', 'one', 'friendly', 'never', 'nice', 'restaurant', 'no', 'delicious', 'amazing', 'im', 'vegas', 'experience', 'ive', 'came', 'wont', 'disappointed', 'even', 'love', 'minutes', 'us', 'eat', 'get', 'staff', 'pretty', 'going', 'well', 'got', 'definitely', 'much', 'bad', 'first', 'chicken', 'made', 'better', 'think', 'say', 'could', 'pizza', 'always', 'stars', 'salad', 'menu', 'steak', 'wait', 'way', 'ordered', 'worst', 'fresh', 'flavor', 'wasnt', 'sushi', 'times', 'server', 'quality', 'taste', 'didnt', 'want', 'awesome', 'night', 'fantastic', 'went', 'enough', 'burger', 'recommend', 'order', 'come', 'know', 'next', 'meal', 'bland', 'cant', 'buffet', 'feel', 'slow', 'still', 'tasty', 'perfect', 'terrible', 'probably', 'waited', 'excellent', 'atmosphere', 'another', 'everything', 'coming']


In [0]:
# Create matrix with reviews as rows and top 100 words as columns, where each cell is 1 if the word appears in the review and 0 otherwise
m = []
for i in df['nontokens']: m.append([1 if j in i else 0 for j in top100])
print(np.matrix(m))

[[0 0 1 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 ...
 [0 1 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [0]:
mdf = pd.DataFrame(m, columns = top100, index = df['nontokens'])
mdf['Liked'] = df['Liked'].values
print(mdf.head())

                                                    food  not  place  good  \
nontokens                                                                    
wow loved place                                        0    0      1     0   
crust not good                                         0    1      0     1   
not tasty texture nasty                                0    1      0     0   
stopped late may bank holiday rick steve recomm...     0    0      0     0   
selection menu great prices                            0    0      0     0   

                                                    service  great  back  \
nontokens                                                                  
wow loved place                                           0      0     0   
crust not good                                            0      0     0   
not tasty texture nasty                                   0      0     0   
stopped late may bank holiday rick steve recomm...        0      0     0 

In [0]:
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

train, test = train_test_split(mdf, test_size = 0.3)

In [0]:
from sklearn.naive_bayes import MultinomialNB

cols = train.columns[:-1]
gnb = MultinomialNB()
gnb.fit(train[cols], train['Liked'])
y_pred = gnb.predict(test[cols])

print("Number of mislabeled points out of a total {} points : {}, performance {:05.2f}%"
      .format(
          test.shape[0],
          (test["Liked"] != y_pred).sum(),
          100*(1-(test["Liked"] != y_pred).sum()/test.shape[0])
))

Number of mislabeled points out of a total 300 points : 73, performance 75.67%


In [0]:
confusion_matrix(test['Liked'], y_pred)

array([[112,  31],
       [ 42, 115]])