## Bag of Word Model - NLP

### Introduction

## Importing the Libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

##  Importing the Dataset

We are importing a tab seperated value (tsv) dataset which is also a csv file but the seperation parameter is tab. This dataset is in tsv format becuase texts also contain commas which will lead to mal processing of data.

We also need to avoid the double quotation("") for this we use the quoting argument of the read_csv method of pandas and set its value to 3 which means to avoid. 

We also set the seperation parameter(delimiter) of read_csv method to tab(\t).

In [2]:
dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)

In [3]:
print(dataset)

                                                Review  Liked
0                             Wow... Loved this place.      1
1                                   Crust is not good.      0
2            Not tasty and the texture was just nasty.      0
3    Stopped by during the late May bank holiday of...      1
4    The selection on the menu was great and so wer...      1
5       Now I am getting angry and I want my damn pho.      0
6                Honeslty it didn't taste THAT fresh.)      0
7    The potatoes were like rubber and you could te...      0
8                            The fries were great too.      1
9                                       A great touch.      1
10                            Service was very prompt.      1
11                                  Would not go back.      0
12   The cashier had no care what so ever on what I...      0
13   I tried the Cape Cod ravoli, chicken, with cra...      1
14   I was disgusted because I was pretty sure that...      0
15   I w

##  Cleaning the Dataset 

First of all we download the Stop Words from NLTK library.
Stop words are those words which dont convey any sentiment (ex -  is, am, are etc)

In [4]:
import nltk
nltk.download('stopwords')

[nltk_data] Error loading stopwords: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>


False

Now we are going to import the stopwords from nltk library which we downloaded earlier.

Now we will loop through each entries(reviews) of dataset using the for loop and for each entries we will do:
    
    remove every charachter except a to z and A to Z for this we are going to use the sub method of the re library which 
    takes a regurlar expression to specify the filter and as another argument takes the charchter with which it will 
    replace the charachter not specified in filter and last argument it takes is the data on which to apply filter, for 
    applying the filter we will acess the review from dataset by specifying the column of datset and index of the data.
    
    After this it converts all the data(reviews) from dataset to lower case using the lower function.
    
    And at last we seperate each charachter of each data(review) using the split method.
    
    Now we will apply stemming on each word of review using the stem method except the words which are stop words and for this we will iterate trough a single review.
    
    After that we will join each stemmed words by space.
    
    Then we will append our corpus list with processed review

In [7]:
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

corpus = []

for i in range(0, 1000):
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
    review = review.lower()
    review = review.split()
    ps =  PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

## Creating the Bag of Words Model

Now we will convert our corpus words to toke using the CountVectorizer method of feature_extraction.text module of sklearn library. It will return a Sparse Matrix as ouptut.

But their are still few words in corpus which dont have any sentiments and were not present in stop words list thus were not removed in previous steps.

To remove suc word we provide our CounVectorizer object max_features argument which take the value of how much word to include for training.

To get the value of max_feature first of all we run the below code without providing this argument and getting its value by len(X[O]) and after that passing the argument with found value.

Now we train our model using fit_transform method on corpus and convert the output to a 2D array.

After that we get our dependent variable which is already present in our dataset.

In [9]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, -1].values

In [10]:
len(X[0])

1500

## Splitting the Dataset in Training and Test Set

In [12]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

## Training the Naive Bayes model on the Training set

In [13]:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

GaussianNB()

## Predicting the Test set results

In [15]:
y_pred = classifier.predict(X_test)

In [16]:
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[1 0]
 [1 0]
 [1 0]
 [0 0]
 [0 0]
 [1 0]
 [1 1]
 [1 0]
 [1 0]
 [1 1]
 [1 1]
 [1 1]
 [1 0]
 [1 1]
 [1 1]
 [1 1]
 [0 0]
 [0 0]
 [0 0]
 [1 1]
 [0 0]
 [0 1]
 [1 1]
 [1 0]
 [1 0]
 [0 1]
 [1 1]
 [1 1]
 [1 1]
 [0 0]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [0 0]
 [1 0]
 [0 0]
 [1 0]
 [1 1]
 [1 1]
 [1 0]
 [1 1]
 [0 0]
 [0 0]
 [0 0]
 [1 0]
 [1 0]
 [0 0]
 [0 0]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [1 0]
 [0 0]
 [1 1]
 [1 1]
 [0 0]
 [1 1]
 [1 0]
 [0 0]
 [1 0]
 [1 0]
 [1 1]
 [0 0]
 [1 1]
 [1 1]
 [1 1]
 [1 0]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [0 0]
 [1 0]
 [1 1]
 [0 1]
 [0 0]
 [1 1]
 [0 0]
 [1 1]
 [1 1]
 [0 0]
 [1 1]
 [1 1]
 [1 0]
 [0 0]
 [1 1]
 [1 0]
 [0 0]
 [1 1]
 [0 0]
 [0 0]
 [1 0]
 [1 1]
 [1 0]
 [1 1]
 [1 1]
 [1 0]
 [0 1]
 [1 1]
 [1 1]
 [1 0]
 [0 1]
 [1 0]
 [1 1]
 [1 1]
 [0 0]
 [0 1]
 [0 1]
 [1 1]
 [0 0]
 [1 0]
 [1 1]
 [0 0]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [0 0]
 [1 1]
 [1 0]
 [0 0]
 [0 0]
 [1 1]
 [1 0]
 [0 0]
 [1 1]
 [1 0]
 [1 1]
 [0 0]
 [0 0]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [1 0]
 [0 1]
 [1 1]
 [1 1]

## Making the Confusion Matrix

In [19]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)

In [20]:
print(cm)

[[55 42]
 [12 91]]


In [21]:
accuracy_score(y_test, y_pred)

0.73