## Natural Language Processing - Classification

We have a set of Restaurant Reviws. We would like to classify them as a positive or a negative review. The dataset is in a tab seperated file, where the first column is the review and the 2nd colums is an integeter( 0 or 1) where 1 indicates liked and 0 otherwise.  

In [1]:
# Importing the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
# Importing the dataset
# We need to specify the delimiter as we are using a .tsv file and we set the quoting to 3 which ignores double-quotes
dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter='\t', quoting=3)
rows, columns = dataset.shape
dataset.head(5)

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


Now that we have imported our data, we need to some cleaning of the data so that it can be used in our algorithms. To help in our classification process we shall follow the below preprocessing steps.
    * Remove words like 'the','and','so', etc. 
    * Remove the punctuation and numbers. 
    * Perform stemming which is a process of removing/cutting different versions of the same words such as 'loved','loves','lover', etc to 'love'. 
    * Convert all the words to lower case.
    * Tokenize the text. This splits all the sentences to seperate words.

In [3]:
# Cleaning the texts
# Import library to help in cleaning text 
import re
# Import library to help in nlp 
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

[nltk_data] Downloading package stopwords to /Users/Kiran/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


We will have a look at preprocessing just 1 review and later run all the preprocessing steps in a for loop to clean all the reviews.

In [4]:
review = dataset['Review'][0]
print(review)

Wow... Loved this place.


In [5]:
# Removing all characters which are not letters
review = re.sub('[^a-zA-Z]',' ', review)
print(review)

Wow    Loved this place 


In [6]:
# Convert all letters to lower case
review = review.lower()
print(review)

wow    loved this place 


In [7]:
# Splitting the words in the sentence and removing irrelavant words like 'this','the','that' etc
review = review.split()
print("Before: ",review)
review = [word for word in review if word not in set(stopwords.words('english'))]
print("After: ",review)

Before:  ['wow', 'loved', 'this', 'place']
After:  ['wow', 'loved', 'place']


In [8]:
# Stemming the words
ps = PorterStemmer()
review = [ps.stem(word) for word in review]
print(review)

['wow', 'love', 'place']


In [9]:
review = ' '.join(review)
print(review)

wow love place


Now we can clean all the lines in our dataset

In [10]:
corpus = []
for i in range(0, rows):
    review = dataset['Review'][i]
    review = re.sub('[^a-zA-Z]',' ', review)
    review = review.lower()
    review = review.split()
    review = [ps.stem(word) for word in review if word not in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

print(corpus[1:10])

['crust good', 'tasti textur nasti', 'stop late may bank holiday rick steve recommend love', 'select menu great price', 'get angri want damn pho', 'honeslti tast fresh', 'potato like rubber could tell made ahead time kept warmer', 'fri great', 'great touch']


###### Bag of words model
In the Bag of words model we take all the words in our corpus and create a unique set of words with each word in itz own column. The rows will still be all the reviews. Each cell we represent if the word in that column apprears in the review. We will definately have a lot of cells with 0 in them. Hence this matrix of values is called a sparse matrix. 

In [11]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:,1].values

Now that we have the bag of words as a sparse matrix, each column of words can be considered as an independent variable and we can use the dataset's last column as the dependent variable. This is enough data for us to fit a classification model on this dataset. We are going to fit a Navie Bayes Classifier on the dataset and determine the calssfication accuracy

###### Applying Navie Bayes Classification

In [12]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20,
                                                    random_state=0)

In [13]:
# Fitting classifier to the Training Set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

GaussianNB(priors=None)

In [14]:
# Predicting the Test Set results
y_pred = classifier.predict(X_test)
print("y_test Values:", y_test)                            
print("y_pred Values:", y_pred)

y_test Values: [0 0 0 0 0 0 1 0 0 1 1 1 0 1 1 1 0 0 0 1 0 1 1 0 0 1 1 1 1 0 1 1 1 1 1 0 0
 0 0 1 1 0 1 0 0 0 0 0 0 0 1 1 1 1 0 0 1 1 0 1 0 0 0 0 1 0 1 1 1 0 1 1 1 1
 0 0 1 1 0 1 0 1 1 0 1 1 0 0 1 0 0 1 0 0 0 1 0 1 1 0 1 1 1 0 1 0 1 1 0 1 1
 1 0 0 1 0 1 1 1 1 1 0 1 0 0 0 1 0 0 1 0 1 0 0 1 1 1 1 1 0 1 1 1 0 0 0 0 1
 1 1 1 1 1 1 0 0 1 1 1 0 0 0 1 1 0 0 0 0 0 1 0 1 1 0 0 1 0 1 0 1 1 0 0 0 0
 1 0 1 0 1 1 0 0 0 1 0 1 1 0 1]
y_pred Values: [1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 1 1 1 0 1 1 1 1 1 0 1
 0 1 1 1 1 1 0 0 0 1 1 0 0 1 1 1 1 1 0 1 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1
 0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 1 0 1 0 0 1 1 1 1 1 1 0 1 1 1 0 1 1 1 0 0 0
 1 0 1 1 0 1 1 1 1 1 0 1 1 0 0 1 1 0 1 1 1 0 0 1 1 1 1 1 1 0 1 1 0 1 0 1 1
 1 1 1 0 1 1 1 0 1 1 1 1 1 0 0 1 0 0 1 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 0 1 0
 1 0 1 0 1 1 0 1 1 1 0 1 1 1 1]


In [15]:
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[55 42]
 [12 91]]


In [16]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
accuracy = (tn+tp)/(tn + fp + fn + tp) * 100
print("Accuracy: ",accuracy,"%")

Accuracy:  73.0 %
