# NLP Application 2

### This code builds and tests a text classification model on the restaurant reviews dataset. The dataset is cleaned and feature extraction is performed. The Naive Bayes model is trained and predictions are made on the test data. The complexity matrix is used to evaluate the performance of the model.

In [1]:
import numpy as np
import pandas as pd
import re
import nltk

### File path and data reading

In [2]:
Reviews = pd.read_csv(r'C:\Users\Arif Furkan\OneDrive\Belgeler\Python_kullanirken\Restaurant_Reviews.csv')
print(Reviews)

                                                Review  Liked
0                            Wow... Loved this place.       1
1                                  Crust is not good.       0
2           Not tasty and the texture was just nasty.       0
3    Stopped by during the late May bank holiday of...      1
4    The selection on the menu was great and so wer...      1
..                                                 ...    ...
992  I think food should have flavor and texture an...      0
993                          Appetite instantly gone.       0
994  Overall I was not impressed and would not go b...      0
995  The whole experience was underwhelming  and I ...      0
996  Then  as if I hadn't wasted enough of my life ...      0

[997 rows x 2 columns]


### Data Preprocessing

In [3]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

nltk.download('stopwords')
from nltk.corpus import stopwords

collection = []
for i in range(len(Reviews)):
    Comment = re.sub('[^a-zA-Z]',' ',Reviews['Review'][i])
    Comment = Comment.lower()
    Comment = Comment.split()
    english_stopwords = set(stopwords.words('english'))
    Comment = [ps.stem(Word) for Word in Comment if Word not in english_stopwords]
    Comment = ' '.join(Comment)
    collection.append(Comment)

[nltk_data] Downloading package stopwords to C:\Users\Arif
[nltk_data]     Furkan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Bag of Words (BOW)

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=1996)
X = cv.fit_transform(collection).toarray()
y = Reviews.iloc[:,1].values

### Separation of Data Set into Training and Testing

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)
print(X_train)
print(y_train)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
[1 0 1 1 1 0 1 1 1 0 1 1 1 0 0 1 1 1 1 0 0 1 1 1 0 0 1 0 1 0 0 1 1 1 0 0 1
 1 0 0 1 0 0 0 1 0 0 1 0 1 1 1 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 0 1 0 1 0 0 1
 1 1 1 0 1 1 1 0 1 0 1 0 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0
 1 0 0 1 0 1 1 0 1 1 1 1 1 0 0 1 1 0 1 1 0 0 0 1 0 0 0 0 1 0 0 0 1 1 0 0 1
 0 1 1 1 1 1 1 0 0 1 0 0 0 0 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 1 0 0 1 0 1 0 1
 1 0 0 1 0 0 1 0 0 1 0 0 1 1 1 1 0 0 1 1 1 1 0 1 1 1 1 0 1 1 0 1 1 1 0 0 1
 0 1 0 0 1 0 0 1 1 1 0 1 1 1 0 1 1 0 1 0 0 0 1 1 1 0 0 1 0 1 0 0 0 0 1 1 0
 0 1 1 1 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 0 0 1 0 1 1 0 0
 0 1 1 0 1 1 1 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 1 1 1 0 1 0 1
 1 1 1 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 1 1 1 1 0 0 1 0 0 0 1 0 0 0 1 1 0 1 1
 1 1 0 1 0 0 0 0 0 1 0 0 0 0 1 1 1 0 0 1 0 1 0 0 0 1 0 1 1 0 0 0 0 0 0 1 1
 0 1 1 0 0 0 0 1 1 0 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 0 1 

### Naive Bayes Model Training and Estimation

In [7]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train,y_train)
y_pred = gnb.predict(X_test)

### Confusion Matrix and Performance Evaluation

In [8]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
print(cm)

[[48 47]
 [18 87]]
