# Natural Language Process

<p>We should use tsv file because comma is used to seperate data in csv but in language, people will use comma in sentence.</p>

## Get Data

In [0]:
import requests
import pandas as pd

In [0]:
url = 'https://raw.githubusercontent.com/MarkIChen/NLP-restaurants/master/Restaurant_Reviews.tsv'
dataset = pd.read_csv(url, delimiter='\t', quoting=3)


In [73]:
dataset

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1
...,...,...
995,I think food should have flavor and texture an...,0
996,Appetite instantly gone.,0
997,Overall I was not impressed and would not go b...,0
998,"The whole experience was underwhelming, and I ...",0


## Create a bag of words

### 1.Cleaning one review

<p>
Tranfer past term, Ving, capital word.</p>

In [0]:
import re
import nltk

In [75]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [76]:
dataset['Review'][0]

'Wow... Loved this place.'

#### a. Remove numbers, punctuation, question marks.


In [0]:
review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][0])

In [78]:
review

'Wow    Loved this place '

#### b. Lower case to Upper case

In [0]:
review = review.lower()

In [80]:
review

'wow    loved this place '

#### c. Split the words into tokens

In [0]:
review = review.split()

In [82]:
review

['wow', 'loved', 'this', 'place']

In [0]:
from nltk.corpus import stopwords

In [0]:
review = [word for word in review if not word in stopwords.words('english')]

In [85]:
review

['wow', 'loved', 'place']

#### d. Stemming

In [0]:
from nltk.stem.porter import PorterStemmer

In [0]:
ps = PorterStemmer()

In [0]:
review = [ps.stem(word) for word in review]

In [89]:
review

['wow', 'love', 'place']

#### e. Joining

In [0]:
review = ' '.join(review)

In [91]:
review

'wow love place'

### 2. Cleaning all reviews

In [0]:
 corpus = []

In [0]:
for review in dataset['Review']:
  review = re.sub('[^a-zA-Z]', ' ', review)
  review = review.lower().split()
  ps = PorterStemmer()
  review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
  corpus.append(' '.join(review))

### 3. Create the Bag of Word Model

Note: In machine learning, we try to avoid sparse matrix

We are training a classification model based on the previous reviewss.

In [0]:
from sklearn.feature_extraction.text import CountVectorizer

In [0]:
cv = CountVectorizer(max_features=1500)

In [0]:
X = cv.fit_transform(corpus).toarray()
# transfer to sparse matrix (hot matrix)

In [106]:
X.shape

(1000, 1500)

In [0]:
y = dataset.iloc[:, 1].values

### 4. Train the model

In [0]:
from sklearn.model_selection import train_test_split

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [0]:
from sklearn.naive_bayes import GaussianNB

In [0]:
classifier = GaussianNB()

In [122]:
classifier.fit(X_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [0]:
y_pred = classifier.predict(X_test)

In [0]:
from sklearn.metrics import confusion_matrix

In [0]:
cm = confusion_matrix(y_test, y_pred)

In [127]:
cm

array([[55, 42],
       [12, 91]])