***Below are some major NLP libraries - ***

NLTK, spaCy, Stanford NLP, OpenNLP

BOW Model - It's a NLP model used to preprocess text before fitting the classification algorithm on the observations contatining the text.


***Goal***
1. Clean texts to prepare them for ML models,
2. Create BOW Model
3. Apply ML models to this BOW model

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter='\t', quoting = 3) # quoting = 3 ignores double quotes
dataset.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [3]:
dataset['Review'][0]

'Wow... Loved this place.'

Remove Punctutation

In [25]:
import re
import nltk
from nltk.corpus import stopwords
review = (re.sub('[^a-zA-Z]',' ',dataset['Review'][0])).lower()
review

'wow    loved this place '

**Remove Stopwords**

In [26]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Nitin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [27]:
review  = review.split()
review

['wow', 'loved', 'this', 'place']

In [28]:
# review = [word for word in review if not word in stopwords.words('english')]
# Above works but it is higly recommended to use set as it get stopwords in a list and is more efficient
review = [word for word in review if not word in set(stopwords.words('english'))]
review

['wow', 'loved', 'place']

**Stemming** -  In order to reduce the sparsity. Love or loved mean the same thing 

In [29]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
review

['wow', 'love', 'place']

In [30]:
review = ' '.join(review)

In [31]:
review

'wow love place'

In [11]:
corpus = []
for i in range(0,len(dataset)):
    review = (re.sub('[^a-zA-Z]',' ',dataset['Review'][i])).lower().split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

In [12]:
corpus[0:5]

['wow love place',
 'crust good',
 'tasti textur nasti',
 'stop late may bank holiday rick steve recommend love',
 'select menu great price']

**Create Bag of Words Model**

Get all unique values from the corpus and put them in seprate columns. And then for every row populate 0 or 1 for each word.(Table with a lot of zeroes - Sparse Matrix) via thr process of tokenization.

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
cv =  CountVectorizer()
X = cv.fit_transform(corpus).toarray()
X.shape

(1000, 1565)

Reduce sparsity by removing infrequent words

In [14]:
cv =  CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
X.shape

(1000, 1500)

Getting the labels

In [15]:
y = dataset.iloc[:,1].values

In [16]:
y

array([1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1,
       0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1,
       1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1,
       1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1,
       0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1,
       0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1,