<center><img src="https://github.com/insaid2018/Term-1/blob/master/Images/INSAID_Full%20Logo.png?raw=true" width="240" height="100" /></center>

<center><h1> Implementing Bag of Words Model</center>

---
# **Table of Contents**
---

**1.** [**Importing Libraries**](#section1)<br>
**2.** [**Data Acquisition**](#section2)<br>
**3.** [**Cleaning the Texts**](#section3)<br>
**4.** [**Creating the Bag of Words Model**](#section4)<br>
**5.** [**Splitting the dataset into Training and Test sets**](#section5)<br>
**6.** [**Fitting Naive Bayes Algorithm to the Training set**](#section6)<br>
**7.** [**Predicting on the Test set**](#section7)<br>
**8.** [**Model Evaluation**](#section8)<br>
**9.** [**Conclusion**](#section9)<br>

---
<a name = Section1></a>
# **1. Importing Libraries**
---

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

---
<a name = Section2></a>
# **2. Importing the Dataset**
---

In [None]:
dataset = pd.read_csv('https://raw.githubusercontent.com/insaid2018/DeepLearning/master/Data/Restaurant_Reviews.tsv', delimiter='\t', quoting=3)
dataset.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


---
<a name = Section1></a>
# **3. Cleaning the Texts**
---

- **Importing** necessary libraries to clean the **text column**.

In [None]:

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

- We will be **cleaning** our text reviews in the next cell.

- There are multiple **steps** being performed on **each review**.

- At the end, we will have a **corpus** of clean reviews.

In [None]:
corpus = []
for i in range(0, len(dataset)):
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])   # Replacing anything except alphabets with a space.
    review = review.lower()                                   # Changing text to lower case.
    review = review.split()                                   # Creating tokens from a review by splitting it.
    ps = PorterStemmer()                                      # Creating a stemmer using PorterStemmer
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]    # Dropping the stopwords, and stemming the remaining words.
    review = ' '.join(review)                                 # Joining the tokens of a review into a single sentence.
    corpus.append(review)                                     # Creating a list of all the reviews.

---
<a name = Section4></a>
# **4. Creating the Bag of Words Model**
---

- We are loading **`CountVectorizer`** from **`sklearn.feature_extraction`**.




- Building a `CountVectorizer` that will keep the **most common 1500** words.

In [None]:
cv = CountVectorizer(max_features=1500)

- Fitting the vectorizer to our **corpus** and converting it to an array using **`toarray()`** function.

In [None]:
X = cv.fit_transform(corpus).toarray()

In [None]:
y = dataset.iloc[:, 1].values

---
<a name = Section5></a>
# **5. Splitting the dataset into Training and Test sets**
---

- Splitting our data into train and test using **`train_test_split`** function from **`sklearn.model_selection`**.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

---
<a name = Section6></a>
# **6. Fitting Naive Bayes Algorithm to the Training set**
---

- We are using the **`MultinomialNB`** from **`sklearn.naive_bayes`** to train our model.



In [None]:
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

---
<a name = Section7></a>
# **7. Predicting on the Test set**
---

In [None]:
y_pred = classifier.predict(X_test)

---
<a name = Section8></a>
# **8. Model Evaluation**
---

In [None]:
accuracy_score(y_test, y_pred)

0.765

- Our model is getting an accuracy of **76.5%** on the test set.

---
<a name = Section9></a>
# **9. Conclusion**
---

- This **notebook** gives a basic idea on how to use the **Bag of Words** model on a real dataset.

- We learn about **cleaning** text reviews and building a **corpus** of all our documents (reviews).

- Then we **build** a Bag of Words model using the **CountVectorizer**.

- At, last we fit the **numeric interpretation** of textual data into a **Machine** Learning model and make **predictions**.