# Natural Language Processing

### Steps to learn NLP: Bottom to Top Approach

![flows](images/flow.PNG)

### What is NLP

Natural Language Processing (or NLP) is applying Machine Learning models to text and language. Teaching machines to understand what is said in spoken and written word is the focus of Natural Language Processing.

### Use-Cases

You can also use NLP on a text review to predict if the review is a good one or a bad one. You can use NLP on an article to predict some categories of the articles you are trying to segment. You can use NLP on a book to predict the genre of the book. And it can go further, you can use NLP to build a machine translator or a speech recognition system, and in that last example you use classification algorithms to classify language. Speaking of classification algorithms, most of NLP algorithms are classification models, and they include Logistic Regression, Naive Bayes, CART which is a model based on decision trees, Maximum Entropy again related to Decision Trees, Hidden Markov Models which are models based on Markov processes.

### Types

![types](images/types.PNG)

## Importing the libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Importing the dataset

In [2]:
#quoting = 3 ignores the double quotes(treats them as something else): 
#this will be useful as our model won't cause any issue because of quotes now
dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)

In [None]:
dataset.head()

## Cleaning the texts

In [None]:
import re
import nltk #allows us to get the stop words list
nltk.download('stopwords') #words such as "the" "an" which are not related to sentiments. This line downloads the words
from nltk.corpus import stopwords #this line gets the stop words(from module nltk) into our notebook
from nltk.stem.porter import PorterStemmer #Remove conjugations from words and bring them to present tense. eg: loved -> love
#This will also help us to reduce the contents of the sparse matrix we will create later

In [None]:
corpus = []
for i in range(0,1000):
    review = re.sub("[^a-zA-Z]", " ", dataset['Review'][i]) #Removing extra characters/punctuations by space
    review = review.lower()
    review = review.split()
    #Stemming
    
    #Create object of stemmer
    ps = PorterStemmer()
    #Apply stemming to each word in list & remove stop words
    all_stopwords = stopwords.words('english')
    all_stopwords.remove("not")
    review = [ps.stem(word) for word in review if not word in set(all_stopwords)]
    #Create string back from list
    review = ' '.join(review)
    
    corpus.append(review)

In [None]:
corpus

## Creating the Bag of Words model

Tokenization to create a sparse matrix with all the reviews in different rows and all the words from from all reviews in different columns. 1 if word is present. 0 if not

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
#fit will: fit corpus to X : take all the words from all reviews in corpus
#tranform will put them in different columns
X = cv.fit_transform(corpus).toarray()
Y = dataset.iloc[:,-1].values

Example of a review : "stop late may bank holiday rick steve recommend love"
Now we want to remove the words that are not relevant(they only appear once or twice - are less frequent) : such as the word "steve"

We do this by passing the parameter in CountVectorizer()

But we need to know total number of words first so we won't pass a paramater as of now

In [None]:
len(X[0])

We have 1566 words that were taken from the reviews
Now we can take an estimate and keep only most frequent words: maybe 1500?
66 words will be removed which are not so much frequent such as rick & steve

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
Y = dataset.iloc[:,-1].values

In [None]:
len(X[0])

## Splitting the dataset into the Training set and Test set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 1)

## Training the Naive Bayes model on the Training set

## Predicting the Test set results

## Making the Confusion Matrix