<h1><center> NATURAL LANGUAGE PROCESSING </h1></center>

Natural Language Processing (or NLP) is applying *Machine Learning models to text and language*. 
Teaching machines to understand what is said in spoken and written word is the focus of Natural Language Processing. Whenever you dictate something into your iPhone / Android device that is then converted to text, that’s an NLP algorithm in action.

You can also use NLP on a text review to predict if the review is a good one or a bad one. 
You can use NLP on an article to predict some categories of the articles you are trying to segment. 
You can use NLP on a book to predict the genre of the book. And it can go further, you can use NLP to build a machine translator or a speech recognition system, and in that last example you use classification algorithms to classify language. Speaking of classification algorithms, most of NLP algorithms are classification models, and they include Logistic Regression, Naive Bayes, CART which is a model based on decision trees, Maximum Entropy again related to Decision Trees, Hidden Markov Models which are models based on Markov processes.

A very well-known model in NLP is the Bag of Words model. It is a model used to preprocess the texts to classify before fitting the classification algorithms on the observations containing the texts.

Types of NLP:
- Natural Language Processing.
- Deep Learning: algorithms which are related to neural network. 
- The intersection of the two is the DNLP.
- Seq2Seq (sequence to sequence) is a subsection of DNLP and it's the most powerful tool for NLP. 

Theme: Classical vs. Deep learning models. Some examples:
- If/else is the rules for the chatbot. It refers to the mechanism which identifies the correct answer within the huge amount of questions. (NLP)
- Audio frequency components analysis (Speech recognition) (NLP). It also a mathematical computation without the using of neural network. 
- Bag of words model classification (NLP). It can be used in order to classify a positive or negative sentence.
- CNN for text recognition (Classification) (DNLP).
- Seq2Seq.

We focus on the *bag-of-words model*. Gmail gives a possible answers to an email. We want to create a model that will give us an yes/no response to our friend. We select the bag of words as the set of words most used in the UK with a vector of zero. Then, put one or more if the word is present in the email. We analize all of the emails which define the training data: the X variable is the email while the Y variable is Yes/No, it's a classification. So, actually we are defining a table in which each entries is the filled vector of the bag of word, while the variable to be predicted is binary: it's a well-known classification model! Vectors are sparse. This is a NLP classification technique. We can use also the neural network which is fed by the vectors in order to define the parameters (DNLP technique). 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, accuracy_score

def after_class(classifier, X_test, Y_test):
    # CONFUSION MATRIX: C00 is the true negative. 
    # Column is the result of the predictor.
    CM = confusion_matrix(Y_test, classifier.predict(X_test))
    ACC = accuracy_score(Y_test, classifier.predict(X_test))
    print('\n Accuracy: ', ACC, '\n', 'Confusion matrix: \n', CM)

# We perform the sentimental analysis. Chatbot and translated are defined in advanced NLP.

# the dataset collects the reviews of a restaurant and the binary variable.
dataset = pd.read_csv('Dataset/Restaurant_Reviews.tsv', 
                      sep = '\t',
                     quoting = 3 # ignore quotes.
                     )

# CLEAN THE TEST.
import re 
import nltk # download the stop words (articles, prepositions etc.)
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer # stemming: It also reduces the final dimension of the bag of words.
# also the puntaction will be removed.

corpus = [] # it contains all reviews after the ceaning process. 

for i in range (0, dataset.shape[0]):
    # replace anything in a text with anything I want: 
    # I substitute punctuation and digit with a white space.
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i]) 
    # lower case.
    review = review.lower()
    # stemming process.
    review = review.split() # primitive bag of words.
    ps = PorterStemmer()
    # if the word isn't an english stopword, then it's saved into the list.
    # tokenization.
    all_stopwords = stopwords.words('english')
    all_stopwords.remove('not') # don't include not in the stopwords.
    review = [ps.stem(word) for word in review if not word in set()]
    # join the selected words. 
    review = ' '.join(review) # join the word with a space for eah review. 
    corpus.append(review) 

# CREATE THE BAG OF WORDS.
from sklearn.feature_extraction.text import CountVectorizer
# define the most frequent word in the reviews.
# total number of words is 1655 but we want to delete some unuseful words. 
cv = CountVectorizer(max_features = 1500)  # it's possible to specify the dimension of the bag of words.

# dependent and independent variables.
X = cv.fit_transform(corpus).toarray()
Y = dataset.iloc[:,-1].values

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, # independent variables.
                                                    Y, # dependent variables.
                                                    test_size = 0.2, # size of test set.
                                                    random_state = 0) # fix the seed for both of data set.

# Naive Bayes classification.
from sklearn.naive_bayes import GaussianNB
# the function for the probability is a normal distribution. 
NB = GaussianNB()
NB.fit(X_train, Y_train)

# confusion matrix.
after_class(NB, X_test, Y_test)

# it's possible to improve this result. 

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Lucia\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!



 Accuracy:  0.73 
 Confusion matrix: 
 [[55 42]
 [12 91]]
