# Natural Language Processing

# Sentiment Analysis using Recurrent Neural Network & Logistic Regression

### Introduction

First of all i would like to thank Laboratoire d'Informatique en Image et Systèmes d'Information (LIRIS) in collaboration with the Ecole Centrale de Lyon that give me the opportunity to achieve this work in the best environnement that helped me to improve my knowledge and to sharpen my skills.

### Objectif

In this notebook we are going through all the basic and necessary knowledge to understand how we can apply both machine learning and deep learning approach in sentiment analysis, we are looking to understand how all the functionalities and what's behind it is working, to do so we are going to code everything from the scratch and after getting a good understanding w'll walk through concrete code examples and a full Tensorflow sentiment classifier at the end.

- **Part 1** Words to vectors
- **Part 2** Recurent neural network
- **Part 3** pre-processing data
- **Part 4** Application using Keras
- **Part 5** Logstic Regression

## Part 1 : Words to vectors

One of the most important task in NLP is how we are going to fit our date in our model, as we know most of data that we are working with in NLP is text, paragraphe or comment.
And if we look closely to other type of machine leanring algorithm including neural network. The common theme is that the inputs need to be scalar. Lets see how we can do this using words to vectors and some similarity algorithms.

<table>
    <td>
         <img src="images/S.png"style="width:250;height:300px;">
    </td>
</table>

Our goal is to find a way to code every single word into an array and give the collection of the array as an input like the figure below excatly show.

<table>
<td>
<img src = "images/S2.png" style="width:250;height:300px;">
</td>
</table>

Actually, there is two way to code texts into vectors
* The first method consist of importing a corpus full of words indexed from 0 to the last word, we are going to give and input as an array where all the element are 0 expect the index of the word in the corpus where it's 1 
* The second methode named word2vec and like the name applies it turn words into vectors 

A simple application to the first exemple we are going to code words to predict the next letter 

In [1]:
from data import train_data
import numpy as np

vocabulaire = [w for text in train_data.keys() for w in text.split(' ')]
vocabulaire = list(set(vocabulaire))

print(vocabulaire)

word_to_id = { w: i for i, w in enumerate(vocabulaire) }
id_to_word = { i: w for i, w in enumerate(vocabulaire) }

text = 'this is very good'.split(' ')
X = list()
for elm in text:
    x = np.zeros(18)
    x[word_to_id[elm]] = 1
    X.append(x)
    pass
pass
print(*X, sep = '\n')

['was', 'i', 'bad', 'this', 'is', 'and', 'at', 'now', 'or', 'good', 'sad', 'very', 'earlier', 'happy', 'all', 'right', 'am', 'not']
[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]


In [2]:
def read_glove_vecs(glove_file):
    with open(glove_file, 'r', encoding='utf8') as f:
        words = set()
        word_to_vec_map = {}
        for line in f:
            line = line.strip().split()
            curr_word = line[0]
            words.add(curr_word)
            word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)
    
    return words, word_to_vec_map

In [3]:
words, word_to_vec_map = read_glove_vecs('glove.6B.50d.txt')

In [4]:
love = word_to_vec_map['love']
adore = word_to_vec_map['adore']

<table>
<td>
<img src = "images/s4.png" style="width:250;height:300px;">

</td>

</table>

In [5]:
# u.v = cos(u.v) * |u|*|v|
# the similarity is to caclulate the consinus
def smilarity(U, V):
    UV = np.dot(U,V)
    norme_U = np.sqrt(np.sum(U*U)) 
    norme_V = np.sqrt(np.sum(V*V))
    return UV/(norme_U * norme_V)

In [6]:
smilarity(love, adore)

0.42786951433899845

## Part 2 : Recurent neural network 

Recurent neural network is a type of neural network quite complicated compared to simple neural network in term of how we process forward propagation and backward propagation. We are using RNN because of the dependency because we are going to process speech 

<table>
<td>
<img src="images/RNN.png" style="width:250;height:300px;">
</td>
</table>

The main difference between simple neural network (NN) and recurent neural network (RNN) is that RNN at took each time an input with a diffrent lenght to process it, for exemple we have a text like **"hello, world"** and **"recurrent neural network"** in this case our lenght is 2 and 3 respectively. 

Let's simulate how a RNN will process our input x which equals to "hello, world". RNN will process it 2 times because of the lenght of our input as follows x(t=0) = 'hello' then x(t=1) = 'world' using some transformation to convert words into numerical values. another one important characteristic is that RNN use the previous data to process the next one in our case RNN will use the data from x(t=0) to process x(t=1) with the help of a hidden state and understand that hello always followed by world.

<table>
<td>
<img src="images/RNN_equations.png" style="width:250;height:300px;">
</td>
</table>

Using the formula above we can implement our forword propagation and then the only thing we need is to update the wieghts which is the connection between the layers to get the accurat resultat using gradient descent.

<table>
<td>
<img src="images/backpro.png" style="width:250;height:300px;">
</td>
</table>

## Part 3 : pre-processing data

In [137]:
import pandas as pd
import nltk
import re
import string
from keras.preprocessing.text import Tokenizer
import warnings
warnings.filterwarnings('ignore')

In [138]:
df = pd.read_csv('IMDB Dataset.csv')

In [139]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [140]:
# assign the sentiment as 1 if postive and 0 of negative
df['sentiment'] = [1 if x == 'positive' else 0 for x in df['sentiment']]

In [141]:
stopwords = set(nltk.corpus.stopwords.words('english'))
tokinzer = nltk.tokenize.RegexpTokenizer(r'\w+')

In [142]:
def clean_text_1(text):
    '''Get rid of some additional punctuation and non-sensical text that was missed the first time around.'''
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub('\n', '', text)
    text = text.replace('br','')
    return text

In [143]:
def clean_text_2(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

In [144]:
clean1 = lambda x : clean_text_1(x)
clean2 = lambda x : clean_text_2(x)

In [145]:
data_cleaned  = pd.DataFrame(df.review.apply(clean1))
data_cleaned  = pd.DataFrame(data_cleaned.review.apply(clean2))

In [146]:
df['review'] = data_cleaned['review']

In [147]:
training_dataset = df[:30000]
testing_dataset  = df[30000:]
print(' training dataset shape ',training_dataset.shape)
print(' testing dataset shape ',testing_dataset.shape)

 training dataset shape  (30000, 2)
 testing dataset shape  (20000, 2)


In [148]:
def split_text_dataFrame(data_frame, column_name):
    for index,elm in enumerate(data_frame[column_name]):
        data_frame[column_name].iloc[index] = elm.split(' ')
        data_frame[column_name].iloc[index] = [word for word in data_frame[column_name].iloc[index] if word not in stopwords]
        pass
    return data_frame

In [149]:
def clean_split_text(data_frame, column_name):
    for index,elm in enumerate(data_frame[column_name]):
        for word in elm:
            if word == '' or word == 'oz' or word == ' ':
                elm.remove(word)
                pass
            pass
        pass
    return data_frame

In [150]:
training_dataset = split_text_dataFrame(training_dataset, 'review')
testing_dataset = split_text_dataFrame(testing_dataset, 'review')

In [151]:
training_dataset = clean_split_text(training_dataset, 'review')
testing_dataset = clean_split_text(testing_dataset, 'review')

In [152]:
#the number of words in the dictionnary
tokenizer = Tokenizer(5000)

tokenizer.fit_on_texts(data_cleaned['review'])

X_train = tokenizer.texts_to_sequences(training_dataset['review'])
X_test = tokenizer.texts_to_sequences(testing_dataset['review'])

In [155]:
del data_cleaned, df

NameError: name 'data_cleaned' is not defined

In [161]:
from keras.preprocessing import sequence
max_words = 500

X_train = sequence.pad_sequences(X_train, maxlen=max_words)
X_test = sequence.pad_sequences(X_test, maxlen=max_words)

In [162]:
X_train = np.array(X_train)
X_test  = np.array(X_test)

print(' X_train shape :',X_train.shape)
print(' X_test shape :',X_test.shape)

 X_train shape : (30000, 500)
 X_test shape : (20000, 500)


In [163]:
Y_train = training_dataset['sentiment']
Y_test  = testing_dataset['sentiment']

print(' Y_train shape :',Y_train.shape)
print(' Y_test shape :',Y_test.shape)

 Y_train shape : (30000,)
 Y_test shape : (20000,)


## Part 4 : Application using Keras

In [164]:
from keras import Sequential
from keras.layers import Embedding, LSTM, Dense

vocabulary_size = 5000
embedding_size=32

model=Sequential()
model.add(Embedding(vocabulary_size, embedding_size, input_length=max_words))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))

In [165]:
print(model.summary())

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 500, 32)           160000    
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
Total params: 213,301
Trainable params: 213,301
Non-trainable params: 0
_________________________________________________________________
None


In [166]:
model.compile(loss='binary_crossentropy', 
             optimizer='adam', 
             metrics=['accuracy'])

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


In [167]:
batch_size = 1000
num_epochs = 3
X_valid, y_valid = X_train[:batch_size], Y_train[:batch_size]
X_train2, y_train2 = X_train[batch_size:], Y_train[batch_size:]
model.fit(X_train2, y_train2, validation_data=(X_valid, y_valid), batch_size=batch_size, epochs=num_epochs)


Train on 29000 samples, validate on 1000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.callbacks.History at 0x182416b7648>

In [None]:
scores = model.evaluate(X_test, Y_test, verbose=0)
print('Test accuracy:', scores[1])

In [None]:
class Logstic_Regression():
    def __init__(self, epochs = 10000, learningRate = 1):
        self.weight = None
        self.LearningRate = learningRate
        self.epochs = epochs
        self.sigmoid = lambda x : 1/(1.0 + np.exp(-x))
        pass
    
    def fit(self, x_train, y_train):
        self.weight = np.zeros(x_train.shape[1])
        
        for i in range(self.epochs):
            
            output = self.sigmoid(np.dot(x_train, self.weight))
            
            error = output - y_train
            gradient = (np.dot(x_train.T, error))*(1/len(y_train))
            
            self.weight -= self.LearningRate * gradient
        pass
    
    def predict_proba(self, x_test):
        return self.sigmoid(np.dot(x_test, self.weight))
    
    def predict(self, x_test):
        values =  np.dot(x_test, self.weight) 
        prediction = [ 1 if elm > 0.5 else 0 for elm in values]
        return prediction

***bibliography***

* https://github.com/Kulbear/deep-learning-coursera/blob/master/Sequence%20Models/Operations%20on%20word%20vectors%20-%20v2.ipynb
* https://gist.github.com/aparrish/2f562e3737544cf29aaf1af30362f469
* https://github.com/Kulbear/deep-learning-coursera/blob/master/Sequence%20Models/Dinosaurus%20Island%20--%20Character%20level%20language%20model%20final%20-%20v3.ipynb
* https://github.com/adeshpande3/LSTM-Sentiment-Analysis/blob/master/Oriole%20LSTM.ipynb