# Natural Language Processing

# Sentiment Analysis with Recurrent Neural Network & Logistic Regression

### Introduction

First of all i would like to thank Laboratoire d'Informatique en Image et Systèmes d'Information (LIRIS) in collaboration with the Ecole Centrale de Lyon that give me the opportunity to achieve this work in the best environnement that helped me to improve my knowledge and to sharpen my skills.

#### Objectif
In this notebook we are going through all the basic and necessary knowledge to understand how we can apply both machine learning and deep learning approach in sentiment analysis, we are looking to understand how all the functionalities and what's behind going to work, to do so we are going to code everything from the scratch. after getting a good understanding we’ll walk through concrete code examples and a full Tensorflow sentiment classifier at the end.

- **Part 1** recurent neural netowrk
- **Part 2** words to vectors
- **Part 3** pre-processing data
- **Part 4** Application using Keras
- **Part 5** Logstic Regression

## Part 1 : Recurent neural network

Recurent neural network is a type of neural network quite complicated compared to simple neural network in term of how we process forward propagation and backward propagation. We are using RNN because of the dependency because we are going to process speech 

<table>
<td>
<img src="images/RNN.png" style="width:250;height:300px;">

</td>

</table>

The main difference between simple neural network (NN) and recurent neural network (RNN) is that RNN at took each time an input with a diffrent lenght to process it, for exemple we have a text like **"hello, world"** and **"recurrent neural network"** in this case our lenght is 2 and 3 respectively. 

Let's simulate how a RNN will process our input x which equals to "hello, world". RNN will process it 2 times because of the lenght of our input as follows x(t=0) = 'hello' then x(t=1) = 'world' using some transformation to convert words into numerical values. another one important characteristic is that RNN use the previous data to process the next one in our case RNN will use the data from x(t=0) to process x(t=1) with the help of a hidden state and understand that hello always followed by world.

<table>
<td>
<img src="images/RNN_equations.png" style="width:250;height:300px;">

</td>

</table>

Using the formula above we can implement our forword propagation and then the only thing we need is to update the wieghts which is the connection between the layers to get the accurat resultat using gradient descent.

<table>
<td>
<img src="images/backpro.png" style="width:250;height:300px;">

</td>

</table>

## Part 2 : words to vectors

In order to understand how we use the natural language in machine learning let's took a look at other types of neural networks Convolutional neural networks use arrays of pixel values, logistic regression uses quantifiable features, and reinforcement learning models use reward signals. The common theme is that the inputs need to be scalar. Lets see how we can do this using words to vectors and some similarity algorithms

Our goal is to give a text as an input to our RNN and give back in return a state which is either positive or negative

<table>
<td>
<img src="images/S.png"style="width:250;height:300px;">

</td>

</table>

Our goal is to find a way to code every single word into an array and give the collection of the array as an input like the figure below excatly show.


<table>
<td>
<img src = "images/S2.png" style="width:250;height:300px;">

</td>

</table>


Actually, there is two way to code texts into vectors
* The first method consist of importing a corpus full of words indexed from 0 to the last word, we are going to give and input as an array where all the element are 0 expect the index of the word in the corpus where it's 1 
* The second methode named word2vec and like the name applies it turn words into vectors 

A simple application to the first exemple we are going to code words to predict the next letter 

In [1]:
from data import train_data
import numpy as np
vocabulaire = [w for text in train_data.keys() for w in text.split(' ')]
vocabulaire = list(set(vocabulaire))

print(vocabulaire)

print("Nous avons ", len(vocabulaire)," mot unique ")

word_to_id = { w: i for i, w in enumerate(vocabulaire) }
id_to_word = { i: w for i, w in enumerate(vocabulaire) }

text = 'this is very good'.split(' ')
X = list()
for elm in text:
    x = np.zeros(18)
    x[word_to_id[elm]] = 1
    X.append(x)
    pass
pass
print(X)

['is', 'bad', 'this', 'not', 'and', 'earlier', 'at', 'or', 'all', 'now', 'good', 'i', 'right', 'was', 'happy', 'very', 'am', 'sad']
Nous avons  18  mot unique 
[array([0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0.]), array([1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0.]), array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
       0.]), array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
       0.])]



<table>
<td>
<img src = "images/S3.png" style="width:250;height:300px;">

</td>

</table>


In [2]:
import numpy as np
def read_glove_vecs(glove_file):
    with open(glove_file, 'r', encoding='utf8') as f:
        words = set()
        word_to_vec_map = {}
        for line in f:
            line = line.strip().split()
            curr_word = line[0]
            words.add(curr_word)
            word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)
    
    return words, word_to_vec_map

In [3]:
words, word_to_vec_map = read_glove_vecs('glove.6B.50d.txt')

In [4]:
love = word_to_vec_map['love']

In [5]:
adore = word_to_vec_map['adore']

Now we can mesure how two word similar Cosine similarity this method consist of compute the ongle between two vector let's see how we can do this

<table>
<td>
<img src = "images/s4.png" style="width:250;height:300px;">

</td>

</table>

In [6]:
# u.v = cos(u.v) * |u|*|v|
# the similarity is to caclulate the consinus
import numpy as np
def smilarity(U, V):
    UV = np.dot(U,V)
    norme_U = np.sqrt(np.sum(U*U)) 
    norme_V = np.sqrt(np.sum(V*V))
    return UV/(norme_U * norme_V)

***Now let's see how the two word LOVE & ADORE similar.***

In [7]:
smilarity(love, adore)

0.42786951433899845

So as we can see the function smilarity give as the smilarity between words


# Part 3 pre-processing data


In [8]:
import pandas as pd
import nltk
import re
import string
from keras.preprocessing.text import Tokenizer
import warnings
warnings.filterwarnings('ignore')

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [9]:
df = pd.read_csv('IMDB Dataset.csv')

In [10]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [11]:
# assign the sentiment as 1 if postive and 0 of negative
Sentiments = [1 if x == 'positive' else 0 for x in df['sentiment']]

In [12]:
Sentiments[:5]

[1, 1, 1, 0, 1]

In [13]:
stopwords = set(nltk.corpus.stopwords.words('english'))
tokinzer = nltk.tokenize.RegexpTokenizer(r'\w+')

In [14]:
def clean_text_round2(text):
    '''Get rid of some additional punctuation and non-sensical text that was missed the first time around.'''
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub('\n', '', text)
    return text

In [15]:
def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

In [16]:
clean1 = lambda x : clean_text_round1(x)
clean2 = lambda x : clean_text_round2(x)

In [17]:
data_clean  = pd.DataFrame(df.review.apply(clean1))
data_clean  = pd.DataFrame(data_clean.review.apply(clean1))

In [18]:
data_clean['Sentiments'] =  pd.DataFrame(Sentiments)

In [19]:
data = data_clean

In [20]:
del data_clean, df

In [21]:
x_train = list()
for par in data['review'].values:
    tmp = list()
    sentences = nltk.sent_tokenize(par)
    for sent in sentences:
        token = tokinzer.tokenize(sent)
        words = [w for w in token if w not in stopwords]
        tmp.extend(words)
        pass
    x_train.append(tmp)
    pass
pass    

In [22]:
x_train[0]

['one',
 'reviewers',
 'mentioned',
 'watching',
 'oz',
 'episode',
 'youll',
 'hooked',
 'right',
 'exactly',
 'happened',
 'mebr',
 'br',
 'first',
 'thing',
 'struck',
 'oz',
 'brutality',
 'unflinching',
 'scenes',
 'violence',
 'set',
 'right',
 'word',
 'go',
 'trust',
 'show',
 'faint',
 'hearted',
 'timid',
 'show',
 'pulls',
 'punches',
 'regards',
 'drugs',
 'sex',
 'violence',
 'hardcore',
 'classic',
 'use',
 'wordbr',
 'br',
 'called',
 'oz',
 'nickname',
 'given',
 'oswald',
 'maximum',
 'security',
 'state',
 'penitentary',
 'focuses',
 'mainly',
 'emerald',
 'city',
 'experimental',
 'section',
 'prison',
 'cells',
 'glass',
 'fronts',
 'face',
 'inwards',
 'privacy',
 'high',
 'agenda',
 'em',
 'city',
 'home',
 'manyaryans',
 'muslims',
 'gangstas',
 'latinos',
 'christians',
 'italians',
 'irish',
 'moreso',
 'scuffles',
 'death',
 'stares',
 'dodgy',
 'dealings',
 'shady',
 'agreements',
 'never',
 'far',
 'awaybr',
 'br',
 'would',
 'say',
 'main',
 'appeal',
 'sho

In [23]:
tokenizer = Tokenizer(5000,  oov_token = "<00V>")
tokenizer.fit_on_texts(x_train)
x_train = tokenizer.texts_to_sequences(x_train)

In [24]:
x_train[0]

[5,
 1800,
 937,
 58,
 3148,
 286,
 351,
 3026,
 109,
 486,
 475,
 1994,
 2,
 21,
 59,
 3086,
 3148,
 1,
 1,
 52,
 466,
 181,
 109,
 552,
 54,
 1569,
 43,
 1,
 1,
 1,
 43,
 2356,
 1,
 1,
 1335,
 275,
 466,
 3238,
 247,
 236,
 1,
 2,
 359,
 3148,
 1,
 233,
 1,
 1,
 2390,
 935,
 1,
 2482,
 1240,
 1,
 421,
 4529,
 2367,
 1077,
 1,
 2833,
 1,
 300,
 1,
 1,
 214,
 4894,
 3526,
 421,
 239,
 1,
 1,
 1,
 1,
 4964,
 1,
 2313,
 1,
 1,
 224,
 1,
 1,
 1,
 1,
 1,
 36,
 128,
 1,
 2,
 9,
 48,
 169,
 1171,
 43,
 550,
 94,
 162,
 157,
 433,
 2842,
 698,
 86,
 1138,
 4160,
 2347,
 970,
 698,
 1278,
 698,
 1,
 60,
 853,
 89,
 21,
 286,
 45,
 105,
 3086,
 1441,
 2069,
 289,
 48,
 1414,
 177,
 1329,
 1118,
 3148,
 91,
 1,
 214,
 1948,
 1956,
 466,
 466,
 1,
 1,
 4772,
 1,
 2811,
 1,
 1,
 1,
 381,
 504,
 16,
 143,
 15,
 1,
 634,
 695,
 1,
 542,
 1077,
 1,
 550,
 434,
 807,
 1854,
 1077,
 443,
 58,
 3148,
 102,
 303,
 3601,
 3107,
 1,
 16,
 1080,
 3834,
 388]

In [25]:
from keras.preprocessing import sequence
max_words = 500
X_train = sequence.pad_sequences(x_train, maxlen=max_words)

In [26]:
X_train.shape

(50000, 500)

In [27]:
from sklearn.model_selection import train_test_split

In [28]:
x, y, x_t, y_t = train_test_split(X_train, Sentiments, train_size = 0.2)

In [30]:
x.shape

(10000, 500)

In [31]:
y.shape

(40000, 500)

# Part 4 Application using Keras

In [35]:
embedding_size=32
vocabulary_size = 5000


from keras import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout
embedding_size=32
model=Sequential()
model.add(Embedding(vocabulary_size, embedding_size, input_length=max_words))
model.add(LSTM(1000))
model.add(Dense(1, activation='sigmoid'))
print(model.summary())


Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 500, 32)           160000    
_________________________________________________________________
lstm_2 (LSTM)                (None, 1000)              4132000   
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 1001      
Total params: 4,293,001
Trainable params: 4,293,001
Non-trainable params: 0
_________________________________________________________________
None


In [36]:
model.compile(loss='binary_crossentropy', 
             optimizer='adam', 
             metrics=['accuracy'])

In [37]:
model.fit(x, y, 1)

ValueError: Error when checking target: expected dense_2 to have shape (1,) but got array with shape (500,)

# Part 5 Logstic Regression

In [None]:
class Logitic_Regression():
    def __init__(self, epochs = 1000, learningRate = 1):
        self.weight = None
        self.LearningRate = learningRate
        self.epochs = epochs
        self.sigmoid = lambda x : 1/(1.0 + np.exp(-x))
        pass
    
    def fit(self, x_train, y_train):
        self.weight = np.zeros(x_train.shape[1])
        
        for i in range(self.epochs):
            
            output = self.sigmoid(np.dot(x_train, self.weight))
            
            error = output - y_train
            gradient = (np.dot(x_train.T, error))*(1/len(y_train))
            
            self.weight -= self.LearningRate * gradient
        pass
    
    def predict_proba(self, x_test):
        return self.sigmoid(np.dot(x_test, self.weight))
    
    def predict(self, x_test):
        values =  np.dot(x_test, self.weight) 
        prediction = [ 1 if elm > 0.5 else 0 for elm in values]
        return prediction

***bibliography***

* https://github.com/Kulbear/deep-learning-coursera/blob/master/Sequence%20Models/Operations%20on%20word%20vectors%20-%20v2.ipynb
* https://gist.github.com/aparrish/2f562e3737544cf29aaf1af30362f469
* https://github.com/Kulbear/deep-learning-coursera/blob/master/Sequence%20Models/Dinosaurus%20Island%20--%20Character%20level%20language%20model%20final%20-%20v3.ipynb
* https://github.com/adeshpande3/LSTM-Sentiment-Analysis/blob/master/Oriole%20LSTM.ipynb