### Movie Review Sentiment Analysis with an RNN
In this notebook, we'll implement a recurrent neural network that performs sentiment analysis. Using an RNN rather than a feedfoward network is more accurate since we can include information about the sequence of words. Here we'll use a dataset of movie reviews, accompanied by labels.

### Data Set( Data Used to train the model is from Kaggle)
The labeled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1. No individual movie has more than 30 reviews. The 25,000 review labeled training set does not include any of the same movies as the 25,000 review test set.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
tsv_file='labeledTrainData.tsv'
csv_table=pd.read_table(tsv_file,sep='\t')
csv_table.to_csv('train.csv',index=False)

### Train Data

In [None]:
train_data=pd.read_csv('train.csv')
print(train_data.shape)

In [None]:
train_data.head()

### Test Data

In [None]:
tsv_file='testData.tsv'
csv_table=pd.read_table(tsv_file,sep='\t')
csv_table.to_csv('test.csv',index=False)

In [None]:
test_data=pd.read_csv('test.csv')
print(train_data.shape)

In [None]:
test_data.head()

### Data Preprocessing
The first step when building a neural network model is getting your data into the proper form to feed into the network. Since we're using embedding layers, we'll need to encode each word with an integer. We'll also want to clean it up a bit.

You can see an example of the reviews data above. We'll want to get rid of those periods. Also, you might notice that the reviews are delimited with newlines \n. To deal with those, I'm going to split the text into each review using \n as the delimiter. Then I can combined all the reviews back together into one big string.

First, let's remove all punctuation. Then get all the text without the newlines and split it into individual words.

In [None]:
# Import packages required during data processing.
import re
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import *
stemmer = PorterStemmer()

from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words("english"))

In [None]:
def clean_text(text):
  clean_review=[]
  for i in text:
    
 
    text = re.sub(r'[^\w\s]','',i, re.UNICODE)
    text = text.lower()

    clean_review.append(text)
  return clean_review

In [None]:
#train 
d1=train_data['review']
data=clean_text(d1)

In [None]:
# test
d2=test_data['review']
data_test=clean_text(d2)

In [None]:
# train
train_data['review']=pd.Series(data)
train_data.head()
# Here we can see more cleaner review.

In [None]:
# test
test_data['review']=pd.Series(data_test)
test_data.head()

### Tokenization
Now we will tokenize all the cleaned tweets in our dataset. Tokens are individual terms or words, and tokenization is the process of splitting a string of text into tokens.

In [None]:
# train
# first lets remove all the words of len less then 3.
train_data['review'] = train_data['review'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3]))

In [None]:
tokenized_review = train_data['review'].apply(lambda x: x.split())
tokenized_review.head()

In [None]:
# test
# first lets remove all the words of len less then 3.
test_data['review'] = test_data['review'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3]))

In [None]:
tokenized_review_test = test_data['review'].apply(lambda x: x.split())
tokenized_review_test.head()

### Stemming
1.Stemming is a rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word. For example, For example – “play”, “player”, “played”, “plays” and “playing” are the different variations of the word – “play”.

2.Stemming and Lemmatization are Text Normalization (or sometimes called Word Normalization) techniques in the field of Natural Language Processing that are used to prepare text, words, and documents for further processing.

In [None]:
# train
from nltk.stem.porter import *
stemmer = PorterStemmer()

stem_review = tokenized_review.apply(lambda x: [stemmer.stem(i) for i in x]) # stemming
stem_review.head()

In [None]:
# test
stem_review_test = tokenized_review_test.apply(lambda x: [stemmer.stem(i) for i in x]) # stemming
stem_review_test.head()

### Removing the stop words¶


In [None]:
def remove_stop(review):
  final_review=[]
  for i in review:
    text = [word for word in i if not word in stop_words]
    text = " ".join(text)  
    final_review.append(text)
  return final_review

In [None]:
# train
final_review=remove_stop(stem_review)

In [None]:
# test
final_review_test=remove_stop(stem_review_test)

Now lets put it back in to our data frame and compare to the earlier one


In [None]:
# train
train_data['stem_review']=final_review
train_data.head()

In [None]:
# test
test_data['stem_review_test']=final_review_test
test_data.head()

As here we can see stemming create a negative impact on the words, even some words lose out its meaning. Let's try out without stamming.

### Text without stemming.

In [None]:
def remove_stop(review):
  unstemmed=[]
  for i in review:
    text = [word for word in i if not word in stop_words]
    text = " ".join(text)  
    unstemmed.append(text)
  return unstemmed

In [None]:
# train
unstemmed=remove_stop(tokenized_review)

train_data['final_review']=pd.Series(unstemmed)

In [None]:
# test

unstemmed_test=remove_stop(tokenized_review_test)

test_data['final_review_test']=pd.Series(unstemmed_test)

### Final Review

In [None]:
train_data.head()

In [None]:
test_data.head()

Out of all three we can see the final review is the most suitable one.

##  Using RNN( Recurrent Neural Network) to train the model
To train the model on the given reviews i am gonna use a advanced version of RNN called LSTM.

### LSTM RNN Networks
Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies. They work tremendously well on a large variety of problems, and are now widely used.

### Library under use
To train this RNN Network, I am gonna use Keras Library , which is an open-source neural-network library written in Python

### Keras to train RNN
So what exactly is Keras? Let's put it this way, it makes programming machine learning algorithms much much easier. It simply runs atop Tensorflow/Theano, cutting down on the coding and increasing efficiency. In more technical terms, Keras is a high-level neural network API written in Python.



### Importing Essential Libraries

In [None]:
# Importing Rnn dependencies
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense , Input , LSTM , Embedding, Dropout , Activation, GRU, Flatten
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model, Sequential
from keras.layers import Convolution1D
from keras import initializers, regularizers, constraints, optimizers, layers


### Formatting
Our tools are ready! We can now format our data! i.e Convert the data in trainable form .

In [None]:
max_features = 6000
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(train_data['final_review'])
list_tokenized_train = tokenizer.texts_to_sequences(train_data['final_review'])

maxlen = 130
X = pad_sequences(list_tokenized_train, maxlen=maxlen)
y = train_data['sentiment']

In [None]:
X.shape

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,shuffle=True,test_size=0.2)

In [None]:
X_train.shape

### Building the RNN model
Thats data formatting and representation part finished! Yes! We can now start building our RNN model!

In [None]:
embed_size = 128
model = Sequential()
model.add(Embedding(max_features, embed_size))
model.add(Bidirectional(LSTM(32, return_sequences = True)))
model.add(GlobalMaxPool1D())
model.add(Dense(20, activation="relu"))
model.add(Dropout(0.05))
model.add(Dense(1, activation="sigmoid"))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
batch_size = 100
epochs = 3
model.fit(X_train,y_train, batch_size=batch_size, epochs=epochs, validation_split=0.2)

Here i use only 3 epochs because previously when i trained it for 10 epochs the outcome was like accuracy is increasing after 3 epochs but Validation accuracy is decreasing , Which signifies that model start overfitting after 3rd epoch.

### Model Testing

In [None]:
prediction = model.predict(X_test)

In [None]:
y_pred = (prediction > 0.5)

In [None]:
from sklearn.metrics import f1_score, confusion_matrix

print('F1-score: {0}'.format(f1_score(y_pred, y_test)))

In [None]:
print('Confusion matrix:')
confusion_matrix(y_pred, y_test)