# Sentiment Analysis Part 2

_Natural Langauge Processing Nanodegree Program_

---



## Step 5: Switching gears - RNNs

We just saw how the task of sentiment analysis can be solved via a traditional machine learning approach: BoW + a nonlinear classifier. We now switch gears and use Recurrent Neural Networks, and in particular LSTMs, to perform sentiment analysis in Keras. Conveniently, Keras has a built-in [IMDb movie reviews dataset](https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification) that we can use, with the same vocabulary size.

In [1]:
from keras.datasets import imdb  # import the built-in imdb dataset in Keras

# Set the vocabulary size
vocabulary_size = 5000

# Load in training and test data (note the difference in convention compared to scikit-learn)
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=vocabulary_size)
print("Loaded dataset with {} training samples, {} test samples".format(len(X_train), len(X_test)))

Using TensorFlow backend.


Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz
Loaded dataset with 25000 training samples, 25000 test samples


In [2]:
# Inspect a sample review and its label
print("--- Review ---")
print(X_train[7])
print("--- Label ---")
print(y_train[7])

--- Review ---
[1, 4, 698, 1071, 396, 1510, 122, 24, 2, 4402, 19, 4, 326, 7, 2, 4, 3023, 4132, 466, 15, 1063, 5, 2, 2, 12, 16, 2429, 8, 2242, 2, 5, 2, 3286, 2, 5, 1248, 349, 8, 521, 4, 698, 2, 134, 84, 71, 220, 1097, 2, 39, 4, 655, 4132, 54, 4, 1071, 2, 69, 2, 8, 123, 68, 290, 4, 2, 7, 4, 2, 71, 1412, 8, 98, 247, 74, 30, 2, 33, 344, 34, 2, 5, 2, 17, 69, 2453, 77, 4, 420, 36, 2, 8, 2, 4, 2, 2, 513, 38, 2, 5, 2, 877, 572, 1063, 19, 15, 707, 7, 4069, 4, 4311, 2159, 7, 1071, 2, 562, 68, 2, 39, 2358, 180, 4, 1031, 2407, 827, 2115, 382, 4, 91, 804, 1071, 396, 2, 126, 16, 4, 492, 7, 6, 704, 2, 24, 6, 2, 4875, 1346, 11, 522, 4575, 1831, 6, 704, 11, 4, 2, 7, 1208, 2, 3237, 5, 2, 4, 236, 7, 94, 2, 8, 94, 1138, 2, 17, 946, 17, 2, 2613, 100, 2, 125, 27, 2, 2, 2, 2, 48, 16, 2, 19, 2, 1805, 34, 4, 2, 4132, 5, 3419, 2, 34, 316, 334, 12, 215, 30, 2032, 15, 4, 38, 446, 1506, 7, 119, 16, 1477, 34, 4, 2, 2634, 6, 701, 1494, 15, 317, 6, 171, 2, 11, 1316, 19, 2, 1828, 5, 4, 1206, 590, 2, 19, 31, 42, 107, 1

Notice that the label is an integer (0 for negative, 1 for positive), and the review itself is stored as a sequence of integers. These are word IDs that have been preassigned to individual words. To map them back to the original words, you can use the dictionary returned by `imdb.get_word_index()`.

In [3]:
# Map word IDs back to words
word2id = imdb.get_word_index()
id2word = {i: word for word, i in word2id.items()}
print("--- Review (with words) ---")
print([id2word.get(i, " ") for i in X_train[7]])
print("--- Label ---")
print(y_train[7])

Downloading data from https://s3.amazonaws.com/text-datasets/imdb_word_index.json
--- Review (with words) ---
['the', 'of', 'non', 'tension', 'doing', 'wall', 'off', 'his', 'and', 'treats', 'film', 'of', 'less', 'br', 'and', 'of', 'bridge', 'props', 'throughout', 'for', 'members', 'to', 'and', 'and', 'that', 'with', 'qualities', 'in', 'dry', 'and', 'to', 'and', 'stunts', 'and', 'to', 'aspect', 'budget', 'in', 'actress', 'of', 'non', 'and', 'while', 'great', 'than', 'family', 'bored', 'and', 'or', 'of', 'husband', 'props', 'no', 'of', 'tension', 'and', 'me', 'and', 'in', 'ever', 'were', 'main', 'of', 'and', 'br', 'of', 'and', 'than', 'cry', 'in', 'any', 'girl', 'been', 'at', 'and', 'they', 'line', 'who', 'and', 'to', 'and', 'movie', 'me', 'deaths', 'will', 'of', 'liked', 'from', 'and', 'in', 'and', 'of', 'and', 'and', 'kill', 'her', 'and', 'to', 'and', 'killing', 'happened', 'members', 'film', 'for', 'silly', 'br', 'unintentional', 'of', 'hal', 'pair', 'br', 'tension', 'and', 'strong', 

In [83]:
word2id.get('unknown')

1856

Unlike our Bag-of-Words approach, where we simply summarized the counts of each word in a document, this representation essentially retains the entire sequence of words (minus punctuation, stopwords, etc.). This is critical for RNNs to function. But it also means that now the features can be of different lengths!

#### Question: Variable length reviews

What is the maximum review length (in terms of number of words) in the training set? What is the minimum?

#### Answer:

Maximum is guess will be 500 words per review and the minumum will be None i guess

### TODO: Pad sequences

In order to feed this data into your RNN, all input documents must have the same length. Let's limit the maximum review length to `max_words` by truncating longer reviews and padding shorter reviews with a null value (0). You can accomplish this easily using the [`pad_sequences()`](https://keras.io/preprocessing/sequence/#pad_sequences) function in Keras. For now, set `max_words` to 500.

In [6]:
from keras.preprocessing import sequence

# Set the maximum number of words per document (for both training and testing)
max_words = 500

# TODO: Pad sequences in X_train and X_test
X_train = sequence.pad_sequences(maxlen=max_words,value=0,sequences=X_train)
X_test = sequence.pad_sequences(maxlen=max_words,value=0,sequences=X_test)



In [9]:
len(X_train[8])

500

### TODO: Design an RNN model for sentiment analysis

Build your model architecture in the code cell below. We have imported some layers from Keras that you might need but feel free to use any other layers / transformations you like.

Remember that your input is a sequence of words (technically, integer word IDs) of maximum length = `max_words`, and your output is a binary sentiment label (0 or 1).

In [270]:
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout

# TODO: Design your model
model = Sequential()
model.add(Embedding(input_dim=vocabulary_size,output_dim=64,input_length=max_words))
model.add(LSTM(64))
model.add(Dense(1,activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy',
              metrics=['acc'])

print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (None, 500, 64)           320000    
_________________________________________________________________
lstm_6 (LSTM)                (None, 64)                33024     
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 65        
Total params: 353,089
Trainable params: 353,089
Non-trainable params: 0
_________________________________________________________________
None


#### Question: Architecture and parameters

Briefly describe your neural net architecture. How many model parameters does it have that need to be trained?

#### Answer:

...

### TODO: Train and evaluate your model

Now you are ready to train your model. In Keras world, you first need to _compile_ your model by specifying the loss function and optimizer you want to use while training, as well as any evaluation metrics you'd like to measure. Specify the approprate parameters, including at least one metric `'accuracy'`.

In [271]:
# TODO: Compile your model, specifying a loss function, optimizer, and metrics
model.fit(X_train, y_train, epochs=7, batch_size=64, validation_split=0.2)

Train on 20000 samples, validate on 5000 samples
Epoch 1/7
Epoch 2/7
Epoch 3/7
Epoch 4/7
Epoch 5/7
Epoch 6/7
Epoch 7/7


<keras.callbacks.History at 0x7f9c1b973198>

Once compiled, you can kick off the training process. There are two important training parameters that you have to specify - **batch size** and **number of training epochs**, which together with your model architecture determine the total training time.

Training may take a while, so grab a cup of coffee, or better, go for a hike! If possible, consider using a GPU, as a single training run can take several hours on a CPU.

> **Tip**: You can split off a small portion of the training set to be used for validation during training. This will help monitor the training process and identify potential overfitting. You can supply a validation set to `model.fit()` using its `validation_data` parameter, or just specify `validation_split` - a fraction of the training data for Keras to set aside for this purpose (typically 5-10%). Validation metrics are evaluated once at the end of each epoch.

In [272]:
import os

# Save your model, so that you can quickly load it in future (and perhaps resume training)
model_file = "rnn_model.h5"  # HDF5 file
cache_dir='./'
model.save(os.path.join(cache_dir, model_file))

# Later you can load it using keras.models.load_model()
#from keras.models import load_model
#model = load_model(os.path.join(cache_dir, model_file))

Once you have trained your model, it's time to see how well it performs on unseen test data.

In [273]:
# Evaluate your model on the test set
scores = model.evaluate(X_test, y_test, verbose=0)  # returns loss and other metrics specified in model.compile()
print("Test accuracy:", scores[1])  # scores[1] should correspond to accuracy if you passed in metrics=['accuracy']



Test accuracy: 0.87148


In [70]:
# RegEx for removing non-letter characters
import re

# NLTK library for the remaining steps
import nltk
nltk.download("stopwords")   # download list of stopwords (only once; need not run it again)
from nltk.corpus import stopwords # import stopwords
from nltk.stem.porter import *
stemmer = PorterStemmer()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [151]:
word2id.get("unknown", "")

1856

In [168]:

output = model.predict_classes(X_test)





In [274]:

import pandas as pd
def visualize_prediction(output):
    outputs =[]
    for value in output:
        outputs.append(*value)

    output=list(output)
    df = pd.DataFrame(data={'Label':list(y_test), 'Prediction':outputs })


    df['Label'],df['Prediction']=df.Label.map({0:'Positive',1:'Negative'}),df.Prediction.map({0:'Positive',1:'Negative'})
    return df
    
#list(y_test)
visualize_prediction(output)

Unnamed: 0,Label,Prediction
0,Negative,Negative
1,Negative,Negative
2,Negative,Positive
3,Positive,Positive
4,Negative,Negative
5,Negative,Positive
6,Positive,Positive
7,Negative,Positive
8,Negative,Negative
9,Positive,Positive


In [232]:
# Later you can load it using keras.models.load_model()
from keras.models import load_model
model = load_model(os.path.join(cache_dir, model_file))

In [233]:
model.fit(X_train, y_train, epochs=3, batch_size=64, validation_split=0.5)

Train on 12500 samples, validate on 12500 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f9c1c7406d8>

In [234]:
model.fit(X_train, y_train, epochs=1, batch_size=32, validation_split=0.4)

Train on 15000 samples, validate on 10000 samples
Epoch 1/1


<keras.callbacks.History at 0x7f9c1c7401d0>

In [261]:
import numpy as np


msg="Such a loss of time "
def review_to_array(text):
    text = re.sub(r"[^a-zA-Z0-9]", " ", text)
    text = text.lower().split()
    
    words_ids=[word2id.get(i, " ") for i in text]
    review_array=np.array(list(filter((" ").__ne__, words_ids)))


    test_seq = np.pad(review_array, (max_words-len(review_array), 0),
                      'constant', constant_values=(0),)

    
    test_seq = test_seq.reshape(-1, 500)

    return test_seq



def get_predicion(msg):
    msg  = review_to_array(msg)
    pred = model.predict_classes(msg)
    
    return 'Positive' if pred[0]==1 else 'Negative'
    

    

pred = get_predicion(msg)
print(pred)

Positive


In [259]:
from ipywidgets import widgets 
from IPython.display import display

print('Please enter your movie review.')
text = widgets.Text()

display(text)
def handle_submit(sender):
    review=(text.value)
    try:
        res=(get_predicion(review))
        printed= 'You liked the movie.'if res == 'Positive' else 'You didn\'t like that movie.'
        print(printed)
        
    except:
        print('Please enter a more descriptive review for the movie/series')
    

  
text.on_submit(handle_submit)

Please enter your movie review.


gdf
You liked the movie.
I hate it
You liked the movie.
I Love it but 
You didn't like that movie.
Amazing
You liked the movie.
terrible
You liked the movie.
Terrible e
You liked the movie.
Terrible 
You liked the movie.


#### Question: Comparing RNNs and Traditional Methods

How well does your RNN model perform compared to the BoW + Gradient-Boosted Decision Trees?

#### Answer:

...

## Extensions

There are several ways in which you can build upon this notebook. Each comes with its set of challenges, but can be a rewarding experience.

- The first thing is to try and improve the accuracy of your model by experimenting with different architectures, layers and parameters. How good can you get without taking prohibitively long to train? How do you prevent overfitting?

- Then, you may want to deploy your model as a mobile app or web service. What do you need to do in order to package your model for such deployment? How would you accept a new review, convert it into a form suitable for your model, and perform the actual prediction? (Note that the same environment you used during training may not be available.)

- One simplification we made in this notebook is to limit the task to binary classification. The dataset actually includes a more fine-grained review rating that is indicated in each review's filename (which is of the form `<[id]_[rating].txt>` where `[id]` is a unique identifier and `[rating]` is on a scale of 1-10; note that neutral reviews > 4 or < 7 have been excluded). How would you modify the notebook to perform regression on the review ratings? In what situations is regression more useful than classification, and vice-versa?

Whatever direction you take, make sure to share your results and learnings with your peers, through blogs, discussions and participating in online competitions. This is also a great way to become more visible to potential employers!