<a href="https://colab.research.google.com/github/Boadzie/Jupyter-Notebooks/blob/master/Sentimental_Analysis_with_LSTM_and_IMDB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Business Problem
---
**Predict Sentiment in Text**

### Step 1: Get the Data

In [0]:
# load keras libraries, layers and the imdb dataset
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Embedding
from keras.layers import LSTM
from keras.datasets import imdb


# decide maximum number of words to load in dataset
max_features = 20000


# decide maximum words in a sentence and batch size
maxlen = 50
batch_size = 32

Using TensorFlow backend.


In [0]:
# !pip install numpy==1.16.1
import numpy as np

# load the data
print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

Loading data...


In [0]:
# pad the sequences
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

x_train shape: (25000, 50)
x_test shape: (25000, 50)


### Explore the Text Dataset

In [0]:
# show a data sample
print("Sample of x_train array = ", x_train[0])
print("Sample of y_train array = ", y_train[0])


Sample of x_train array =  [2071   56   26  141    6  194 7486   18    4  226   22   21  134  476
   26  480    5  144   30 5535   18   51   36   28  224   92   25  104
    4  226   65   16   38 1334   88   12   16  283    5   16 4472  113
  103   32   15   16 5345   19  178   32]
Sample of y_train array =  1


In [0]:
# get the vocabulary used for converting words to numbers
imdb_vocab = imdb.get_word_index()

In [0]:
# create a small vocabulary with only top 20 items and print it
# this is just to understand how the vocabulary looks
small_vocab = { key:value for key, value in imdb_vocab.items() if value < 20 }
print("Vocabulary = ", small_vocab)

Vocabulary =  {'with': 16, 'i': 10, 'as': 14, 'it': 9, 'is': 6, 'in': 8, 'but': 18, 'of': 4, 'this': 11, 'a': 3, 'for': 15, 'br': 7, 'the': 1, 'was': 13, 'and': 2, 'to': 5, 'film': 19, 'movie': 17, 'that': 12}


In [0]:
# function to get the sentence from integer array
# reverse look-up words in vocabulary
def get_original_text(int_arr):
  word_to_id = {k:(v+3) for k,v in imdb_vocab.items()}
  word_to_id["<PAD>"] = 0
  word_to_id["<START>"] = 1
  word_to_id["<UNK>"] = 2
  id_to_word = {value:key for key,value in word_to_id.items()}
  return ' '.join(id_to_word[id] for id in int_arr)

In [0]:
# define sentiments array
sentiment_labels = ['Negative', 'Positive']

In [0]:
print("-------------------------")
print("SOME SENTENCE AND SENTIMENT SAMPLES")

# print some of the training data
for i in range(5):
  print("Training Sentence = ", get_original_text(x_train[i]))
  print("Sentiment = ", sentiment_labels[y_train[i]])
  print("-----------------------")

-------------------------
SOME SENTENCE AND SENTIMENT SAMPLES
Training Sentence =  grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all
Sentiment =  Positive
-----------------------
Training Sentence =  boobs and <UNK> taking away bodies and the gym still doesn't close for <UNK> all joking aside this is a truly bad film whose only charm is to look back on the disaster that was the 80's and have a good old laugh at how bad everything was back then
Sentiment =  Negative
-----------------------
Training Sentence =  must have looked like a great idea on paper but on film it looks like no one in the film has a clue what is going on crap acting crap costumes i can't get across how <UNK> this is to watch save yourself an hour a bit of your life
Sentiment =  Negative
-----------------------
Trai

### Build and Train the LSTM Model

In [0]:
# build the model
model = Sequential()
model.add(Embedding(max_features, 128))
model.add(LSTM(128, dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

In [0]:
# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [0]:
# train the model
model.fit(x_train, y_train, batch_size=batch_size, epochs=2, validation_data=(x_test, y_test))

Instructions for updating:
Use tf.cast instead.
Train on 25000 samples, validate on 25000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7ffb73b8aa20>

In [0]:
score, acc = model.evaluate(x_test, y_test, batch_size=batch_size)

print('Test score:', score)
print('Test accuracy:', acc)

Test score: 0.4212919333076477
Test accuracy: 0.8124


###  Make Predictions on Our Sentences

In [0]:
from keras.preprocessing.text import text_to_word_sequence

# first we will save the model
model.save('imdb_nlp.h5')

In [0]:
# get the word index from imdb dataset
word_index = imdb.get_word_index()

In [0]:
# define the documents
my_sentence1 = 'really bad experience. amazingly bad.'
my_sentence2 = 'pretty awesome to see. very good work.'

In [0]:
# define function to predict sentiment using model
def predict_sentiment(my_test):
  # tokenize the sentence
  
  word_sequence = text_to_word_sequence(my_test)
  # create a blank sequence of integers
  int_sequence = []
  
  # for each word in the sentence
  for w in word_sequence:
    # get the integer from word_index (vocabulary) and add to list
    int_sequence.append(word_index[w])
    
  # pad the sequence of numbers to input size expected by model
  sent_test = sequence.pad_sequences([int_sequence], maxlen=maxlen)
    
    # make a prediction using our Model
  y_pred = model.predict(sent_test)
  return y_pred[0][0]

In [0]:
# show results for sentences
print ('SENTENCE : ', my_sentence1, ' : ', predict_sentiment(my_sentence1), ' : SENTIMENT : ', sentiment_labels[int(round(predict_sentiment(my_sentence1)))] )

print ('SENTENCE : ', my_sentence2, ' : ', predict_sentiment(my_sentence2), ' : SENTIMENT : ', sentiment_labels[int(round(predict_sentiment(my_sentence2)))] )

SENTENCE :  really bad experience. amazingly bad.  :  0.8836059  : SENTIMENT :  Positive
SENTENCE :  pretty awesome to see. very good work.  :  0.17481722  : SENTIMENT :  Negative
