In this example we shall train a sentiment analysis model capable of classifying the attitudes expressed
in a film review. The dataset is a sample of the **IMDb** dataset that contains `50,000`
reviews (split equally into `25,000` train and `25,000` test sets) of movies accompanied by a
label expressing the sentiment of the review (`0 = negative`, `1 = positive`). **IMDb**
(https://www.imdb.com/) is a large online database containing information
about films, TV series, and video games.

In [1]:
import warnings
warnings.filterwarnings('ignore')

# Import required modules and packages

In [2]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
print(tf.__version__)

2.0.0-beta1


# Load and explore the dataset

The **IMDb** textual
data offered by Keras is cleansed of punctuation, normalized into lowercase, and
transformed into numeric values. Each word is coded into a number representing
its ranking in frequency.
For the *vocabulary size*, we are ensuring that we want to consider a maximum `10,000-word` vocabulary.

In [3]:
top_words = 10000
((x_train, y_train), (x_test, y_test)) = keras.datasets.imdb.load_data(num_words = top_words, seed = 21)

In [5]:
print("Size Training Examples: %i" % len(x_train))
print("Size of Training labels: %i\n" % len(y_train))
print("Size Test Examples: %i" % len(x_test))
print("Size of Test labels: %i" % len(y_test))

Size Training Examples: 25000
Size of Training labels: 25000

Size Test Examples: 25000
Size of Test labels: 25000


In [6]:
print('Train set:', np.unique(y_train, return_counts = True), '\n')
print('Test set:', np.unique(y_test, return_counts = True))

Train set: (array([0, 1], dtype=int64), array([12500, 12500], dtype=int64)) 

Test set: (array([0, 1], dtype=int64), array([12500, 12500], dtype=int64))


This demonstration dataset is *balanced* with equal number
of positive and negative sentiment examples

Next we create some Python
dictionaries that can convert between the code used in the dataset and the real words or review texts.

In [7]:
word_to_id = {w: i + 3 for w, i in keras.datasets.imdb.get_word_index().items()}
id_to_word = {0: '<PAD>', 1: '<START>', 2: '<UNK>'}
id_to_word.update({i + 3: w for w, i in keras.datasets.imdb.get_word_index().items()})

In [8]:
def convert_to_text(sequence):
    return ' '.join([id_to_word[s] for s in sequence if s >= 3])

# Let's convert some samples sequences to texts and get their scores

In [9]:
print(convert_to_text(x_train[8]))

this movie was like a bad train wreck as horrible as it was you still had to continue to watch my boyfriend and i rented it and wasted two hours of our day now don't get me wrong the acting is good just the movie as a whole just both of us there wasn't anything positive or good about this scenario after this movie i had to go rent something else that was a little lighter jennifer is as usual a very dramatic actress her character seems manic and not all there hannah though over played she does a wonderful job playing out the situation she is in more than once i found myself yelling at the tv telling her to fight back or to get violent all in all very violent movie not for the faint of heart


In [10]:
print('Sentiment score:', y_train[8])

Sentiment score: 0


In [13]:
print(convert_to_text(x_test[100]))

as good as list was i found this movie much more powerful as it is a documentary and based on real life it details the story of the frank family and anne in particular although it is a bit slow moving at first their family life before the war it becomes very powerful br br due to some of the footage and photos of the camps i would not recommend it for children but for adults it illustrates the horror of the holocaust through one young girl highly recommended


In [14]:
print('Sentiment score:', y_test[100])

Sentiment score: 1


It's obvious from the wordings of the texts that the first is a **negative** review (sentiment `score = 0`) with words like `bad`, `wreck`, `horrible` used to describe it. The second is a **positive** review (sentiment `score = 1`)with words like `good`, `powerful`, `highly recommended` used to describe the review.

# Prepare dataset

Sometimes the first few words in a review can reveal the sentiments. Sometimes the sentiment is hidden in the middle or right at the end of the review. In bulding the model we shall limit the number of words to analyse to get the sense. In this example we shall consider the first `200` words and those reviews with less than 200 words shall be **input padded** with zeros.

In [15]:
print('Before padding: \n')
print(x_train[0])

Before padding: 

[1, 13, 119, 78, 3310, 102, 13, 66, 81, 13, 462, 7273, 33, 98, 5, 4, 6196, 1308, 16, 260, 6, 9337, 7, 98, 9992, 11, 4, 2, 7, 68, 162, 204, 431, 8870, 3310, 9159, 448, 23, 4, 5596, 12, 610, 40, 12, 16, 170, 8, 30, 545, 1139, 2027, 6, 1034, 7, 2, 1664, 66, 12, 16, 2, 34, 6, 800, 7, 3310, 1274, 342, 2, 63, 9, 3310, 20, 5868, 33, 94, 118, 13, 16, 11, 4, 1310, 13, 16, 1623, 8, 140, 721, 12, 23, 8870, 1168, 1656, 132, 449, 558, 16, 15, 20, 355, 10, 10, 355, 355, 355, 10, 10, 1195, 2509, 4621, 56, 10, 10, 14, 9, 2, 2, 33, 94, 55, 249, 61, 369, 54, 6, 8527, 46, 250, 9, 839, 46, 7, 9429, 748, 5, 2, 8, 6, 2702, 1930, 41, 419, 125, 88, 4, 3310, 406, 6762, 2, 4, 427, 2140, 1656, 4042, 2, 11, 41, 2, 494, 46, 1954, 4712, 198, 51, 13, 683, 1193, 10, 10, 198, 66, 89, 4, 114, 495, 7303, 197, 4, 1168, 1656, 61, 492, 1131, 7, 5388, 21, 13, 839, 90, 145, 8, 113, 34, 8253, 27, 2, 19, 15, 7, 6, 8870, 3310, 88, 8222, 92, 2, 8, 5388, 5, 1037, 2, 2, 2864, 2, 449, 168, 6, 404, 2, 112, 207, 107

In [16]:
max_pad = 200
x_train = keras.preprocessing.sequence.pad_sequences(x_train, maxlen = max_pad)
x_test = keras.preprocessing.sequence.pad_sequences(x_test, maxlen = max_pad)

In [17]:
print('After padding: \n')
print(x_train[0])

After padding: 

[  88    4 3310  406 6762    2    4  427 2140 1656 4042    2   11   41
    2  494   46 1954 4712  198   51   13  683 1193   10   10  198   66
   89    4  114  495 7303  197    4 1168 1656   61  492 1131    7 5388
   21   13  839   90  145    8  113   34 8253   27    2   19   15    7
    6 8870 3310   88 8222   92    2    8 5388    5 1037    2    2 2864
    2  449  168    6  404    2  112  207 1075    4  375 5986    7    4
  406 1522   13  124  903   97   90    2   21    2   48   32  148 3310
    2    2   93   61  492    2  305    7    2    4  893 8016   13  401
 5679   83   27  117 2687 5419   29  941 1889   90   21  808   14   46
  793    4 1526   84   37   28   34   96    7   49    2  114 1009 1054
   56   23   61 2301 1111    9    4  255    8  937   61  492   16 3953
  159   29 1131   13 2134 3872   81   41   32   14  832   56    8   35
  576 1301    5 5348 3134  255  335  170    8    2   72 1168 1656   57
   29    9    2    2 3310  415   11 5215   89 1047   10   10

# Build the Deep Learning Model

In this example the **embedding** will apply a pretrained word embedding (such as **Word2vec** or **GloVe**) to the sequence input.
The model uses **Bidirectional** wrapping — an LSTM layer of 64 cells.
**Bidirectional**
transforms a normal **LSTM** layer by doubling it: On the first side,
it applies the normal sequence of inputs you provide; on the second, it passes the
reverse of the sequence. We use this approach because sometimes we use words
in a different order, and building a bidirectional layer will catch any word pattern,
no matter the order.

In [None]:
embedding_vector_length = 32

model = keras.models.Sequential() 
model.add(keras.layers.Embedding(top_words, embedding_vector_length, input_length = max_pad))
model.add(keras.layers.Bidirectional(keras.layers.LSTM(64, return_sequences = True)))
model.add(keras.layers.GlobalMaxPool1D())
model.add(keras.layers.Dense(16, activation = "relu"))
model.add(keras.layers.Dense(1, activation = "sigmoid"))
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

In [21]:
print(model.summary())

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 200, 32)           320000    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 200, 128)          49664     
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 128)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 16)                2064      
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 17        
Total params: 371,745
Trainable params: 371,745
Non-trainable params: 0
_________________________________________________________________
None


# Train the model

In [23]:
num_epochs = 10
batch_size = 256

In [None]:
history = model.fit(x_train, y_train, batch_size = batch_size, epochs = num_epochs)

# Model Evaluation

In [26]:
train_loss, train_acc = model.evaluate(x_train, y_train)



In [27]:
print('Training Accuracy: {}%'.format(round(float(train_acc) * 100, 2)))

Training Accuracy: 99.63%


In [28]:
test_loss, test_acc = model.evaluate(x_test, y_test)



In [29]:
print('Training Accuracy: {}%'.format(round(float(test_acc) * 100, 2)))

Training Accuracy: 85.04%


We have just trained a sentiment analyzer that can guess the sentiment
expressed in a movie review correctly about `85%` of the time. With `15%` chances of wrongfully guessing the sentiment in a review does not make this model a very impressive one. However, with
more training data and a corresponding more complex neural architectures, we can get results that are even more impressive.

# The Embeddings

Now that the model is trained, we can extract the *embeddings* learned from the model, using the *layers* function. Each embedding is a vector of size `32`, and we have `10,000` embeddings, since our total vocabulary size was set to `10,000`.

In [30]:
embeddings = model.layers[0]
embeddings.weights

[<tf.Variable 'embedding_2/embeddings:0' shape=(10000, 32) dtype=float32, numpy=
 array([[ 0.01176346, -0.02211049,  0.01999186, ..., -0.00767972,
          0.00915258,  0.02567264],
        [ 0.03514829, -0.00280277, -0.02752534, ...,  0.01723774,
          0.03138287, -0.02131016],
        [ 0.00956264, -0.00240009, -0.03587588, ..., -0.00217557,
          0.07052197,  0.03453688],
        ...,
        [-0.07937256,  0.10807124, -0.01995642, ...,  0.0930188 ,
          0.09878337, -0.06895982],
        [-0.0682204 ,  0.00144474,  0.09145122, ...,  0.07640875,
          0.03292344,  0.01669248],
        [ 0.06022037, -0.05517315, -0.03429912, ..., -0.09712823,
         -0.0608963 ,  0.00122278]], dtype=float32)>]

In [32]:
weights = embeddings.get_weights()[0]
print('Weights shape:', weights.shape)

Weights shape: (10000, 32)


In [36]:
weights[1]

array([ 3.5148285e-02, -2.8027729e-03, -2.7525343e-02, -6.6123626e-05,
       -1.8618425e-02, -1.9484749e-02, -5.7897922e-02, -6.1920997e-02,
       -5.7299681e-02,  2.9555941e-02, -1.9974085e-03, -5.9607994e-02,
        2.9678049e-02,  3.8764924e-02, -7.4368590e-03,  3.2156970e-02,
       -1.8604781e-02,  4.0184289e-02,  1.3975586e-02,  6.7665675e-03,
       -6.0407475e-02, -4.2519130e-02, -2.9203249e-02, -1.3698910e-04,
        1.5641183e-02, -2.7326887e-02,  8.7939892e-03,  1.0132346e-03,
       -3.4886673e-02,  1.7237742e-02,  3.1382874e-02, -2.1310160e-02],
      dtype=float32)

# Visualizing the Embeddings

In order to visualize the embeddings in the 3D space, we must reverse the key value for embeddings and respective words, so as to represent every word via its embedding. To do this, we create a helper function...

In [33]:
index_based_embedding = dict([(value, key) for (key, value) in word_to_id.items()])

In [34]:
def decode_review(text):
    return ' '.join([index_based_embedding.get(i, '?') for i in text])

In [39]:
index_based_embedding[4]

'the'

In the final part of this exercise, we extract the embeddings value and put it into a `.tsv` file, along with another `.tsv` file that captures the words of the embedding.

In [40]:
import io

In [41]:
vec = io.open('./data/embedding_vectors_new.tsv', 'w', encoding = 'utf-8')
meta = io.open('./data/metadata_new.tsv', 'w', encoding = 'utf-8')

In [43]:
vocab_size = 10000
for i in range(1, vocab_size):
    if i in index_based_embedding.keys():
        word = index_based_embedding[i]
        embedding_vec_values = weights[i]
        meta.write(word + "\n")
        vec.write('\t'.join([str(x) for x in embedding_vec_values]) + "\n")

In [44]:
meta.close()
vec.close()

# Visualize the embeddings in 3D space using TensorFlow Projector

To view the embeddings go to https://projector.tensorflow.org/ and load `embedding_vectors_new.tsv` and `metadata_new.tsv`. Try a positive word such as *recommended* and a negative world like *boring* and see how related words are distributed around them.

# Save the model

In [45]:
model_path = './model/sentiment_model.h5'
model.save(model_path)

# Load model for prediction

In [46]:
sentiment_prediction = keras.models.load_model(model_path)

# Sample Review Text preparation

In [47]:
import re

# Helper function to clean text

In [48]:
def clean_reviews(text):
    text = re.sub("[^a-zA-Z]"," ", str(text))
    return re.sub("^\d+\s|\s\d+\s|\s\d+$", " ", text)

In [75]:
review_text = ["Loved the film. I wasn’t sure at the start but it was lovely "+ \
               "to see all the characters from the small screen arrive in the "+ \
               "cinema as old friends, and I laughed and cried. This is a great "+ \
               "film and I really hope they make a sequel. "+ \
               "To the person who gave this film one star you should have reviewed "+ \
               "the film, not the taxi driver and as you didn’t see the first 30 minutes "+ \
               "you aren’t in a position to comment on the entire film anyway.", 
               
               "I have no doubt that Uptown fans will support this film. We have every "+ \
               "episode on DVD, so it is with something of a heavy heart to give this film "+ \
               "such a low rating. As a stand-alone film (or if you have never seen the TV series), "+ \
               "all you get are lavish scenery and costumes. However, the characters appear shallow "+ \
               "and the plot flimsy. As a Uptown fan, yes – the pleasant and familiar characters are "+ \
               "there for you to enjoy in their familiar costumes. However, that is not enough. "+ \
               "Soon into the film, we found that the depth of our characters was not there. "+ \
               "I can only allude to metaphors. It was like watching a Formula One race run at "+ \
               "20 mph – where was the excitement? It was like watching a Weakenhand ruby game of "+ \
               "touch rugby – all spectacle but no impact. It was like being forced to lie in a bubble "+ \
               "bath of lukewarm water for too long. Even a couple of people around gave up and stated "+ \
              "playing with their iPhone, with mutterings as we left at it was far too long. "+ \
               "Maybe it was our local cinema’s projection but even the film quality was nothing "+ \
               "like my Blu-ray at home, let along 4K. So, all in all, this is best seen as a light touch "+ \
               "homage to the TV series. Bearing in mind the trouble taken to assemble the actors in one "+ \
               "place at one time to make this movie, this was a wasted opportunity to create a real Uptown epic. "+ \
               "I hope they do not make a sequel."
               ] 

In [76]:
review_text_clean = [] 
for i in range(len(review_text)):
    review_text_clean.append(clean_reviews(review_text[i]))

In [77]:
len(review_text_clean)

2

In [78]:
unique_words = set(word.lower() for phrase in review_text_clean for word in phrase.split(" "))
print(f"There are {len(unique_words)} unique words in the review text.")

There are 183 unique words in the review text.


In [81]:
vocabulary_size = len(unique_words) + 1
tokenizer = keras.preprocessing.text.Tokenizer(num_words = vocabulary_size, oov_token = 'xxxxxxx')

In [82]:
tokenizer.fit_on_texts(review_text_clean)
review_dict = tokenizer.word_index
print('Length of X_dict:', len(review_dict))

Length of X_dict: 183


In [83]:
review_dict.items()

dict_items([('xxxxxxx', 1), ('the', 2), ('a', 3), ('to', 4), ('film', 5), ('was', 6), ('and', 7), ('it', 8), ('in', 9), ('this', 10), ('i', 11), ('as', 12), ('you', 13), ('at', 14), ('all', 15), ('of', 16), ('characters', 17), ('is', 18), ('one', 19), ('have', 20), ('not', 21), ('like', 22), ('t', 23), ('but', 24), ('make', 25), ('that', 26), ('uptown', 27), ('we', 28), ('with', 29), ('see', 30), ('cinema', 31), ('hope', 32), ('they', 33), ('sequel', 34), ('gave', 35), ('on', 36), ('no', 37), ('so', 38), ('seen', 39), ('tv', 40), ('series', 41), ('are', 42), ('costumes', 43), ('however', 44), ('familiar', 45), ('there', 46), ('for', 47), ('their', 48), ('our', 49), ('watching', 50), ('touch', 51), ('too', 52), ('long', 53), ('even', 54), ('loved', 55), ('wasn', 56), ('sure', 57), ('start', 58), ('lovely', 59), ('from', 60), ('small', 61), ('screen', 62), ('arrive', 63), ('old', 64), ('friends', 65), ('laughed', 66), ('cried', 67), ('great', 68), ('really', 69), ('person', 70), ('who', 

In [88]:
review_seq = tokenizer.texts_to_sequences(review_text_clean)
#review_seq[:1]

In [104]:
print('Before padding: \n')
print(review_seq[0])

Before padding: 

[55, 2, 5, 11, 56, 23, 57, 14, 2, 58, 24, 8, 6, 59, 4, 30, 15, 2, 17, 60, 2, 61, 62, 63, 9, 2, 31, 12, 64, 65, 7, 11, 66, 7, 67, 10, 18, 3, 68, 5, 7, 11, 69, 32, 33, 25, 3, 34, 4, 2, 70, 71, 35, 10, 5, 19, 72, 13, 73, 20, 74, 2, 5, 21, 2, 75, 76, 7, 12, 13, 77, 23, 30, 2, 78, 79, 13, 80, 23, 9, 3, 81, 4, 82, 36, 2, 83, 5, 84]


In [90]:
review_padded_seq = keras.preprocessing.sequence.pad_sequences(review_seq, maxlen = max_pad)

In [91]:
print('After padding: \n')
print(review_padded_seq[0])

After padding: 

[ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 55  2  5 11 56 23 57 14  2
 58 24  8  6 59  4 30 15  2 17 60  2 61 62 63  9  2 31 12 64 65  7 11 66
  7 67 10 18  3 68  5  7 11 69 32 33 25  3 34  4  2 70 71 35 10  5 19 72
 13 73 20 74  2  5 21  2 75 76  7 12 13 77 23 30  2 78 79 13 80 23  9  3
 81  4 82 36  2 83  5 84]


In [92]:
review_padded_seq.shape

(2, 200)

# Predictions

In [98]:
review_pred0 = sentiment_prediction.predict(review_padded_seq[0:1])
review_pred0

array([[0.20508002]], dtype=float32)

In [99]:
review_pred1 = sentiment_prediction.predict(review_padded_seq[1:2])
review_pred1

array([[0.01894105]], dtype=float32)

In [100]:
review_pred = sentiment_prediction.predict(review_padded_seq)
review_pred

array([[0.2050801 ],
       [0.01894104]], dtype=float32)

In [106]:
review_pred_prob = sentiment_prediction.predict_proba(review_padded_seq)
review_pred_prob

array([[0.2050801 ],
       [0.01894104]], dtype=float32)