### TensorFlow Tutorial 2: Text Classification with movie reviews

This notebook classifies movie reviews as positive or negative based on the text of the review and is - as such - an example of classifying a data set in a binary way. It uses the built-in IMDB dataset, which contains 50000 movie reviews taken from the IMDB. This set is split into training and validation sets of 25000 reviews each, which contain an equal number of positive and negative reviews. The original text of this tutorial can be found here: <a href="https://www.tensorflow.org/tutorials/keras/basic_text_classification"> TensorFlow - Getting Started </a>

In [1]:
# import tensorflow and tensorflow's very own tf.keras, as well as numpy first:
import tensorflow as tf
from tensorflow import keras
import numpy as np

In [2]:
# next, we need to downliad the IMDB Dataset (which is a subclass of keras.datasets):
imdb = keras.datasets.imdb

In order to load the data, we need to create two arrays containing the sets with their respective labels (aka. their "correct" answers):

In [3]:
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

Please note that the above argument num_words=10000 limits the stored words from the review to the 10000 most frequently occuring words in the data sets and discards the rest of the entries.

Let us look at the data types of some of the sets we created above!

In [4]:
type(train_data)

numpy.ndarray

In [5]:
type(train_labels)

numpy.ndarray

In [6]:
len(train_data[0])

218

In [7]:
train_labels[0]

1

In [8]:
print("Training entries: {}, labels: {}".format(len(train_data), len(train_labels)))
# fancy print statement using .format() :-)

Training entries: 25000, labels: 25000


Now, what do these arrays actually contain? Both label-sets are farily straightforward. They contain integers which classify the respective entries in the data sets as positive (1) or negative (0) review. (The tutorial does not specify which integer corellates to a positive or negative review at that point, but the conversion a couple of cells below does!) 

It is important that movie reviews might obviously differ in length regarding the number of words used. As shown above, the first entry in the training data set is 218 words long. All words have been converted to an integer correlating to a dictionary built into the keras.imdb dataset. This dictionary represents the most common words as integers.

By using the below function taken from the tutorial code, the given reviews can be converted back to understandable words:

In [9]:
# A dictionary mapping words to an integer index
word_index = imdb.get_word_index()

# The first indices are reserved
word_index = {k:(v+3) for k,v in word_index.items()} 
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2  # unknown
word_index["<UNUSED>"] = 3

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

Please note: this tells us that the integers 0-3 are reserved for functions that do not correlate to actual words. For example, 0 is used for PAD, which can later be used to standardize the length of training data, while keeping the original review text intact.

We can now use our freshly defined function to "translate" any review from the data set back into a string of words. However, some words might be ommited, as the data set discarded all but the 10000 most frequently occuring words:

In [10]:
decode_review(train_data[0])

"<START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <UNK> and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also <UNK> to the two little boy's that played the <UNK> of norman and paul they were just brilliant children are often left out of the <UNK> list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for wh

In [11]:
decode_review(train_data[1])

"<START> big hair big boobs bad music and a giant safety pin these are the words to best describe this terrible movie i love cheesy horror movies and i've seen hundreds but this had got to be on of the worst ever made the plot is paper thin and ridiculous the acting is an abomination the script is completely laughable the best is the end showdown with the cop and how he worked out who the killer is it's just so damn terribly written the clothes are sickening and funny in equal <UNK> the hair is big lots of boobs <UNK> men wear those cut <UNK> shirts that show off their <UNK> sickening that men actually wore them and the music is just <UNK> trash that plays over and over again in almost every scene there is trashy music boobs and <UNK> taking away bodies and the gym still doesn't close for <UNK> all joking aside this is a truly bad film whose only charm is to look back on the disaster that was the 80's and have a good old laugh at how bad everything was back then"

At this point, the tutorial becomes rather technical and uses terminology that has not yet been explained in the documentation. For now, the first irritating term might be a **tensor**. Basically, a tensor can be thought of as a multi-dimensional matrix, for example a 5x5x5 matrix. It is, of course, not a coincidence that **tensor**flow has been named the way it has

### Preparing the data sets

Therefore, the next step in training our network to be is to prepare our data to be fed to the network, which will be working with tensors. One of the major difficulties right here is that the reviews in the data sets vary in length. However, as we have stated above, there are some integers that can be used in the training set and which don't correlate to any given word in our dictionary. If you look at the actial array for our second review in the training set, you can see that there is a single "1" denoting the beginning of the review, a number of "2"s demarking unknown words (aka. words not in the dictionary), no "3" and no "0" (PAD).

In [12]:
print(train_data[1])

[1, 194, 1153, 194, 8255, 78, 228, 5, 6, 1463, 4369, 5012, 134, 26, 4, 715, 8, 118, 1634, 14, 394, 20, 13, 119, 954, 189, 102, 5, 207, 110, 3103, 21, 14, 69, 188, 8, 30, 23, 7, 4, 249, 126, 93, 4, 114, 9, 2300, 1523, 5, 647, 4, 116, 9, 35, 8163, 4, 229, 9, 340, 1322, 4, 118, 9, 4, 130, 4901, 19, 4, 1002, 5, 89, 29, 952, 46, 37, 4, 455, 9, 45, 43, 38, 1543, 1905, 398, 4, 1649, 26, 6853, 5, 163, 11, 3215, 2, 4, 1153, 9, 194, 775, 7, 8255, 2, 349, 2637, 148, 605, 2, 8003, 15, 123, 125, 68, 2, 6853, 15, 349, 165, 4362, 98, 5, 4, 228, 9, 43, 2, 1157, 15, 299, 120, 5, 120, 174, 11, 220, 175, 136, 50, 9, 4373, 228, 8255, 5, 2, 656, 245, 2350, 5, 4, 9837, 131, 152, 491, 18, 2, 32, 7464, 1212, 14, 9, 6, 371, 78, 22, 625, 64, 1382, 9, 8, 168, 145, 23, 4, 1690, 15, 16, 4, 1355, 5, 28, 6, 52, 154, 462, 33, 89, 78, 285, 16, 145, 95]


The following cell will be adding 0s at the end of each review in order to pad every item in the data sets to have a standardized length of 256 integers. The function keras.preprocessing.sequence.pad_sequences will be applied to both the train and test data sets:

In [13]:
train_data = keras.preprocessing.sequence.pad_sequences(train_data,
                                                        value=word_index["<PAD>"],
                                                        padding='post', 
                                                        # pad at the end!
                                                        maxlen=256)
                                                        # max length of input array

test_data = keras.preprocessing.sequence.pad_sequences(test_data,
                                                       value=word_index["<PAD>"],
                                                       padding='post',
                                                       maxlen=256)

In [14]:
# now, the second review looks a little different in the end:
print(train_data[1])

[   1  194 1153  194 8255   78  228    5    6 1463 4369 5012  134   26
    4  715    8  118 1634   14  394   20   13  119  954  189  102    5
  207  110 3103   21   14   69  188    8   30   23    7    4  249  126
   93    4  114    9 2300 1523    5  647    4  116    9   35 8163    4
  229    9  340 1322    4  118    9    4  130 4901   19    4 1002    5
   89   29  952   46   37    4  455    9   45   43   38 1543 1905  398
    4 1649   26 6853    5  163   11 3215    2    4 1153    9  194  775
    7 8255    2  349 2637  148  605    2 8003   15  123  125   68    2
 6853   15  349  165 4362   98    5    4  228    9   43    2 1157   15
  299  120    5  120  174   11  220  175  136   50    9 4373  228 8255
    5    2  656  245 2350    5    4 9837  131  152  491   18    2   32
 7464 1212   14    9    6  371   78   22  625   64 1382    9    8  168
  145   23    4 1690   15   16    4 1355    5   28    6   52  154  462
   33   89   78  285   16  145   95    0    0    0    0    0    0    0
    0 

In [15]:
# interestingly enough, if we now decode the review:
decode_review(train_data[1])

"<START> big hair big boobs bad music and a giant safety pin these are the words to best describe this terrible movie i love cheesy horror movies and i've seen hundreds but this had got to be on of the worst ever made the plot is paper thin and ridiculous the acting is an abomination the script is completely laughable the best is the end showdown with the cop and how he worked out who the killer is it's just so damn terribly written the clothes are sickening and funny in equal <UNK> the hair is big lots of boobs <UNK> men wear those cut <UNK> shirts that show off their <UNK> sickening that men actually wore them and the music is just <UNK> trash that plays over and over again in almost every scene there is trashy music boobs and <UNK> taking away bodies and the gym still doesn't close for <UNK> all joking aside this is a truly bad film whose only charm is to look back on the disaster that was the 80's and have a good old laugh at how bad everything was back then <PAD> <PAD> <PAD> <PAD>

### BUILDING OUR NETWORK (to be continued)