In [2]:
import tensorflow as tf
from tensorflow import keras 
import numpy as np 

In [32]:
data = keras.datasets.imdb # Importing the dataset from Keras

<code>num_words = 10000</code> variable takes the reviews only 10000 words and refuses the other that have different length

In [31]:
# In the variable train_data is located an integer number that 
# represents a word, however it needs to be represented in actual text 
(train_data, train_labels), (test_data, test_labels) = data.load_data(num_words = 10000)

### Preprocess the integer numbers into actual words annotations

The first is to create a dictionary, fortunately it is already done by Tensorflow.

<code>data.get_word_index()</code> function returns tuples of type string so that it can be put on a dictionary to eventually map them and put to each integer value ***v*** a key *(word)* ***k***  

In [34]:
word_index = data.get_word_index()
word_index = {k:(v + 3) for k, v in word_index.items()} # The reason for starting +3 values
                                                        # after is because exist some special characters

word_index["<PAD>"] = 0  # Padding
word_index["<START>"] = 1 # Start
word_index["<UNK>"] = 2 # Unknown
word_index["<UNUSED>"] =  3 # Unused

# Swap all the values in the keys, now the key (word) is first and then the value  
# this is basically so the the values of the word (the integers) point to a word (key)
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

### Preprocessing data annotations

Given that we don't know the size and shape of the inputs it is neccesary to know them to start modeling the NN.

So that, the padding tag <code> "PAD" </code> is used to define a definitive length of all the data, this is, ***it is going to be a allowed a ceratain ammount of words in each review***. In this case the number will be 250 word

In [38]:
# Trimming the data
train_data = keras.preprocessing.sequence.pad_sequences(train_data, value = word_index["<PAD>"], padding = "post", maxlen = 250)
test_data = keras.preprocessing.sequence.pad_sequences(train_data, value = word_index["<PAD>"], padding = "post", maxlen = 250)

In [36]:
def decode_review(text):
    '''
    Function that decodes the text of the movie reviews and returns a
    readable words.
    
    Inputs: The integer values corresponding to a word
    
    Outputs: The key (word) that corresponds to the integer values    
    '''
    # If there is a word that is not in the dictionary then it puts
    # a question mark (?)
    return " ".join([reverse_word_index.get(i, "?") for i in text])


### Model architecture annotations

The purpose of the model is to find out whether the review is ***GOOD*** or ***BAD*** 

There are two ways of doing the model for a NN, the first is directly as in [Crash Course TensorFlow](http://localhost:8888/notebooks/Crash%20Course%20Tensorflow%20.ipynb#reference1) and the other is by using the method <code>.add()</code> in which as parameter will get the desired configuration that we want to give to the NN.

In [42]:
model = keras.Sequential() # Creating the model 
model.add(keras.layers.Embedding(10000, 16)) # Input layer
model.add(keras.layers.GlobalAveragePooling1D()) # Hidden layer
model.add(keras.layers.Dense(16, activation = "relu")) # Hidden layer
model.add(keras.layers.Dense(1, activation = "sigmoid")) # Output layer

In [43]:
model.summary() # Useful to visualize how the NN is composed

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 16)          160000    
_________________________________________________________________
global_average_pooling1d_1 ( (None, 16)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 16)                272       
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 17        
Total params: 160,289
Trainable params: 160,289
Non-trainable params: 0
_________________________________________________________________
