## Text Classification With Movie Reviews

#### Given a dataset of movie reviews the objective is to train and test the dataset so the model can predict wheter the review is good (0) or bad (1)

In [1]:
import tensorflow as tf
from tensorflow import keras 
import numpy as np 

In [2]:
data = keras.datasets.imdb # Importing the dataset from Keras

<code>num_words = 10000</code> variable takes the reviews only 10000 words and refuses the other that have different length

In [3]:
# In the variable train_data is located an integer number that 
# represents a word, however it needs to be represented in actual text 
(train_data, train_labels), (test_data, test_labels) = data.load_data(num_words = 88000)

### Preprocess the integer numbers into actual words annotations

The first is to create a dictionary, fortunately it is already done by Tensorflow.

<code>data.get_word_index()</code> function returns tuples of type string so that it can be put on a dictionary to eventually map them and put to each integer value ***v*** a key *(word)* ***k***  

In [4]:
word_index = data.get_word_index()
word_index = {k:(v + 3) for k, v in word_index.items()} # The reason for starting +3 values
                                                        # after is because exist some special characters

word_index["<PAD>"] = 0  # Padding
word_index["<START>"] = 1 # Start
word_index["<UNK>"] = 2 # Unknown
word_index["<UNUSED>"] =  3 # Unused

# Swap all the values in the keys, now the key (word) is first and then the value  
# this is basically so the the values of the word (the integers) point to a word (key)
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

### Preprocessing data annotations

Given that we don't know the size and shape of the inputs it is neccesary to know them to start modeling the NN.

So that, the padding tag <code> "PAD" </code> is used to define a definitive length of all the data, this is, ***it is going to be a allowed a ceratain ammount of words in each review***. In this case the number will be 250 word

In [5]:
# Trimming the data
train_data = keras.preprocessing.sequence.pad_sequences(train_data, value = word_index["<PAD>"], padding = "post", maxlen = 250)
test_data = keras.preprocessing.sequence.pad_sequences(train_data, value = word_index["<PAD>"], padding = "post", maxlen = 250)

In [6]:
def decode_review(text):
    '''
    Function that decodes the text of the movie reviews and returns a
    readable words.
    
    Inputs: The integer values corresponding to a word
    
    Outputs: The key (word) that corresponds to the integer values    
    '''
    # If there is a word that is not in the dictionary then it puts
    # a question mark (?)
    return " ".join([reverse_word_index.get(i, "?") for i in text])


### Model architecture annotations

The purpose of the model is to find out whether the review is ***GOOD*** or ***BAD*** 

There are two ways of doing the model for a NN, the first is directly as in [Crash Course TensorFlow](http://localhost:8888/notebooks/Crash%20Course%20Tensorflow%20.ipynb#reference1) and the other is by using the method <code>.add()</code> in which as parameter will get the desired configuration that we want to give to the NN.

In [7]:
model = keras.Sequential() # Creating the model 
model.add(keras.layers.Embedding(88000, 16)) # Hidden layer
model.add(keras.layers.GlobalAveragePooling1D()) # Hidden layer
model.add(keras.layers.Dense(16, activation = "relu")) # Hidden layer
model.add(keras.layers.Dense(1, activation = "sigmoid")) # Output layer

### Layers explanation

It first starts by taking the input which are the value words and pass it to the *Embedding Layer*.

**Embedding Layer:** Takes the input data and turns into many vectors.
It also groups the *similar* words 

**Global Average Pooling:** Takes the vectors and averages them out, kind of shrinks them.

**Dense_1:** 16 neurons *(arbitrary networks)* in this case receive the averaged results 

**Dense_2:** Finally pass through a single neuron that decides whether it is ***GOOD*** (0) or ***BAD*** (1)

In [8]:
model.summary() # Useful to visualize how the NN is composed

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 16)          1408000   
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
_________________________________________________________________
dense (Dense)                (None, 16)                272       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 1,408,289
Trainable params: 1,408,289
Non-trainable params: 0
_________________________________________________________________


In [9]:
model.compile(optimizer = "adam", loss = "binary_crossentropy", metrics = ["accuracy"])

x_val = train_data[:10000] # Taking as validation data
x_train = train_data[10000:]

y_val = train_labels[:10000] # Taking as validation data
y_train = train_labels[10000:]

fitModel = model.fit(x_train, y_train, epochs = 40, batch_size = 512, validation_data = (x_val, y_val), verbose = 1)
    
results = model.evaluate(test_data, test_labels)

Train on 15000 samples, validate on 10000 samples
Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


In [10]:
number_of_review = 25
test_review = test_data[number_of_review]
predict = model.predict([test_review])
print("Review: \n", decode_review(test_review))
print("Prediction:" + str(predict[number_of_review]))
if test_labels[number_of_review] == 0:
    print("Good Review: " + str(test_labels[number_of_review]))
else: 
    print("Bad Review: " + str(test_labels[number_of_review]))
print(results)

Review: 
 <START> this is a very light headed comedy about a wonderful family that has a son called pecker because he use to peck at his food pecker loves to take all kinds of pictures of the people in a small suburb of baltimore md and manages to get the attention of a group of photo art lovers from new york city pecker has a cute sister who goes simply nuts over sugar and is actually an addict taking spoonfuls of sugar from a bag there are scenes of men showing off the lumps in their jockey's with grinding movements and gals doing pretty much the same it is rather hard to keep your mind out of the gutter with this film but who cares it is only a film to give you a few laughs at a simple picture made in 1998 <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>

## Saving & Loading Models 
### Saving a model

The function to save a model is the model's name and the method is:

<code>
    model.save("name.h5")
</code>

In order to save models it is necessary to save it as a <code>* .h5</code> file, which will save in binary

In [11]:
model.save("model.h5")

### Loading a model

In order to save a model the function: 

<code>
    model = keras.models.load_model("model.h5")
</code>

is useful 

In [12]:
 model = keras.models.load_model("model.h5")

In [13]:
def review_encode(string):
    encoded = [1]
    
    for word in string: 
        if word in word_index:
            encoded.append(word_index[word.lower()])
        else:
            encoded.append(2)
    return encoded

### Testing the model with a new review

In [14]:
with open("test.txt") as f:
          for line in f.readlines():
              nline = line.replace(",","").replace(".","").replace("(","").replace(")","").replace(":","").replace("/","").replace("!","").strip().split(" ")
              encode = review_encode(nline)
              encode = keras.preprocessing.sequence.pad_sequences([encode], value = word_index["<PAD>"], padding = "post", maxlen = 250)
              predict = model.predict(encode)
              print(line)
              print(encode)
              print(predict[0])

Seriously, I know that sounds stupid as this is an animated movie, but why can't we have a favorite movie that's animated? If you ask me this is Disney's best animated film of all time. Why do you ask? The animation is just beautiful, the story is powerful and moving, the characters are terrific, the villain is one of Disney's most monstrous and the songs are out of this world incredible! Not only is it my favorite animated movie, this is one of my favorite soundtracks, with the strong power of Elton John The Lion King is absolutely beautiful and a pleasure to watch still to this day. I have to say also that this story is just so beautiful, it's still one of the movies that will always bring a tear to my eye. When I was 17 years old, I begged my dad to take me to the re-release of The Lion King, him and I being the only adults besides the parents who took their children to see the movie, Simba's father dies and I started balling! My dad looks around and said "Krissy, stop it!" and I ju

In [15]:
model.compile

<bound method Model.compile of <tensorflow.python.keras.engine.sequential.Sequential object at 0x7f4b5637a490>>