# Text Classification
One of the large application of neural network is text classification. We will classify moview reviews as either positive or negative.

# Loading Data
The dataset we use is the IMDB movie dataset from keras. We load and split the data.

In [17]:
import tensorflow as tf
from tensorflow import keras
import numpy

In [18]:
imdb = keras.datasets.imdb
(train_data, train_labels),(test_data, test_labels) = imdb.load_data(num_words=10000)

# Integer Encoded Data
Having a look at our data we'll notice that our reviews are integer encoded. This means that each word in our reviews are represented as positive integers where each ineger represents a specific word. Thi is necessary as we cannot pass strings to our neural network. However, if we (as humans) want to be able to read our reviews and see what they look like we'll have to find a way to turn those integer encoded reviews back into strings. The following code will do this for us:

In [19]:
# A dictionary mapping words to an integer index
_word_index = imdb.get_word_index()

word_index = {k:(v+3) for k,v in _word_index.items()}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2 #unknown
word_index["<UNUSED>"] = 3

reverse_word_index = dict([(value,key) for (key,value) in word_index.items()])

def decode_review(text):
    return " ".join([reverse_word_index.get(i, "?") for i in text])
# this function will return the decoded (human readable) reviews

We start by getting a dictionary that maps all of our words to an integer, add some more keys to it like , etc. and the reverse dictionary so we can use integers as keys that map to each word. The function defied will take as a list the integer encoded reviews and return the human readable version

# Preprocessing Data
If we have a look at some of our loaded in reviews we'll notice that they are different lengths. This is an issue. We cannot pass different length data into our neural network. Therefore we must make each review the same length. To do this we will follow the procedure below:

- If the review is greater than 250 words then trim off the extra words
- If the review is less than 250 words add the necessary amount of's to make it equal to 250.

Luckily for us keras has a function that can do this for us:

In [20]:
train_data = keras.preprocessing.sequence.pad_sequences(train_data, value=word_index["<PAD>"], padding="post",maxlen=250)
test_data = keras.preprocessing.sequence.pad_sequences(test_data, value=word_index["<PAD>"], padding="post", maxlen=250)

# Defining the Model
Finally we will define our model! This model is a little bit different and will be discussed in depth later part

In [21]:
model = keras.Sequential()
model.add(keras.layers.Embedding(88000,16))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(16,activation="relu"))
model.add(keras.layers.Dense(1,activation="sigmoid"))

model.summary() #prints a summary of the model

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, None, 16)          1408000   
_________________________________________________________________
global_average_pooling1d_2 ( (None, 16)                0         
_________________________________________________________________
dense_4 (Dense)              (None, 16)                272       
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 17        
Total params: 1,408,289
Trainable params: 1,408,289
Non-trainable params: 0
_________________________________________________________________


# Understanding the Model Architecture
Here is the sequence of layers we defined for our text classification model:

- Word Embedding Layer
- GlobalAveragePooling1D
- Dense
- Dense

# What is Word Embedding
To understand what a word embedding layer and why it is so important we will compare two very simple sentences. First in human readable form and second in integer encoded form:

Human Readable:

Have a great day

Have a good day

Integer encoded:

[0,1,2,3]

[0,1,4,3]

Mappings: {0:"Have",1:"a",2:"great",3:"day",4:"good"}

Looking at the sentences above, we as humans know that the two sentences are very similar and pretty well mean the same thing. However when we look at the integer encoded version all we can tell is that the words at index 2 (position 3) are different. We have no idea how different they are.

This is where a word embedding layer comes in. We want a way to determine not only the contents of a sentence but the context of the sentence. A word embedding layer will attempt to determine the meaning of each word in the sentence by mapping each word to a position in vector space. If you don't care about the math or don't understand it think of it as just grouping similar words together.

An example of something we'd hope an embedding layer would do for us:

Maybe "good","great","fantastic" and "awesome" are placed close to each other and words like "bad","horrible","sucks" are placed close together. We'd also hope that these groupings of words are placed far apart from each other representing that they have very different meanings.

# GlobalPooling1D Layer
This layer is nothing special and simply scales down our data's dimension to make it easier computationally for our model in the later layers. Because our word embedding layer maps thousands and thousands of words to a location in vector space they usually do this in a high dimensional vector space. This means when we get our word vectors from the embedding layer they have multiple dimensions and can be scaled down by this layer.

# Dense Layers
The last two layers in our network are dense fully connected layers. The output layer is one neuron that uses the sigmoid function to get a value between a 0 and a 1 which will represent likelihood of the review being positive or negative. The layer before that contains 16 neurons with a relu activation function designed to find patterns between different words present in the review.

# Validation Data
For this specific model we will introduce a new idea of validation data. Above we trained the models accuracy after each epoch on the current training data, data the model had seen before. This can be problematic as it is highly possioble that a model can simpply memorize input data and its related output and the accuracy will affect how the model is modified as it trains. So to avoid this issue we will seperate our training data into two sections, training and validation. The model will use the validation data to check accuracy after learning from the trainig data. This will hopefully result in us avoiding a false confidence for our model.
We can split our training data into validation like so:

In [22]:
x_val = train_data[:10000]
x_train = train_data[10000:]

y_val = train_labels[:10000]
y_train = train_labels[10000:]

# Training the model

In [24]:
# We will train the model 
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])
fitModel = model.fit(x_train, y_train, epochs=40, batch_size=512, validation_data=(x_val, y_val), verbose=1)

Train on 15000 samples, validate on 10000 samples
Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


# Testing the Model
To have a look at the results of our accuracy model

In [25]:
results = model.evaluate(test_data,test_labels)
print(results)



[0.5680697588539123, 0.87232]


# Saving the model

In [26]:
model.save("model.h5")

# Loading the Model
Now we have saved the trained model we never need to retain it. We can simply load saved model in by using the following:

In [27]:
mode = keras.models.load_model("model.h5")

# Making Predictions
Now it is time to used our saved model to make predictions. Now this is a little bit harder than it looks because we need to consider the following:
- our model accepts integer encoded data
- our model needs reviews that are of length 250 words

This means we can't just pass any string of text into our model. It will need to be reshaped and reformed to meet the criterai above.

# Transforming our Data
The data we will use is the movie The Lion King. Stroring the data in text file called test.txt.

To start we will need to integer encode the data. We will do the following:

In [28]:
def review_encode(s):
    ecncoded = [1]
    for word in s:
        if word.lower() in word_index:
            encoded.append(word_index[word.lower()])
        else:
            encoded.append(2)
    return encoded

Next we will open our text file, read in each of the reviews (in this case just one) and use the model to predict whether it is positive or negative.

In [29]:
with open("test.txt", encoding="utf-8") as f:
    for line in f.readlines():
        nline = line.replace(",", "").replace(".", "").replace("(", "").replace(")", "").replace(":", "").replace("\"","").strip().split(" ")
        encode = review_encode(nline)
        encode = keras.preprocessing.sequence.pad_sequences([encode], value=word_index["<PAD>"], padding="post", maxlen=250) # make the data 250 words long
        predict = model.predict(encode)
        print(line)
        print(encode)
        print(predict[0])

NameError: name 'encoded' is not defined