# Machine Learning

## L3T20 - Capstone Project 3

### Importing libraries & loading datasets

In [1]:
# Import Libraries
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import matplotlib.pyplot as plt

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Conv1D
from tensorflow.keras.layers import MaxPooling1D
from tensorflow.keras.layers import LSTM
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences


In [2]:
 # Load dataset  and key from imdb set

(train_x, train_y), (test_x, test_y) = imdb.load_data(num_words = 10000)
word_dict = imdb.get_word_index()


In [3]:
# Dict for review generation

offset=3   # word index offset
word_dict = {k:(v + offset) for k, v in word_dict.items()}
word_dict["<PAD>"] = 0
word_dict["<START>"] = 1
word_dict["<UNK>"] = 2
word_dict["<UNUSED>"] = 3
id_dict = {value:key for key, value in word_dict.items()}

# Create review print function

def print_review(review_data):
    print('\n=================================================')
    print(f'Sample = {i} | Length = {len(review_data)}')
    print('=================================================')
    print(' '.join(id_dict[id] for id in review_data))
    


In [4]:
 # Print random reviews using for loops
review_num = 2
for i in np.random.randint(0, len(train_x), review_num):
    print_review(train_x[i])


Sample = 5326 | Length = 116
<START> i realize this review will get me <UNK> by the expert film critics <UNK> this site but i will defend this film br br the dentist is actually a really good film the acting isn't always top notch but the thrills are good and the story's good plus you see linda <UNK> <UNK> not that i'm an expert in this field but the direction seems good and the plot makes sense corbin makes a great creepy dentist it does to <UNK> what jaws does to sharks ish it obviously had a fairly limited budget but they did well with it what they could and developed the characters well those that count br br the end

Sample = 100 | Length = 158
<START> i am a great fan of david lynch and have everything that he's made on dvd except for hotel room the 2 hour twin peaks movie so when i found out about this i immediately grabbed it and and what is this it's a bunch of <UNK> drawn black and white cartoons that are loud and foul mouthed and unfunny maybe i don't know what's good but m

### Processing Data before using it in the NN

The way to  ensure compatibility with the modelling structures is to first process the data before it is to be used in the neural network. The processing required is that the reviews need to be truncated to 500 words and any reviews shorter are to be padded. This is to ensure so that all of the data is of a similar size.

In [5]:
# Preprocessing data
train_x_padding = pad_sequences(train_x, maxlen = 500, value = 0)
test_x_padding  = pad_sequences(test_x, maxlen = 500, value = 0)

### Model Development

In [6]:
#
np.random.seed(1984)
model = Sequential()
model.add(Embedding(10000, 20, input_length = 500))
model.add(LSTM(32, return_sequences = True))
model.add(LSTM(16))
model.add(Dropout(0.20))
model.add(Dense(1, activation = 'sigmoid'))

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


The first layer of the neutral network  is the Embedding layer. This layer enables the neural network to operate on text data that are converted to integers. LSTM layers  allow the recurrence required in the RNN. Dropout was used to assist in preventing the overfitting of the model.

The input to the model would be a vector of 500 integers as per the  data processing step. The output would be a number between 0 and 1 where 0 is a negative sentiment and 1 is a postive sentiment.

### Model Training

In [7]:
# Compiles the model
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

# Provides a summary
model.summary()
np.random.seed(1984)
# Model training
epochist = model.fit(train_x_padding , train_y, epochs = 10
                     , verbose = True
                     , validation_data = (test_x_padding, test_y)
                     , batch_size = 32)

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 500, 20)           200000    
_________________________________________________________________
lstm (LSTM)                  (None, 500, 32)           6784      
_________________________________________________________________
lstm_1 (LSTM)                (None, 16)                3136      
_________________________________________________________________
dropout (Dropout)            (None, 16)                0         
_________________________________________________________________
dense (Dense)                (None, 1)                 17        
Total params: 209,937
Trainable params: 209,937
Non-trainable params: 0
_________________________________________________________________
Train on 25000 samples

### Model performance

In [8]:
results = model.evaluate(test_x_padding, test_y)
Loss = np.round(results[0], 3)
Acc = np.round(100*results[1], 3)
print(f'Test results: \n ==> Loss: {Loss}\n ==> Accuracy: {Acc}%')

Test results: 
 ==> Loss: 0.377
 ==> Accuracy: 86.012%


In [10]:
# create a function the provides the model output for given riew
def m_predict(review):
    # Split review into words
    split = review.lower().split()
    
    # Convert words to integers using word_to_id dictionary
    ids = []
    for k in split:
        if word_dict.get(k) == None or word_dict.get(k) > 10000:
            ids.append(2)
        else:
            ids.append(word_dict.get(k))
            
    # Pad and truncate to input length
    ids_p = pad_sequences(np.array(list([ids])), maxlen = 500, value = 0)
    
    # Use model
    score = model.predict(ids_p)
    
    return score


In [16]:
# Test model with self-defined examples

sample = ['This movie is good'
            , 'This movie is not good'
            , 'She liked the movie '
            , 'She did not like the movie'
            , 'Beautiful and exciting movie'
            , 'Wonderful and interesting'
            , 'Depressing and sad'
            , 'Boring and long'
            , 'This movie is great'
            , 'This movie is not great'
    
]
bench = 0.55

for s in sample:
    score = m_predict(s)
    
    if score > bench:
        sent = 'Good'
    else:
        sent ='Bad'
        
    print(f'{s} <==> {score} <==> {sent} Sentiment')

This movie is good <==> [[0.7944853]] <==> Good Sentiment
This movie is not good <==> [[0.64577883]] <==> Good Sentiment
She liked the movie  <==> [[0.90852463]] <==> Good Sentiment
She did not like the movie <==> [[0.53184646]] <==> Bad Sentiment
Beautiful and exciting movie <==> [[0.925625]] <==> Good Sentiment
Wonderful and interesting <==> [[0.88919955]] <==> Good Sentiment
Depressing and sad <==> [[0.19502357]] <==> Bad Sentiment
Boring and long <==> [[0.08598903]] <==> Bad Sentiment
This movie is great <==> [[0.91242594]] <==> Good Sentiment
This movie is not great <==> [[0.839977]] <==> Good Sentiment


Various sample reviews were passed through the model. The main issue with the prediction of the model is that 'is not good' and'is not great' were identified as good  when it is infact bad.