# Machine Learning

## L3T20 - Capstone Project 3

### Importing libraries & loading datasets

In [40]:
# Import Libraries
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import matplotlib.pyplot as plt

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Conv1D
from tensorflow.keras.layers import MaxPooling1D
from tensorflow.keras.layers import LSTM
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences


In [16]:
 # Load dataset  and key from imdb set

(trainX, trainY), (testX, testY) = imdb.load_data(num_words = 10000)
word_dict = imdb.get_word_index()


In [17]:
# Key for review generation

offset=3   # word index offset
word_dict = {k:(v + offset) for k, v in word_dict.items()}
word_dict["<PAD>"] = 0
word_dict["<START>"] = 1
word_dict["<UNK>"] = 2
word_dict["<UNUSED>"] = 3
id_dict = {value:key for key, value in word_dict.items()}

# Create review print function

def print_rev(rev_data):
    print('\n=================================================')
    print(f'Sample = {i} | Length = {len(rev_data)}')
    print('=================================================')
    print(' '.join(id_dict[id] for id in rev_data ))
    


In [27]:
 # Print random reviews
rev_n = 2
for i in np.random.randint(0, len(trainX), rev_n):
    print_rev(trainX[i])


Sample = 6759 | Length = 248
<START> trick or treat <UNK> review this zany romp of a film revolves around the 80's culture of heavy metal and horror movies two things which i love <UNK> so as you can imagine this movie <UNK> to me pretty easily plus for no apparent reason <UNK> <UNK> plays a preacher br br this film is about an <UNK> high school youth who like all us losers ended up <UNK> in a world of evil heavy metal his favorite <UNK> dies and of course is miraculously <UNK> by playing his latest <UNK> album backwards this allows the <UNK> singer to go around killing people with demons and sh t helping out br br okay it's pretty cheesy at times but you know what it's got a surprising number of good qualities decent acting including gene simmons as a radio dj pretty good special effects very brief nudity decent atmosphere all in all it's actually a decent horror film but what really sucks is the music ironic huh well this uber evil metal guy is one of the most obnoxious high pitched

### Processing Data before using it in the NN

The way to  ensure compatibility with the modelling structures is to first process the data before it is to be used in the neural network. The processing required is that the reviews need to be truncated to 500 words and any reviews shorter are to be padded. This is to ensure so that all of the data is of a similar size.

In [29]:
# Preprocessing data
trainX_p = pad_sequences(trainX, maxlen = 500, value = 0)
testX_p = pad_sequences(testX, maxlen = 500, value = 0)

### Model Development

In [31]:
#
np.random.seed(1984)
model = Sequential()
model.add(Embedding(10000, 20, input_length = 500))
model.add(LSTM(32, return_sequences = True))
model.add(LSTM(16))
model.add(Dropout(0.20))
model.add(Dense(1, activation = 'sigmoid'))

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


The first layer of the neutral network  is the Embedding layer. This layer enables the neural network to operate on text data that are converted to integers. LSTM layers  allow the recurrence required in the RNN. Dropout was used to assist in preventing the overfitting of the model.

The input to the model would be a vector of 500 integers as per the  data processing step. The output would be a number between 0 and 1 where 0 is a negative sentiment and 1 is a postive sentiment.

### Model Training

In [32]:
# Compiles the model
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

# Provides a summary
model.summary()

# Model training
epochist = model.fit(trainX_p, trainY, epochs = 3
                     , verbose = True
                     , validation_data = (testX_p, testY)
                     , batch_size = 32)

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 500, 20)           200000    
_________________________________________________________________
lstm (LSTM)                  (None, 500, 32)           6784      
_________________________________________________________________
lstm_1 (LSTM)                (None, 16)                3136      
_________________________________________________________________
dropout (Dropout)            (None, 16)                0         
_________________________________________________________________
dense (Dense)                (None, 1)                 17        
Total params: 209,937
Trainable params: 209,937
Non-trainable params: 0
_________________________________________________________________
Train on 25000 sampl

### Model performance

In [34]:
results = model.evaluate(testX_p, testY)
Loss = np.round(results[0], 3)
Acc = np.round(100*results[1], 3)
print(f'Test results: \n ==> Loss: {Loss}\n ==> Accuracy: {Acc}%')

Test results: 
 ==> Loss: 0.322
 ==> Accuracy: 87.548%


In [35]:
# create a function the provides the model output for given riew
def m_predict(review):
    # Split review into words
    split = review.lower().split()
    
    # Convert words to integers using word_to_id dictionary
    ids = []
    for k in split:
        if word_dict.get(k) == None:
            ids.append(2)
        else:
            ids.append(word_dict.get(k))
            
    # Pad and truncate to input length
    ids_p = pad_sequences(np.array(list([ids])), maxlen = 500, value = 0)
    
    # Use model
    score = model.predict(ids_p)
    
    return score


In [37]:
# Test model with self-defined examples

sample = ['This movie is good'
            ,'This movie is not good'
            ,'This movie was good'
            ,'This movie was not good'
            , 'This movie is bad'
            , 'This movie is not bad'
            , 'This movie was bad'
            , 'This movie was not bad'
            , 'An interesting study in death by boredom'
            , 'Fascinating and beautiful'
            , 'The movie is great'
            , 'The movie is not great'
]
bench = 0.4
for s in sample:
    score = m_predict(s)
    
    if score > bench:
        sent = 'Good'
    else:
        sent ='Bad'
        
    print(f'{s} <==> {score} <==> {sent} Sentiment')

This movie is good <==> [[0.47562233]] <==> Good Sentiment
This movie is not good <==> [[0.39087367]] <==> Bad Sentiment
This movie was good <==> [[0.40771827]] <==> Good Sentiment
This movie was not good <==> [[0.33134705]] <==> Bad Sentiment
This movie is bad <==> [[0.22538441]] <==> Bad Sentiment
This movie is not bad <==> [[0.20132266]] <==> Bad Sentiment
This movie was bad <==> [[0.20304897]] <==> Bad Sentiment
This movie was not bad <==> [[0.17836869]] <==> Bad Sentiment
An interesting study in death by boredom <==> [[0.19303657]] <==> Bad Sentiment
Fascinating and beautiful <==> [[0.81232333]] <==> Good Sentiment
The movie is great <==> [[0.6603089]] <==> Good Sentiment
The movie is not great <==> [[0.56914]] <==> Good Sentiment


Various reviews were passed through the model. The main issue with the prediction of the model is that negatives are not handled consistently 'not good' is correctly identified as bad but 'not bad' is  identified as bad and 'not great' is  identified as good.