### RNN - Sentiment Predictions
##### we'll be looking at sequential data in the form of text data of movie reviews on the Internet Movie Database (IMDB) website. We'll be training our models on this data and using it to predict the sentiment (either positive or negative) of the reviews.
##### The data contains 50000 rows of movie reviews (evenly split between train and test) where each word in the review has been converted to numeric values (integers) as part of the pre-processing. Both the train and test datasets are balanced;
-------------------------------------------------------------------

## LSTM (Long Short-Term Memory)
   * Gated units (nodes) that control what information should be retained and what should be forgotten.
   * LSTM has three gates: input, output, and forget.
        1. Fotget gate : what information from the network's memory so far should be forgotten.
        2. Input gate : what from the new information that is coming in should be stored in memory. 
        3. Output gate : what information should be passed along to the next step, takes into account both the memory and the new information.
    
    

In [1]:
import tensorflow as tf
from tensorflow.keras import layers
import pandas as pd
import numpy as np

In [3]:
# Define the model
model = tf.keras.Sequential()
model.add(layers.LSTM(3, input_shape=(200, 32)))
model.add(layers.Dense(3, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='sgd', loss='binary_crossentropy')
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm (LSTM)                 (None, 3)                 432       
                                                                 
 dense (Dense)               (None, 3)                 12        
                                                                 
 dense_1 (Dense)             (None, 1)                 4         
                                                                 
Total params: 448
Trainable params: 448
Non-trainable params: 0
_________________________________________________________________


-----------------------------------------------------------------------

## GRU (Gated Recurrent Units)
   * Gated units (nodes) that control what information should be retained and what should be forgotten.
   * GRU has only 2 gates: reset and update.
        1. Reset gate : old information dropped.
        2. Updaqt4e gate : new information passed along and/or stored in memory.
   * The gates in the GRU are for controlling what old information gets dropped from memory as new information comes in (reset gate) and what new information gets passed along and/or stored in memory (update gate). In general, because GRUs have a simpler architecture (fewer gates) than LSTM, they use less memory and tend to run faster. That being said, LSTM may be more accurate when used on longer sequences. [--dataqu4est.io]

    

In [5]:
# Define model 
model = tf.keras.Sequential()
model.add(layers.GRU(3, input_shape=(200,32)))
model.add(layers.Dense(3, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='sgd', loss='binary_crossentropy')
model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 gru_1 (GRU)                 (None, 3)                 333       
                                                                 
 dense_2 (Dense)             (None, 3)                 12        
                                                                 
 dense_3 (Dense)             (None, 1)                 4         
                                                                 
Total params: 349
Trainable params: 349
Non-trainable params: 0
_________________________________________________________________


## Preporcessing 
    --Embedding Layer:
   * represent the words in our data as fixed-length n-dimensional vectors that can extract meaning and context from our text.
   * learns to map integers (which represent words) to dense vectors (called embeddings) such that the semantic relationships between words are captured in the geometric relationships between the vectors.
   * The embedding layer is able to place related words closer together in a multi-dimensional space which enables the model to extract meaning and context from a sequence of words.
   * Most commonly used with text data. Usually is the first layer in the model.
   

In [2]:
# Read the data
X_train_prep = np.loadtxt("imdb_X_train_prep_200.csv", delimiter=',')
X_test_prep = np.loadtxt("imdb_X_test_prep_200.csv", delimiter=',')
y_train_prep = np.loadtxt("imdb_y_train_prep_200.csv", delimiter=',')
y_test_prep = np.loadtxt("imdb_y_test_prep_200.csv", delimiter=',')

X_train_prep

array([[0.000e+00, 0.000e+00, 0.000e+00, ..., 8.740e+02, 1.450e+02,
        1.000e+01],
       [0.000e+00, 0.000e+00, 0.000e+00, ..., 3.200e+01, 3.100e+01,
        4.700e+01],
       [0.000e+00, 0.000e+00, 0.000e+00, ..., 9.000e+00, 6.176e+03,
        4.700e+01],
       ...,
       [7.629e+03, 3.700e+01, 1.100e+01, ..., 1.563e+03, 1.467e+03,
        5.600e+01],
       [0.000e+00, 0.000e+00, 0.000e+00, ..., 1.321e+03, 2.300e+01,
        4.700e+01],
       [3.875e+03, 5.000e+00, 3.100e+01, ..., 1.000e+01, 1.295e+03,
        4.300e+01]])

In [5]:
X_train_prep.shape

(25000, 200)

In [9]:
# Instantiate initial model
model = tf.keras.Sequential()
model.add(layers.Embedding(input_dim=25000, output_dim=32, input_length=200))
model.add(layers.LSTM(3))
model.add(layers.Dense(3, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='sgd', loss='binary_crossentropy')
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 200, 32)           800000    
                                                                 
 lstm (LSTM)                 (None, 3)                 432       
                                                                 
 dense (Dense)               (None, 3)                 12        
                                                                 
 dense_1 (Dense)             (None, 1)                 4         
                                                                 
Total params: 800,448
Trainable params: 800,448
Non-trainable params: 0
_________________________________________________________________


##### The Output Shape of the embedding layer in the model summary is the input for the next layer.

In [10]:
model.fit(X_train_prep, y_train_prep)
y_pred = model.predict(X_test_prep)

from sklearn.metrics import accuracy_score 
print(f"Accuracy Score on the test set: {accuracy_score(y_test_prep, np.round(y_pred))}")

Accuracy Score on the test set: 0.5176


In [11]:
y_pred

array([[0.49974284],
       [0.50161827],
       [0.503109  ],
       ...,
       [0.5031221 ],
       [0.50764626],
       [0.4996824 ]], dtype=float32)

### Optimizing an LSTM Model -- by adjusting "embedding dimension (output_dim) used in the Embedding layer"
There are a number of different ways to experiment to try to improve the performance of our model. Some of these include:

* Number of hidden layers
* Number of nodes per hidden layer
* Activation function used (or not) at each layer
* Optimizer
* Loss function

In [3]:
optimize_dict = {'sgd': 0.5176,
                 'adam': 0.81208,
                 'RMSprop': 0.81552,
                 'output_dim-1': 0.5,
                 'output_dim-2': 0.80616,
                 'output_dim-4': 0.81192,
                 'output_dim-8': 0.73832,
                 'output_dim-16': 0.83192,
                 'output_dim-32': 0.84692,
                 'output_dim-64': 0.8236,
                 'output_dim-128': 0.82832,
                 'number_of_hidden-layers-3': 0.8358,
                 'number_of_hidden-layers-4': 0.78588,
                 'number_of_hidden-layers-5': 0.59844,
                 'number_of_nodes_2_hidden_layers-4': 0.8484,
                 'number_of_nodes_3_hidden_layers-4': 0.85452,
                 'number_of_nodes_3_hidden_layers-5': 0.82284,
                 'number_of_nodes_2_hidden_layers-5': 0.84496,
                 'number_of_nodes_2_hidden_layers-6': 0.82944,
                 'number_of_nodes_3_hidden_layers-6': 0.84072,
                 'number_of_nodes_2_hidden_layers-32': 0.85864,
                 'number_of_nodes_2_hidden_layers-adam_32': 0.86888}

In [6]:
# Instantiate initial model
optimizer_ = 'adam'
output_dim_ = 2**5
# hidden_layers = 5
nodes_hidden_layers = 64

model = tf.keras.Sequential()
model.add(layers.Embedding(input_dim=25000, output_dim=output_dim_, input_length=200))
model.add(layers.LSTM(32))
model.add(layers.Dense(nodes_hidden_layers, activation='relu'))

model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer=optimizer_, loss='binary_crossentropy')
model.summary()
model.fit(X_train_prep, y_train_prep)
y_pred = model.predict(X_test_prep)

from sklearn.metrics import accuracy_score 
print(f"Accuracy Score on the test set: {accuracy_score(y_test_prep, np.round(y_pred))}")

optimize_formula = f'number_of_nodes_2_hidden_layers-adam_{nodes_hidden_layers}'
accuracy = accuracy_score(y_test_prep, np.round(y_pred))
optimize_dict[optimize_formula] = accuracy

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 200, 32)           800000    
                                                                 
 lstm_1 (LSTM)               (None, 32)                8320      
                                                                 
 dense_2 (Dense)             (None, 64)                2112      
                                                                 
 dense_3 (Dense)             (None, 1)                 65        
                                                                 
Total params: 810,497
Trainable params: 810,497
Non-trainable params: 0
_________________________________________________________________
Accuracy Score on the test set: 0.86768


In [5]:
optimize_dict

{'sgd': 0.5176,
 'adam': 0.81208,
 'RMSprop': 0.81552,
 'output_dim-1': 0.5,
 'output_dim-2': 0.80616,
 'output_dim-4': 0.81192,
 'output_dim-8': 0.73832,
 'output_dim-16': 0.83192,
 'output_dim-32': 0.84692,
 'output_dim-64': 0.8236,
 'output_dim-128': 0.82832,
 'number_of_hidden-layers-3': 0.8358,
 'number_of_hidden-layers-4': 0.78588,
 'number_of_hidden-layers-5': 0.59844,
 'number_of_nodes_2_hidden_layers-4': 0.8484,
 'number_of_nodes_3_hidden_layers-4': 0.85452,
 'number_of_nodes_3_hidden_layers-5': 0.82284,
 'number_of_nodes_2_hidden_layers-5': 0.84496,
 'number_of_nodes_2_hidden_layers-6': 0.82944,
 'number_of_nodes_3_hidden_layers-6': 0.84072,
 'number_of_nodes_2_hidden_layers-32': 0.85864,
 'number_of_nodes_2_hidden_layers-adam_32': 0.86888,
 'number_of_nodes_2_hidden_layers-adam_64': 0.86732}

In [30]:
optimize_dict['number_of_nodes_3_hidden_layers-6'] = 0.84072

# Memory Adjustment

In [7]:
sequence_lengths = [1, 120, 200, 500]

# create 4 dicts that store 4 datasets X_train, y_train, X_text, y_test
X_train_prep_dict = {}
X_test_prep_dict = {}
y_train_prep_dict = {}
y_test_prep_dict = {}
results_sequence_lengths = {}

for sequence_length in sequence_lengths:  
    X_train_prep_dict[sequence_length] = np.loadtxt(f"imdb_X_train_prep_{sequence_length}.csv", delimiter=",")
    X_test_prep_dict[sequence_length] = np.loadtxt(f"imdb_X_test_prep_{sequence_length}.csv", delimiter=",")
    y_train_prep_dict[sequence_length] = np.loadtxt(f"imdb_y_train_prep_{sequence_length}.csv", delimiter=",")
    y_test_prep_dict[sequence_length] = np.loadtxt(f"imdb_y_test_prep_{sequence_length}.csv", delimiter=",")
    

    model = tf.keras.Sequential()   # model initialion
    model.add(layers.Embedding(input_dim=25000, output_dim=output_dim_, input_length=sequence_length))   # embedding layer
    model.add(layers.LSTM(32))   # hidden layer
    model.add(layers.Dense(64, activation='relu'))   # hidden layer
    model.add(layers.Dense(1, activation='sigmoid'))   # output layer

    model.compile(optimizer=optimizer_, loss='binary_crossentropy')
    model.summary()
    model.fit(X_train_prep_dict[sequence_length], y_train_prep_dict[sequence_length])
    y_pred = model.predict(X_test_prep_dict[sequence_length])

    from sklearn.metrics import accuracy_score 
    print(f"Accuracy Score on the test set: {accuracy_score(y_test_prep_dict[sequence_length], np.round(y_pred))}")

    accuracy = accuracy_score(y_test_prep, np.round(y_pred))
    results_sequence_lengths[f'sequence_lengths_{sequence_length}'] = accuracy

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 1, 32)             800000    
                                                                 
 lstm_2 (LSTM)               (None, 32)                8320      
                                                                 
 dense_4 (Dense)             (None, 64)                2112      
                                                                 
 dense_5 (Dense)             (None, 1)                 65        
                                                                 
Total params: 810,497
Trainable params: 810,497
Non-trainable params: 0
_________________________________________________________________
Accuracy Score on the test set: 0.49968
Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape           

# Comparison - SimpleRNN & LSTM

In [6]:
# Instantiate initial model
optimizer_ = 'adam'
output_dim_ = 2**5
# hidden_layers = 5
nodes_hidden_layers = 64

model = tf.keras.Sequential()
model.add(layers.Embedding(input_dim=25000, output_dim=output_dim_, input_length=200))
model.add(layers.SimpleRNN(32))
model.add(layers.Dense(64, activation='relu'))

model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer=optimizer_, loss='binary_crossentropy')
model.summary()
model.fit(X_train_prep, y_train_prep)
y_pred = model.predict(X_test_prep)

from sklearn.metrics import accuracy_score 
print(f"Accuracy Score on the test set: {accuracy_score(y_test_prep, np.round(y_pred))}")

optimize_formula = f'number_of_nodes_2_SimpleRNN_hidden_layers-adam_{nodes_hidden_layers}'
accuracy = accuracy_score(y_test_prep, np.round(y_pred))
optimize_dict[optimize_formula] = accuracy

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 200, 32)           800000    
                                                                 
 simple_rnn (SimpleRNN)      (None, 32)                2080      
                                                                 
 dense_6 (Dense)             (None, 64)                2112      
                                                                 
 dense_7 (Dense)             (None, 1)                 65        
                                                                 
Total params: 804,257
Trainable params: 804,257
Non-trainable params: 0
_________________________________________________________________
Accuracy Score on the test set: 0.80812


In [9]:
optimize_dict

{'sgd': 0.5176,
 'adam': 0.81208,
 'RMSprop': 0.81552,
 'output_dim-1': 0.5,
 'output_dim-2': 0.80616,
 'output_dim-4': 0.81192,
 'output_dim-8': 0.73832,
 'output_dim-16': 0.83192,
 'output_dim-32': 0.84692,
 'output_dim-64': 0.8236,
 'output_dim-128': 0.82832,
 'number_of_hidden-layers-3': 0.8358,
 'number_of_hidden-layers-4': 0.78588,
 'number_of_hidden-layers-5': 0.59844,
 'number_of_nodes_2_hidden_layers-4': 0.8484,
 'number_of_nodes_3_hidden_layers-4': 0.85452,
 'number_of_nodes_3_hidden_layers-5': 0.82284,
 'number_of_nodes_2_hidden_layers-5': 0.84496,
 'number_of_nodes_2_hidden_layers-6': 0.82944,
 'number_of_nodes_3_hidden_layers-6': 0.84072,
 'number_of_nodes_2_hidden_layers-32': 0.85864,
 'number_of_nodes_2_hidden_layers-adam_32': 0.86888,
 'number_of_nodes_2_SimpleRNN_hidden_layers-adam_64': 0.80812}