# Exercise - IMDB

1. Use the IMDB movie review data (positive/negative movie reviews) to build a model that is able to predict the sentiment (positive/negative) from movie reviews. Your initial model should use one embedding layer, one recurrent layer (up to you which type), and a final fully connected layer to perform the classification.
1. Try to improve your model by doing (at least) the following: add an additional recurrent layer and/or use a bidirectional recurrent layer (**note**: If you have a good 1-layer model this may be difficult - just try your best).
1. In the preprocessing of the data made by me, I kepts the top 1000 words and let all reviews be 100 words long. Consider changing one/both of these to try to improve your best model (**hint**: the limit of only 100 words is very severe - try doubling it to 200, this may likely improve your performance).

**Note**: You may want to use:
1. https://www.tensorflow.org/api_docs/python/tf/keras/layers/SimpleRNN
1. https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM
1. https://www.tensorflow.org/api_docs/python/tf/keras/layers/GRU
1. https://www.tensorflow.org/api_docs/python/tf/keras/layers/Bidirectional
1. https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb/load_data

**See slides for more details!**

# Setup

In [None]:
NB_WORDS_KEEP = 1000 # keep top 1000 words (most occuring) -> all else cast to "unknown"
SENTENCE_LEN = 100 # ensure sentences are exactly 100 words long. If longer, truncate. If shorter, pad

In [None]:
import tensorflow as tf
import numpy as np

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.imdb.load_data()

print(x_train.shape, y_train.shape, x_test.shape, y_test.shape)

In [None]:
print(f'First sentence has length = {len(x_train[0])}')
print(f'Second sentence has length = {len(x_train[1])}')

Above data is difficult to work with. Too many unique words and not uniform review length. Let's change that.

In [None]:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.imdb.load_data(
    num_words=NB_WORDS_KEEP, # only keep 1000 most used words
    start_char=1, # use 1 to indicate start of sentence
    oov_char=2, # use 2 to indicate any word not in the top 1000
    index_from=3 # start indexing words from 3
)

print(x_train.shape, y_train.shape, x_test.shape, y_test.shape)

In [None]:
def preprocess_sample(sample, target_len):
    if len(sample) > target_len:
        return sample[:target_len] # if too long, shorten
    if len(sample) < target_len: # if too short, pad
        return sample + [0] * (target_len - len(sample)) # zero for these cases, i.e. padding
    return sample

def preprocess_imdb(x, target_len):
    mod_x = []
    
    for sample in x:
        mod_x.append(preprocess_sample(sample, target_len))
        
    return np.array(mod_x)

In [None]:
# NOTE: Use these modified sentences as the data for training and testing!
z_train = preprocess_imdb(x_train, SENTENCE_LEN)
z_test = preprocess_imdb(x_test, SENTENCE_LEN)

In [None]:
print(f'First sentence has length = {len(x_train[0])}')
print(f'Second sentence has length = {len(x_train[1])}')

print(f'First modified sentence has length = {len(z_train[0])}')
print(f'Second modified sentence has length = {len(z_train[1])}')

# Exercise 1

Use the IMDB movie review data (positive/negative movie reviews) to build a model that is able to predict the sentiment (positive/negative) from movie reviews. Your initial model should use one embedding layer, one recurrent layer (up to you which type), and a final fully connected layer to perform the classification.

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Embedding(input_dim=??, output_dim=??),
    tf.keras.layers.??(??),
    tf.keras.layers.Dense(??, activation=??),
])

model.summary()

model.compile(
    loss=??,
    optimizer=??,
    metrics=['accuracy'],
)

In [None]:
model.fit(??)

# Exercise 2

Try to improve your model by doing (at least) the following: add an additional recurrent layer and/or use a bidirectional recurrent layer (**note**: If you have a good 1-layer model this may be difficult - just try your best).

In [None]:
model_deep = tf.keras.models.Sequential([
    tf.keras.layers.Embedding(input_dim=??, output_dim=??),
    tf.keras.layers.LSTM(??, return_sequences=´??),
    tf.keras.layers.LSTM(??),
    tf.keras.layers.Dense(??, activation=??),
])

model_deep.summary()

model_deep.compile(
    ??
)

In [None]:
model_deep.fit(??)

In [None]:
model_bidirectional = tf.keras.models.Sequential([
    tf.keras.layers.Embedding(input_dim=??, output_dim=??),
    tf.keras.layers.Bidirectional(??),
    tf.keras.layers.Dense(??),
])

model_bidirectional.summary()

model_bidirectional.compile(
    ??
)

In [None]:
model_bidirectional.fit(??)

# Exercise 3

In the preprocessing of the data made by me, I kepts the top 1000 words and let all reviews be 100 words long. Consider changing one/both of these to try to improve your best model (**hint**: the limit of only 100 words is very severe - try doubling it to 200, this may likely improve your performance).

In [None]:
# On your own here