# Sentiment Analysis with RNNs

Based on the [Chapter 16 notebook](https://github.com/ageron/handson-ml3/blob/main/16_nlp_with_rnns_and_attention.ipynb) from the book "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow" by Aurélien Géron.

## Download and prepare the IMDB dataset
The IMDB dataset is kind of a "hello world" of NLP. It contains 50,000 movie reviews, each labeled as positive or negative.

Note: this took forever on my non-GPU laptop, so I'd recommend running this in Colab.

In [None]:
import tensorflow_datasets as tfds
import tensorflow as tf

raw_train_set, raw_valid_set, raw_test_set = tfds.load(
    name="imdb_reviews",
    split=["train[:90%]", "train[90%:]", "test"],
    as_supervised=True
)
tf.random.set_seed(42)
train_set = raw_train_set.shuffle(5000, seed=42).batch(32).prefetch(1)
valid_set = raw_valid_set.batch(32).prefetch(1)
test_set = raw_test_set.batch(32).prefetch(1)

In [None]:
for review, label in raw_train_set.take(4):
    print(review.numpy().decode("utf-8")[:200], "...")
    print("Label:", label.numpy())

In [None]:
vocab_size = 1000
text_vec_layer = tf.keras.layers.TextVectorization(max_tokens=vocab_size)
text_vec_layer.adapt(train_set.map(lambda reviews, labels: reviews))

**Warning**: the following cell will take a few minutes to run and the model will probably not learn anything because we didn't deal with the padding tokens.

In [None]:
embed_size = 128
tf.random.set_seed(42)
model = tf.keras.Sequential([
    text_vec_layer,
    tf.keras.layers.Embedding(vocab_size, embed_size),
    tf.keras.layers.GRU(128),
    tf.keras.layers.Dense(1, activation="sigmoid")
])
model.compile(loss="binary_crossentropy", optimizer="nadam",
              metrics=["accuracy"])
history = model.fit(train_set, validation_data=valid_set, epochs=2)

That's not a great accuracy, we're basically just randomly guessing! Let's see if we can figure out why.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

def plot_stuff(model):
    predicted = model.predict(valid_set)

    lengths = []
    correct = []
    truth = []
    b = 0
    for batch in valid_set:
        lengths += [len(text.numpy()) for text in batch[0]]
        correct += [(pred > 0.5) == label for pred, label in zip(predicted[b:b+32], batch[1])]
        truth += list(batch[1].numpy())
        b += 32

    lengths = np.array(lengths)
    correct = np.array(correct).squeeze()

    plt.hist(lengths[correct], bins=50, alpha=0.5, label='correct')
    plt.hist(lengths[~correct], bins=50, alpha=0.5, label='incorrect')
    plt.xlabel('Review Length')
    plt.ylabel('Count')
    plt.legend()
    plt.show()

    # show the confusion matrix
    from sklearn.metrics import confusion_matrix
    from sklearn.metrics import ConfusionMatrixDisplay

    bin_pred = (predicted > 0.5).squeeze().astype(int)
    cm = confusion_matrix(truth, bin_pred)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["Negative Review", "Positive Review"])
    disp.plot()

plot_stuff(model)

## Handling variable length inputs
We have a few choices to deal with the problem of variable length inputs, particularly the ones with excessive null-valued padding tokens. The original source notebook describes a few approaches to **masking** (basically telling the model to ignore those tokens), but [Ragged Tensors](https://www.tensorflow.org/guide/ragged_tensor) are a fairly new feature of Tensorflow and friends.

**Warning**: the following cell will take a while to run (possibly 30 minutes if you are not using a GPU).

In [None]:
text_vec_layer_ragged = tf.keras.layers.TextVectorization(
    max_tokens=vocab_size, ragged=True)
text_vec_layer_ragged.adapt(train_set.map(lambda reviews, labels: reviews))
text_vec_layer_ragged(["Great movie!", "This is DiCaprio's best role."])

In [None]:
text_vec_layer(["Great movie!", "This is DiCaprio's best role."])

**Warning**: the following cell will take a while to run (possibly 30 minutes if you are not using a GPU).

In [None]:
embed_size = 128
tf.random.set_seed(42)
ragged_model = tf.keras.Sequential([
    text_vec_layer_ragged,
    tf.keras.layers.Embedding(vocab_size, embed_size),
    tf.keras.layers.GRU(128),
    tf.keras.layers.Dense(1, activation="sigmoid")
])
ragged_model.compile(loss="binary_crossentropy", optimizer="nadam",
              metrics=["accuracy"])
history = ragged_model.fit(train_set, validation_data=valid_set, epochs=5)

In [None]:
plot_stuff(ragged_model)