# Laboratory 4

## Description of the `IMDB` dataset
The `IMDB` dataset is a set of 50,000 highly polarized reviews from the Internet Movie Database. They’re split into 25,000 reviews for training and 25,000 reviews for testing, each set consisting of 50% negative and 50% positive reviews. The reviews (sequences of words) have been preprocessed - turned into sequences of integers, where each integer stands for a specific word in a dictionary. 

In [None]:
from tensorflow.keras.datasets import imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(
    num_words=10000)

The argument `num_words=10000` means we’ll only keep the top 10,000 most frequently occurring words in the training data. Rare words will be discarded. The variables `train_data` and `test_data` are lists of reviews; each review is a list of word indices (encoding a sequence of words). `train_labels` and `test_labels` are lists of 0s and 1s, where 0 stands for negative and 1 stands for positive. For instance, the first review consists of 218 words and is positive:

In [None]:
import numpy as np
np.array(train_data[0]),len(train_data[0]),train_labels[0]

We can easily decode any of these reviews back to English words:

In [None]:
word_index = imdb.get_word_index()
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

In [None]:
def decoded_review(number_of_review):
    return " ".join([reverse_word_index.get(i - 3, "?") for i in train_data[number_of_review]])

In [None]:
number_of_review = 1000
decoded_review(number_of_review)

In [None]:
import tensorflow as tf
from tensorflow.keras import models, layers, optimizers
from tensorflow.keras.datasets import imdb
import matplotlib.pyplot as plt

## Task 1
Prepare the data:
- Multi-hot encode lists from `train_data` and `train_labels` to turn them into vectors of 0s and 1s. This would mean, for instance, turning the sequence [8, 5] into a 10,000-dimensional vector that would be all 0s except for indices 8 and 5, which would be 1s. Then you could use a Dense layer, capable of handling floating-point vector data, as the first layer in your model.
- Change data type in `test_data` and `test_labels` from `int64` into `float32`.

In [None]:
# Load the data (keep top 10,000 words only)
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

In [None]:
# Multi-hot encode the input sequences
def multi_hot_sequences(sequences, dimension=10000):
    results = np.zeros((len(sequences), dimension), dtype="float32")
    for i, seq in enumerate(sequences):
        results[i, seq] = 1.0
    return results

x_train = multi_hot_sequences(train_data)
x_test = multi_hot_sequences(test_data)

In [None]:
# Ensure labels are float32 (for compatibility)
y_train = np.array(train_labels).astype("float32")
y_test = np.array(test_labels).astype("float32")

## Task 2
Build your model. Take into consideration that the input data is vectors, and the labels are scalars (1s and 0s) and a type of model that performs well on such a problem is a plain stack of densely connected (Dense) layers with relu activations. Think about:
- How many layers to use?
- How many units to choose for each layer?

Compile your model choosing a proper optimizer, loss function, and metrics.

In [None]:
def build_model(units=[16, 16], activation='relu', loss='binary_crossentropy'):
    model = models.Sequential()
    model.add(layers.Input(shape=(10000,)))
    for u in units:
        model.add(layers.Dense(u, activation=activation))
    model.add(layers.Dense(1, activation='sigmoid'))
    model.compile(optimizer='adam',
                  loss=loss,
                  metrics=['accuracy'])
    return model

## Task 3
Validate your model: 
- Create a validation set by setting apart 10,000 samples from the original training data.
- Train the model for 20 epochs in mini-batches of 512 samples from training data.
- Monitor loss and accuracy on the 10,000 samples from the validation set.
- Make a plot of the training and validation loss.
</br><img src=2.png/>
- Make a plot of the training and validation accuracy. 
</br><img src=3.png/>
- Choose a proper number of epochs to train the model on the entire train data to prevent overfitting, and then evaluate it on the test data.

In [None]:
# Create a validation set
x_val = x_train[:10000]
partial_x_train = x_train[10000:]

y_val = y_train[:10000]
partial_y_train = y_train[10000:]

In [None]:
model = build_model()

history = model.fit(
    partial_x_train, partial_y_train,
    epochs=20,
    batch_size=512,
    validation_data=(x_val, y_val),
    verbose=2
)

In [None]:
# Plot training and validation loss
def plot_history(history):
    history_dict = history.history
    acc = history_dict['accuracy']
    val_acc = history_dict['val_accuracy']
    loss = history_dict['loss']
    val_loss = history_dict['val_loss']
    epochs = range(1, len(loss) + 1)

    plt.figure(figsize=(12, 5))

    plt.subplot(1, 2, 1)
    plt.plot(epochs, loss, 'bo', label='Training loss')
    plt.plot(epochs, val_loss, 'b', label='Validation loss')
    plt.title('Training and validation loss')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.legend()

    plt.subplot(1, 2, 2)
    plt.plot(epochs, acc, 'bo', label='Training acc')
    plt.plot(epochs, val_acc, 'b', label='Validation acc')
    plt.title('Training and validation accuracy')
    plt.xlabel('Epochs')
    plt.ylabel('Accuracy')
    plt.legend()

    plt.tight_layout()
    plt.show()

plot_history(history)

In [None]:
# Retrain final model using best epoch (e.g. 4)
final_model = build_model()
final_model.fit(x_train, y_train, epochs=4, batch_size=512, verbose=2)

In [None]:
# Evaluate on test set
results = final_model.evaluate(x_test, y_test, verbose=2)
print(f"\nTest Loss: {results[0]:.4f}, Test Accuracy: {results[1]:.4f}")

## Task 4
Do the following experiments:
- Try using less or more representation layers, and see how doing so affects validation and test accuracy.
- Try using layers with more units or fewer units.
- Try using the `mse` loss function instead of `binary_crossentropy`.
- Try using the `tanh` activation instead of `relu`.

In [None]:
from tensorflow.keras import backend as K
import gc

def experiment(description, **kwargs):
    print(f"\n=== {description} ===")
    model = build_model(**kwargs)
    model.fit(partial_x_train, partial_y_train,
              epochs=4,
              batch_size=512,
              validation_data=(x_val, y_val),
              verbose=0)
    results = model.evaluate(x_test, y_test, verbose=0)
    print(f"Test Accuracy: {results[1]:.4f}")

    # 🧹 Cleanup to avoid memory issues
    K.clear_session()
    del model
    gc.collect()

In [None]:
# Experiment 1: Fewer layers
experiment("1 layer with 16 units", units=[16])

In [None]:
# Experiment 2: More layers
experiment("3 layers with 16 units", units=[16, 16, 16])

In [None]:
# Experiment 3: Larger layers
experiment("2 layers with 64 units", units=[64, 64])

In [None]:
# Experiment 4: Smaller layers
experiment("2 layers with 4 units", units=[4, 4])

In [None]:
# Experiment 5: MSE loss instead of binary_crossentropy
experiment("MSE loss function", loss='mse')

In [None]:
# Experiment 6: tanh instead of relu
experiment("tanh activation", activation='tanh')