<a href="https://colab.research.google.com/github/FranciscoBPereira/AnaliseDados_2425_MEI_ISEC/blob/main/AD2425_P8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Setup, Version check and Common imports

# Python ≥3.8 is required
import sys
assert sys.version_info >= (3, 5)


# TensorFlow ≥2.0 is required
import tensorflow as tf
assert tf.__version__ >= "2.0"

# Common imports
import numpy as np
import os

from tensorflow import keras
from tensorflow.keras import layers

# to make this notebook's output stable across runs
np.random.seed(42)

import matplotlib.pyplot as plt

plt.rc('font', size=14)
plt.rc('axes', labelsize=14, titlesize=14)
plt.rc('legend', fontsize=14)
plt.rc('xtick', labelsize=10)
plt.rc('ytick', labelsize=10)

print('Python version: ', sys.version_info)
print('TF version: ', tf.__version__)
print('Keras version: ', keras.__version__)
print('GPU is', 'available' if tf.config.list_physical_devices('GPU') else 'NOT AVAILABLE')

In [None]:
# Download IMDB dataset from Keras: https://keras.io/api/datasets/imdb/
# Reviews are already preprocessed and ready to use

# The raw dataset is available here: https://ai.stanford.edu/~amaas/data/sentiment/

tf.random.set_seed(42)

# Check the documentation of the load_data function

max_features = 10000
common_words = 10

# Load train and test datasets
(x_train, y_train), (x_test, y_test) = keras.datasets.imdb.load_data(num_words=max_features, skip_top=common_words)

# Load a dictionary that will be used to decode reviews
word_index = keras.datasets.imdb.get_word_index()

**Quiz 1**

Consult the documentation to answer the following questions:

1. What preprocessing operations have been done to the original reviews?

2. The load_data() method has 2 parameters. Why are they important to prepare the dataset?


**1.1. Consult Some Reviews**

In [None]:
# Visualizing some reviews

# By relying on the retrieved dictionary, it is also possible to visualize the decoded review

# Labels: 0(Bad), 1(Good)

# Choose review
review = 0

print("Word count: " ,len(x_train[review]))
print(x_train[review])

tam = len(x_train[review])
print('Label ', y_train[review])


id_to_word = {id_ + 3: word for word, id_ in word_index.items()}
for id_, token in enumerate(("<pad>", "<sos>", "<unk>")):
    id_to_word[id_] = token
" ".join([id_to_word[id_] for id_ in x_train[review][:tam]])

**1.2. Trim Reviews**

In [None]:
# Trim reviews and keep just the last maxlen words

# The classifier will perform sentiment anaslysis just considering these last words of the reviews
# https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences

maxlen = 20

x_trainP = keras.preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_testP = keras.preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)




In [None]:
# Visualizing reviews after trimming

# Escolher a review
review = 10

print("Word count: " ,len(x_trainP[review]))

print('Label ', y_train[review])
id_to_word = {id_ + 3: word for word, id_ in word_index.items()}
for id_, token in enumerate(("<pad>", "<sos>", "<unk>")):
    id_to_word[id_] = token
" ".join([id_to_word[id_] for id_ in x_trainP[review][:tam]])


**2. Classifiers for Sentiment Analysis**

**2.1 Model A - Multilayer Perceptron (MLP)**

In [None]:

# Baseline Feed-Forward Neural Network

# Complete the model with two hidden layers, each with 20 nodes (default parameters)
# The last layer should have 1 node with sigmoid activation

keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)

inputs = keras.Input(shape=[maxlen, 1])
x = keras.layers.Flatten()(inputs)

### Complete the Missing layers ###

output = keras.layers.Dense(1, activation='sigmoid')(x)

modelA = keras.Model(inputs, output)

In [None]:

modelA.summary()

In [None]:
modelA.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

history = modelA.fit(x_trainP, y_train, epochs=10, validation_split=0.2)

In [None]:
# Evaluate ModelA on the test set

modelA.evaluate(x_testP, y_test)

**Quiz:**
How do you evaluate the performance of Model A?

**2.2. MultiLayer Perceptron with Word Embedding**

In [None]:

# Add a trainable embedding layer. The embedding should habe 8 dimensions
# The remaining layers should be identical to Model A

# https://www.tensorflow.org/tutorials/text/word_embeddings
# https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding

keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)

output_emb = 8

inputs = keras.Input(shape=[maxlen])
emb = keras.layers.Embedding(max_features,  output_emb)(inputs)
x = keras.layers.Flatten()(emb)

### Complete the Missing layers ###

output = keras.layers.Dense(1, activation='sigmoid')(x)

modelB = keras.Model(inputs, output)


In [None]:
modelB.summary()

In [None]:
modelB.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

history = modelB.fit(x_trainP, y_train,
                      epochs=10,
                      validation_split=0.2)

In [None]:
# Evaluate Model B on the test set

modelB.evaluate(x_testP, y_test)

**Quiz:**
How do you evaluate the performance of Model B?

**2.3. Recurrent Neural Network (RNN) with Word Embedding**

In [None]:

# The feed-forward cells from hidden layers are replaced by recurrent GRU cells
# Everything else is identical to ModelB

keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)

output_emb = 8

inputs = keras.Input(shape=[maxlen])
emb = keras.layers.Embedding(max_features,  output_emb)(inputs)

x = keras.layers.GRU(20, return_sequences=True)(emb)
x = keras.layers.GRU(20, return_sequences=False)(x)

output = keras.layers.Dense(1, activation='sigmoid')(x)

modelC = keras.Model(inputs, output)

In [None]:
modelC.summary()

In [None]:
modelC.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

history = modelC.fit(x_trainP, y_train,
                      epochs=10,
                      validation_split=0.2)

In [None]:
# Evaluate Model C on the test set

modelC.evaluate(x_testP, y_test)

**Quiz:**
How do you evaluate the performance of Model C?

**2.4. Recurrent Neural Network with Pretrained Embedding**

In [None]:
# Using a pretrained embedding

#Two options to obtain the embedding

# Option 1: Direct download from Stanford

# In this example we will adopt the GloVe with 50 dimensions: https://nlp.stanford.edu/projects/glove/

#!wget https://nlp.stanford.edu/data/glove.6B.zip
#!unzip -q glove.6B.zip

#!rm glove.6B.100d.txt
#!rm glove.6B.200d.txt
#!rm glove.6B.300d.txt
#!rm glove.6B.zip


# Option 2: Upload the file emb.zip to the working directory

!unzip -q emb.zip

!rm emb.zip

In [None]:

embeddings_index = {}
f = open(os.path.join('glove.6B.50d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

In [None]:
# Create the embedding matrix with max_words lines and 50 columns (the embedding dimension)

embedding_dim = 50
max_words = max_features
embedding_matrix = np.zeros((max_words, embedding_dim)) # matriz com zeros

# Fill the matrix with the values of the pretrained embedding

for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if i < max_words:
        if embedding_vector is not None:
            # Words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector

In [None]:
# Create model D, which is identical to model C with the exception of the embedding dimension

keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)

inputs = keras.Input(shape=[maxlen])
emb = keras.layers.Embedding(max_words, embedding_dim)(inputs)

### Complete the Missing layers ###

output = keras.layers.Dense(1, activation='sigmoid')(x)

modelD = keras.Model(inputs, output)




In [None]:
# Store the pretrained embedding values in the embedding layer
# Freeze the weights, so that they do not change during training

modelD.layers[1].set_weights([embedding_matrix])
modelD.layers[1].trainable = False

In [None]:
modelD.summary()

In [None]:
modelD.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])


history = modelD.fit(x_trainP, y_train,
                      epochs=10,
                      validation_split=0.2)

In [None]:
# Evaluate modelD in the test set

modelD.evaluate(x_testP, y_test)


**Quiz**

1. How do you evaluate the performance of Model D?

2.Present justifications for the comparative accuracy of the different models.

**3.	Text Preprocessing / Hyperparameter Analysis**

Several preprocessing operations and hyperparameters are used in this example and they can have a relevant impact in the performance of the models. Create a new experiment by changing one of the following options and analyze the impact on performance.

**3.1.	Reviews Preprocessing**

a)	Do not remove frequent words from the dataset.

b)	Is the relative order of the words relevant for classifying a review? (In this task, it might be useful to consult the permuted method from NumPy)

c)	Change the number of words kept in each review and check how it impacts results.

d)	Eliminate the words at the end of the review and not at the beginning and check how it impacts results.

e)	Try varying the number of frequent words that are dropped and check how it impacts results.

**3.2.	Hyperparameters**

a)	Test different values for hyperparameters (e.g., the embedding dimension) and check how it impacts results.




In [None]:

## CODE GOES HERE  ##
