<a href="https://colab.research.google.com/github/FM11pp3/RNN_Practicise_exercises-solution/blob/main/RNN_ipnyb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#`RNN Exercises`



```
# Isto está formatado como código
```

# Ex1 Sentiment Analysis

Sentiment analysis is a computational technique used to identify and extract subjective information from text, determining the writer's attitude or opinion towards a topic. The objective of this study is to classify movie reviews from the IMDb dataset as either positive [1] or negative [0] using a Recurrent Neural Network (RNN), specifically a Bidirectional Long Short-Term Memory (LSTM) network.

The dataset consists of 50,000 movie reviews, equally divided into training (25,000) and testing (25,000) sets, each labeled with a binary sentiment. The dataset has already undergone initial preprocessing, including text cleaning, lowercasing, and tokenization into integer sequences ordered by word frequency.

The application of RNNs, particularly LSTMs, is highly relevant to this task as they are capable of capturing sequential dependencies and contextual information within the text, providing a deeper understanding of sentiment compared to methods based solely on word counts.

The dataset is publicly available through the Keras API: https://keras.io/api/datasets/imdb/


This exercise, is an adaptation of:
* Deep learning for dummies. (Wiley). John Wiley & Sons. [Chapter 14]
* https://keras.io/examples/nlp/bidirectional_lstm_imdb/

## Pre-Requisites:

To do this exercise, we imply that you already know the basics about RNN and text Mining.
If you are not confortable, some good references to start:

* Cheat Sheet Keras: https://media.datacamp.com/legacy/image/upload/v1660903348/Keras_Cheat_Sheet_gssmi8.pdf
* Cheat Sheet: https://github.com/BharathKumarNLP/Deep-Learning-Cheat-Sheets/blob/master/cheatsheet-recurrent-neural-networks.pdf  
* https://github.com/DSC-SPIDAL/harpgbdt/blob/master/doc/meeting/0821-DistributedGBT/fig/Fundamentals%20of%20Predictive%20Text%20Mining%20.pdf
*   Deep learning for dummies. (Wiley). John Wiley & Sons. [Chapter 14]: https://moodle2526.up.pt/pluginfile.php/136412/mod_folder/content/0/Deep%20Learning%20for%20Dummies.pdf?forcedownload=1


To fully understand this exercise, is fundamental to know the basic uma compreensão fundamental de RNNs e princípios de mineração de texto é benéfica. Para aqueles que buscam fortalecer seus conhecimentos nessas áreas, os seguintes recursos são recomendados:

Datacamp. (n.d.). *Keras Cheat Sheet*. https://media.datacamp.com/legacy/image/upload/v1660903348/Keras_Cheat_Sheet_gssmi8.pdf

Kumar, B. (n.d.). *Deep-Learning-Cheat-Sheets*. GitHub. https://github.com/BharathKumarNLP/Deep-Learning-Cheat-Sheets/blob/master/cheatsheet-recurrent-neural-networks.pdf

Deep Learning for Dummies. (n.d.). https://moodle2526.up.pt/pluginfile.php/136412/mod_folder/content/0/Deep%20Learning%20for%20Dummies.pdf?forcedownload=1

Nielsen, L. (n.d.). *Fundamentals of Predictive Text Mining*. GitHub. https://github.com/DSC-SPIDAL/harpgbdt/blob/master/doc/meeting/0821-DistributedGBT/fig/Fundamentals%20of%20Predictive%20Text%20Mining%20.pdf

## Objectives:

* Classify movie reviews as positive or negative using an LSTM.
* Apply preprocessing techniques for recurrent models.
* Build and train a bidirectional LSTM model for sentiment analysis.
* Evaluate the performance of the trained model.

We first pull in the core libraries—Keras for the dataset and model utilities, plus pandas for quick inspection. The IMDb dataset arrives pre-split into training and test sets, with each review encoded as a sequence of integer word indices. The data is ranked frenquency. Limiting the vocabulary to the 10,000 most frequent tokens keeps the problem manageable while preserving the most informative words.

## 1. Import Libraries and Load Dataset

Because reviews naturally vary in length, we use pad_sequences to reshape every sequence to a fixed size. Reviews longer than 200 tokens are truncated, and shorter reviews are left-padded with zeros. This uniform shape is essential for feeding data into the LSTM layers that expect consistent timesteps.

In [None]:
# --- Import core libraries ---

from keras.datasets import imdb       # IMDb dataset from Keras
import pandas as pd                   # To handle tabular data for exploration

# --- Load the IMDb dataset ---

# Keep only the 10,000 most frequent words (to limit vocabulary size)
top_words = 10000

# The dataset is already split into training and testing sets.
# Each review is represented as a sequence of integer word indices.
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=top_words, seed=21)

# --- Convert to DataFrames for a quick look ---

df_x_train = pd.DataFrame(x_train)
df_y_train = pd.DataFrame(y_train)
df_x_test = pd.DataFrame(x_test)
df_y_test = pd.DataFrame(y_test)

# Print the number of samples in the test set
print(f"Number of test samples: {len(x_test)}")

# Show first few reviews
df_x_train.head(5)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
[1m17464789/17464789[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step
Number of test samples: 25000


Unnamed: 0,0
0,"[1, 13, 119, 78, 3310, 102, 13, 66, 81, 13, 46..."
1,"[1, 4, 86, 173, 3007, 11, 4195, 9, 44, 15, 416..."
2,"[1, 449, 2, 50, 26, 38, 111, 85, 108, 13, 181,..."
3,"[1, 25, 3525, 119, 4, 954, 364, 352, 102, 14, ..."
4,"[1, 13, 81, 79, 7937, 19, 682, 5111, 7, 2282, ..."


## 2. Preprocessing — Sequence Padding

In [None]:
from keras.preprocessing.sequence import pad_sequences

# Define the maximum review length
max_pad = 50

# Pad or truncate all sequences to the same length (200 words)
x_train = pad_sequences(x_train, maxlen=max_pad)
x_test = pad_sequences(x_test, maxlen=max_pad)

# Check one padded review
print("Example of a padded sequence:")
print(x_train[0])

Example of a padded sequence:
[  61  492   16 3953  159   29 1131   13 2134 3872   81   41   32   14
  832   56    8   35  576 1301    5 5348 3134  255  335  170    8    2
   72 1168 1656   57   29    9    2    2 3310  415   11 5215   89 1047
   10   10   81   24  106   14   20  126]


## 3. Model Architecture — Bidirectional LSTM

The model couples an embedding layer with a bidirectional LSTM, allowing it to read each review from both directions and capture context that might otherwise be lost. Stacking dense layers on top prepares a rich representation that culminates in a single sigmoid neuron for binary sentiment predictions.

In [None]:
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional, GlobalMaxPool1D

# --- Define hyperparameters ---
embedding_vector_length = 32  # size of the word embeddings

# --- Build the model ---
model = Sequential()

# 1. Embedding layer: converts each word ID into a dense 32-dimensional vector
model.add(Embedding(input_dim=top_words, output_dim=embedding_vector_length))

# 2. Bidirectional LSTM: processes the text both forward and backward
model.add(Bidirectional(LSTM(64, return_sequences=True)))

# 3. Global max pooling: reduces the sequence output into a single feature vector
model.add(GlobalMaxPool1D())

# 4. Dense layer with ReLU: learns non-linear combinations of the extracted features
model.add(Dense(32, activation='relu'))

# 5. Dropout: randomly disables neurons (50%) to prevent overfitting
model.add(Dropout(0.5))

# 6. Output layer: single neuron with sigmoid for binary sentiment classification
model.add(Dense(1, activation='sigmoid'))

## 4. Summary


In [None]:
# Build the model (optional but ensures summary displays input/output shapes)
model.build((None, 200))

# Summary of the model architecture
model.summary()

## 5. Model Compilation and Training

We compile the network with binary_crossentropy, the standard loss for positive/negative classification, and the adaptive Adam optimizer. Training runs for a few epochs while tracking validation accuracy, which helps us spot overfitting or underfitting quickly. The final evaluation on the held-out test set summarizes how well the model generalises.

In [None]:
# Compile the model
model.compile(
    loss='binary_crossentropy',   # Binary sentiment (0 or 1)
    optimizer='adam',             # Adaptive gradient optimizer
    metrics=['accuracy']          # Evaluate model performance
)

# Train the model
history = model.fit(
    x_train, y_train,
    epochs=3,                     # Number of full passes through the data
    batch_size=64,               # Number of samples per gradient update
    validation_data=(x_test, y_test),  # Evaluate on test data each epoch
    verbose=1
)

Epoch 1/3
[1m260/391[0m [32m━━━━━━━━━━━━━[0m[37m━━━━━━━[0m [1m10s[0m 83ms/step - accuracy: 0.5995 - loss: 0.6397

KeyboardInterrupt: 

In [None]:
# Evaluate model performance on the test set
loss, accuracy = model.evaluate(x_test, y_test, verbose=0)
print(f"Test Loss: {loss:.4f}")
print(f"Test Accuracy: {accuracy:.4f}")


## Final Notes

Accuracy in the mid-80% range is typical for this setup, and there is room to experiment—try deeper embeddings, longer padded sequences, or alternative recurrent layers such as GRUs or Conv1D blocks.

In [None]:
help(Bidirectional)

Help on class Bidirectional in module keras.src.layers.rnn.bidirectional:

class Bidirectional(keras.src.layers.layer.Layer)
 |  Bidirectional(layer, merge_mode='concat', weights=None, backward_layer=None, **kwargs)
 |
 |  Bidirectional wrapper for RNNs.
 |
 |  Args:
 |      layer: `keras.layers.RNN` instance, such as
 |          `keras.layers.LSTM` or `keras.layers.GRU`.
 |          It could also be a `keras.layers.Layer` instance
 |          that meets the following criteria:
 |          1. Be a sequence-processing layer (accepts 3D+ inputs).
 |          2. Have a `go_backwards`, `return_sequences` and `return_state`
 |          attribute (with the same semantics as for the `RNN` class).
 |          3. Have an `input_spec` attribute.
 |          4. Implement serialization via `get_config()` and `from_config()`.
 |          Note that the recommended way to create new RNN layers is to write a
 |          custom RNN cell and use it with `keras.layers.RNN`, instead of
 |          subclas

In [None]:
help(keras.layers.RNN)

NameError: name 'keras' is not defined

In [None]:
from keras.layers import LSTM
help(LSTM)

Help on class LSTM in module keras.src.layers.rnn.lstm:

class LSTM(keras.src.layers.rnn.rnn.RNN)
 |  LSTM(units, activation='tanh', recurrent_activation='sigmoid', use_bias=True, kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', bias_initializer='zeros', unit_forget_bias=True, kernel_regularizer=None, recurrent_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, recurrent_constraint=None, bias_constraint=None, dropout=0.0, recurrent_dropout=0.0, seed=None, return_sequences=False, return_state=False, go_backwards=False, stateful=False, unroll=False, use_cudnn='auto', **kwargs)
 |
 |  Long Short-Term Memory layer - Hochreiter 1997.
 |
 |  Based on available runtime hardware and constraints, this layer
 |  will choose different implementations (cuDNN-based or backend-native)
 |  to maximize the performance. If a GPU is available and all
 |  the arguments to the layer meet the requirement of the cuDNN kernel
 |  (see below for

In [None]:
help(GRU)

# Ex2 Prediction

In [2]:
import pandas as pd
import os
import matplotlib.pyplot as plt

In [4]:
os.chdir("G:/My Drive/Profissional/FEUP/Ensino\Ciência_Dados/2025/3.TimeSeriesLABDatasets")
df=pd.read_csv("RossmannStoreSales.csv")
df.info()

  os.chdir("G:/My Drive/Profissional/FEUP/Ensino\Ciência_Dados/2025/3.TimeSeriesLABDatasets")


FileNotFoundError: [Errno 2] No such file or directory: 'G:/My Drive/Profissional/FEUP/Ensino\\Ciência_Dados/2025/3.TimeSeriesLABDatasets'

In [3]:
import pandas as pd

# Filtrar a loja 1 e ordenar por data
store1 = df[df['Store'] == 1].sort_values('Date')

# Selecionar apenas as colunas relevantes
data = store1[['Date', 'Sales', 'Customers', 'Open', 'Promo', 'SchoolHoliday', 'StateHoliday']]
data['Date'] = pd.to_datetime(data['Date'])


NameError: name 'df' is not defined

In [5]:
n = len(data)
train_end = int(n * 0.7)
val_end = int(n * 0.85)

train = data.iloc[:train_end]
val = data.iloc[train_end:val_end]
test = data.iloc[val_end:]


NameError: name 'data' is not defined

# Ex3 Text Generator