# RNN and Transformers


In this lab lesson we will see how and when to use Recurrent Neural Networks and how to exploits pre-trained Transformers model like Bert.


## Task description



In this exercise we will try to classify subjectivity of text in sentences.

We will use a collection of 103 Italian newspaper's articles labeled as Objective or Subjective. Each article is divided in sentences, which are consequently classified as either Subjective or Objective.

You can find the data along with a more detailed description [here](https://github.com/francescoantici/SubjectivITA).

We will be trying to create a model which is able to predict if a sentence contains subjectivity or it is fully objective.



## Import libraries

In [1]:
import tensorflow as tf
import pandas as pd
import numpy as np
from sklearn.metrics import classification_report
from sklearn.utils.class_weight import compute_class_weight

## Data loading

Please download the train, val and test files from the [repository](https://github.com/francescoantici/SubjectivITA/tree/main/datasets/sentences).

After having uploaded the three files to the notebook, use this utility function to load the data.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [5]:
!cp /content/drive/MyDrive/Università/DL/Laboratories/L4/*.csv ./

In [3]:
def get_data_split(split):
  """
  Args:
    - split: the split of the data you want to load.
  Returns:
    - X, y data, where X is the array containing the sentences and y is the labels vector.

  """
  df = pd.read_csv(f"sentences{split.capitalize()}.csv")
  return df['FRASE'].values, df['TAG_FRASE'].values

In [6]:
sentences_train, labels_train = get_data_split(split = 'train')
sentences_val, labels_val = get_data_split(split = 'val')
sentences_test, labels_test = get_data_split(split = 'test')

### Data Inspection

In [7]:
sentences_train

array(['Prova estrema su TikTok:',
       'bambina di 10 anni in coma a Palermo, dichiarata la morte cerebrale.',
       'Inchiesta per istigazione al suicidio.', ...,
       'è quanto ha detto Guido Bertolaso, nuovo consulente della Lombardia per la campagna vaccinale regionale, nel corso di una conferenza stampa con il presidente Attilio Fontana e la vicepresidente Letizia Moratti.',
       'Non voglio soldi, faccio il volontario e mi sono abbassato lo stipendio: da un euro zero,',
       'ha aggiunto.'], dtype=object)

In [8]:
labels_train

array(['OGG', 'OGG', 'OGG', ..., 'OGG', 'SOG', 'OGG'], dtype=object)

## RNN

Recurrent neural networks (RNN) are a class of neural networks that is powerful for
modeling sequence data such as time series or natural language.

Schematically, a RNN layer uses a `for` loop to iterate over the timesteps of a
sequence, while maintaining an internal state that encodes information about the
timesteps it has seen so far.

The Keras RNN API is designed with a focus on:

- **Ease of use**: the built-in `keras.layers.RNN`, `keras.layers.LSTM`,
`keras.layers.GRU` layers enable you to quickly build recurrent models without
having to make difficult configuration choices.

- **Ease of customization**: You can also define your own RNN cell layer (the inner
part of the `for` loop) with custom behavior, and use it with the generic
`keras.layers.RNN` layer (the `for` loop itself). This allows you to quickly
prototype different research ideas in a flexible way with minimal code.

There are three built-in RNN layers in Keras:

1. `keras.layers.SimpleRNN`, a fully-connected RNN where the output from previous
timestep is to be fed to next timestep.

2. `keras.layers.GRU`, first proposed in
[Cho et al., 2014](https://arxiv.org/abs/1406.1078).

3. `keras.layers.LSTM`, first proposed in
[Hochreiter & Schmidhuber, 1997](https://www.bioinf.jku.at/publications/older/2604.pdf).

### Data pre-processsing

We will use a tokenizer [function](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer) provided by Keras to map each token to an integer, so that the model is able to interpreter it.

In [9]:
from tensorflow.keras.preprocessing.text import Tokenizer


lbl_to_idx_dict = {"OGG":0, "SOG":1}

label_to_idx_f = np.vectorize(lbl_to_idx_dict.get)

vocabulary_dim = 10000

def get_tokenizer(x_train):
  tokenizer = Tokenizer(num_words = vocabulary_dim)
  tokenizer.fit_on_texts(x_train)
  return tokenizer

tokenizer = get_tokenizer(sentences_train)

In [10]:
from tensorflow.keras.utils import pad_sequences

maxSentenceLen = 20

generate_x = lambda x: pad_sequences(tokenizer.texts_to_sequences(x), maxlen = maxSentenceLen, padding = "post")

x_train = generate_x(sentences_train)
x_test = generate_x(sentences_test)
x_val = generate_x(sentences_val)

y_train = label_to_idx_f(labels_train)
y_test = label_to_idx_f(labels_test)
y_val = label_to_idx_f(labels_val)

let's build a rnn baseline based on LSTM

In [11]:
from tensorflow.keras.layers import Input, Embedding, LSTM, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from tensorflow.keras.optimizers import Adam

def get_rnn_model(input_shape, out_dim, vocabulary_dim):
  input = Input(shape=input_shape)

  embedding_layer = Embedding(input_dim=vocabulary_dim, output_dim=64)(input)

  lstm_1 = LSTM(128, return_sequences=True, recurrent_dropout = 0.2)(embedding_layer)

  lstm_2 = LSTM(64, dropout = 0.2)(lstm_1)

  output = Dense(out_dim)(lstm_2)

  model = Model(input, output)

  model.compile(loss=SparseCategoricalCrossentropy(from_logits=True), optimizer = Adam(1e-3), metrics = ['accuracy'])
  
  model.summary()
  
  return model

In [12]:
callback = tf.keras.callbacks.EarlyStopping(monitor = 'val_accuracy', mode = 'max', patience = 5, restore_best_weights = True)

In [13]:
model = get_rnn_model((20,), 2, vocabulary_dim)



Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 20)]              0         
                                                                 
 embedding (Embedding)       (None, 20, 64)            640000    
                                                                 
 lstm (LSTM)                 (None, 20, 128)           98816     
                                                                 
 lstm_1 (LSTM)               (None, 64)                49408     
                                                                 
 dense (Dense)               (None, 2)                 130       
                                                                 
Total params: 788,354
Trainable params: 788,354
Non-trainable params: 0
_________________________________________________________________


In [14]:
history = model.fit(x_train, y_train, epochs=10, validation_data = (x_val, y_val), callbacks = [callback])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10


### Model evaluation

For the model evaluation we will use the [classification report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) function provided by scikit-learn, which will present a detailed report of the performances of the model evaluated on different metrics.

In [15]:
toLabels = np.vectorize(lambda e: "OGG" if e == 0 else "SOG")

def evaluate_model(model, x_test, y_test):
  """
  Args:
    - model: the model to use to make the prediction.
    - x_test: the sentences to label.
    - y_test: the actual labels.
  Returns:
    - The results of the evaluation of the model.

  """
  y_pred = np.argmax(model.predict(x_test), axis = -1)
  print(classification_report(toLabels(y_test), y_pred = toLabels(y_pred)))

In [16]:
evaluate_model(model, x_test, y_test)

              precision    recall  f1-score   support

         OGG       0.77      0.77      0.77       152
         SOG       0.54      0.55      0.54        75

    accuracy                           0.70       227
   macro avg       0.66      0.66      0.66       227
weighted avg       0.70      0.70      0.70       227



#### Bidirectional RNNs

For sequences other than time series (e.g. text), it is often the case that a RNN model
can perform better if it not only processes sequence from start to end, but also
backwards. For example, to predict the next word in a sentence, it is often useful to
have the context around the word, not only just the words that come before it.

Keras provides an easy API for you to build such **bidirectional RNNs**: the
`keras.layers.Bidirectional` wrapper.

In [17]:
from tensorflow.keras.layers import Bidirectional, TimeDistributed

def get_rnn_model_bd(input_shape, out_dim, vocabulary_length):
  input = Input(shape=input_shape)

  embedding_layer = Embedding(input_dim=vocabulary_length, output_dim=64)(input)

  lstm_1 = Bidirectional(LSTM(128, return_sequences=True, recurrent_dropout = 0.2))(embedding_layer)

  lstm_2 = Bidirectional(LSTM(64, dropout = 0.2))(lstm_1)

  output = Dense(out_dim)(lstm_2)

  model = Model(input, output)

  model.compile(loss=SparseCategoricalCrossentropy(from_logits=True), optimizer = Adam(1e-3), metrics = ['accuracy'])
  
  model.summary()
  
  return model

In [18]:
model_bd = get_rnn_model_bd((20,), 2, vocabulary_dim)



Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 20)]              0         
                                                                 
 embedding_1 (Embedding)     (None, 20, 64)            640000    
                                                                 
 bidirectional (Bidirectiona  (None, 20, 256)          197632    
 l)                                                              
                                                                 
 bidirectional_1 (Bidirectio  (None, 128)              164352    
 nal)                                                            
                                                                 
 dense_1 (Dense)             (None, 2)                 258       
                                                                 
Total params: 1,002,242
Trainable params: 1,002,242
Non-tra

In [19]:
history = model_bd.fit(x_train, y_train, epochs=10, validation_data = (x_val, y_val), callbacks = [callback])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10


In [20]:
evaluate_model(model_bd, x_test, y_test)

              precision    recall  f1-score   support

         OGG       0.82      0.74      0.78       152
         SOG       0.56      0.68      0.61        75

    accuracy                           0.72       227
   macro avg       0.69      0.71      0.70       227
weighted avg       0.74      0.72      0.72       227



## Transformers

Transformers are deep neural networks that over the last years achieved state of the art performances in several tasks.

Transformers replaces CNNs and RNNs with [self-attention](https://developers.google.com/machine-learning/glossary#self-attention). Self attention allows Transformers to easily transmit information across the input sequences.

For the transformer model implementation we will rely on a Python library called `transformers`, which provides an API inteface to several pre-trained models for fine-tuning or transfer-learning purposes.

In [None]:
!pip3 install transformers 
from transformers import AutoTokenizer, TFBertModel

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m56.0 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m75.8 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transformers-4.28.1


### Pre-trained model

For this task we will use a pre-trained language model called [AlBERTo](github.com/marcopoli/AlBERTo-it). AlBERTo is a BERT model trained for the Italian language. In particular, AlBERTo is focused on the language used in social networks, specifically on Twitter. Due to the language and the type of data present in the dataset AlBERTo is the best fit for this kind of task.

You can find the pre-trained model in [huggingface](https://huggingface.co/bert-base-multilingual-cased?text=mi+piace+il+%5BMASK%5D), which is an open repository for pre-trained architectures, available in both pytorch and tensorflow (depending on the developers).

### Data pre-processing

Transoformes must recieve input in a standard format, namely divided in `input_ids`, `token_type_ids`, `attention_mask`.

In [None]:
def prepare_data_bert(x, y, maxSentenceLen = maxSentenceLen):
  """
  Args:
    - x: the sentences to label.
    - y: the actual labels.
    - maxSentenceLen: The maximum length of the sentences, it is used as a truncation length
  Returns:
    - A tuple with the input to feed into a transformers model, namely ((input_ids, attention_mask, token_type_ids), categorical_labels).

  """
  pad = tf.keras.preprocessing.sequence.pad_sequences
  tokenizer = AutoTokenizer.from_pretrained("m-polignano-uniba/bert_uncased_L-12_H-768_A-12_italian_alb3rt0")
  dataFields = {
          "input_ids": [],
          "token_type_ids": [],
          "attention_mask": [],
          "subjectivity": []
      }
  lbls = {
      'SOG' : 1.0,
      'OGG' : 0.0
  }
  for i in range(len(x)):
      data = tokenizer(x[i])
      padded = pad([data['input_ids'], data['attention_mask'], data['token_type_ids']], padding = 'post', maxlen = maxSentenceLen)
      dataFields['input_ids'].append(padded[0])
      dataFields['attention_mask'].append(padded[1])
      dataFields['token_type_ids'].append(padded[-1])
      dataFields['subjectivity'].append(lbls[y[i]])
  
  for key in dataFields:
      dataFields[key] = np.array(dataFields[key])
  
  return [dataFields["input_ids"], dataFields["token_type_ids"], dataFields["attention_mask"]], dataFields["subjectivity"]

In [None]:
x_train_bert, y_train_bert = prepare_data_bert(sentences_train, labels_train)
x_val_bert, y_val_bert = prepare_data_bert(sentences_val, labels_val)
x_test_bert, y_test_bert = prepare_data_bert(sentences_test, labels_test)

### Model

#### Try it yourself! 

Try to implement the model to fine tune AlBERTo model. 


The bert model takes as input three tensors of type tf.int32 and of shape (maxSentenceLen, ). 
It returns as output two tensors, a simple model output -> modelOutput[0] and a pooled output -> modelOutput[-1].
You should work with the pooled output, so when you call the bert model as a layer in your network use it as follows:

`bertModel = TFBertModel.from_pretrained("m-polignano-uniba/bert_uncased_L-12_H-768_A-12_italian_alb3rt0")(inputs)[-1]`

In [None]:
from tensorflow.keras.layers import Concatenate

def create_transformers_model(input_shape, out_dim):
  """
  It should return the model instance to fine-tune the transformer model
  """
  
  input_ids = Input(shape=input_shape, name="input_ids", dtype=tf.int32)
  token_type_ids = Input(shape=input_shape, name="token_type_ids", dtype=tf.int32)
  attention_mask = Input(shape=input_shape, name="attention_mask", dtype=tf.int32)
  # inputs = Concatenate()([input_ids, token_type_ids, attention_mask])

  bertModel = TFBertModel.from_pretrained("m-polignano-uniba/bert_uncased_L-12_H-768_A-12_italian_alb3rt0")(input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)[-1]
  
  out = Dense(out_dim)(bertModel)

  model = Model(inputs=[input_ids, token_type_ids, attention_mask], outputs=out)
  model.compile(loss=SparseCategoricalCrossentropy(from_logits=True), optimizer = Adam(1e-3), metrics = ['accuracy'])
  print(model.summary())
  return model

In [None]:
bert_model = create_transformers_model((maxSentenceLen,), 2)

All model checkpoint layers were used when initializing TFBertModel.

All the layers of TFBertModel were initialized from the model checkpoint at m-polignano-uniba/bert_uncased_L-12_H-768_A-12_italian_alb3rt0.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


Model: "model_3"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_ids (InputLayer)         [(None, 20)]         0           []                               
                                                                                                  
 attention_mask (InputLayer)    [(None, 20)]         0           []                               
                                                                                                  
 token_type_ids (InputLayer)    [(None, 20)]         0           []                               
                                                                                                  
 tf_bert_model_6 (TFBertModel)  TFBaseModelOutputWi  184345344   ['input_ids[0][0]',              
                                thPoolingAndCrossAt               'attention_mask[0][0]',   

Normally, pre-trained models shouldn't be fine tuned for long, otherwise you could alter the prior knowledge of the model and result in poor results, also, given the dimension of the model you can easily overfit on the new data. 
For Bert fine tuning it is reccomended to use a low `Learning Rate (1e-5)` and to train for not more than 4 `epochs`.

In [None]:
history = bert_model.fit(x_train_bert, 
                         y_train_bert, 
                         epochs=4, 
                         validation_data = (x_val_bert, y_val_bert), 
                         batch_size = 16, 
                         callbacks = [callback]
                         )

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


### Model evaluation

In [None]:
evaluate_model(bert_model, x_test_bert, y_test_bert)