## Run this notebook on colab or your local system
<a href="https://colab.research.google.com/drive/14dea3PuU6bS4SG1T-k18zCzgQ_OAsx4E?usp=sharing" target="_blank" >
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>

# Lab 8: BERT

In this lab, we will work with [DistilBERT](https://huggingface.co/distilbert-base-uncased), a model derived from BERT through a process of transfer learning which has fewer paremeters, thus is lighter and  easier to execute.

You will use the pre-trained model to perform the NLP task such as sentiment analysis. The datasets will be the same as the ones we used in previous labs.

You will use the library "[_transformers_](https://huggingface.co/transformers/installation.html)" from HugginFace.

First, import the tokenizer. You can learn more about the arguments in this [link](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer).

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
import tensorflow as tf
from transformers import  TFDistilBertModel
from transformers import DistilBertTokenizer
from sklearn.model_selection import train_test_split

In [2]:
# Import the pretrained DistilBertTokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Now test the tokenizer with a few sentences. The input is a list of sentences without any preprocessing.

The tokenizer is applied to each sentence and returns a dictionary with two keys:
- _input_ids_: Contains the token ids. By default the padding symbol is 0.
- _attention_mask_: The mask of the attention in int format.

In [3]:
# Test sentence to extract the representations
text = ["WordPiece is the subword tokenization algorithm used for BERT", "It relies on the same base as BPE, which is to initialize the vocabulary"]

# Extract the encoding of the test sentence from the tokenizer
encoding = tokenizer(text, padding=True, return_tensors='tf')
encoding

{'input_ids': <tf.Tensor: shape=(2, 19), dtype=int32, numpy=
array([[  101,  2773, 11198,  2003,  1996,  4942, 18351, 19204,  3989,
         9896,  2109,  2005, 14324,   102,     0,     0,     0,     0,
            0],
       [  101,  2009, 16803,  2006,  1996,  2168,  2918,  2004, 17531,
         2063,  1010,  2029,  2003,  2000,  3988,  4697,  1996, 16188,
          102]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(2, 19), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
      dtype=int32)>}

Now explore the representation of the sentences. *tokenizer.batch_decode* will decode a set of sentences. *tokenizer.decode* will decode one sentence.

In [4]:
# Using tokenizer.batch_decode visualize the representation of the sentences
tokenizer.batch_decode(encoding['input_ids'])

['[CLS] wordpiece is the subword tokenization algorithm used for bert [SEP] [PAD] [PAD] [PAD] [PAD] [PAD]',
 '[CLS] it relies on the same base as bpe, which is to initialize the vocabulary [SEP]']

Note the [CLS], [SEP] and [PAD] tokens are present.

This tokenizer is ready to use with Bert models.

Now, load the model itself.

In [17]:
# Initialize the TFDistilBertModel model with pre-trained weights
model = TFDistilBertModel.from_pretrained('distilbert-base-uncased')


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertModel: ['vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight']
- This IS expected if you are initializing TFDistilBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFDistilBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


**QUESTION: Print model summary to view the number of parameters.
Compare the number with BigBert. What is the fraction of parameters that was saved?**

#### Enter your answer here


In [6]:
# View the summary of the model
model.summary()

Model: "tf_distil_bert_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMa  multiple                  66362880  
 inLayer)                                                        
                                                                 
Total params: 66362880 (253.15 MB)
Trainable params: 66362880 (253.15 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


Now, explore the model output.

In [7]:
# Get the output of the model for the test sentence
output = model(encoding)

# Take a look at the dimension of the output of the model
output[0].shape

TensorShape([2, 19, 768])

It has shape [2,19,768]. As alway, the first dimension is the batch, the second is the sentence length and the third the embedding size.

<div class='exercise'><b>Ex 1:Tokenization </b></div></br>

Instead of using the most frequent words, BERT uses WordPiece, representing unseen words as a combination of subtokens. </br>

Load the IMDB dataset and use the tokenizer.

Truncate the sentences with more than 200 tokens.</br>

In [8]:
#pip install datasets

In [37]:
# Read the IMDB Dataset
from datasets import load_dataset
dataset = load_dataset("imdb")
train_data = dataset['train']['text']
train_label = np.array(dataset['train']['label'])

test_data =  dataset['test']['text']
test_label =  np.array(dataset['test']['label'])

In [10]:
# Clean the IMDB data (separate sentences based on delimiters, remove html tags and any non-alphanumeric characters).
# Do this for both the train and test data
import re

def clean_text(text):
    # Separate sentences based on delimiters
    text = re.sub(r'([.!?])', r' \1 ', text)
    # Remove HTML tags
    text = re.sub(r'<[^>]+>', '', text)
    # Remove non-alphanumeric characters
    text = re.sub(r'[^a-zA-Z0-9.!? ]', '', text)
    return text

# clean the train data
train_data = [clean_text(text) for text in train_data]

# clean the test data
test_data = [clean_text(text) for text in test_data]

In [11]:
# Tokenize the train data and truncate the sentences with more than 200 tokens
# set padding as true and return_tensors as np
train_tokenized = tokenizer(train_data, truncation=True, padding=True, max_length=200, return_tensors='np')


In [12]:
# Tokenize the test data and truncate the sentences with more than 200 tokens
# set padding as true and return_tensors as np
test_tokenized = tokenizer(test_data, truncation=True, padding=True, max_length=200, return_tensors='np')


<div class='exercise'><b>Ex 2: Model </b></div></br>
To construct your model, use DistilBERT as a normal layer, which receives inputs and outputs a tensor. The only difference is that you have to set it up as a non-trainable layer using the following code.</br>

This freezes the parameters and will not be trained in the fine-tuning phase.</br>

In [18]:
# Freeze the model parameters
model.trainable = False

In [22]:
tf.keras.backend.clear_session()

# Define a dictionary with keys as input layer.
# The keys should be 'input_ids' and 'attention_mask'.
inputs = {
    'input_ids': tf.keras.Input(shape=(200,), dtype=tf.int32, name='input_ids'),
    'attention_mask': tf.keras.Input(shape=(200,), dtype=tf.int32, name='attention_mask')
}


In [None]:
model.summary()

In [32]:
# Get the sentence embedding of the test sentence by passing the input dictionary
test_embedding = model(inputs)

# Get the embedding of the CLS token of the test sentence
cls_embedding = test_embedding[0][:,0,:]

# Define the output layer of the model with sigmoid activation
# Pass the embedding of the CLS token
output = tf.keras.layers.Dense(1, activation='sigmoid')(cls_embedding)

# Bring the model together with input as the input dictionary
# output as the output dense layer defiend above
new_model = tf.keras.Model(inputs=inputs, outputs=output)

# Compile the model with loss as binary_crossentropy and adam optimizer
new_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [36]:
print(train_tokenized.data)

{'input_ids': array([[  101,  1045, 12524, ..., 10036,  2135,   102],
       [  101,  1045,  2572, ...,  2053,  8991,   102],
       [  101,  2065,  2069, ...,     0,     0,     0],
       ...,
       [  101,  2023,  2143, ...,     0,     0,     0],
       [  101,  1996,  7357, ..., 15468,  2008,   102],
       [  101,  1996,  2466, ...,     0,     0,     0]]), 'attention_mask': array([[1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 0, 0, 0]])}


In [31]:
print(train_label)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

The object _tokenized_train_ is more than just a dictionary. _tokenized_train.data_ will give you just that.

In [38]:
# Fit the model on the tokenized train data and use the tokenized
# test data for validation
# Set the number of epochs and batch size
new_model.fit(train_tokenized.data, train_label, epochs=2, batch_size=32, validation_data=(test_tokenized.data, test_label))


Epoch 1/2
Epoch 2/2


<tf_keras.src.callbacks.History at 0x788124493550>

Now, try training the model without setting DistilBert as non-trainable parameters. It will probably crash, even with the reduced number of parameters of DistilBert.

In [None]:
# Your code here

In [None]:
# Fit the model on the tokenized train data and use the tokenized
# test data for validation
# Set the number of epochs and batch size
___