<a href="https://colab.research.google.com/github/telsayed/IR-in-Arabic/blob/master/Summer2021/labs/day8/IR_in_Arabic_Lab8_Transformers_%26_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# **IR in Arabic** - Summer 2021 lab **Day 8**


This is one of a series of Colab notebooks created for the **IR in Arabic** course. This lab aims to build basic practical knowledge on how to use BERT transformer model.

The **learning outcomes** of the this notebook are:
1. Understand what is huggingface and how to use it.
2. Learn how to tokenize a sentence and convert an input sentence to the required format for BERT (deal with special tokens, sentence length & Attention Mask)
3. Perform tokenization on a given dataset.
4. Check the architecture of each layer in BERT in practice.
5. Get to know the output of BERT
6. Utilize BERT embedding in computing cosine similarity.




## Utilize the GPU of Colab
In this session, we will work on experiments that require GPU to run. To make the experiments running over the GPU provided by Colab, you need to do the following:

1. Go to Menu > Runtime > Change runtime.

2. Change hardware acceleration to GPU.

Then run the following cell to confirm that the GPU is detected.

In [None]:
import tensorflow as tf
import torch
# Get the GPU device name.
device_name = tf.test.gpu_device_name()

# The device name should look like the following:
if device_name == '/device:GPU:0':
    print('Found GPU at: {}'.format(device_name))
else:
    raise SystemError('GPU device not found')

# Choose GPU as device to run the experiments on 
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

## **Hugging Face** 
[Hugging face](https://huggingface.co/) is an NLP-focused startup with a large open-source community, in particular around the Transformers library. 🤗 Transformers is a python-based library that exposes an API to use many well-known transformer architectures, such as BERT, RoBERTa, GPT-2 or DistilBERT, that obtain state-of-the-art results on a variety of NLP tasks like text classification, information extraction, question answering, and text generation. Those architectures come pre-trained with several sets of weights. Getting started with Transformers only requires to install the pip package:

In [None]:
#install the transformer library
!pip install transformers

In [None]:
# install needed libraries
#install the Pyterrier framework
!pip install python-terrier
#install the Arabic stop words library
!pip install Arabic-Stopwords
#we need to import the following libraries.
import pandas as pd
#to display the full text on the notebook without truncation
pd.set_option('display.max_colwidth', 150)
import re
from snowballstemmer import stemmer
from tqdm import tqdm
import arabicstopwords.arabicstopwords as stp

### Transformer Components in Hugging face
Transformers is based around the concept of pre-trained transformer models. These transformer models come in different shapes, sizes, and architectures and have their own ways of accepting input data: via tokenization.

The library builds on three main classes:
1. **The configuration class:** hosts relevant information concerning the model we will be using, such as the number of layers and the number of attention heads. Below is an example of a BERT configuration file, for the pre-trained weights bert-base-cased. The configuration classes host these attributes with various I/O methods and standardized name properties.

{
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 28996
}

2. **The tokenizer class:** the tokenizer class takes care of converting python string in arrays or tensors of integers which are indices in a model vocabulary. It has many handy features revolving around the tokenization of a string into tokens. This tokenization varies according to the model, therefore each model has its own tokenizer.
3. **The model class:** the model class holds the neural network modeling logic itself.

## Loading an Arabic BERT model


In the following example, we show how to load an Arabic BERT model which is AraBERT. AraBERT is one of the famous models in literature and performs well for many Arabic tasks.

In [None]:
from transformers import AutoTokenizer, AutoModel

model_name = "aubmindlab/bert-base-arabertv01"

arabert_tokenizer = AutoTokenizer.from_pretrained(model_name)
arabert_model = AutoModel.from_pretrained(model_name)


## **Perform tokenization**

To feed our text to BERT, it must be splitted into tokens, and then these tokens must be mapped to their index in the tokenizer vocabulary.

The tokenization must be performed by the tokenizer included with BERT tokenizer. Now, we utilize the pretrained AraBERT model to tokenize a given sentence. We just need to provide the sentence as an input string to the loaded tokenizer.

In [None]:
## Example 1 of tokenization
text1 = "هذا هو اليوم التاسع من دورة استرجاع المعلومات باللغة العربية. درس اليوم مهم جدا في مجال معالجة اللغات الطبيعية"
tokeninzed_text1 = arabert_tokenizer.tokenize(text1)
text1_token_ids = arabert_tokenizer.convert_tokens_to_ids(tokeninzed_text1)

# Print the original sentence.
print('Original text1: ', text1)

# Print the sentence split into tokens.
print('Tokenized text1 : ', tokeninzed_text1)

# Print the sentence mapped to token ids.
print('Token IDs of text1: ',text1_token_ids )



In [None]:
# Example 2 of tokenization 
# This example should be completed by students 
text2 = "سنتعلم اليوم كيف نستخدم نموذج برت بشكل عملي ونجري بعض التجارب"

# The remaining code of this cell should be deleted as students require to write it 
tokeninzed_text2 = # student should complete this line on chats
text2_token_ids = # student should complete this line on chats

# Print the original sentence.
print('Original text2: ', text2)

# Print the sentence split into tokens.
print('Tokenized text2 : ', tokeninzed_text2)

# Print the sentence mapped to token ids.
print('Token IDs of text2: ',text2_token_ids )

## **BERT Required Formatting**



The input for BERT model has to be in specific format. What we need is to:
1. Add special tokens to the start and end of each sentence.
2. Pad & truncate all sentences to a single constant length. Maximum allowed length is 512.
3. Explicitly differentiate real tokens from padding tokens with the "attention mask".


<!-- **`[SEP]`** -->

At the end of every sentence, we need to append the special **`[SEP]`** token. 

This token is an artifact of two-sentence tasks, where BERT is given two separate sentences and asked to do some task on them (e.g., can the answer to the question in sentence A be found in sentence B?). 

In [None]:
sep_token =arabert_tokenizer.sep_token

# print sep token of the tokenizer
print("Sep token : ", sep_token)

# print the token id of sep token
print('Token ID of sep token : ',  arabert_tokenizer.convert_tokens_to_ids(sep_token))

<!-- **`[CLS]`** -->

>  "The first token of every sequence is always a special classification token (**`[CLS]`**). The final hidden state
corresponding to this token is used as the aggregate sequence representation for classification
tasks." ([BERT paper](https://arxiv.org/pdf/1810.04805.pdf))


This token has special significance. BERT consists of 12 Transformer layers. Each transformer takes in a list of token embeddings, and produces the same number of embeddings on the output.

On the output of the final (12th) transformer, *only the first embedding (corresponding to the [CLS] token) is used by the classifier.

![Illustration of CLS token purpose](https://drive.google.com/uc?export=view&id=1ck4mvGkznVJfW3hv6GUqcdGepVTOx7HE)


In [None]:
cls_token =arabert_tokenizer.cls_token

# print cls token of the tokenizer
print("Cls token : ", cls_token)

# print the token id of cls token
print('Token ID of cls token : ',  arabert_tokenizer.convert_tokens_to_ids(cls_token))

### **Sentence Length & Attention Mask**

BERT has two constraints:
1. All sentences must be padded or truncated to a single, fixed length.
2. The maximum sentence length is 512 tokens.

Padding is done with a special **`[PAD]`** token, which is at index 0 in the BERT vocabulary. 

The **"Attention Mask"** is simply an array of 1s and 0s indicating which tokens are padding and which aren't. This mask tells the "Self-Attention" mechanism in BERT not to incorporate these PAD tokens into its interpretation of the sentence.

The below illustration demonstrates padding out to a "MAX_LEN" of 8 tokens.
<img src="https://drive.google.com/uc?export=view&id=1cb5xeqLu_5vPOgs3eRnail2Y00Fl2pCo" width="600">



The maximum length does impact training and evaluation speed. 
For example, with a Tesla K80:

`MAX_LEN = 128  -->  Training epochs take ~5:28 each`

`MAX_LEN = 64   -->  Training epochs take ~2:57 each`

### Perform tokenization on a sentence

In [None]:
# `encode_plus` will:
#   (1) Tokenize the sentence.
#   (2) Prepend the `[CLS]` token to the start.
#   (3) Append the `[SEP]` token to the end.
#   (4) Map tokens to their IDs.
#   (5) Pad or truncate the sentence to `max_length`
#   (6) Create attention masks for [PAD] tokens.

text = "سنتعلم اليوم كيف نستخدم نموذج برت بشكل عملي ونجري بعض التجارب"
encoding= arabert_tokenizer.encode_plus(
                  text,                      # Sentence to encode.
                  add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                  truncation=True,
                  max_length = 32,           # Pad & truncate all sentences.
                  padding="max_length",
                  return_attention_mask = True,   # Construct attention mask
                  return_tensors = 'pt',     # Return pytorch tensors.
              )

    
# Print the input ids and attention mask of the encoded sentence
print("Original text: ", text)
print("Input ids: ", encoding["input_ids"].flatten(),)
print("Attention mask: ", encoding["attention_mask"].flatten(),)
# Note in the output of the next line that the cls, sep,and pad tokens were added automatically
print("Tokenized text: ",arabert_tokenizer.convert_ids_to_tokens(encoding["input_ids"].flatten()))

## **Dataset tokenization**

Let's tokenize the EveTAR dataset.
First, we need to load the dataset.




In [None]:
dataset_links=["https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-01.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-02.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-03.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-04.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-05.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-06.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-07.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-08.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-09.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-10.txt"]

full_data=pd.DataFrame()
for i in tqdm(range(len(dataset_links))):
    tweets=pd.read_csv(dataset_links[i], sep='\t')
    full_data=pd.concat([full_data,tweets],ignore_index=True)
full_data.reset_index(inplace=True,drop=True)
full_data  

Now, we need to encode each tweet in this dataset.
Encoding the dataset is your job now.

In [None]:
def encode(text, max_length=32):
    return arabert_tokenizer.encode_plus(
                  text,                      # Sentence to encode.
                  add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                  truncation=True,
                  max_length = max_length,           # Pad & truncate all sentences.
                  padding="max_length",
                  return_attention_mask = True,   # Construct attention mask
                  return_tensors = 'pt',     # Return pytorch tensors.
    )

tokenized_tweets = []
for tweet in tqdm(full_data["tweetText"].values, desc="Tokenizing ..."):
    tokenized_tweets.append(encode(tweet, max_length=32))

In [None]:
tokenized_tweets[0]


In [None]:
len(tokenized_tweets)

## BERT layers

Let's see in practice the layers that consist the BERT model. Simply, you can see every layer in the model.

In [None]:
arabert_model.cuda()


## BERT output 

Let's see what is the output of BERT for a given sentence.

In [None]:
input_ids = tokenized_tweets[0]["input_ids"].to(device)
attention_mask = tokenized_tweets[0]["attention_mask"].to(device)
output = arabert_model(input_ids=input_ids, attention_mask=attention_mask)

In [None]:
output[0].shape # batch_size x sequence_length x embedding_dimension

 Let's see the embedding vector for the tokens. In this case, we have 32 tokens. Each token has embedding vector of length 768

In [None]:
# print the embedding of all input tokens.
all_embeddings = output[0][0]
print(all_embeddings.shape)
print(all_embeddings)

In [None]:
# print the cls embedding
cls_embedding = output[0][0][0]
print(cls_embedding.shape)
print(cls_embedding)

In [None]:
# print the first token embedding
first_token_embedding = output[0][0][1]
print(first_token_embedding.shape)
print(first_token_embedding)

## **Exercise1**
Given the following sentences:

sent_1 = "قام فريق بحثي في جامعة قطر باجراء دورة استرجاع معلومات باللغة العربية"

sent_2 = "قام الطلاب احتراما لقدوم استاذهم"

1. Compute the embeddings of these sentences
2. Check whether the embeddings of the word "قام" are equal by computing the cosine similarity.

In [None]:
from numpy import dot
from numpy.linalg import norm

# helper function 
def compute_cosine_similarity(a, b):
    cos_sim = dot(a, b) / (norm(a) * norm(b))
    return cos_sim

sent_1 = "قام فريق بحثي في جامعة قطر باجراء دورة استرجاع معلومات باللغة العربية"
sent_2 = "قام الطلاب احتراما لقدوم استاذهم"

# Before passing the embedding to compute cosine similarity score, 
# you have to convert them from tensor to numpy array as follows:
# input1 = token1_embeddging.detach().numpy() 
# input2 = token2_embeddging.detach().numpy() 
# cosine_score = compute_cosine_similarity(input1, input2)


## Exercise 2

Given a list of query-document pairs:
1. Compute the cosine similarity for each pair using AraBERT. Store the values as a new column.

In [None]:
selected_queries = ['E14','E48', 'E36', 'E58', 'E19', 'E63', 'E30', 'E27', 'E39', 'E21']

data_path = "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/labs/day8/bm25_top_1000_with_text.txt"
df_data = pd.read_csv(data_path,  sep='\t',encoding="utf-8")
df_data= df_data[0:10]
df_data

In [None]:
# write your solution

### **References**


*  [Hugging Face: State-of-the-Art Natural Language Processing in ten lines of TensorFlow 2.0](https://blog.tensorflow.org/2019/11/hugging-face-state-of-art-natural.html).
*   [BERT Fine-Tuning Tutorial with PyTorch.](https://mccormickml.com/2019/07/22/BERT-fine-tuning/)

