<a href="https://colab.research.google.com/github/Bosy-Ayman/IR/blob/main/Week(13)_IR_Transformers_%26_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# **DSAI 201 - Information Retrieval - Zewail City**


This lab aims to build basic practical knowledge on how to use BERT transformer model.

The **learning outcomes** of the this notebook are:
1. Understand what is huggingface and how to use it.
2. Learn how to tokenize a sentence and convert an input sentence to the required format for BERT (deal with special tokens, sentence length & Attention Mask)
3. Perform tokenization on a given dataset.
4. Check the architecture of each layer in BERT in practice.
5. Get to know the output of BERT
6. Utilize BERT embedding in computing cosine similarity.




## Utilize the GPU of Colab
In this session, we will work on experiments that require GPU to run. To make the experiments running over the GPU provided by Colab, you need to do the following:

1. Go to Menu > Runtime > Change runtime.

2. Change hardware acceleration to GPU.

Then run the following cell to confirm that the GPU is detected.

In [None]:
import torch

# Choose GPU as device to run the experiments on
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

cuda:0


## **Hugging Face**
[Hugging face](https://huggingface.co/) is an NLP-focused startup with a large open-source community, in particular around the Transformers library. 🤗 Transformers is a python-based library that exposes an API to use many well-known transformer architectures, such as BERT, RoBERTa, GPT-2 or DistilBERT, that obtain state-of-the-art results on a variety of NLP tasks like text classification, information extraction, question answering, and text generation. Those architectures come pre-trained with several sets of weights. Getting started with Transformers only requires to install the pip package:

In [None]:
#install the transformer library
!pip install transformers



In [None]:
#we need to import the following libraries.
import pandas as pd
import re
from tqdm import tqdm

#to display the full text on the notebook without truncation
pd.set_option('display.max_colwidth', 150)

### Transformer Components in Hugging face
Transformers is based around the concept of pre-trained transformer models. These transformer models come in different shapes, sizes, and architectures and have their own ways of accepting input data: via tokenization.

The library builds on three main classes:
1. **The configuration class:** hosts relevant information concerning the model we will be using, such as the number of layers and the number of attention heads. Below is an example of a BERT configuration file, for the pre-trained weights bert-base-cased. The configuration classes host these attributes with various I/O methods and standardized name properties.

{
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 28996
}

2. **The tokenizer class:** the tokenizer class takes care of converting python string in arrays or tensors of integers which are indices in a model vocabulary. It has many handy features revolving around the tokenization of a string into tokens. This tokenization varies according to the model, therefore each model has its own tokenizer.
3. **The model class:** the model class holds the neural network modeling logic itself.

## Loading BERT model


In the following example, we show how to load BERT model. BERT is one of the famous models in literature and performs well for many language tasks.

In [None]:
from transformers import AutoTokenizer, AutoModel

model_name = "bert-base-uncased"

bert_tokenizer = AutoTokenizer.from_pretrained(model_name)
bert_model = AutoModel.from_pretrained(model_name)

## **Perform tokenization**

To feed our text to BERT, it must be splitted into tokens, and then these tokens must be mapped to their index in the tokenizer vocabulary.

The tokenization must be performed by the tokenizer included with BERT tokenizer. Now, we utilize the pretrained BERT model to tokenize a given sentence. We just need to provide the sentence as an input string to the loaded tokenizer.

In [None]:
## Example 1 of tokenization
text1 = "This is week 8 of the Information Retrieval course. Today's lesson is very important in the field of natural language processing."
tokeninzed_text1 = bert_tokenizer.tokenize(text1)
text1_token_ids = bert_tokenizer.convert_tokens_to_ids(tokeninzed_text1)

# Print the original sentence.
print('Original text1: ', text1)

# Print the sentence split into tokens.
print('Tokenized text1 : ', tokeninzed_text1)

# Print the sentence mapped to token ids.
print('Token IDs of text1: ',text1_token_ids )

Original text1:  This is week 8 of the Information Retrieval course. Today's lesson is very important in the field of natural language processing.
Tokenized text1 :  ['this', 'is', 'week', '8', 'of', 'the', 'information', 'retrieval', 'course', '.', 'today', "'", 's', 'lesson', 'is', 'very', 'important', 'in', 'the', 'field', 'of', 'natural', 'language', 'processing', '.']
Token IDs of text1:  [2023, 2003, 2733, 1022, 1997, 1996, 2592, 26384, 2607, 1012, 2651, 1005, 1055, 10800, 2003, 2200, 2590, 1999, 1996, 2492, 1997, 3019, 2653, 6364, 1012]


In [None]:
## Example 1 of tokenization
text1 = "This is week 8 of the Information Retrieval course. Today's lesson is very important in the field of natural language processing."
tokeninzed_text1 = bert_tokenizer.tokenize(text1, add_special_tokens=True)
text1_token_ids = bert_tokenizer.convert_tokens_to_ids(tokeninzed_text1)

# Print the original sentence.
print('Original text1: ', text1)

# Print the sentence split into tokens.
print('Tokenized text1 : ', tokeninzed_text1)

# Print the sentence mapped to token ids.
print('Token IDs of text1: ',text1_token_ids )

Original text1:  This is week 8 of the Information Retrieval course. Today's lesson is very important in the field of natural language processing.
Tokenized text1 :  ['[CLS]', 'this', 'is', 'week', '8', 'of', 'the', 'information', 'retrieval', 'course', '.', 'today', "'", 's', 'lesson', 'is', 'very', 'important', 'in', 'the', 'field', 'of', 'natural', 'language', 'processing', '.', '[SEP]']
Token IDs of text1:  [101, 2023, 2003, 2733, 1022, 1997, 1996, 2592, 26384, 2607, 1012, 2651, 1005, 1055, 10800, 2003, 2200, 2590, 1999, 1996, 2492, 1997, 3019, 2653, 6364, 1012, 102]


## **BERT Required Formatting**



The input for BERT model has to be in specific format. What we need is to:
1. Add special tokens to the start and end of each sentence.
2. Pad & truncate all sentences to a single constant length. Maximum allowed length is 512.
3. Explicitly differentiate real tokens from padding tokens with the "attention mask".


<!-- **`[SEP]`** -->

At the end of every sentence, we need to append the special **`[SEP]`** token.

This token is an artifact of two-sentence tasks, where BERT is given two separate sentences and asked to do some task on them (e.g., can the answer to the question in sentence A be found in sentence B?).

In [None]:
sep_token =bert_tokenizer.sep_token

# print sep token of the tokenizer
print("Sep token : ", sep_token)

# print the token id of sep token
print('Token ID of sep token : ',  bert_tokenizer.convert_tokens_to_ids(sep_token))

Sep token :  [SEP]
Token ID of sep token :  102


<!-- **`[CLS]`** -->

>  "The first token of every sequence is always a special classification token (**`[CLS]`**). The final hidden state
corresponding to this token is used as the aggregate sequence representation for classification
tasks." ([BERT paper](https://arxiv.org/pdf/1810.04805.pdf))


This token has special significance. BERT consists of 12 Transformer layers. Each transformer takes in a list of token embeddings, and produces the same number of embeddings on the output.

On the output of the final (12th) transformer, *only the first embedding (corresponding to the [CLS] token) is used by the classifier.

![Illustration of CLS token purpose](https://drive.google.com/uc?export=view&id=1ck4mvGkznVJfW3hv6GUqcdGepVTOx7HE)


In [None]:
cls_token =bert_tokenizer.cls_token

# print cls token of the tokenizer
print("Cls token : ", cls_token)

# print the token id of cls token
print('Token ID of cls token : ',  bert_tokenizer.convert_tokens_to_ids(cls_token))

Cls token :  [CLS]
Token ID of cls token :  101


### **Sentence Length & Attention Mask**

BERT has two constraints:
1. All sentences must be padded or truncated to a single, fixed length.
2. The maximum sentence length is 512 tokens.

Padding is done with a special **`[PAD]`** token, which is at index 0 in the BERT vocabulary.

The **"Attention Mask"** is simply an array of 1s and 0s indicating which tokens are padding and which aren't. This mask tells the "Self-Attention" mechanism in BERT not to incorporate these PAD tokens into its interpretation of the sentence.

The below illustration demonstrates padding out to a "MAX_LEN" of 8 tokens.
<img src="https://drive.google.com/uc?export=view&id=1cb5xeqLu_5vPOgs3eRnail2Y00Fl2pCo" width="600">



The maximum length does impact training and evaluation speed.
For example, with a Tesla K80:

`MAX_LEN = 128  -->  Training epochs take ~5:28 each`

`MAX_LEN = 64   -->  Training epochs take ~2:57 each`

### Perform tokenization on a sentence

In [None]:
# `encode_plus` will:
#   (1) Tokenize the sentence.
#   (2) Prepend the `[CLS]` token to the start.
#   (3) Append the `[SEP]` token to the end.
#   (4) Map tokens to their IDs.
#   (5) Pad or truncate the sentence to `max_length`
#   (6) Create attention masks for [PAD] tokens.

text = "Today we will learn how to use the BERT model in practice and conduct some experiments."
encoding= bert_tokenizer.encode_plus(
                  text,                      # Sentence to encode.
                  add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                  truncation=True,
                  max_length = 32,           # Pad & truncate all sentences.
                  padding="max_length",
                  return_attention_mask = True,   # Construct attention mask
                  return_tensors = 'pt',     # Return pytorch tensors.
              )


# Print the input ids and attention mask of the encoded sentence
print("Original text: ", text)
print("Input ids: ", encoding["input_ids"].flatten(),)
print("Attention mask: ", encoding["attention_mask"].flatten(),)
# Note in the output of the next line that the cls, sep,and pad tokens were added automatically
print("Tokenized text: ",bert_tokenizer.convert_ids_to_tokens(encoding["input_ids"].flatten()))

Original text:  Today we will learn how to use the BERT model in practice and conduct some experiments.
Input ids:  tensor([  101,  2651,  2057,  2097,  4553,  2129,  2000,  2224,  1996, 14324,
         2944,  1999,  3218,  1998,  6204,  2070,  7885,  1012,   102,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0])
Attention mask:  tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])
Tokenized text:  ['[CLS]', 'today', 'we', 'will', 'learn', 'how', 'to', 'use', 'the', 'bert', 'model', 'in', 'practice', 'and', 'conduct', 'some', 'experiments', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']


## **Dataset tokenization**

Let's tokenize the EveTAR dataset.
First, we need to load the dataset.




In [None]:
imdb_dataset_url = "https://raw.githubusercontent.com/LearnDataSci/articles/master/Python%20Pandas%20Tutorial%20A%20Complete%20Introduction%20for%20Beginners/IMDB-Movie-Data.csv"

reviews = pd.read_csv(imdb_dataset_url)
reviews.head()


Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced to work together to stop a fanatical warrior from taking control of the universe.,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe Saldana",2014,121,8.1,757074,333.13,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a team finds a structure on a distant moon, but they soon realize they are not alone.",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fassbender, Charlize Theron",2012,124,7.0,485820,126.46,65.0
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diagnosed 23 distinct personalities. They must try to escape before the apparent emergence of a frightfu...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richardson, Jessica Sula",2016,117,7.3,157606,138.12,62.0
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling theater impresario's attempt to save his theater with a singing competition becomes grander than he anti...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth MacFarlane, Scarlett Johansson",2016,108,7.2,60545,270.32,59.0
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of the most dangerous incarcerated super-villains to form a defensive task force. Their first mission: sa...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola Davis",2016,123,6.2,393727,325.02,40.0


Now, we need to encode each tweet in this dataset.
Encoding the dataset is your job now.

In [None]:
def encode(text, max_length=32):
    return bert_tokenizer.encode_plus(
                  text,                      # Sentence to encode.
                  add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                  truncation=True,
                  max_length = max_length,           # Pad & truncate all sentences.
                  padding="max_length",
                  return_attention_mask = True,   # Construct attention mask
                  return_tensors = 'pt',     # Return pytorch tensors.
    )

tokenized_reviews = []
for tweet in tqdm(reviews["Description"].values, desc="Tokenizing ..."):
    tokenized_reviews.append(encode(tweet, max_length=32))

Tokenizing ...: 100%|██████████| 1000/1000 [00:00<00:00, 2265.40it/s]


In [None]:
tokenized_reviews[0]


{'input_ids': tensor([[  101,  1037,  2177,  1997,  6970,  9692, 28804, 12290,  2024,  3140,
          2000,  2147,  2362,  2000,  2644,  1037,  5470, 12070,  2389,  6750,
          2013,  2635,  2491,  1997,  1996,  5304,  1012,   102,     0,     0,
             0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 0, 0, 0, 0]])}

In [None]:
len(tokenized_reviews)

1000

## BERT layers

Let's see in practice the layers that consist the BERT model. Simply, you can see every layer in the model.

In [None]:
bert_model.cuda()


BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

## BERT output

Let's see what is the output of BERT for a given sentence.

In [None]:
input_ids = tokenized_reviews[0]["input_ids"].to(device)
attention_mask = tokenized_reviews[0]["attention_mask"].to(device)
output = bert_model(input_ids=input_ids, attention_mask=attention_mask)

In [None]:
output[0].shape # batch_size x sequence_length x embedding_dimension

torch.Size([1, 32, 768])

 Let's see the embedding vector for the tokens. In this case, we have 32 tokens. Each token has embedding vector of length 768

In [None]:
# print the embedding of all input tokens.
all_embeddings = output[0][0]
print(all_embeddings.shape)
print(all_embeddings)

torch.Size([32, 768])
tensor([[-0.5374, -0.0045, -0.1489,  ..., -0.1589,  0.2519,  0.3107],
        [-0.7558,  0.0240, -0.7778,  ..., -0.0435,  0.4008, -0.0553],
        [ 0.1333,  0.0783, -0.0168,  ..., -0.4033,  0.1526, -0.1878],
        ...,
        [ 0.0496, -0.2305,  0.0917,  ...,  0.0366,  0.1320, -0.0457],
        [ 0.2256, -0.2471,  0.1439,  ...,  0.0176,  0.0063, -0.2181],
        [ 0.1711, -0.2524,  0.0136,  ...,  0.0411,  0.0254, -0.1964]],
       device='cuda:0', grad_fn=<SelectBackward0>)


In [None]:
# print the cls embedding
cls_embedding = output[0][0][0]
print(cls_embedding.shape)
print(cls_embedding)

torch.Size([768])
tensor([-5.3743e-01, -4.5285e-03, -1.4889e-01,  2.6006e-01, -5.4038e-01,
        -1.0482e-01,  5.7612e-02,  3.7673e-01,  2.9624e-01, -4.7522e-01,
        -1.3860e-01, -1.2560e-01,  9.2107e-02,  1.1082e+00,  6.0340e-01,
         2.4000e-01, -2.3380e-01,  1.6385e-01,  5.4327e-01, -1.8306e-01,
         1.4615e-02, -6.7008e-01, -1.5706e-01,  4.8035e-01,  8.6288e-02,
        -1.9906e-01, -3.1754e-01,  2.3249e-01,  4.4113e-01,  3.2665e-01,
        -3.6641e-01,  4.9974e-02, -3.6367e-01, -7.1475e-01,  3.9699e-01,
         2.8407e-01, -1.6459e-02,  2.3711e-01, -2.6904e-01,  3.8026e-01,
        -4.8049e-01,  5.8538e-03,  2.1590e-01,  4.6976e-02, -8.9185e-02,
        -8.1980e-01, -2.4482e+00, -3.5167e-01, -1.7509e-01,  4.3415e-01,
         3.2940e-01, -2.5368e-01,  6.2703e-01, -1.2208e-01, -7.2259e-02,
         6.8791e-01, -1.0008e-01,  4.4677e-01, -1.0553e-01,  4.7799e-01,
         5.0411e-01,  3.2362e-01, -2.5777e-01, -1.8169e-01, -2.7259e-01,
         1.5589e-01, -1.0433e-01,

In [None]:
# print the first token embedding
first_token_embedding = output[0][0][1]
print(first_token_embedding.shape)
print(first_token_embedding)

torch.Size([768])
tensor([-7.5579e-01,  2.3958e-02, -7.7775e-01, -2.9636e-02,  1.3142e-01,
         3.4829e-01, -7.1905e-01,  2.2947e-01,  5.7406e-01, -5.2165e-01,
        -6.3331e-01, -3.9139e-01,  6.1251e-01,  7.6919e-01,  2.4809e-01,
         4.1138e-03,  3.1326e-01, -2.1136e-01,  3.4086e-01,  3.7136e-01,
        -2.0355e-01, -2.5799e-01, -9.1278e-01,  1.2476e+00,  6.0668e-01,
        -3.5893e-01,  2.8126e-01,  9.9478e-01,  7.2137e-01,  8.1404e-01,
        -4.0528e-02, -2.9032e-01, -1.1082e-01, -4.8372e-01, -2.4480e-01,
         2.6875e-01,  2.1412e-01,  2.4016e-01, -4.4493e-01,  5.2279e-01,
         8.4595e-02, -4.5354e-01,  1.4433e-01,  2.5544e-01,  2.4262e-01,
        -3.2311e-01,  4.1547e-01, -5.2544e-01,  6.2329e-02,  3.1465e-01,
        -3.1489e-01, -1.1175e-01, -6.6004e-02, -3.6600e-01, -1.4372e-01,
         7.6647e-01, -1.8431e-01,  9.4430e-01, -4.3552e-01,  4.6231e-01,
         1.5902e-01,  9.4874e-01,  9.6713e-01, -5.6813e-01, -2.2341e-01,
         6.0240e-01, -3.8815e-01,

# Cosine Similairty & Embeddings

In [None]:
text_king = "king"
text_man = "man"
text_apple = "apple"

# Function to convert text to embeddings
def get_embeddings(text):
  # Encode text and computer input_ids and attention mask
  tokens = encode(text)
  input_ids = tokens["input_ids"].to(device)
  attention_mask = tokens["attention_mask"].to(device)
  # Pass input_ids and attention mask to model
  output = bert_model(input_ids=input_ids, attention_mask=attention_mask)
  return output

embed_king = get_embeddings(text_king)
embed_man = get_embeddings(text_man)
embed_apple = get_embeddings(text_apple)

In [None]:
embed_king

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-0.1318,  0.1997,  0.0593,  ..., -0.1656, -0.0187, -0.0151],
         [-0.1006, -0.0524,  0.0772,  ..., -0.2972, -0.1973, -0.6630],
         [ 0.8228,  0.2570, -0.2939,  ..., -0.0014, -0.8282, -0.2063],
         ...,
         [-0.4145, -0.0221,  0.6602,  ..., -0.0110,  0.0749, -0.2281],
         [-0.2734,  0.0148,  0.7546,  ..., -0.0276,  0.0481, -0.2719],
         [-0.6719, -0.0510,  0.0872,  ...,  0.0179,  0.0716, -0.2357]]],
       device='cuda:0', grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-0.7962, -0.2311,  0.4424,  0.6373, -0.3603, -0.1046,  0.8782,  0.1833,
          0.2154, -0.9996,  0.1175,  0.2204,  0.9610, -0.2259,  0.8968, -0.3678,
         -0.1455, -0.4299,  0.3994, -0.7417,  0.5365,  0.8699,  0.6029,  0.1351,
          0.3400,  0.2738, -0.5180,  0.8717,  0.9329,  0.6474, -0.6485,  0.1758,
         -0.9658, -0.2114,  0.3730, -0.9662,  0.1246, -0.7125, -0.0458,  0.0131,
         -0.808

In [None]:
# Embedding of the word "king" at index 1 (since index 0 is reserved for [CLS] token)
embed_king[0][0][1]

tensor([-1.0063e-01, -5.2394e-02,  7.7210e-02, -3.3986e-01, -1.0123e-01,
        -3.0228e-02,  1.7702e-01, -1.2432e-01, -4.0533e-01, -1.0918e+00,
        -8.1277e-01,  1.5461e-01,  1.9359e-01,  4.6548e-01,  1.2872e-01,
        -5.5175e-02, -1.7732e-02, -6.9906e-02,  3.8718e-01,  1.1769e-01,
        -3.8392e-01,  1.5189e-01, -7.4889e-03, -9.0575e-02,  2.6472e-01,
         5.8122e-01, -1.8173e-01,  1.4358e-01,  8.3955e-02,  4.5901e-01,
         3.5349e-01, -6.7306e-01,  1.6623e-01,  2.6586e-01, -5.9197e-01,
        -4.3626e-01,  5.0111e-01, -2.3916e-01, -6.2190e-01, -1.3644e-01,
         4.4553e-01, -5.7102e-01,  7.1198e-01, -4.4230e-01,  2.3134e-01,
         2.2516e-01,  1.0018e-01,  1.6920e-01,  2.4917e-01,  2.6961e-01,
        -8.8481e-02,  2.2383e-01, -6.7094e-01,  2.2819e-01,  1.8186e-01,
         6.0945e-02,  5.2540e-01,  3.3595e-01, -5.2457e-01, -2.2025e-01,
         4.0992e-01, -8.3914e-01, -3.9756e-02, -5.4600e-01,  9.7653e-02,
         6.9954e-01,  1.1494e+00,  2.9574e-01, -1.0

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
# helper function
def compute_cosine_similarity(a, b):
    # Move array to CPU memory, detach from computational graph which requires_grad, convert to numpy & reshape
    a = a.cpu().detach().numpy().reshape(1, -1)
    b = b.cpu().detach().numpy().reshape(1, -1)
    cos_sim = cosine_similarity(a, b)
    return cos_sim

In [None]:
# Compare cosine similarity between (king, man, appple) embeddings

king_apple_similarity = compute_cosine_similarity(embed_king[0][0][1], embed_apple[0][0][1])
print(f"Similarity between King & Apple is {king_apple_similarity * 100}")

king_man_similarity = compute_cosine_similarity(embed_king[0][0][1], embed_man[0][0][1])
print(f"Similarity between King & Man is {king_man_similarity * 100}")

man_apple_similarity = compute_cosine_similarity(embed_man[0][0][1], embed_apple[0][0][1])
print(f"Similarity between Man & Apple is {man_apple_similarity * 100}")

Similarity between King & Apple is [[66.72247]]
Similarity between King & Man is [[74.478874]]
Similarity between Man & Apple is [[65.9001]]


In [None]:
king_queen_similarity = compute_cosine_similarity(get_embeddings("queen")[0][0][1], embed_king[0][0][1])
print(f"Similarity between King & Queen is {king_queen_similarity * 100}")

Similarity between King & Queen is [[86.43858]]


## **Exercise1**
Given the following sentences:

sent_1 = "A research team at Zewail City conducted an information retrieval course in English"

sent_2 = "Natural Language Processing is one of the most popular fields in 2024"

1. Compute the embeddings of these sentences
2. Check whether the embeddings of the word "research" are equal by computing the cosine similarity.
3. Compute the cosine similarity between the two sentences/embeddings

In [None]:

sent_1 = "A research team at Zewail City conducted an information retrieval course in English"
sent_2 = "Natural Language Processing is one of the most popular research topics in 2024"

# Before passing the embedding to compute cosine similarity score,
# you have to convert them from tensor to numpy array as follows:
# input1 = token1_embeddging.detach().numpy()
# input2 = token2_embeddging.detach().numpy()
# cosine_score = compute_cosine_similarity(input1, input2)

### **References**


*  [Hugging Face: State-of-the-Art Natural Language Processing in ten lines of TensorFlow 2.0](https://blog.tensorflow.org/2019/11/hugging-face-state-of-art-natural.html).
*   [BERT Fine-Tuning Tutorial with PyTorch.](https://mccormickml.com/2019/07/22/BERT-fine-tuning/)

