## Classify text with BERT

Let’s understand with code how to build BERT with PyTorch. 

We will break the entire program into 4 sections:

1. Preprocessing
2. Building model
3. Loss and Optimization
4. Training


## First Test

In [7]:
from transformers import pipeline
unmasker = pipeline('fill-mask', model='bert-base-uncased')
unmasker("The [MASK] went to the store")


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.05779837816953659,
  'token': 3337,
  'token_str': 'boys',
  'sequence': 'the boys went to the store'},
 {'score': 0.04151124134659767,
  'token': 3057,
  'token_str': 'girls',
  'sequence': 'the girls went to the store'},
 {'score': 0.03639920428395271,
  'token': 2500,
  'token_str': 'others',
  'sequence': 'the others went to the store'},
 {'score': 0.03291618078947067,
  'token': 2273,
  'token_str': 'men',
  'sequence': 'the men went to the store'},
 {'score': 0.03129786252975464,
  'token': 2048,
  'token_str': 'two',
  'sequence': 'the two went to the store'}]

In [50]:
from transformers import pipeline

# Charger un pipeline de génération de texte avec GPT-2
generator = pipeline('text-generation', model='gpt2')

# Demander une question
question = "What is the capital of France?"

# Générer une réponse
response = generator(question, max_length=50, num_return_sequences=1)

# Afficher la réponse générée
print(response[0]['generated_text'])


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


What is the capital of France?

The capital of France is Paris. This is the largest part of France; it's also the busiest in Europe. By the way, the capital was originally named after France's founder, Napoleon Bonaparte


In [1]:
question = "What is the capital of Egypt?"

# Générer une réponse
response = generator(question, max_length=50, num_return_sequences=1)

# Afficher la réponse générée
print(response[0]['generated_text'])

NameError: name 'generator' is not defined

In [56]:
question1= 'Who is Victor Hugo'

reponse1 = generator(question1, max_length = 50, num_return_sequences = 1)

print(reponse1[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Who is Victor Hugo?

(From: The Amazing World of Harry Potter Fan Fic Archive)

Virgil: The Man Who Dived Down [The Man Who Saved the World]

(From: Harry Potter Book


## 2nd Test

### Preprocessing

In [8]:
text = (
       'Hello, how are you? I am Romeo.n'
       'Hello, Romeo My name is Juliet. Nice to meet you.n'
       'Nice meet you too. How are you today?n'
       'Great. My baseball team won the competition.n'
       'Oh Congratulations, Julietn'
       'Thanks you Romeo'
   )

Then we will clean the data by:

- Making the sentences into lower case.
- Creating vocabulary. Vocabulary is a list of unique words in the document. 


In [11]:
import re

In [14]:
sentences = re.sub("[.,!?-]", '', text.lower()).split('n')  # filter '.', ',', '?', '!'
word_list = list(set(" ".join(sentences).split()))

In [18]:
word_dict = {'[PAD]': 0, '[CLS]': 1, '[SEP]': 2, '[MASK]': 3}
for i, w in enumerate(word_list):
   word_dict[w] = i + 4
   number_dict = {i: w for i, w in enumerate(word_dict)}
   vocab_size = len(word_dict)

Once that is taken care of, we need to create a function that formats the input sequences for three types of embeddings: token embedding, segment embedding, and position embedding.

What is token embedding?

For instance, if the sentence is “The cat is walking. The dog is barking”, then the function should create a sequence in the following manner: “[CLS] the cat is walking [SEP] the dog is barking”. 

After that, we convert everything to an index from the word dictionary. So the previous sentence would look something like “[1, 5, 7, 9, 10, 2, 5, 6, 9, 11]”. Keep in mind that 1 and 2 are [CLS] and [SEP] respectively. 

What is segment embedding?

A segment embedding separates two sentences from each other and they are generally defined as 0 and 1. 

What is position embedding?

A position embedding gives position to each embedding in a sequence. 

We will create a function for position embedding later. 

![embeddings](assets/embeddings.jpg)


Now the next step will be to create masking. 

As mentioned in the original paper, BERT randomly assigns masks to 15% of the sequence. But keep in mind that you don’t assign masks to the special tokens. For that, we will use conditional statements.

Once we replace 15% of the words with [MASK] tokens, we will add padding. Padding is usually done to make sure that all the sentences are of equal length. For instance, if we take the sentence :

 “The cat is walking. The dog is barking at the tree”

then with padding, it will look like this: 

“[CLS] The cat is walking [PAD] [PAD] [PAD]. [CLS] The dog is barking at the tree.” 

The length of the first sentence is equal to the length of the second sentence. 

In [None]:
import random
def make_batch():
   batch = []
   positive = negative = 0
   while positive != batch_size/2 or negative != batch_size/2:
       tokens_a_index, tokens_b_index= random.randrange(len(sentences)), random.randrange(len(sentences))

       tokens_a, tokens_b= token_list[tokens_a_index], token_list[tokens_b_index]

       input_ids = [word_dict['[CLS]']] + tokens_a + [word_dict['[SEP]']] + tokens_b + [word_dict['[SEP]']]
       segment_ids = [0] * (1 + len(tokens_a) + 1) + [1] * (len(tokens_b) + 1)

       # MASK LM
       n_pred =  min(max_pred, max(1, int(round(len(input_ids) * 0.15)))) # 15 % of tokens in one sentence
       cand_maked_pos = [i for i, token in enumerate(input_ids)
                         if token != word_dict['[CLS]'] and token != word_dict['[SEP]']]
       shuffle(cand_maked_pos)
       masked_tokens, masked_pos = [], []
       for pos in cand_maked_pos[:n_pred]:
           masked_pos.append(pos)
           masked_tokens.append(input_ids[pos])
           if random() < 0.8:  # 80%
               input_ids[pos] = word_dict['[MASK]'] # make mask
           elif random() < 0.5:  # 10%
               index = random.randint(0, vocab_size - 1) # random index in vocabulary
               input_ids[pos] = word_dict[number_dict[index]] # replace

       # Zero Paddings
       n_pad = maxlen - len(input_ids)
       input_ids.extend([0] * n_pad)
       segment_ids.extend([0] * n_pad)

       # Zero Padding (100% - 15%) tokens
       if max_pred > n_pred:
           n_pad = max_pred - n_pred
           masked_tokens.extend([0] * n_pad)
           masked_pos.extend([0] * n_pad)

       if tokens_a_index + 1 == tokens_b_index and positive < batch_size/2:
           batch.append([input_ids, segment_ids, masked_tokens, masked_pos, True]) # IsNext
           positive += 1
       elif tokens_a_index + 1 != tokens_b_index and negative < batch_size/2:
           batch.append([input_ids, segment_ids, masked_tokens, masked_pos, False]) # NotNext
           negative += 1
   return batch

Since we are dealing with next-word prediction, we have to create a label that predicts whether the sentence has a consecutive sentence or not, i.e. IsNext or NotNext. So we assign True for every sentence that precedes the next sentence and we use a conditional statement to do that. 

For instance, two sentences in a document usually follow each other if they are in context. So assuming the first sentence is A then the next sentence should be A+1. Intuitively we write the code such that if the first sentence positions i.e. tokens_a_index + 1 == tokens_b_index,  i.e. second sentence in the same context, then we can set the label for this input as True. 

If the above condition is not met i.e. if tokens_a_index + 1 != tokens_b_index then we set the label for this input as False. 

### Building Model

BERT is a complex model and if it is perceived slowly you lose track of the logic. So it’ll only make sense to explain its component by component and their function.

BERT has the following components:

1. Embedding layers
2. Attention Mask
3. Encoder layer

        - Multi-head attention

        - Scaled dot product attention

        - Position-wise feed-forward network
        
4. BERT (assembling all the components)


Embedding layer

The embedding is the first layer in BERT that takes the input and creates a lookup table. The parameters of the embedding layers are learnable, which means when the learning process is over the embeddings will cluster similar words together. 

The embedding layer also preserves different relationships between words such as: semantic, syntactic, linear, and since BERT is bidirectional it will also preserve contextual relationships as well. 

In the case of BERT, it creates three embeddings for 

    Token, 
    Segments and
    Position. 

If you recall we haven’t created a function that takes the input and formats it for position embedding but the formatting for token and segments are completed. So we will take the input and create a position for each word in the sequence. And it looks something like this:

In [None]:
import torch
print(torch.arange(30, dtype=torch.long).expand_as(input_ids))

TypeError: expand_as(): argument 'other' (position 1) must be Tensor, not int