# <p style='text-align: center;'> Introduction to BERT Model </p>

## Why was BERT needed?
One of the biggest challenges in NLP is the lack of enough training data. Overall there is enormous amount of text data available, but if we want to create task-specific datasets, we need to split that pile into the very many diverse fields. And when we do this, we end up with only a few thousand or a few hundred thousand human-labeled training examples. Unfortunately, in order to perform well, deep learning based NLP models require much larger amounts of data — they see major improvements when trained on millions, or billions, of annotated training examples. To help bridge this gap in data, researchers have developed various techniques for training general purpose language representation models using the enormous piles of unannotated text on the web (this is known as pre-training). These general purpose pre-trained models can then be fine-tuned on smaller task-specific datasets, e.g., when working with problems like question answering and sentiment analysis. This approach results in great accuracy improvements compared to training on the smaller task-specific datasets from scratch. BERT is a recent addition to these techniques for NLP pre-training; it caused a stir in the deep learning community because it presented state-of-the-art results in a wide variety of NLP tasks, like question answering.

The best part about BERT is that it can be download and used for free — we can either use the BERT models to extract high quality language features from our text data, or we can fine-tune these models on a specific task, like sentiment analysis and question answering, with our own data to produce state-of-the-art predictions.

## What is the core idea behind it?
What is language modeling really about? Which problem are language models trying to solve? Basically, their task is to "**fill in the blank**" based on context. For example, given

"**The woman went to the store and bought a _____ of shoes**."

a language model might complete this sentence by saying that the word "**cart**" would fill the blank **20%** of the time and the word "**pair**" **80%** of the time.

In the **pre-BERT** world, a language model would have looked at this text sequence during training from either **left-to-right or combined left-to-right and right-to-left**. This **one-directional approach** works well for generating sentences — we can predict the next word, append that to the sequence, then predict the next to next word until we have a complete sentence.

Now enters **BERT**, a language model which is **bidirectionally** trained (this is also its key technical innovation). This means we can now have a deeper sense of language context and flow compared to the single-direction language models.

## What is BERT?
**BERT** stands for **Bidirectional Encoder Representations from Transformers**. It is designed to **pre-train** deep **bidirectional** representations from unlabeled text by jointly conditioning on both left and right context. As a result, the pre-trained BERT model can be **fine-tuned** with just one additional output layer to create **state-of-the-art** models for a wide range of NLP tasks.

That sounds way too complex as a starting point. But it does summarize what BERT does pretty well so let’s break it down.

First, it’s easy to get that BERT stands for Bidirectional Encoder Representations from Transformers. Each word here has a meaning to it and we will encounter that one by one. For now, the key takeaway from this line is – **BERT is based on the Transformer architecture**.

Second, **BERT** is **pre-trained** on a large corpus of unlabelled text including the entire **Wikipedia(that’s 2,500 million words!) and Book Corpus (800 million words)**.

This **pre-training** step is half the magic behind BERT’s success. This is because as we train a model on a large text corpus, our model starts to pick up the deeper and intimate understandings of how the language works. This knowledge is the **swiss army knife** that is useful for almost any NLP task.

Third, BERT is a "**deeply bidirectional**" model. Bidirectional means that BERT learns information from both the left and the right side of a token’s context during the training phase.

The bidirectionality of a model is important for truly understanding the meaning of a language. Let’s see an example to illustrate this. There are two sentences in this example and both of them involve the word "**bank**":

![image.png](attachment:image.png)

If we try to predict the nature of the word "**bank**" by only taking either the left or the right context, then we will be making an error in at least one of the two given examples.

One way to deal with this is to consider both the left and the right context before making a prediction. That’s exactly what BERT does! We will see later in the article how this is achieved.

And finally, the most impressive aspect of BERT. We can fine-tune it by adding just a couple of additional output layers to create state-of-the-art models for a variety of NLP tasks.

## Basic Idea Of Bert

**Bert** stands for **Bidirectional Encoder Representation Transformer**. It has created a major breakthrough in the field of NLP by providing greater results in many NLP tasks, such as **question answering , text generation , sentence classification** and many more besides . One of the major reasons of its success is that it is a **context-based** embedding model unlike any popular embedding model like word2vec which is a **context-free**.

First lets understand what is the **difference between context-free and context-based model**. Consider the following sentence

**Sentence A: He got bit by a Python.**

**Sentence B: Python is my favorite programming language.**

By reading both the sentences we can understand that the meaning of word "**Python**" is different in both the sentences . In **sentence A** the word "**Python**" refer to a **snake** , while in **sentence B** refers to a **programming language**.

Now if get embedding of the word "**Python**" using embedding model like **word2vec** we will get same embedding for both the sentences and therefore will render the meaning of the word in both sentences . This is because word2vec is a **context-free model** , it will ignore the context and give the **same embedding** for the word "**Python**" irrespective of the context.

Bert on the other hand , is a **context-based model**. It will understand the context and then generate the embedding for the word based on context . So, for the preceding two words it will give **different embedding** for the word "**Python**".

## But how does this work ? How does Bert understand context ?
Lets take **sentence A** , in this case Bert relates each word in the sentence to all the words in the sentence to get the contextual meaning of every word. By doing this Bert can understand that the word "**Python**" denotes the **snake**. Similarly the in **sentence B** Bert understands that the word "**Python**" denotes a **programming language**.

![image.png](attachment:image.png)

<b> Now the question is how exactly Bert work ? How does it understand the context ?

## Working Of Bert
BERT relies on a Transformer (the attention mechanism that learns contextual relationships between words in a text). A basic Transformer consists of an encoder to read the text input and a decoder to produce a prediction for the task. Since BERT’s goal is to generate a language representation model, it only needs the encoder part. The input to the encoder for BERT is a sequence of tokens, which are first converted into vectors and then processed in the neural network.

In Transformer we feed the sentence to as an input to the transformer’s encoder and it returns the representation of each word in the sentence as an output. Well that is what exactly what Bert is an — **Encoder Representation Of Transformer but Bidirectional as Encoder of the Transformer is Bidirectional**.

**Once we feed the sentence as an input to the encoder , the encoder understands the context using multi-head attention mechanism**.

![image.png](attachment:image.png)


### Configuration of Bert

<b> The researcher have presented the Bert in two main configuration

- <b> Bert-base
    
- <b> Bert-large
    
    
**Bert-base** — has **12 encoder layers** stacked on one of top of the other, **12 attention heads** and consist of **768 hidden units**. The total number of parameters Bert-base is **110 million**.

**Bert-large** — has **24 encoders layers** stacked on one of top of the other, **16 attention heads** and consist of **1024 hidden units**. The total number of parameters Bert-large is **3400 million**.

There are other configuration of Bert apart from two standard configurations such as Bert-mini, Bert-tiny , Bert-medium etc.

We can use smaller configurations of Bert in settings where computational resources are limited . However the standard giver more accurate results as they are most widely used.

## Pre-training the Bert model

**Pre-training** a model means training a model with a huge dataset for a particular task and saved the trained model. Now for a new task , instead of initializing a new model with random weights , we will initialize the weights of our already trained model i.e pre-trained. Since the model is trained on a huge dataset , instead of training model from scratch for a new task we used the pre-trained model and adjust(fine-tune) its weights according to new task. This is a type of transfer learning.

Bert model is pre-trained on huge corpus using two interesting tasks called **masked language modelling (MLM)** and **next sentence prediction (NSP)**. For a new task lets say question answering we used the pre-trained Bert and fine tune its weights.

### Input data representation

Before feeding the input to Bert , we convert input into embeddings using 3 embedding layer

- 1. Token embedding


- 2. Segment embedding


- 3. Position embedding

<b> 1. Token embeddings: 

A **[CLS]** token is added to the input word tokens at the beginning of the first sentence and a **[SEP]** token is inserted at the end of each sentence.
    
Let’s understand this by taking an example . Consider the following two sentences

**Sentence A : Paris is a beautiful city.**

**Sentence B : I love Paris.**

First we tokenized both the sentences and our output will be as follow

**tokens =[Paris, is , a , beautiful , city , I , love , Paris]**

then we add token a new token called **[cls]** in the beginning of the token

**tokens =[[cls] , Paris, is , a , beautiful , city , I , love , Paris]**

then we add **[sep]** token at the end of every sentence

**tokens =[[cls] , Paris, is , a , beautiful , city ,[sep] , I , love , Paris]**

The **[cls]** token is used for **classification** task whereas the **[sep]** is used to indicate the **end of every sentence**. Now before feeding the tokens to the Bert we convert the tokens into embeddings using an embedding layer called **token embedding layer**. Note that the value of the embedding will be learned during training.
    
<b> 2. Segment embedding:

Segment embedding is used to distinguish between the two gives sentences.

Lets consider our previous example again.

**tokens =[[cls] , Paris, is , a , beautiful , city ,[sep] , I , love , Paris]**

Now apart from [sep] we have to give our model some sort of indicator to our model to distinguish between the two sentences . To do this we feed the input tokens to the segment embedding layer .

The segment embedding layer returns only either of the two embedding EA(embedding of Sentence A) or EB(embedding of Sentence B) i.e if the input token belongs to sentence A then EA else EB for sentence B.
    
![image.png](attachment:image.png)
    
    
<b> 3. Positional embeddings:
    
A positional embedding is added to each token to indicate its position in the sentence.
    
Since we are aware that the transformer does not use any recurrence mechanism and process all the words in parallel , we need to provide some information relating related to word order, so we used positional encoding.

We know that Bert is essentially the transformer’s encoder and so we need to give information about the position of the words in our sentence before feeding it directly to our Bert.
    
<b> Final Representation

Now lets look at the final representation of the input data
    
![image-2.png](attachment:image-2.png)
    
    
<b> WordPiece Tokenizer:

Bert uses a special type of tokenizer called **WordPiece tokenizer**. The **WordPiece tokenizer** follows the **subword** tokenizer scheme . Lets understand WordPiece tokenizer , consider a sentence

**"Let us start pretraining the model**"

Now if we tokenize the sentence using wordpiece , then shall obtain

**token = [let , us , start , pre , ###train , ###ing , the , model]**

while tokenizing the sentence , our word pretraining is splint into 3 parts , this happened because our word piece tokenizer first check whether the word is present in our vocabulary . If the word is present then it will used as a token but if not then our word is split into subwords recursively until the subwords are found in our corpus. This process is effective in handling **the out of vocabulary words**.
    
Essentially, the Transformer stacks a layer that maps sequences to sequences, so the output is also a sequence of vectors with a 1:1 correspondence between input and output tokens at the same index. And as we learnt earlier, BERT does not try to predict the next word in the sentence. Training makes use of the following two strategies:
    
### Pre Training Strategies

Bert Model is pre-trained on the following two task:

1. Masked language modeling (MLM)
    
    
2. Next Sentence Prediction (NSP)
    
<b> Before diving directly in these two models lets first understand about language modeling.

### Language Modeling

In **language modeling** task we train our the model to predict the next word given a sequence of words. We can categories the language modeling into two aspects:

1. Auto-regressive language modeling
    
    
2. Auto-encoding language modeling
    
<b> 1. Auto-regressive language modeling:
    
we can categories auto-regressive language modeling as follows

- forward (left to right) prediction
    
    
- backward(right to left) prediction
    
Now consider our previous example "**Paris is a beautiful city. I love Paris**". Let’s remove the word city and add a blank . Now , our model has to predict the blank. If we use forward prediction then our model reads all the words from left to right up o the blank in order to make a prediction.

**Paris is a beautiful __.**

but if we use backward prediction then our model reads all the words from right to left in order to make prediction

**__ . I love Paris.**
    
Thus auto regressive models are **unidirectional**, meaning they read the sentence in only one direction.
    
    
<b> Auto-encoding language modeling:

**Auto-encoding** language modeling takes advantage of both **forward and backward prediction** and thus we can say that auto-encoding model are **bidirectional** in nature. Reading the sentence in both directions gives much clarity about the sentence and hence will give better result. **Bert is an auto-encoding language model**.
    
<b> Now we are diving into those Pre Training Strategies.

### 1. Masked language modeling (MLM)
In masked language modeling task for a given input , we randomly mask 15% of the word and train the network to predict the masked words . To predict the masked words our model reads in both the direction.

Let’s understand how masked language modeling works .

**tokens =[[cls] , Paris, is , a , beautiful , [Mask],[sep] , I , love , Paris ]**

In our previous example we replace the word city with **[Mask]** token.

Masking token in this way will create a discrepancy between pre-training and fine-tuning which means that we train Bert by predicting the [Mask] token. After training , we can fine-tune the pre-trained Bert for downstream task such as sentiment analysis . But during fine-tuning we will not have any [Mask] token in the input which will cause a mismatch between the way Bert is pre-trained and how it is used for fine-tuning.

To overcome this issue we play **80–10–10% rule**. we learned that we randomly mask 15 % of the sentence , now for these 15% we do the following:

- for 80 % of the time we replace words with [Mask] token.


- for 10 % of the time we replace the token with random token such our input will be as follow

**[[cls] , Paris, is , a , beautiful ,[sep] ,love , I]**

- for 10 % of the time we don’t make any changes.

Following the tokenization and masking , we feed the input tokens to the token, segment and position embedding layers and get the input embeddings.

Now we feed our input embedding to Bert. Bert takes the input and return a representation of each token as output

![image.png](attachment:image.png)

To predict the masked token , we feed the representation of the masked token R[masked] returned by Bert to the feedforward with the SoftMax activation function. Now the feed forward network takes R[masked] as input and return the probability of the words from our vocabulary to be our word.

![image-2.png](attachment:image-2.png)

The masked language modeling task is also known as a **cloze task**. While masking input tokens we can also use slightly different method known as **whole word masking**.

#### Whole Word Masking:

Consider the sentence **"Let us start pretraining the model**", After using the WordPiece tokenizer we will get

**token = [let , us , start , pre , ###train , ###ing , the , model]**

next we will add **[Cls]** token and mask 15 % of the word

**token = [[CLS],[Mask] , us , start , pre , [Mask], ###ing , the , model]**

As we can see that we have masked a subword as per the part of word pretraining. In the Whole Word Masking is a sub word is masked then we masked all the words corresponding to the subword retaining our masked rate i.e 15%.

**token = [[CLS],let , us , start , [Mask], [Mask], [Mask], the , model]**

### 2. Next Sentence Prediction (NSP)
In order to understand relationship between two sentences, BERT training process also uses next sentence prediction. A pre-trained model with this kind of understanding is relevant for tasks like question answering. During training the model gets as input pairs of sentences and it learns to predict if the second sentence is the next sentence in the original text as well.

As we have seen earlier, BERT separates sentences with a special [SEP] token. During training the model is fed with two input sentences at a time such that:

- 50% of the time the second sentence comes after the first one.


- 50% of the time it is a a random sentence from the full corpus.

BERT is then required to predict whether the second sentence is random or not, with the assumption that the random sentence will be disconnected from the first sentence:

![image.png](attachment:image.png)

To predict if the second sentence is connected to the first one or not, basically the complete input sequence goes through the Transformer based model, the output of the [CLS] token is transformed into a 2×1 shaped vector using a simple classification layer, and the IsNext-Label is assigned using softmax.

The model is trained with both Masked LM and Next Sentence Prediction together. This is to minimize the combined loss function of the two strategies — "together is better".

## How can we 'fine-tune' for a specific task?
Now, how can we fine-tune it for a specific task? BERT can be used for a wide variety of language tasks. If we want to fine-tune the original model based on our own dataset, we can do so by just adding a single layer on top of the core model.

For example, say we are creating a **question answering application**. In essence question answering is just a prediction task — on receiving a question as input, the goal of the application is to identify the right answer from some corpus. So, given a question and a context paragraph, the model predicts a start and an end token from the paragraph that most likely answers the question. This means that using BERT a model for our application can be trained by learning two extra vectors that mark the beginning and the end of the answer.

![image.png](attachment:image.png)

Just like sentence pair tasks, the question becomes the first sentence and paragraph the second sentence in the input sequence. However, this time there are two new parameters learned during fine-tuning: a start vector and an end vector.

In the fine-tuning training, most hyper-parameters stay the same as in BERT training; the paper gives specific guidance on the hyper-parameters that require tuning.

Note that in case we want to do fine-tuning, we need to transform our input into the specific format that was used for pre-training the core BERT models, e.g., we would need to add special tokens to mark the beginning ([CLS]) and separation/end of sentences ([SEP]) and segment IDs used to distinguish different sentences — convert the data into features that BERT uses.

## Advantages of BERT
1. High accuracy for many NLP tasks.


2. Requires less training time.


3. Memory requirements are low.


4. Pre-trained models available in many languages.


5. Supports multilingual input.


6. Handles short input sequences well.


7. Cost-effective as it is free.


8. Easy to fine-tune for specific tasks.


9. Good for classification tasks.


10. Easy to deploy on production systems.


## DisAdvantages of BERT

1. Limited context understanding.


2. Text generation capabilities are not ideal.


3. Speed can be slow for long sequences.


4. Cannot handle multiple inputs.


5. Poor performance on tasks that require long-term memory.


6. The quality of the generated text is not ideal.


7. Cannot handle long sequences efficiently.


8. Limited support for non-English languages.


9. Requires large amounts of training data.


10. Fine-tuning can be time-consuming