# BERT Tutorial: BERT 101 - State Of The Art NLP Model Explained

#### _<ins>**Objective</ins>:**_ 

Get a high level understanding of BERTModel


This notebook follows the following link: https://huggingface.co/blog/bert-101

#### **<ins>Introduction:</ins>**

We will get a very high level understanding of BERT Model.

#### **<ins>What is BERT?</ins>**

BERT stands for _Bidirectional Encoder Representations from Transformers._ Developed in 2018 by Google AI Language, the model is considered state-of-the-art and a real game changer in the world of natural language procesing (NLP). Typically, in solving problems in NLP, one requires a specific model for a specific problem. 

A key challenge in NLP is the ability for computers to understand the language context. For example, consider the following two sentences:

<center>
    I went to the bank <br>
    I went to the river bank
</center>

to humans, the word *bank* here have rather different meangings - as one refers to be the bank where you deposit your money, the other refers to the edge of the river. For computers, much work has been put so that they can distinguish the meaning based on the context words.

In fact, BERT has been benchmarked with other models using the GLUE and has always shown to outperform the rest.


#### **<ins>How Does BERT Work?</ins>**

##### <ins>1. Large Amounts of Training Data</ins>

BERT was trained on Wikipedia and Google's BooksCorpus data - both amounting to about 3.3 billion words. 64 Tensor Processing Units (TPUs) trained BERT over the course of 4 days. 

Of course, BERT Model is very large. There are smaller BERT Models of varying sizes that can be used for smaller embedding tasks. You may check out these smaller models at https://github.com/google-research/bert and https://huggingface.co/docs/transformers/model_doc/distilbert. 


##### <ins>2. Using Masked Language Modeling</ins>

Masked Lanuage Model enforces birectional learning from text by masking (hiding) a word in a sentence, and forcing BERT to use the words on both sides of the masked word to guess what it means. 


##### <ins>3. Next Sentence Prediction</ins>

This method is used to help BERT learning about relationships between sentences by predicint if a given sentence follows the previous sentence or not. 


##### <ins>4. Transformers</ins>

The unique aspect of Transformers is the usage of an attention mechanism, to observe relationships between words. This concept was first introduced in 2017, and it revolutionalized the domain of NLP. You may check out the paper in https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. 

We can roughly outline how this attention layer works. This is not too different from how we humans process information - we tend to forget details about insignificant or trivial events, and focus on the important ones. Similarly, the attention mechanism allows machine learning models to pay attention to significant details, and ignore irrelevant information. Transformers create differential weights on the words, hinting which words are more critical to further process. 

A transformer accomplish this by successively processing an input through a stack of transformer layers, called the encoder. If required, an additional stack of layers called a decoder can be used to predict a target output. Howevever, BERT does not employ one. 


#### **<ins>BERT Model Size and Architecture</ins>**

The two original BERT models are called BERTbase and BERTlarge:


|           | Transformer Layers | Hidden Size | Attention Heads | Parameters  | Processing | Length of Training  |
| :-------: | :----------------: | :---------: |:--------------: |:----------: |:---------: |:------------------: |
| BERTbase  |          12        |   768       |      12         | 110 Million |   4 TPUs   |      4 days         |
| BERTlarge |          24        |   1024      |      16         | 340 Million |  16 TPUs   |      4 days         |


#### **<ins>What Makes BERT Special?</ins>**
- BERT was trained on massive amounts of unlabeled data (no human annotation) in an unsupervised fashion.
- BERT was then trained on small amounts of human-annotated data starting from the previous pre-trained model resulting in state-of-the-art performance.
