# INTRODUCTION

In this notebook, I am going to interactively implement a **BERT**(Bidirectional Encoder Representation from Transformer) which is a language representation designed to pretrain deep bidirectional representations from unlabelled text. It does this by jointly conditioning on both left and right context in all of its layers. As a result, after pretraining a BERT model, fine-tuning is done only on the output layer, meaning there's minimal architectural modifications for fine-tuning the model for a specific task.

Getting straight into the details, there are two major steps in this framework:
* **Pre-training** - This is where the model is pretrained on unlabelled data over different pretraining tasks.
* **Fine-tuning** - To do this, the model first initializes with pretrained parameters which are all fine-tuned using labelled data from downstream tasks, thus, as mentioned above, each task will have separate fine-tuned model which were all initialized with the same pretrained parameters.

# Model Architecture

The architecture here is just a variant of a transformer model discussed in the paper [Attention is All You Need](https://arxiv.org/abs/1706.03762) so we won't discuss it at length, but we'll see as we go, how they're the same.

**Model Parameters** - The paper used the following parameters for its base model:
* $L = 12$ - Number of layers i.e. the number of stacked Transformer Encoder Blocks, each of which will have sublayers multi-attention mechanism and feed-forward neural network as we already know from previous work.
* $H = 768$ - This will be the number of neurons in each hidden state, meaning, after embedding, each token will have a hidden representation of size 768.
* $A = 12$ - This indicates the number of self-attention heads in each layer, meaning we'll have $12$ instances of $K, Q, V$ matrices for computing multi-head self-attention.

## Inputs
The model needs to accept a special token $\text{<CLS>}$ (the first token for every sequence) and a list of word tokens from the user. Each of these tokens are then converted final word embeddings by adding the following embeddings:
* **Token embeddings $(TE)$** In order for our model to make something out of words it hasn't seen before(during training) e.g. novel words, mispellings etc, the paper uses **WordPiece** embeddings for each token to create token embedding.
* **Segment embeddings $(SE)$** The sentence number encoded in a vector. This is a way to differentiate between sentence A and sentence B e.g. during NSP that we'll discuss shortly.
* **Position embeddings $(PE)$** A vector encoding of the position of a word in a sentence.

It is thus clear that the segment and position embeddings are both useful in capturing temporal ordering within input sentences.