# ___BERT - Bidirectional Encoder Representations from Transformers___

_Since 2018, our conceptual understanding of how best to represent words and sentences in a way that best captures underlying meanings and relationships is rapidly evolving. And, the NLP community has been putting forward incredibly powerful components that you can freely download and use in your own models and pipelines. One of the latest milestones in this development is the release of __BERT__, an event described as marking the beginning of a new era in NLP. BERT is a model that broke several records for how well models can handle language-based tasks._

## ___What is BERT?___
_BERT is a model that knows to represent text. You give it some sequence as an input, it then looks left and right several times and produces a vector representation for each word as the output. At the end of 2018 researchers at Google AI Language open-sourced a new technique for Natural Language Processing (NLP) called __BERT (Bidirectional Encoder Representations from Transformers)__. A major breakthrough that took the Deep Learning community by storm because of its incredible performance._

___“BERT stands for Bidirectional Encoder Representations from Transformers. It is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of NLP tasks.”___

<img src='https://cdn.analyticsvidhya.com/wp-content/uploads/2019/09/sent_context.png' width=400/>
<center style='font-size:10px'><i>BERT captures both the left and right context</i></center>

 
### ___Why was BERT needed?___
_One of the biggest challenges in NLP is the lack of enough training data. Overall there is an enormous amount of text data available, but if we want to create task-specific datasets, we need to split that pile into very many diverse fields. And when we do this, we end up with only a few thousand or a few hundred thousand human-labeled training examples. Unfortunately, in order to perform well, deep learning-based NLP models require much larger amounts of data — they see major improvements when trained on millions, or billions, of annotated training examples._

_To help bridge this gap in data, researchers have developed various techniques for training general-purpose language representation models using the enormous piles of unannotated text on the web (this is known as __pre-training__). These general-purpose __pre-trained models can then be fine-tuned on smaller task-specific datasets__, e.g., when working with problems like question answering and sentiment analysis._

_This approach results in great accuracy improvements compared to training on the smaller task-specific datasets from scratch. BERT is a recent addition to these techniques for NLP pre-training; it caused a stir in the deep learning community because it presented state-of-the-art results in a wide variety of NLP tasks, like question answering._

_It has been __pre-trained on Wikipedia__ and __BooksCorpus__._

<img src='http://jalammar.github.io/images/bert-transfer-learning.png' width=600/>
<center style='font-size:10px'><i>The two steps of how BERT is developed. You can download the model pre-trained in step 1 (trained on un-annotated data), and only worry about fine-tuning it for step 2.</i></center>

_So, with BERT the new approach to solving NLP tasks became a 2-step process:_

* _Train a language model on a large unlabelled text corpus (unsupervised or semi-supervised)_
* _Fine-tune this large model to specific NLP tasks to utilize the large repository of knowledge this model has gained (supervised)_

## ___Core Idea Behind BERT___

_What is language modeling really about? Which problem are language models trying to solve? Basically, their task is to “fill in the blank” based on context. For example, given_

“The woman went to the store and bought a _____ of shoes.”

_a language model might complete this sentence by saying that the word “cart” would fill the blank 20% of the time and the word “pair” 80% of the time._

_In the pre-BERT world, a language model would have looked at this text sequence during training from either left-to-right or combined left-to-right and right-to-left. This one-directional approach works well for generating sentences — we can predict the next word, append that to the sequence, then predict the next to next word until we have a complete sentence._

_Now enters BERT, a language model which is bidirectionally trained (this is also its key technical innovation). This means we can now have a deeper sense of language context and flow compared to the single-direction language models._

_Instead of predicting the next word in a sequence, BERT makes use of a novel technique called __Masked LM (MLM): it randomly masks words in the sentence and then it tries to predict them.__ Masking means that the model looks in both directions and it uses the full context of the sentence, both left and right surroundings, in order to predict the masked word. Unlike the previous language models, it takes both the previous and next tokens into account at the same time. The existing combined left-to-right and right-to-left LSTM based models were missing this “same-time part”. (It might be more accurate to say that BERT is non-directional though.)_

___But why is this non-directional approach so powerful?___

_Pre-trained language representations can either be context-free or context-based. __Context-based representations__ can then be __unidirectional__ or __bidirectional__._

_Context-free models like word2vec generate a single word embedding representation (a vector of numbers) for each word in the vocabulary. For example, the word “bank” would have the same context-free representation in “bank account” and “bank of the river.”_

_On the other hand, context-based models generate a representation of each word that is based on the other words in the sentence. For example, in the sentence “I accessed the bank account,” a unidirectional contextual model would represent “bank” based on “I accessed the” but not “account.” However, BERT represents “bank” using both its previous and next context — “I accessed the … account” — starting from the very bottom of a deep neural network, making it deeply bidirectional._

<img src='https://miro.medium.com/max/700/0*G8oaGEpkm1nEALmA'/>

_Moreover, BERT is based on the Transformer model architecture, instead of LSTMs._

_A Transformer works by performing a small, constant number of steps. In each step, it applies an attention mechanism to understand relationships between all words in a sentence, regardless of their respective position. For example, given the sentence, “I arrived at the bank after crossing the river”, to determine that the word “bank” refers to the shore of a river and not a financial institution, the Transformer can learn to immediately pay attention to the word “river” and make this decision in just one step._

_BERT was built upon recent work and clever ideas in pre-training contextual representations including __Semi-supervised Sequence Learning, Generative Pre-Training, ELMo, the OpenAI Transformer, ULMFit and the Transformer__. Although these models are all unidirectional or shallowly bidirectional, BERT is fully bidirectional._

## ___BERT’s Architecture___

_The BERT architecture builds on top of Transformer. We currently have two variants available:_

<img src='http://jalammar.github.io/images/bert-base-bert-large.png' width=500/>

* ___BERT Base___ _: 12 layers (transformer blocks), 12 attention heads, and 110 million parameters_
* ___BERT Large___ _: 24 layers (transformer blocks), 16 attention heads and, 340 million parameters_

_BERT is basically a trained __Transformer Encoder stack__._

<img src='http://jalammar.github.io/images/bert-base-bert-large-encoders.png' width=500/>

___Fun Fact___ _: BERT-Base was trained on 4 cloud TPUs for 4 days and BERT-Large was trained on 16 TPUs for 4 days._

## ___Information Flow of a Word in BERT___

<img src='https://yashuseth.files.wordpress.com/2019/06/fig9.png' width=300/>

_A word starts with its embedding representation from the embedding layer. Every layer does some multi-headed attention computation on the word representation of the previous layer to create a new intermediate representation. All these intermediate representations are of the same size. In the figure above, E1 is the embedding representation, T1 is the final output and Trm are the intermediate representations of the same token. In a 12-layers BERT model a token will have 12 intermediate representations._

## ___How BERT Works?___

_BERT relies on a Transformer (the attention mechanism that learns contextual relationships between words in a text). A basic Transformer consists of an encoder to read the text input and a decoder to produce a prediction for the task. Since __BERT’s goal is to generate a language representation model__, it only needs the encoder part. The input to the encoder for BERT is a sequence of tokens, which are first converted into vectors and then processed in the neural network. But before processing can start, BERT needs the input to be massaged and decorated with some __extra metadata__:_

* ___Token Embeddings___ _: The token embeddings are numerical representations of words in the input sentence. There is also something called sub-word tokenization that BERT uses to first breakdown larger or complex words into simple words and then convert them into tokens. For example, in the above diagram look how the word ‘playing’ was broken into ‘play’ and ‘##ing’ before generating the token embeddings. This tweak in tokenization works wonders as it utilized the sub-word context of a complex word instead of just treating it like a new word._


* ___Segment Embeddings___ _: The segment embeddings are used to help BERT distinguish between the different sentences in a single input. The elements of this embedding vector are all the same for the words from the same sentence and the value changes if the sentence is different._

    _Let’s consider an example: Suppose we want to pass the two sentences “I have a pen” and “The pen is red” to BERT. The tokenizer will first tokenize these sentences as:_
    [‘[CLS]’, ‘I’, ‘have’, ‘a’, ‘pen’, ‘[SEP]’, ‘the’, ‘pen’, ‘is’, ‘red’, ‘[SEP]’]

    _And the segment embeddings for these will look like:_
    [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]

    _Notice how all the elements corresponding to the word in the first sentence have the same element 0 whereas all the elements corresponding to the word in the second sentence have the same element 1._

    _Since BERT takes a 512-dimensional input, and suppose we have an input of 10 words only. To make the tokenized words compatible with the input size, we will add padding of size 512–10=502 at the end. Along with the padding, we will generate a mask token of size 512 in which the index corresponding to the relevant words will have 1s and the index corresponding to padding will have 0s._
    

* ___Position Embeddings___ _: Finally there are Position Embeddings that are generated internally in BERT and that provide the input data a sense of order. It is the same as what we discussed in Transformers._

<img src='https://miro.medium.com/max/700/0*XET3A5BmwES3qxgF'/>

_Essentially, the Transformer stacks a layer that maps sequences to sequences, so the output is also a sequence of vectors with a 1:1 correspondence between input and output tokens at the same index. And as we learnt earlier, BERT does not try to predict the next word in the sentence. Training makes use of the following two strategies:_

___1. Masked LM (MLM)___

_The idea here is “simple”: Randomly mask out 15% of the words in the input — replacing them with a [MASK] token — run the entire sequence through the BERT attention based encoder and then predict only the masked words, based on the context provided by the other non-masked words in the sequence. However, there is a problem with this naive masking approach — the model only tries to predict when the [MASK] token is present in the input, while we want the model to try to predict the correct tokens regardless of what token is present in the input. To deal with this issue, out of the 15% of the tokens selected for masking:_

* _80% of the tokens are actually replaced with the token [MASK]._
* _10% of the time tokens are replaced with a random token._
* _10% of the time tokens are left unchanged._


<img src='http://jalammar.github.io/images/BERT-language-modeling-masked-lm.png' width = 500/>

___While training the BERT loss function considers only the prediction of the masked tokens and ignores the prediction of the non-masked ones. This results in a model that converges much more slowly than left-to-right or right-to-left models.___

___2. Next Sentence Prediction (NSP)___

_In order to understand relationship between two sentences, BERT training process also uses __next sentence prediction__. A pre-trained model with this kind of understanding is relevant for tasks like question answering. During training the model gets as input pairs of sentences and it learns to predict if the second sentence is the next sentence in the original text as well.
As we have seen earlier, BERT separates sentences with a special [SEP] token. During training the model is fed with two input sentences at a time such that:_

* _50% of the time the second sentence comes after the first one._
* _50% of the time it is a a random sentence from the full corpus._

_BERT is then required to predict whether the second sentence is random or not, with the assumption that the random sentence will be disconnected from the first sentence:_

<img src='http://jalammar.github.io/images/bert-next-sentence-prediction.png' width=500/>

_To predict if the second sentence is connected to the first one or not, basically the complete input sequence goes through the Transformer based model, the output of the [CLS] token is transformed into a 2×1 shaped vector using a simple classification layer, and the IsNext-Label is assigned using softmax._

___The model is trained with both Masked LM and Next Sentence Prediction together. This is to minimize the combined loss function of the two strategies — “together is better”.___

## ___Fine Tuning BERT___

_BERT outperformed the state-of-the-art across a wide variety of tasks under general language understanding like natural language inference, sentiment analysis, question answering, paraphrase detection and linguistic acceptability._

_Now, how can we fine-tune it for a specific task? BERT can be used for a wide variety of language tasks. If we want to fine-tune the original model based on our own dataset, we can do so by just adding a single layer on top of the core model._

### ___Procedure for Sequence Classification Tasks___

_The final hidden state of the [CLS] token is taken as the fixed-dimensional pooled representation of the input sequence. This is fed to the classification layer. The classification layer is the only new parameter added and has a dimension of K x H, where K is the number of classifier labels and H is the size of the hidden state. The label probabilities are computed with a standard softmax._

<img src='https://yashuseth.files.wordpress.com/2019/06/fig1-1.png' width = 300/>

### ___Procedure for Sentence Pair Classification Tasks___

_This procedure is exactly similar to the single sequence classification task. The only difference is in the input representation where the two sentences are concatenated together._

<img src='https://yashuseth.files.wordpress.com/2019/06/fig2-1.png' width = 300/>

### ___Procedure for Question Answering Task___

_Question answering is a prediction task. Given a question and a context paragraph, the model predicts a start and an end token from the paragraph that most likely answers the question._

<img src='https://yashuseth.files.wordpress.com/2019/06/fig6.png' width = 300/>

_Just like sentence pair tasks, the question becomes the first sentence and paragraph the second sentence in the input sequence. There are only two new parameters learned during fine-tuning a start vector and an end vector with size equal to the hidden shape size. The probability of token i being the start of the answer span is computed as – softmax(S . K), where S is the start vector and K is the final transformer output of token i. The same applies to the end token._

<img src='https://yashuseth.files.wordpress.com/2019/06/fig3.png' width = 300/>

### ___Procedure for Single Sentence Tagging___

_In single sentence tagging tasks such as named entity recognition, a tag must be predicted for every word in the input. The final hidden states (the transformer output) of every input token is fed to the classification layer to get a prediction for every token. Since WordPiece tokenizer breaks some words into sub-words, the prediction of only the first token of a word is considered._

<img src='https://yashuseth.files.wordpress.com/2019/06/fig4.png' width = 300/>

## ___BERT for Feature Extraction - A New Age of Embedding___

_The fine-tuning approach isn’t the only way to use BERT. Just like ELMo, you can use the pre-trained BERT to create contextualized word embeddings._

_These new developments carry with them a new shift in how words are encoded. Up until now, word-embeddings have been a major force in how leading NLP models deal with language. Methods like Word2Vec and Glove have been widely used for such tasks._

### ___Word Embedding - GloVe, Word2Vec___

_For words to be processed by machine learning models, they need some form of numeric representation that models can use in their calculation. Word2Vec showed that we can use a vector (a list of numbers) to properly represent words in a way that captures semantic or meaning-related relationships (e.g. the ability to tell if words are similar, or opposites, or that a pair of words like “Stockholm” and “Sweden” have the same relationship between them as “Cairo” and “Egypt” have between them) as well as syntactic, or grammar-based, relationships (e.g. the relationship between “had” and “has” is the same as that between “was” and “is”)._

_The field quickly realized it’s a great idea to use embeddings that were pre-trained on vast amounts of text data instead of training them alongside the model on what was frequently a small dataset. So it became possible to download a list of words and their embeddings generated by pre-training with Word2Vec or GloVe. This is an example of the GloVe embedding of the word “stick” (with an embedding vector size of 200)_

<img src='http://jalammar.github.io/images/glove-embedding.png'/>
<center style='font-size:10px'><i>The GloVe word embedding of the word "stick" - a vector of 200 floats (rounded to two decimals). It goes on for two hundred values.</i></center>

### ___Embeddings from Language Models (ELMo)___

_If we’re using this GloVe representation, then the word “stick” would be represented by this vector no-matter what the context was. “Wait a minute” said a number of NLP researchers (Peters et. al., 2017, McCann et. al., 2017, and yet again Peters et. al., 2018 in the ELMo paper ), “stick”” has multiple meanings depending on where it’s used. Why not give it an embedding based on the context it’s used in – to both capture the word meaning in that context as well as other contextual information?”. And so, contextualized word-embeddings were born._

_ELMo is a novel way to represent words in vectors or embeddings. These word embeddings are helpful in achieving state-of-the-art (SOTA) results in several NLP tasks. ELMo was different from these embeddings because it gives embedding to a word based on its context i.e contextualized word-embeddings.To generate embedding of a word, ELMo looks at the entire sentence instead of a fixed embedding for a word._

_Elmo uses a bidirectional LSTM trained for the specific task to be able to create those embeddings. This model is trained on a massive dataset in the language of our dataset, and then we can use it as a component in other architectures that are required to perform specific language tasks._

<img src='https://cdn.analyticsvidhya.com/wp-content/uploads/2019/03/output_YyJc8E.gif' width=600/>

_ELMo gained its language understanding from being trained to predict the next word in a sequence of words – a task called Language Modeling. This is convenient because we have vast amounts of text data that such a model can learn from without labels can be trained._

### ___Universal Language Model Fine-tuning (ULM-FiT)___

_ULM-FiT introduced methods to effectively utilize a lot of what the model learns during pre-training – more than just embeddings, and more than contextualized embeddings. ULM-FiT introduced a language model and a process to effectively fine-tune that language model for various tasks. This enables NLP architecture to perform transfer learning on a pre-trained model similar to that is performed in many Computer vision tasks._

### ___OpenAI Transformer___

_The release of the Transformer paper and code, and the results it achieved on tasks such as machine translation started to make some in the field think of them as a replacement to LSTMs. This was compounded by the fact that Transformers deal with long-term dependancies better than LSTMs._

_The Encoder-Decoder structure of the transformer made it perfect for machine translation. But how would you use it for sentence classification? How would you use it to pre-train a language model that can be fine-tuned for other tasks (downstream tasks is what the field calls those supervised-learning tasks that utilize a pre-trained model or component)._

_It turns out we don’t need an entire Transformer to adopt transfer learning and a fine-tunable language model for NLP tasks. We can do with just the decoder of the transformer. The decoder is a good choice because it’s a natural choice for language modeling (predicting the next word) since it’s built to mask future tokens – a valuable feature when it’s generating a translation word by word._

_The model stacked twelve decoder layers. Since there is no encoder in this set up, these decoder layers would not have the encoder-decoder attention sublayer that vanilla transformer decoder layers have. It would still have the self-attention layer, however (masked so it doesn’t peak at future tokens)._

_With this structure, we can proceed to train the model on the same language modeling task: predict the next word using massive (unlabeled) datasets._

<img src='http://jalammar.github.io/images/openai-transformer-sentence-classification.png' width=500/>

_The OpenAI paper outlines a number of input transformations to handle the inputs for different types of tasks. The following image from the paper shows the structures of the models and input transformations to carry out different tasks._

<img src='http://jalammar.github.io/images/openai-input%20transformations.png' width=500/>

### ___BERT___

_The fine-tuning approach isn’t the only way to use BERT. Just like ELMo, you can use the pre-trained BERT to create contextualized word embeddings. Then you can feed these embeddings to your existing model – a process the paper shows yield results not far behind fine-tuning BERT on a task such as named-entity recognition._

<img src='http://jalammar.github.io/images/bert-contexualized-embeddings.png' width=500/>

_Which vector works best as a contextualized embedding? I would think it depends on the task. The paper examines six choices (Compared to the fine-tuned model which achieved a score of 96.4):_

<img src='http://jalammar.github.io/images/bert-feature-extraction-contextualized-embeddings.png' width=500/>

## ___BERT Benchmarks___

_BERT provides fine-tuned results for 11 NLP tasks. Here, we discuss some of those results on benchmark NLP tasks._

### ___GLUE___
_The General Language Understanding Evaluation task is a collection of different Natural Language Understanding tasks. These include __MNLI (Multi-Genre Natural Language Inference), QQP(Quora Question Pairs), QNLI(Question Natural Language Inference), SST-2(The Stanford Sentiment Treebank), CoLA(Corpus of Linguistic Acceptability)__ etc. Both, BERTBASE and BERTLARGE outperforms previous models by a good margin (4.5% and 7% respectively). Below are the results of BERTBASE and BERTLARGE as compared to other models:_

<img src='https://media.geeksforgeeks.org/wp-content/uploads/20200406230011/BERT-GL.png'/>

### ___SQuAD v1.1 Dataset___
___Stanford Question Answer Dataset___ _is a collection 100k crowd source Question Answer Pairs. A data point contains a question and a passage from wikipedia which contains the answer. The task is to predict the answer text span from the the passage._

_The best performing BERT (with the ensemble and TriviaQA) outperforms the top leaderboard system by 1.5 F1-score in ensembling and 1.3 F1-score as a single system. In fact, single BERTBASE outperforms top ensemble system in terms of F1-score._

### ___SWAG (Situations With Adversarial Generations)___
_SWAG dataset contains 113k sentence completion tasks that evaluate best-fitting answer using a grounded commonsense inference. Given a sentence, the task is to choose the most plausible continuation among four choices._

_BERTLARGE outperforms the OpenAI GPT by 8.3%. It even performs better than an expert human._

_The result of SWAG dataset are given below:_

<img src='https://media.geeksforgeeks.org/wp-content/uploads/20200406234910/bert-swag.png'/>

## ___Few Questions___

___1. How is BERT different from other embedding generating algorithms like Word2Vector or GloVe?___

_The main differences between BERT and W2V or GloVe are:_

* _W2V and GloVe word embeddings are context-independent. These models output just one vector (embedding) for each word, combining all the different senses of the word into one vector. For example in a given sentence:_
“the game will lead to a tie if both the guys tie their final tie at the same time.”


* _Word2Vector or GloVe will fail to capture that all the 3 words tie in the sentence have different meanings and would simply return the same embedding for all the 3 words._


* _Whereas BERT is context-dependent, which means each of the 3 words would have different embeddings because BERT pays attention to the neighboring words before generating the embeddings._


* _Because W2V and GloVe are context-independent, we do not require the model which was used to train the vectors every time to generate the embeddings. We can simply train the vectors on a corpus of words once and then generate a table or database holding the words and their respective trained vectors._


* _Whereas in the case of BERT, since it is context-dependent, we need the pre-trained model every time while generating the embeddings or performing any NLP task._

___2. Tokenization strategy is used by BERT?___

_BERT uses WordPiece tokenization. The vocabulary is initialized with all the individual characters in the language, and then the most frequent/likely combinations of the existing words in the vocabulary are iteratively added._

___3. How does BERT handle OOV words?___

_Any word that does not occur in the vocabulary is broken down into sub-words greedily. For example, if play, ##ing, and ##ed are present in the vocabulary but playing and played are OOV words then they will be broken down into play + ##ing and play + ##ed respectively. (## is used to represent sub-words)._

___4. What is the maximum sequence length of the input?___

_512 Tokens_

___5. How many layers are frozen in the fine-tuning step?___

_No layers are frozen during fine-tuning. All the pre-trained layers along with the task-specific parameters are trained simultaneously._

___Is discriminative fine-tuning used?___

_No. All the parameters are tuned with the same learning rate._

___6. What are the optimal values of the hyperparameters used in fine-tuning?___

_The optimal hyperparameter values are task-specific. But, the authors found that the following range of values works well across all tasks:_

* _Dropout – 0.1_
* _Batch Size – 16, 32_
* _Learning Rate (Adam) – 5e-5, 3e-5, 2e-5_
* _Number of epochs – 3, 4_

_The authors also observed that large datasets (> 100k labeled samples) are less sensitive to hyperparameter choice than smaller dataset._

___7. Is BERT available in languages other than english?___

_Yes, there is a multilingual BERT model available as well._

___8. Is BERT available on domain specific pre-trained corpus?___

_Yes. I have come across Clinical BERT – BERT pre-trained on clinical notes corpus and sciBERT – Pre-Trained Contextualized Embeddings for Scientific Text._

___BERT-As-Service___

_BERT-As-Service works in a simple way. It creates a BERT server which we can access using the Python code in our notebook._

<table>
<tbody>
<tr>
<td><a href="https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip" target="_blank" rel="noopener noreferrer"><i>BERT-Base, Uncased</i></a></td>
<td><i>12-layer, 768-hidden, 12-heads, 110M parameters</i></td>
</tr>
<tr>
<td><a href="https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-24_H-1024_A-16.zip" target="_blank" rel="noopener noreferrer"><i>BERT-Large, Uncased</a></td>
<td><i>24-layer, 1024-hidden, 16-heads, 340M parameters</td>
</tr>
<tr>
<td><a href="https://storage.googleapis.com/bert_models/2018_10_18/cased_L-12_H-768_A-12.zip" target="_blank" rel="noopener noreferrer"><i>BERT-Base, Cased</i></a></td>
<td><i>12-layer, 768-hidden, 12-heads, 110M parameters</i></td>
</tr>
<tr>
<td><a href="https://storage.googleapis.com/bert_models/2018_10_18/cased_L-24_H-1024_A-16.zip" target="_blank" rel="noopener noreferrer"><i>BERT-Large, Cased</i></a></td>
<td><i>24-layer, 1024-hidden, 16-heads, 340M parameters</i></td>
</tr>
<tr>
<td><a href="https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip" target="_blank" rel="noopener noreferrer">BERT-Base, Multilingual Cased (New)</i></a></td>
<td>104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters</i></td>
</tr>
<tr>
<td><a href="https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip" target="_blank" rel="noopener noreferrer"><i>BERT-Base, Multilingual Cased (Old)</i></a></td>
<td><i>102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters</i></td>
</tr>
<tr>
<td><a href="https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip" target="_blank" rel="noopener noreferrer"><i>BERT-Base, Chinese</i></a></td>
<td><i>Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters</i></td>
</tr>
</tbody>
</table>

## ___Pre-Trained NLP Models___

* ___OpenAI’s GPT-3___

    _The successor to GPT and GPT-2, GPT-3 is one of the most controversial pre-trained models by OpenAI. This large scale transformer-based language model has been trained on 175 billion parameters, which is ten times more than any previous non-sparse language model available. The model has been trained to achieve strong performance on many NLP datasets, including tasks like translation, answering questions, as well as several tasks that require on-the-fly reasoning such as unscrambling words. With its recent advancements, it has been used even to write news articles and generate codes helping developers to build ML applications. GPT-3 is the largest model so far, and its impressive capabilities have positioned it to outrank other text prediction models. In June, this year, the company released its API for allowing the users to access the new AI models virtually._


* ___Google’s BERT___

    _Bidirectional Encoder Representations from Transformers — BERT, is a pre-trained NLP model developed by Google in 2018. With this, anyone in the world can train their own question answering models in about 30 minutes on a single Cloud TPU, or in a few hours using a single GPU. The company, with the release, has showcased its performance on 11 NLP tasks including the very competitive Stanford questions dataset. Unlike other language models, BERT has only been pre-trained on 2,500 million words of Wikipedia and 800 million words of Book Corpus and has been successfully used to pre-train a deep neural network. According to researchers, BERT has achieved 93.2% accuracy, which surpasses previous results of accuracy._


* ___Microsoft’s CodeBERT___

    _Microsoft’s CodeBERT, with ‘BERT’ suffix referring to Google’s BERT framework for NLP, has been built upon a bidirectional multi-layer neural architecture. By understanding the connection between natural language and programming language, the model can support tasks such as code search, code documentation generation etc. CodeBERT has also been evaluated on NL-PL tasks by fine-tuning model parameters and following that it achieved excellent performance on both natural language code search and code documentation generation. The model has further been trained on the large dataset from Github code repositories in six programming languages, including 2.1 million bimodal data points and 6.4 million unimodal codes._


* ___ELMo___

    _ELMo, also known as Embeddings from Language Models is a deep contextualised word representation that models syntax and semantic of words as well as their linguistic contexts. The model, developed by Allen NLP, has been pre-trained on a huge text-corpus and learned functions from deep bi-directional models (biLM). ELMo can easily be added to the existing models, which drastically improves the functions across vast NLP problems, including answering questions, textual entailment and sentiment analysis._


* ___XLNet___

    _XLNet by Google is an extension of the Transformer-XL model, which has been pre-trained using an autoregressive method to learn the functions from bidirectional contexts. Not only it can perform NLP tasks such as text classification, analysing sentiments, answering questions, along with the essential GLUE benchmark for English, but also many a time has outperformed BERT in many NLP tasks. According to researchers, XLNet has surpassed BERT in 20 tasks such as SQuAD, GLUE, and RACE. Also, this model does not undergo the pre-train fine-tune discrepancy that BERT has been subjected to, eliminating the independence assumption._

    _To improve the training, XLNet introduces permutation language modeling, where all tokens are predicted but in random order. This is in contrast to BERT’s masked language model where only the masked (15%) tokens are predicted. This is also in contrast to the traditional language models, where all tokens were predicted in sequential order instead of random order. This helps the model to learn bidirectional relationships and therefore better handles dependencies and relations between words. In addition, Transformer XL was used as the base architecture, which showed good performance even in the absence of permutation-based training._

    _XLNet was trained with over 130 GB of textual data and 512 TPU chips running for 2.5 days, both of which ar e much larger than BERT._
    

* ___Google’s ALBERT___

    _Google ALBERT is a deep-learning NLP model, an upgrade of BERT, which has advanced on 12 NLP tasks including the competitive SQuAD v2.0 and SAT-style comprehension RACE benchmark. The model has been released as an open-source implementation on the TensorFlow framework and includes many ready-to-use pertained language representation models. The model further uses 89% fewer parameters than the BERT model — only 12M parameters and with way less loss of accuracy while evaluating, with an average of 80.1% accuracy. The model uses two optimisations to reduce model size — factorisation of the embedding layer and parameter-sharing across the hidden layers of the network._


* ___ULMFiT___

    _ULMFiT, also known as Universal Language Model Fine-tuning, is an effective transfer learning method which can be used to perform any sort of NLP tasks. The model performs significantly on six text classification tasks, reducing the error by 18-24% on the majority of datasets. Furthermore, the model has only been trained on 100 labelled examples to match the performance of model training from scratch on 100x more data. The creators have open-sourced their pre-trained models and code for developers to use._
    

* ___Facebook’s RoBERTa___

    _Facebook’s RoBERTa is an optimised method for pre-training a self-supervised NLP system built on BERT’s language masking strategy. The model has been trained to predict intentionally hidden sections of text within otherwise unannotated language examples. RoBERTa modifies key hyperparameters in the model BERT allowing it to improve on the masked language modelling objective leading to better downstream task performance. The researchers are also training RoBERTa on more data than BERT and for a more extended amount of time. Researchers used existing unannotated natural language processing data sets as well as public news articles to train the model._
    
    _Introduced at Facebook, Robustly optimized BERT approach RoBERTa, is a retraining of BERT with improved training methodology, 1000% more data and compute power._

    _To improve the training procedure, __RoBERTa removes the Next Sentence Prediction (NSP) task from BERT’s pre-training and introduces dynamic masking so that the masked token changes during the training epochs.__ Larger batch-training sizes were also found to be more useful in the training procedure._

    _Importantly, RoBERTa uses 160 GB of text for pre-training, including 16GB of Books Corpus and English Wikipedia used in BERT. The additional data included CommonCrawl News dataset (63 million articles, 76 GB), Web text corpus (38 GB) and Stories from Common Crawl (31 GB). This coupled with whopping 1024 V100 Tesla GPU’s running for a day, led to pre-training of RoBERTa.
    As a result, RoBERTa outperforms both BERT and XLNet on GLUE benchmark results:_

    <img src='https://miro.medium.com/max/700/1*tHwpOWmEyrD-boE5qBrkHw.png' width=500/>

## ___DistilBERT___

___How should we put these monsters in production? How can we use such large models under low latency constraints? Do we need (costly) GPU servers to serve at scale?___

_There are many techniques available to tackle the previous questions. The most common tools include __quantization (approximating the weights of a network with a smaller precision)__ and __weights pruning (removing some connections in the network)__._

_Then came __Distillation: a technique you can use to compress a large model, called the teacher, into a smaller model, called the student__._

___Knowledge distillation (sometimes also referred to as teacher-student learning) is a compression technique in which a small model is trained to reproduce the behavior of a larger model (or an ensemble of models).___

_DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than bert-base-uncased, runs 60% faster while preserving over 95% of BERT’s performances as measured on the GLUE language understanding benchmark._

<img src='https://4.bp.blogspot.com/-v0xrp7eJRfM/Xr77DD85ObI/AAAAAAAADDY/KjIlWlFZExQA84VRDrMEMrB534euKAzlgCLcBGAsYHQ/s1600/NLP%2Bmodels.png' width=900/>

_Specifically, __it does not has token-type embeddings, pooler and retains only half of the layers from Google’s BERT. DistilBERT uses a technique called distillation, which approximates the Google’s BERT, i.e. the large neural network by a smaller one__. The idea is that once a large neural network has been trained, its full output distributions can be approximated using a smaller network. This is in some sense similar to posterior approximation. One of the key optimization functions used for posterior approximation in Bayesian Statistics is Kulback Leiber divergence and has naturally been used here as well._

_Note: In Bayesian statistics, we are approximating the true posterior (from the data), whereas with distillation we are just approximating the posterior learned by the larger network._

### ___So which one to use?___

<img src='https://miro.medium.com/max/700/1*bSUO_Qib4te1xQmBlQjWaw.png' width=600/>

_If you really need a __faster inference speed__ but can compromise few-% on prediction metrics, __DistilBERT__ is a starting reasonable choice, however, if you are looking for the __best prediction metrics__, you’ll be better off with __Facebook’s RoBERTa__._

_Theoratically, __XLNet’s__ permutation based training should __handle dependencies well, and might work better in longer-run__._

_However, Google’s __BERT__ does serve a __good baseline__ to work with and if you don't have any of the above critical needs, you can keep your systems running with BERT._

### ___How to Use BERT?___

* _[Tensorflow](https://github.com/google-research/bert)_
* _[Pytorch - Huggingface](https://github.com/huggingface/transformers)_
* _[Fastai](http://mlexplained.com/2019/05/13/a-tutorial-to-fine-tuning-bert-with-fast-ai/)_
* _[Keras](https://github.com/Separius/BERT-keras)_ 
* _[Ktrain](https://towardsdatascience.com/bert-text-classification-in-3-lines-of-code-using-keras-264db7e7a358)_