# <center> Code Implementation of BERT </center>
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://drive.google.com/file/d/1HWQVIOjmRjaDjDorUUz6qa5JZQobEhYM/view?usp=sharing)


# Bidirectional Encoder Representations from Transformers (BERT)
- Here is the [[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)](https://arxiv.org/abs/1810.04805) fully detailing what a BERT is; This model was developed by Google AI
    - "BERT is  designed  to  pre-train  deep  bidirectional  representations  from unlabeled text by jointly conditioning on both left  and  right  context  in  all  layers." - arXiv:1810.04805v2
- BERT is an innovative architecture that lead to substantial improvement in the <b> natural language inference, question answering, sentiment analysis, text summarization, Next Sentence Prediction and so many other fields! </b>

# BERT Pro's and its succession to LSTM
- You can think of BERT as the successor to LSTM; [Check out my video on LSTM!](https://www.youtube.com/watch?v=rmxogwIjOhE)
    - Drawbacks of LSTM's
            - Slow to Train
                - Considers words in sequential order (not parallel)
            - Not "really" bi-directional since LSTM has different "gates" that executes that logic (but some is lost)
- Different BERT models of varying sizes from the paper [linked in this google research github link](https://github.com/google-research/bert)

### <center> LSTM Architecture </center>
![LSTM](LSTM.png)

# BERT Architecture
- The BERT architecture is a multi-layer bidrectional transformer encoder; Here is a [video](https://www.youtube.com/watch?v=X0tB-J8_TS4) and [github link](https://github.com/SpencerPao/Natural-Language-Processing/tree/main/Transformers) on Transformers!
    - So it is <b> literally </b> taking the transformer encoder and stacking the encoders on top of each other!
        - The BERT base architecture has 12 encoder blocks, 12 multi-attention heads, and 110 million parameters
        - The BERT large architecture has 24 enoder blocks, 16 multi-attention heads, and 340 million parameters

Image locations came from: <b> [Attention Is All You Need](https://arxiv.org/abs/1706.03762) </b>

Encoder Block             |  Attention Framework
:-------------------------:|:-------------------------:
<img src="Encoder.png" width="300" height="450"> |  <img src="Attention.png" width="600" height="900">



# How are BERT's Trained?

There are two steps in this framework.

### Pre-training
The BERT model is trained on unblabeled data over different pre-training tasks. This is how the BERT architecture understands the language and context.
It accomplishes this in two ways:
- Masked Languaged Model (MLM)
    - This just masks words and attempts to predict what word would fit in the masked term.
    - Original Sentence: "Make sure to like and subscribe!"
    - Masking: "Make sure to <b>[MASK1]</b> and <b>[MASK2]</b>"
    - The Model then attempts to predict what the <b>[MASK1]</b> and <b>[MASK2]</b> by plugging in terms.
- Next Sentence Prediction (NSP)
    - Similarly to the MLM process, the NSP process attempts to predict the next sentence. (and if the next sentence is actually what it is said to be)
    - Original Sentences: 
        - <b> Prior Sentence </b>: "Make sure to like and subscribe!"
        - <b> Post Sentence </b>: "I just hit the like button with notificatons and subscribe button!"
    - Does the post sentence follow the prior sentence?
    
In industry, you will typically utilize an already existing BERT model that has an already pretrained corpus with its distinct vocabulary and either use the model out of the box or go straght to the fine-tuning phase with your training data.

# Overview for pre-training:
- Train BERT using NSP and MLM
    - Every word in sentence returns token embedding
    - Incorporate the segment and positional embeddings to account for ordering of inputs
    - Pass into BERT
    - Outputs word vectors for MLM and a binary value for NSP
    - Word vectors passed into a Softmax Layer with X neurons, where X = number of possible words in vocab
    - Compare with Cross Entropy Loss, thereby providing prediction for word.
<img src="Token_Embeddings.png" width="600" height="900">


### Fine-Tuning
The BERT model is initialized with the pre-trained parameters and then <b> all </b> the parameters are fine-tuned with labeled data. This is where the model utilizes the underlying understanding of language and context to attempt to solve a problem.
- Replace the output layer of BERT with a fully connected network layer where the number of neurons is the number of words for prediction (for QA type problems); the number of neurons can vary among what type of problem you are attempting to solve.
- This process is relatively fast to train since the only parameters that will be updated are the output layer parameters
    - The other parmaters (encoder blocks) won't change as dramatically
 


### BERT Structure for various NLP tasks
- Tok - is a token which is a word.
- E - Embeddings (pretrained embeddings from the pre-training step) -- These are vectors of same size
- C - Class Labels
- CLS - Classification Output (dependent variables : this can be a binary output for example)
- T - represents the contextual representation of a token

Image locations came from: [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) and more information on how to approach your specific problem can be addressed there.
<img src="Fine_Tuned_Tasks.png" width="600" height="900">

# Overview for fine-tuning
- Provide supervised dataset and tune the neurons in the output layer

# Cool. Now, let's start applying BERT!
There are of course sooo many applications for BERT:
- Determine if a movie’s reviews are positive or negative
- Help chatbots answer questions
- Help predicts text when writing an email
- Can quickly summarize long legal contracts

Let's keep it simple and see how it can be applied with Sentiment Analysis! Now as you are probably aware, I have done a Sentiment Analysis Video [here](https://www.youtube.com/watch?v=CzRrD76pnVY) but with an LSTM. So, let's do the same with BERT!

We want to predict if the text has a POSITIVE or NEGATIVE sentiment associated.
- We are going to be conducting Sentiment Analysis.
- Please see <b> figure (d) single sentence </b> in the section: BERT Structure for various NLP tasks for architecture.

# [Clone Repository](https://github.com/google-research/bert) from Google Research
- Pretty much has everything that you need to get started on training and utilizing BERT
