# Understanding BERT

- BERT, short for Bidirectional Encoder Representations from Transformers, is a Machine Learning (ML) model for natural language processing.
- It was developed in 2018 by researchers at Google AI Language.

### 1. What is BERT used for?
- Can determine how positive or negative a movie’s reviews are. ([Sentiment Analysis](https://huggingface.co/blog/sentiment-analysis-python))
- Helps chatbots answer your questions. ([Question answering](https://huggingface.co/tasks/question-answering))
- Predicts your text when writing an email (Gmail). ([Text prediction](https://huggingface.co/tasks/fill-mask))
- Can write an article about any topic with just a few sentence inputs. ([Text generation](https://huggingface.co/tasks/text-generation))
- Can quickly summarize long legal contracts. ([Summarization](https://huggingface.co/tasks/summarization))
- Can differentiate words that have multiple meanings (like ‘bank’) based on the surrounding text. (Polysemy resolution)

### 2. How does BERT Work?

#### 2.1 Large amounts of training data
- A massive dataset of 3.3 Billion words has contributed to BERT’s continued success.
- BERT was specifically trained on Wikipedia (2.5B words) and Google’s BooksCorpus (800M words). These large informational datasets contributed to BERT’s deep knowledge not only of the English language but also of our world! 🚀
- Training on a dataset this large takes a long time. BERT’s training was made possible thanks to the novel Transformer architecture and sped up by using TPUs (Tensor Processing Units - Google’s custom circuit built specifically for large ML models). —64 TPUs trained BERT over the course of 4 days.
- DistilBERT offers a lighter version of BERT; runs 60% faster while maintaining over 95% of BERT’s performance.

#### 2.2 Masked Language Model (MLM)
- MLM enables/enforces bidirectional learning from text by masking (hiding) a word in a sentence and forcing BERT to bidirectionally use the words on either side of the covered word to predict the masked word.
- Examples:
    1. “Yesterday I was walking through the park and a friendly squirrel [blank] up to me.”
    2. “As the sun [blank] over the horizon, we knew it was going to be a beautiful day.”
    3. “On her first day at the new job, she felt [blank] but also excited about the opportunities.”
    4. “The chef added a secret ingredient that made the soup [blank] better than usual.”
       
- A random 15% of tokenized words are hidden during training and BERT’s job is to correctly predict the hidden words.

#### 2.3 Next Sentence Prediction (NSP)
- NSP (Next Sentence Prediction) is used to help BERT learn about relationships between sentences by predicting if a given sentence follows the previous sentence or not.
- Examples of Correct sentence pairs:
    1. The cat climbed up the tree. It was stuck there for hours. (correct sentence pair)
    2. They planned a vacation to Hawaii. They booked flights and hotels yesterday. (correct sentence pair)
    3. The museum had an exhibition on ancient Egypt. It featured artifacts from the Pharaoh's tomb. (correct sentence pair)
- Examples of InCorrect sentence pairs:
    1. The library was very quiet. The race car zoomed around the track. (incorrect sentence pair)
    2. It was raining heavily outside. The recipe calls for two cups of flour. (incorrect sentence pair)
    3. She's an avid reader of science fiction. Basketball is a popular sport worldwide. (incorrect sentence pair)

- In training, 50% correct sentence pairs are mixed in with 50% random sentence pairs to help BERT increase next sentence prediction accuracy.

#### 2.4 Transformers
- This requires a new section of its own!

# Understanding Transformers

- Transformers have revolutionized the field of natural language processing (NLP)
- The Challenge in NLP Before Transformers
    1. **Sequential Processing**: Traditional NLP models processed text sequentially, either left-to-right or right-to-left. This meant they could not fully grasp the context of words in a sentence, especially in long sentences.
    2. **Long-Term Dependencies**: Capturing dependencies between words that are far apart in a text was challenging. Models like RNNs and LSTMs struggled with this due to issues like vanishing gradients.


- Transformers, introduced in the paper “Attention Is All You Need” by Vaswani et al., brought a significant shift in this approach.
    1. **Parallel Processing**: Unlike their predecessors, transformers process all words in a sentence simultaneously. This parallel processing allows for a more nuanced understanding of context.
    2. **Attention Mechanism**: The key innovation in transformers is the attention mechanism. It allows the model to focus on different parts of the input sequence when predicting a word, giving importance to words based on their relevance.

![image.png](attachment:6631a96c-4c03-40b2-a3c2-414d1100bffd.png)

# Next Word Prediction 

https://colab.research.google.com/drive/10bBuh4D_eY2F_FFw6cKV3_ytt2dq6QMg?authuser=1#scrollTo=tSkENh9DRaOa