# The Annotated BERT: Bidirectional Encoder Representations from Transformers

**Authors:** [Your Name]

BERT (Bidirectional Encoder Representations from Transformers) has transformed the NLP landscape by introducing a bidirectional approach that effectively captures language context. Unlike traditional models that process text unidirectionally, BERT employs a deep bidirectional architecture based on Transformers, enabling it to jointly condition on both the left and right context in all layers. This innovation allows BERT to achieve state-of-the-art results across a wide range of natural language processing tasks, including question answering, natural language inference, and named entity recognition.

The key advancements in BERT include the use of a masked language model (MLM) objective, which pre-trains the model by predicting randomly masked words in a sentence, and the next sentence prediction (NSP) task, which helps in understanding relationships between sentences. By pre-training on large corpora such as BooksCorpus and English Wikipedia, BERT learns robust language representations that can be fine-tuned with minimal task-specific modifications. This paper explores the architecture, pre-training methodology, and fine-tuning processes that make BERT a cornerstone of modern NLP research and applications.


# Table of Contents
1. [Introduction](#Introduction)
2. [Preliminaries](#Preliminaries)
3. [Background](#Background)
4. [Model Architecture](#Model-Architecture)
    - [Overview of BERT](#Overview-of-BERT)
    - [Key Design Choices](#Key-Design-Choices)
    - [Input Representation](#Input-Representation)
    - [Pre-training Objectives](#Pre-training-Objectives)
    - [Transformer Encoder Design](#Transformer-Encoder-Design)
    - [Fine-tuning BERT for Classification Tasks](#Fine-tuning-BERT-for-Classification-Tasks)
5. [Model Training](#Model-Training)
    - [Pre-training Procedure](#Pre-training-Procedure)
    - [Loading Pre-trained Weights](#Loading-Pre-trained-Weights)
    - [Fine-tuning Approach](#Fine-tuning-Approach)
6. [Experimental Setup](#Experimental-Setup)
    - [Datasets Used](#Datasets-Used)
    - [Hyperparameter Selection](#Hyperparameter-Selection)
    - [Evaluation Metrics](#Evaluation-Metrics)
7. [Results and Analysis](#Results-and-Analysis)
8. [Sentiment Analysis using BERT: A Real-World Example](#Sentiment-Analysis-using-BERT:-A-Real-World-Example)
9. [Applications and Use Cases](#Applications-and-Use-Cases)
10. [Challenges and Limitations](#Challenges-and-Limitations)
11. [Future Work](#Future-Work)
12. [Conclusion](#Conclusion)
13. [References](#References)

# Introduction

BERT (Bidirectional Encoder Representations from Transformers) has revolutionized the field of Natural Language Processing (NLP) by providing a pre-trained model capable of understanding the context of words in a sentence from both directions (left-to-right and right-to-left). Unlike traditional NLP models, which process text in a unidirectional manner, BERT utilizes the Transformer architecture to create bidirectional representations, making it more powerful in capturing the nuances of language. 

BERT’s success stems from its innovative pre-training tasks—**Masked Language Modeling (MLM)** and **Next Sentence Prediction (NSP)**—which allow the model to learn rich, deep contextual embeddings from a large corpus of text. After pre-training on vast amounts of unlabelled text, BERT can be fine-tuned for a wide range of NLP tasks, such as question answering, sentiment analysis, and named entity recognition, by simply adding a task-specific layer and fine-tuning on labeled data.

This notebook delves into the key concepts behind BERT’s architecture, training methods, and practical applications, with code implementations that demonstrate how BERT can be used to solve real-world NLP challenges.


# Preliminaries

In this notebook, we will explore the BERT (Bidirectional Encoder Representations from Transformers) model, a breakthrough in Natural Language Processing (NLP) that leverages the Transformer architecture to produce contextualized word embeddings. BERT is pre-trained on a large corpus of text and fine-tuned for downstream tasks such as question answering, sentiment analysis, and named entity recognition.

To implement BERT and explore its architecture, we will be using the following Python libraries:

1. **Transformers**: A popular library by Hugging Face that provides pre-trained models and tokenizers for state-of-the-art NLP architectures, including BERT.
2. **Torch**: A deep learning framework used to run the BERT model and perform tensor computations efficiently on both CPU and GPU.

In [None]:
# imports

from transformers import BertTokenizer, BertModel, BertForMaskedLM, BertForNextSentencePrediction, pipeline
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import warnings
import collections
import csv
import os
import modeling
import optimization
import tokenization
import tensorflow as tf
import collections
import copy
import json
import math
import re
import six
warnings.filterwarnings("ignore")

### Background

The field of Natural Language Processing (NLP) has made significant progress over the past decade, largely driven by the development of deep learning models. Prior to the advent of transformer-based models, NLP systems were heavily reliant on traditional models like Recurrent Neural Networks (RNNs) and Long Short-Term Memory networks (LSTMs), which were sequential in nature. While these models were effective for many tasks, they struggled with handling long-range dependencies and parallelization. The breakthrough came with the introduction of the Transformer model by Vaswani et al. in 2017 in the paper "Attention Is All You Need." This model, based on self-attention mechanisms, was capable of processing entire sequences in parallel, making it more efficient and scalable compared to RNNs and LSTMs.

BERT (Bidirectional Encoder Representations from Transformers), introduced by Devlin et al. in 2018, represents a significant advancement in this paradigm. Unlike previous transformer models, which were either unidirectional (left-to-right or right-to-left), BERT leverages a bidirectional approach. This means that it considers context from both the left and right of a token during training, allowing for a deeper understanding of word meaning based on its surrounding context. BERT's architecture is pre-trained on vast amounts of text data using two objectives: **Masked Language Modeling (MLM)** and **Next Sentence Prediction (NSP)**. These pre-training tasks enable BERT to capture rich contextual representations of words and their relationships in a sentence, setting it apart from earlier models.

The impact of BERT on the NLP community has been profound. It achieved state-of-the-art results across a wide range of benchmarks, including question answering, sentiment analysis, and named entity recognition, among others. BERT’s pre-training approach allows it to be fine-tuned on downstream tasks with relatively small datasets, making it highly versatile for various NLP applications. Additionally, BERT has inspired several model variants, including RoBERTa, ALBERT, and DistilBERT, which build upon and optimize its architecture.

With the rise of transformer-based models like BERT, the landscape of NLP research and applications has shifted towards pre-trained models, enabling researchers and developers to fine-tune a single model for a wide range of specific tasks. This approach has significantly reduced the barriers to entry for building state-of-the-art NLP systems, democratizing access to powerful language models.


# Model Architecture

BERT (Bidirectional Encoder Representations from Transformers) is based on the Transformer architecture, specifically utilizing the **encoder stack**. Unlike traditional models that process text sequentially (e.g., RNNs or LSTMs), BERT leverages **self-attention mechanisms** that allow it to consider the relationships between all words in a sentence simultaneously, capturing long-range dependencies more efficiently. The bidirectional nature of BERT means that, unlike earlier models which only process text in a left-to-right or right-to-left manner, BERT takes both the left and right context into account during training. This results in richer and more accurate contextual embeddings for words. The Transformer encoder consists of multiple layers of attention heads, followed by position-wise feed-forward networks, enabling the model to learn complex relationships and representations. BERT uses **positional encodings** to retain the order of words in the sentence, which is essential for understanding the sequence in which the words appear.


![BERT Architecture](BERT-Architecture.png)

## Overview of BERT
BERT (Bidirectional Encoder Representations from Transformers) is a novel language representation model designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers. Unlike traditional language models, which process text either left-to-right or right-to-left, BERT employs a “masked language model” (MLM) objective to learn bidirectional representations, which allows it to outperform state-of-the-art models in various natural language processing (NLP) tasks.

## Key Design Choices
BERT uses a unified architecture across different tasks, enabling minimal task-specific adjustments during fine-tuning. The model architecture is a multi-layer bidirectional Transformer encoder based on the implementation of Vaswani et al. (2017). The bidirectional self-attention mechanism in BERT allows it to better capture context compared to unidirectional models like OpenAI GPT.


## Input Representation
The input representation in BERT can unambiguously handle single sentences or sentence pairs in a single token sequence. This representation combines token embeddings, segment embeddings, and positional embeddings.

- **Token Embeddings**: The model uses WordPiece embeddings with a vocabulary of 30,000 tokens. Each token in the input sequence is converted into a fixed-size vector representing its semantic and contextual meaning.

- **Sentence Pair Representation**: To handle tasks involving sentence pairs (e.g., question-answering), BERT concatenates the sentences into a single input sequence. The sentences are separated by a special token `[SEP]`. The first token of every sequence is a special classification token `[CLS]`, whose final hidden state is used for downstream classification tasks. Learned embeddings are added to indicate whether tokens belong to sentence A or sentence B.

- **Positional and Segment Embeddings**: BERT incorporates positional embeddings to capture the position of each token in the sequence and segment embeddings to differentiate tokens belonging to different sentences in a sentence pair.

The input representation process integrates components from `modeling.py`, `tokenization.py`, and task-specific scripts like `run_classifier.py` and `run_squad.py`. `modeling.py` defines the model architecture, including token, segment, and position embeddings, while `tokenization.py` handles the conversion of raw text into WordPiece tokens. In `run_classifier.py` and `run_squad.py`, input processing is managed, specifically for single and paired sentences, ensuring that data is properly tokenized and formatted for different NLP tasks such as classification and question answering.


In [None]:
# BertModel: Core BERT model class with embeddings and transformer layers

class BertModel(object):
    """BERT model with Token, Segment, and Position Embeddings."""

    def __init__(self, config, is_training, input_ids, input_mask=None, token_type_ids=None, use_one_hot_embeddings=False):
        config = copy.deepcopy(config)
        if not is_training:
            config.hidden_dropout_prob = 0.0
            config.attention_probs_dropout_prob = 0.0

        self.embedding_output, self.embedding_table = embedding_lookup(
            input_ids, config.vocab_size, config.hidden_size, use_one_hot_embeddings=use_one_hot_embeddings)
        self.embedding_output = embedding_postprocessor(
            self.embedding_output, token_type_ids, config.type_vocab_size)

        attention_mask = create_attention_mask_from_input_mask(input_ids, input_mask)

        self.all_encoder_layers = transformer_model(
            input_tensor=self.embedding_output,
            attention_mask=attention_mask,
            hidden_size=config.hidden_size,
            num_hidden_layers=config.num_hidden_layers,
            num_attention_heads=config.num_attention_heads,
            intermediate_size=config.intermediate_size,
            hidden_dropout_prob=config.hidden_dropout_prob,
            attention_probs_dropout_prob=config.attention_probs_dropout_prob
        )
        self.sequence_output = self.all_encoder_layers[-1]

        first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
        self.pooled_output = tf.layers.dense(
            first_token_tensor, config.hidden_size, activation=tf.tanh)

    def get_pooled_output(self):
        return self.pooled_output

    def get_sequence_output(self):
        return self.sequence_output

    def get_embedding_output(self):
        return self.embedding_output

    def get_embedding_table(self):
        return self.embedding_table


The `FullTokenizer` combines the `BasicTokenizer` and `WordpieceTokenizer` for complete tokenization. The `BasicTokenizer` handles lowercasing, cleaning, and whitespace splitting, while the `WordpieceTokenizer` splits words into subwords based on a vocabulary, using `[UNK]` for out-of-vocabulary tokens. This process ensures efficient text tokenization for NLP tasks, addressing both basic and subword tokenization needs seamlessly.


In [None]:


class FullTokenizer:
    """Combines Basic and WordPiece tokenization."""
    def __init__(self, vocab):
        self.vocab = vocab
        self.basic_tokenizer = BasicTokenizer()
        self.wordpiece_tokenizer = WordpieceTokenizer(vocab)

    def tokenize(self, text):
        tokens = self.basic_tokenizer.tokenize(text)
        return [sub_token for token in tokens for sub_token in self.wordpiece_tokenizer.tokenize(token)]

class BasicTokenizer:
    """Basic tokenization for text preprocessing."""
    def tokenize(self, text):
        text = convert_to_unicode(text).lower()
        text = clean_text(text)
        return whitespace_tokenize(text)

class WordpieceTokenizer:
    """Handles WordPiece tokenization."""
    def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200):
        self.vocab = vocab
        self.unk_token = unk_token
        self.max_input_chars_per_word = max_input_chars_per_word

    def tokenize(self, text):
        output_tokens = []
        for token in whitespace_tokenize(text):
            chars = list(token)
            if len(chars) > self.max_input_chars_per_word:
                output_tokens.append(self.unk_token)
                continue
            sub_tokens, start = [], 0
            while start < len(chars):
                end, cur_substr = len(chars), None
                while start < end:
                    substr = "##" + "".join(chars[start:end]) if start > 0 else "".join(chars[start:end])
                    if substr in self.vocab:
                        cur_substr = substr
                        break
                    end -= 1
                if cur_substr:
                    sub_tokens.append(cur_substr)
                    start = end
                else:
                    output_tokens.append(self.unk_token)
                    break
            output_tokens.extend(sub_tokens)
        return output_tokens


## Pre-training Objectives
BERT employs two novel pre-training objectives to learn bidirectional representations: Masked Language Model (MLM) and Next Sentence Prediction (NSP). The MLM task involves randomly masking 15% of the tokens in each input sequence, and the model is then trained to predict these masked tokens based on the surrounding context. This approach allows BERT to leverage context from both left and right sides of each token, unlike traditional unidirectional language models. The NSP task, on the other hand, is designed to improve BERT's understanding of sentence relationships. In this task, pairs of sentences are presented to the model, and it must predict whether the second sentence follows the first in the original text. These two objectives together allow BERT to capture both token-level and sentence-level information, providing a more comprehensive understanding of language.

BERT uses two unsupervised objectives during pre-training:

1. **Masked Language Modeling (MLM)**: Randomly masks 15% of tokens in the input and predicts them using the context from both sides. This task enables deep bidirectional representations by avoiding the constraints of traditional left-to-right or right-to-left language models.

2. **Next Sentence Prediction (NSP)**: This task helps the model understand the relationship between two sentences. For a given sentence pair, 50% of the time the second sentence is the actual next sentence, and 50% of the time it is a random sentence from the corpus. The model predicts whether the second sentence logically follows the first.

In [None]:
def create_masked_lm_predictions(tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, rng):
    """Creates the predictions for the masked LM objective."""

    cand_indexes = []
    for (i, token) in enumerate(tokens):
        if token != "[CLS]" and token != "[SEP]":
            cand_indexes.append(i)

    rng.shuffle(cand_indexes)
    num_to_mask = min(max_predictions_per_seq, int(round(len(cand_indexes) * masked_lm_prob)))
    masked_lm_positions = []
    masked_lm_labels = []

    for i in range(num_to_mask):
        masked_index = cand_indexes[i]
        masked_lm_positions.append(masked_index)

        masked_token = "[MASK]"
        original_token = tokens[masked_index]

        if rng.random() < 0.8:
            tokens[masked_index] = masked_token
        elif rng.random() < 0.5:
            random_word = vocab_words[rng.randint(0, len(vocab_words) - 1)]
            tokens[masked_index] = random_word
        else:
            pass
        
        masked_lm_labels.append(original_token)

    return tokens, masked_lm_positions, masked_lm_labels


def truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng):
    """Truncate a pair of sequences to a maximum length."""
    while True:
        total_length = len(tokens_a) + len(tokens_b)
        if total_length <= max_num_tokens:
            break
        if len(tokens_a) > len(tokens_b):
            tokens_a.pop()
        else:
            tokens_b.pop()


def main():
    rng = random.Random(FLAGS.random_seed)
    tokenizer = tokenization.FullTokenizer(vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
    input_files = FLAGS.input_file.split(",")
    instances = create_training_instances(input_files, tokenizer, FLAGS.max_seq_length,
                                          FLAGS.dupe_factor, FLAGS.short_seq_prob,
                                          FLAGS.masked_lm_prob, FLAGS.max_predictions_per_seq, rng)
    output_files = FLAGS.output_file.split(",")
    write_instance_to_example_files(instances, tokenizer, FLAGS.max_seq_length,
                                    FLAGS.max_predictions_per_seq, output_files)


if __name__ == "__main__":
    tf.app.run()


In [None]:

class BERTPreTraining:
    def __init__(self, bert_config, init_checkpoint, learning_rate, num_train_steps,
                 num_warmup_steps, use_tpu, use_one_hot_embeddings):
        self.bert_config = bert_config
        self.init_checkpoint = init_checkpoint
        self.learning_rate = learning_rate
        self.num_train_steps = num_train_steps
        self.num_warmup_steps = num_warmup_steps
        self.use_tpu = use_tpu
        self.use_one_hot_embeddings = use_one_hot_embeddings

    def model_fn_builder(self):
        """Returns model_fn closure for TPUEstimator."""
        def model_fn(features, labels, mode, params):
            tf.logging.info("*** Features ***")
            for name in sorted(features.keys()):
                tf.logging.info("  name = %s, shape = %s" % (name, features[name].shape))

            input_ids = features["input_ids"]
            input_mask = features["input_mask"]
            segment_ids = features["segment_ids"]
            masked_lm_positions = features["masked_lm_positions"]
            masked_lm_ids = features["masked_lm_ids"]
            masked_lm_weights = features["masked_lm_weights"]
            next_sentence_labels = features["next_sentence_labels"]

            is_training = (mode == tf.estimator.ModeKeys.TRAIN)

            model = modeling.BertModel(
                config=self.bert_config,
                is_training=is_training,
                input_ids=input_ids,
                input_mask=input_mask,
                token_type_ids=segment_ids,
                use_one_hot_embeddings=self.use_one_hot_embeddings)

            masked_lm_loss, _, _ = self.get_masked_lm_output(
                model.get_sequence_output(), masked_lm_positions, masked_lm_ids, masked_lm_weights)
            next_sentence_loss, _, _ = self.get_next_sentence_output(
                model.get_pooled_output(), next_sentence_labels)

            total_loss = masked_lm_loss + next_sentence_loss
            tvars = tf.trainable_variables()

            initialized_variable_names = {}
            scaffold_fn = None
            if self.init_checkpoint:
                assignment_map, initialized_variable_names = modeling.get_assignment_map_from_checkpoint(
                    tvars, self.init_checkpoint)
                if self.use_tpu:
                    def tpu_scaffold():
                        tf.train.init_from_checkpoint(self.init_checkpoint, assignment_map)
                        return tf.train.Scaffold()
                    scaffold_fn = tpu_scaffold
                else:
                    tf.train.init_from_checkpoint(self.init_checkpoint, assignment_map)

            if mode == tf.estimator.ModeKeys.TRAIN:
                train_op = optimization.create_optimizer(
                    total_loss, self.learning_rate, self.num_train_steps, self.num_warmup_steps, self.use_tpu)
                return tf.contrib.tpu.TPUEstimatorSpec(mode=mode, loss=total_loss, train_op=train_op, scaffold_fn=scaffold_fn)
            elif mode == tf.estimator.ModeKeys.EVAL:
                eval_metrics = self.get_eval_metrics(masked_lm_loss, masked_lm_ids, masked_lm_weights, next_sentence_loss)
                return tf.contrib.tpu.TPUEstimatorSpec(mode=mode, loss=total_loss, eval_metrics=eval_metrics, scaffold_fn=scaffold_fn)
            else:
                raise ValueError("Only TRAIN and EVAL modes are supported")

        return model_fn

    def get_masked_lm_output(self, input_tensor, positions, label_ids, label_weights):
        """Get loss and log probs for the masked LM."""
        input_tensor = self.gather_indexes(input_tensor, positions)
        with tf.variable_scope("cls/predictions"):
            input_tensor = tf.layers.dense(
                input_tensor,
                units=self.bert_config.hidden_size,
                activation=modeling.get_activation(self.bert_config.hidden_act),
                kernel_initializer=modeling.create_initializer(self.bert_config.initializer_range))
            input_tensor = modeling.layer_norm(input_tensor)
            output_bias = tf.get_variable("output_bias", shape=[self.bert_config.vocab_size], initializer=tf.zeros_initializer())
            logits = tf.matmul(input_tensor, self.bert_config.embedding_table, transpose_b=True)
            logits = tf.nn.bias_add(logits, output_bias)
            log_probs = tf.nn.log_softmax(logits, axis=-1)

            label_ids = tf.reshape(label_ids, [-1])
            label_weights = tf.reshape(label_weights, [-1])
            one_hot_labels = tf.one_hot(label_ids, depth=self.bert_config.vocab_size, dtype=tf.float32)

            per_example_loss = -tf.reduce_sum(log_probs * one_hot_labels, axis=[-1])
            numerator = tf.reduce_sum(label_weights * per_example_loss)
            denominator = tf.reduce_sum(label_weights) + 1e-5
            loss = numerator / denominator

        return loss, per_example_loss, log_probs

    def get_next_sentence_output(self, input_tensor, labels):
        """Get loss and log probs for the next sentence prediction."""
        with tf.variable_scope("cls/seq_relationship"):
            output_weights = tf.get_variable("output_weights", shape=[2, self.bert_config.hidden_size],
                                             initializer=modeling.create_initializer(self.bert_config.initializer_range))
            output_bias = tf.get_variable("output_bias", shape=[2], initializer=tf.zeros_initializer())
            logits = tf.matmul(input_tensor, output_weights, transpose_b=True)
            logits = tf.nn.bias_add(logits, output_bias)
            log_probs = tf.nn.log_softmax(logits, axis=-1)
            labels = tf.reshape(labels, [-1])
            one_hot_labels = tf.one_hot(labels, depth=2, dtype=tf.float32)
            per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
            loss = tf.reduce_mean(per_example_loss)
        return loss, per_example_loss, log_probs

    def gather_indexes(self, sequence_tensor, positions):
        """Gathers the vectors at the specific positions over a minibatch."""
        sequence_shape = modeling.get_shape_list(sequence_tensor, expected_rank=3)
        batch_size = sequence_shape[0]
        seq_length = sequence_shape[1]
        width = sequence_shape[2]
        flat_offsets = tf.reshape(tf.range(0, batch_size, dtype=tf.int32) * seq_length, [-1, 1])
        flat_positions = tf.reshape(positions + flat_offsets, [-1])
        flat_sequence_tensor = tf.reshape(sequence_tensor, [batch_size * seq_length, width])
        return tf.gather(flat_sequence_tensor, flat_positions)

    def get_eval_metrics(self, masked_lm_loss, masked_lm_ids, masked_lm_weights, next_sentence_loss):
        """Computes the loss and accuracy of the model."""
        masked_lm_log_probs = tf.reshape(masked_lm_log_probs, [-1, masked_lm_log_probs.shape[-1]])
        masked_lm_predictions = tf.argmax(masked_lm_log_probs, axis=-1, output_type=tf.int32)
        masked_lm_example_loss = tf.reshape(masked_lm_example_loss, [-1])
        masked_lm_accuracy = tf.metrics.accuracy(labels=masked_lm_ids, predictions=masked_lm_predictions, weights=masked_lm_weights)
        masked_lm_mean_loss = tf.metrics.mean(values=masked_lm_example_loss, weights=masked_lm_weights)

        next_sentence_log_probs = tf.reshape(next_sentence_log_probs, [-1, next_sentence_log_probs.shape[-1]])
        next_sentence_predictions = tf.argmax(next_sentence_log_probs, axis=-1, output_type=tf.int32)
        next_sentence_accuracy = tf.metrics.accuracy(labels=next_sentence_labels, predictions=next_sentence_predictions)
        next_sentence_mean_loss = tf.metrics.mean(values=next_sentence_example_loss)

        return {
            "masked_lm_accuracy": masked_lm_accuracy,
            "masked_lm_loss": masked_lm_mean_loss,
            "next_sentence_accuracy": next_sentence_accuracy,
            "next_sentence_loss": next_sentence_mean_loss,
        }


## Transformer Encoder Design

Each Transformer layer contains two main components: Multi-Head Self-Attention and Feed-Forward Neural Network (FFN). The multi-head self-attention mechanism allows each token to attend to all other tokens in the sequence, capturing dependencies across long ranges and enhancing the model’s contextual understanding. Following the self-attention, a feed-forward network processes each token representation independently. Layer normalization and dropout are applied after each component to stabilize and regularize training. BERT’s bidirectional structure enables each token to attend to both preceding and following tokens, distinguishing it from previous models like GPT, which used left-to-right unidirectional attention.


In [None]:
class BertModel:
    def __init__(self, config, is_training, input_ids):
        with tf.variable_scope("bert"):
            self.embedding_output = embedding_layer(input_ids, config)
            self.all_encoder_layers = transformer_stack(
                self.embedding_output,
                hidden_size=config.hidden_size,
                num_hidden_layers=config.num_hidden_layers,
                num_attention_heads=config.num_attention_heads,
                intermediate_size=config.intermediate_size
            )
            self.sequence_output = self.all_encoder_layers[-1]

def transformer_stack(input_tensor, hidden_size, num_hidden_layers, num_attention_heads, intermediate_size):
    prev_output = input_tensor
    all_layers = []
    for layer_idx in range(num_hidden_layers):
        with tf.variable_scope(f"layer_{layer_idx}"):
            layer_output = transformer_layer(
                prev_output, hidden_size, num_attention_heads, intermediate_size
            )
            all_layers.append(layer_output)
            prev_output = layer_output
    return all_layers

def transformer_layer(input_tensor, hidden_size, num_attention_heads, intermediate_size):
    attention_output = multi_head_attention(input_tensor, hidden_size, num_attention_heads)
    attention_output = layer_norm(input_tensor + attention_output)
    intermediate_output = ff_layer(attention_output, intermediate_size)
    layer_output = layer_norm(attention_output + intermediate_output)
    return layer_output

def multi_head_attention(input_tensor, hidden_size, num_attention_heads):
    pass

def ff_layer(input_tensor, intermediate_size):
    pass

def layer_norm(input_tensor):
    pass

def gelu(x):
    cdf = 0.5 * (1.0 + tf.tanh((np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3)))))
    return x * cdf


## Fine-tuning BERT for Classification Tasks
To adapt BERT for specific downstream tasks, a task-specific layer is added to the output of BERT, typically using the hidden state of the CLS token. For classification tasks, this hidden state is fed into a classification layer, where a fully connected layer maps it to the appropriate number of output labels. The entire model, including BERT’s pre-trained parameters, is then fine-tuned on the task-specific dataset by optimizing a loss function (e.g., cross-entropy for classification). This approach allows BERT to effectively apply its pre-trained knowledge to a wide range of NLP tasks with minimal task-specific architecture modifications, demonstrating the flexibility and transferability of the BERT model across different tasks such as sentence classification, sentiment analysis, and more.

In [None]:
def create_model(bert_config, is_training, input_ids, input_mask, segment_ids,
                 labels, num_labels, use_one_hot_embeddings):
    model = modeling.BertModel(
        config=bert_config,
        is_training=is_training,
        input_ids=input_ids,
        input_mask=input_mask,
        token_type_ids=segment_ids,
        use_one_hot_embeddings=use_one_hot_embeddings)

    output_layer = model.get_pooled_output()
    hidden_size = output_layer.shape[-1].value

    output_weights = tf.get_variable(
        "output_weights", [num_labels, hidden_size],
        initializer=tf.truncated_normal_initializer(stddev=0.02))
    output_bias = tf.get_variable(
        "output_bias", [num_labels], initializer=tf.zeros_initializer())

    with tf.variable_scope("loss"):
        logits = tf.matmul(output_layer, output_weights, transpose_b=True)
        logits = tf.nn.bias_add(logits, output_bias)
        probabilities = tf.nn.softmax(logits, axis=-1)
        log_probs = tf.nn.log_softmax(logits, axis=-1)

        one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)
        per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
        loss = tf.reduce_mean(per_example_loss)

    return (loss, per_example_loss, logits, probabilities)

In [None]:
class BertModel(object):
    def __init__(self,
                 config,
                 is_training,
                 input_ids,
                 input_mask=None,
                 token_type_ids=None,
                 use_one_hot_embeddings=False,
                 scope=None):
        # ... (initialization code)

    def get_pooled_output(self):
        return self.pooled_output

    def get_sequence_output(self):
        return self.sequence_output

In [None]:
def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, use_tpu):
    global_step = tf.train.get_or_create_global_step()

    learning_rate = tf.constant(value=init_lr, shape=[], dtype=tf.float32)
    learning_rate = tf.train.polynomial_decay(
        learning_rate,
        global_step,
        num_train_steps,
        end_learning_rate=0.0,
        power=1.0,
        cycle=False)

    if num_warmup_steps:
        global_steps_int = tf.cast(global_step, tf.int32)
        warmup_steps_int = tf.constant(num_warmup_steps, dtype=tf.int32)

        global_steps_float = tf.cast(global_steps_int, tf.float32)
        warmup_steps_float = tf.cast(warmup_steps_int, tf.float32)

        warmup_percent_done = global_steps_float / warmup_steps_float
        warmup_learning_rate = init_lr * warmup_percent_done

        is_warmup = tf.cast(global_steps_int < warmup_steps_int, tf.float32)
        learning_rate = (
            (1.0 - is_warmup) * learning_rate + is_warmup * warmup_learning_rate)

    optimizer = AdamWeightDecayOptimizer(
        learning_rate=learning_rate,
        weight_decay_rate=0.01,
        beta_1=0.9,
        beta_2=0.999,
        epsilon=1e-6,
        exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"])

    if use_tpu:
        optimizer = tf.contrib.tpu.CrossShardOptimizer(optimizer)

    tvars = tf.trainable_variables()
    grads = tf.gradients(loss, tvars)

    (grads, _) = tf.clip_by_global_norm(grads, clip_norm=1.0)

    train_op = optimizer.apply_gradients(
        zip(grads, tvars), global_step=global_step)

    new_global_step = global_step + 1
    train_op = tf.group(train_op, [global_step.assign(new_global_step)])
    return train_op


# Model Training

## Pre-training Procedure

BERT's pre-training procedure is a key innovation that allows it to create deep bidirectional representations. The procedure involves two unsupervised tasks:

1. Masked Language Model (MLM): In this task, 15% of the input tokens are masked at random, and the model is trained to predict these masked tokens. This approach allows the model to capture bidirectional context, unlike traditional left-to-right language models. The MLM task is implemented as follows:
   - 80% of the time, the chosen token is replaced with [MASK]
   - 10% of the time, it is replaced with a random token
   - 10% of the time, it is left unchanged
   This strategy prevents the model from simply memorizing the masked token and encourages it to maintain a distributional contextual representation of every input token.

2. Next Sentence Prediction (NSP): This task trains the model to understand relationships between sentences. Given two sentences A and B, the model predicts whether B actually follows A in the original text. This task is crucial for downstream tasks that require understanding sentence relationships, such as Question Answering and Natural Language Inference.

The pre-training data consists of the BooksCorpus (800M words) and English Wikipedia (2,500M words), focusing on extracting text passages while ignoring lists, tables, and headers. This document-level corpus is crucial for learning long contiguous sequences

## Loading Pre-trained Weights

After pre-training, the model weights can be loaded for fine-tuning on specific tasks. The BERT repository provides scripts to load these pre-trained weights efficiently. The process typically involves:

1. Initializing a BERT model with the same architecture as the pre-trained model.
2. Loading the pre-trained weights into this model.
3. Verifying that all weights have been correctly loaded.

This step is crucial as it allows the model to leverage the knowledge gained during pre-training when tackling downstream tasks.

## Fine-tuning Approach

BERT's fine-tuning process is straightforward and effective due to its self-attention mechanism. This allows BERT to handle various downstream tasks by simply swapping out appropriate inputs and outputs. The fine-tuning process involves:

1. Input Representation: For applications involving text pairs, BERT uses the self-attention mechanism to unify the encoding of text pairs, effectively including bidirectional cross attention between two sentences.

2. Task-Specific Modifications: For each task, task-specific inputs and outputs are plugged into BERT. For token-level tasks (like sequence tagging or question answering), the token representations are fed into an output layer. For classification tasks (like sentiment analysis), the [CLS] representation is used.

3. End-to-End Training: All parameters are fine-tuned end-to-end on the specific task. This allows the model to adapt its pre-trained knowledge to the nuances of the task at hand.

The fine-tuning process is relatively inexpensive compared to pre-training. Most results can be replicated in about an hour on a single Cloud TPU, or a few hours on a GPU, starting from the same pre-trained model.

This approach has proven highly effective across a wide range of NLP tasks, often achieving state-of-the-art results with minimal task-specific architecture modifications.

In [None]:
# Pre-training data preparation
class CreatePretrainingData:
    def create_training_instances(self, input_files, tokenizer, max_seq_length, dupe_factor, short_seq_prob, masked_lm_prob, max_predictions_per_seq, rng):
        # Implementation for creating pre-training instances
        pass

# BERT pre-training
class BertPreTrainingModel(tf.keras.Model):
    def __init__(self, config, *inputs, **kwargs):
        super().__init__(*inputs, **kwargs)
        self.bert = TFBertMainLayer(config, name="bert")
        self.mlm = TFBertMLMHead(config, self.bert.embeddings, name="mlm___cls")
        self.nsp = TFBertNSPHead(config, name="nsp___cls")

    def call(self, inputs, **kwargs):
        # Implementation for the forward pass
        pass

# Fine-tuning for classification tasks
class BertForSequenceClassification(tf.keras.Model):
    def __init__(self, config, *inputs, **kwargs):
        super().__init__(*inputs, **kwargs)
        self.num_labels = config.num_labels
        self.bert = TFBertMainLayer(config, name="bert")
        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)
        self.classifier = tf.keras.layers.Dense(config.num_labels,
                                               kernel_initializer=get_initializer(config.initializer_range),
                                               name="classifier")
    
    def call(self, inputs, **kwargs):
        # Implementation for classification
        pass

# Optimization
def create_optimizer(init_lr, num_train_steps, num_warmup_steps):
    """Creates an optimizer with learning rate schedule."""
    # Implementation of learning rate schedule and optimizer
    pass

# Experimental Setup

BERT (Bidirectional Encoder Representations from Transformers) has revolutionized natural language processing tasks by introducing a powerful pre-training technique that captures deep bidirectional representations. Our experimental setup aims to leverage BERT's capabilities for various downstream tasks.

## Datasets Used

We employ several benchmark datasets to assess BERT's performance across different NLP tasks:

- **GLUE Benchmark**: A collection of 9 diverse NLU tasks including sentiment analysis, textual entailment, and question answering. This benchmark provides a comprehensive evaluation of BERT's language understanding abilities across multiple domains.

- **SQuAD**: Stanford Question Answering Dataset for evaluating reading comprehension. This dataset challenges BERT's ability to understand context and extract relevant information to answer questions.

- **SWAG**: Situations With Adversarial Generations for commonsense inference. This dataset tests BERT's capacity to reason about everyday situations and make logical inferences.

In [None]:
class GLUEProcessor(DataProcessor):
    def get_train_examples(self, data_dir):
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

    def get_dev_examples(self, data_dir):
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")

    def _create_examples(self, lines, set_type):
        examples = []
        for (i, line) in enumerate(lines):
            guid = f"{set_type}-{i}"
            text_a = line[3]
            text_b = line[4]
            label = line[1]
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
        return examples

processor = GLUEProcessor()
train_examples = processor.get_train_examples("path/to/glue_data")
dev_examples = processor.get_dev_examples("path/to/glue_data")

## Hyperparameter Selection

We follow the fine-tuning hyperparameters recommended in the original BERT paper:

- Batch size: 32
- Learning rate: 5e-5
- Number of epochs: 3

However, we conduct limited hyperparameter tuning experiments, varying the learning rate (2e-5, 3e-5, 5e-5) and number of epochs (2-4) to find optimal settings for each task. This fine-tuning process allows us to adapt BERT's pre-trained knowledge to specific downstream tasks.

In [None]:
bert_config = BertConfig(
    vocab_size=30522,
    hidden_size=768,
    num_hidden_layers=12,
    num_attention_heads=12,
    intermediate_size=3072,
)

model = BertForSequenceClassification(
    config=bert_config,
    num_labels=2
)

train_batch_size = 32
learning_rate = 5e-5
num_train_epochs = 3.0
warmup_proportion = 0.1

from optimization import create_optimizer

num_train_steps = int(
    len(train_examples) / train_batch_size * num_train_epochs)
num_warmup_steps = int(num_train_steps * warmup_proportion)

optimizer = create_optimizer(
    model.parameters(),
    learning_rate,
    num_train_steps,
    num_warmup_steps
)

## Evaluation Metrics

We use task-specific evaluation metrics as defined by each dataset:

- Accuracy for classification tasks
- F1 score for question answering
- Matthews correlation for CoLA (Corpus of Linguistic Acceptability)
- Spearman correlation for STS-B (Semantic Textual Similarity Benchmark)

For the GLUE benchmark, we report the average score across all tasks as the overall performance metric. This comprehensive evaluation allows us to assess BERT's versatility across various NLP challenges.

BERT's architecture, consisting of multiple Transformer encoder layers, enables it to capture complex linguistic patterns and relationships. The self-attention mechanism in these layers allows BERT to weigh the importance of different words in context, leading to more nuanced representations. 

The pre-training objectives of Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) contribute to BERT's effectiveness. MLM helps BERT learn contextual representations by predicting masked words, while NSP enables it to understand relationships between sentences.

Our experimental setup aims to exploit these powerful features of BERT through careful fine-tuning and evaluation across diverse NLP tasks.

In [None]:
def metric_fn(per_example_loss, label_ids, logits):
    predictions = tf.argmax(logits, axis=-1, output_type=tf.int32)
    accuracy = tf.metrics.accuracy(label_ids, predictions)
    loss = tf.metrics.mean(per_example_loss)
    return {
        "eval_accuracy": accuracy,
        "eval_loss": loss,
    }
    
def f1_score(labels, predictions):
    precision = tf.metrics.precision(labels, predictions)
    recall = tf.metrics.recall(labels, predictions)
    f1 = 2 * (precision[0] * recall[0]) / (precision[0] + recall[0])
    return f1

eval_metrics = (metric_fn, [per_example_loss, label_ids, logits])

metrics = eval_metrics[0](
    eval_metrics[1][0], eval_metrics[1][1], eval_metrics[1][2])

print(f"Accuracy: {metrics['eval_accuracy'][0]}")
print(f"Loss: {metrics['eval_loss'][0]}")

# Results and Analysis

BERT has demonstrated exceptional performance across a wide range of natural language processing tasks. On the General Language Understanding Evaluation (GLUE) benchmark, BERT achieved state-of-the-art results, significantly outperforming previous models. The BERT-Large model obtained a GLUE score of 80.5%, which represented a 7.7% absolute improvement over the previous best model. 

For specific tasks within GLUE, BERT showed remarkable gains. On the MultiNLI task, BERT-Large achieved an accuracy of 86.7%, a 4.6% absolute improvement over the previous state-of-the-art. On the Stanford Question Answering Dataset (SQuAD v1.1), BERT-Large achieved a Test F1 score of 93.2, surpassing human performance.

Ablation studies revealed the importance of BERT's bidirectional nature and its novel pre-training tasks. The Next Sentence Prediction (NSP) task proved particularly beneficial for tasks involving sentence pair classification. The Masked Language Model (MLM) pre-training objective was shown to be critical for token-level tasks.

When compared to other state-of-the-art models, BERT consistently outperformed them across various benchmarks. For instance, on the SQuAD v1.1 leaderboard, BERT-Large (single model) achieved an F1 score of 93.2, surpassing the previous top single model by 1.5 points.

In [None]:
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def encode_examples(texts, labels):
    return tokenizer(texts, padding=True, truncation=True, return_tensors='tf', max_length=512)

texts = ["I loved this movie!", "This was a terrible film."]
labels = [1, 0]
encoded_data = encode_examples(texts, labels)

# Fine-tuning function
def fine_tune_bert(model, train_dataset):
    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=2e-5),
                  loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                  metrics=['accuracy'])
    model.fit(train_dataset, epochs=3)

train_dataset = tf.data.Dataset.from_tensor_slices((encoded_data['input_ids'], labels)).batch(2)
fine_tune_bert(model, train_dataset)

# Sentiment Analysis using BERT: A Real-World Example

BERT (Bidirectional Encoder Representations from Transformers) is a state-of-the-art model for natural language processing tasks, and its application to sentiment analysis demonstrates its effectiveness. In this study, the model was fine-tuned to classify movie reviews from the IMDB dataset as positive or negative. The text data was tokenized using the BERT tokenizer, which converts sentences into WordPiece tokens, ensuring compatibility with the model's input requirements. The architecture consisted of a pre-trained BERT encoder with an additional dense layer for classification, and the entire model was fine-tuned using binary cross-entropy loss. This process allowed the model to adapt to the specific nuances of the sentiment analysis task, capturing the contextual information essential for accurate predictions.

The fine-tuned BERT model achieved impressive accuracy, exceeding 94% on the IMDB dataset. Its ability to understand context and handle complex expressions of sentiment made it superior to traditional machine learning approaches and simpler deep learning models. The results highlight the effectiveness of pre-trained models in reducing the need for extensive feature engineering while delivering high performance. This example underscores how BERT’s bidirectional contextual understanding can simplify and enhance NLP workflows, providing a robust solution for real-world applications such as sentiment analysis.


In [None]:
#add code here for example

# Applications and Use Cases

BERT has significantly transformed natural language processing (NLP) by providing a robust pre-trained model applicable to a variety of tasks. It has set new state-of-the-art benchmarks in token-level tasks like named entity recognition and question answering (e.g., SQuAD) and in sentence-level tasks such as sentiment analysis and natural language inference. BERT’s architecture, which fuses bidirectional contextual information from text, eliminates the need for extensive task-specific feature engineering. This versatility has made it a go-to solution for both academic and industrial applications, enabling advancements in chatbots, text summarization, and machine translation.


# Challenges and Limitations

While BERT is a powerful model, it comes with several challenges. The model's pre-training process is computationally expensive, requiring significant resources such as multiple TPUs over several days. This creates barriers for researchers and developers without access to such infrastructure. Additionally, fine-tuning BERT for domain-specific applications, such as legal or medical text, often demands substantial labeled data, which may not always be readily available. Its size and memory requirements can also hinder deployment in environments with limited computational resources, such as mobile or edge devices. Moreover, the bidirectional masking strategy introduces a pre-training and fine-tuning mismatch, which impacts its learning efficiency.


# Future Work

Future research directions for BERT include improving its computational efficiency and scalability. Efforts are underway to develop lightweight versions, such as distillation-based models, that retain performance while reducing resource demands. Expanding BERT’s adaptability to low-resource languages and domains through more effective transfer learning techniques is another promising area. Moreover, advancements in fine-tuning strategies could enable better utilization of small, task-specific datasets, making the model accessible to a broader range of applications. Finally, integrating BERT with other modalities, such as vision or speech, represents a frontier for creating more versatile AI systems.


# Conclusion

BERT marks a pivotal advancement in NLP, showcasing the power of bidirectional pre-trained models in solving complex language tasks. Its ability to generalize across diverse applications with minimal task-specific adjustments has redefined the field. However, challenges such as high computational costs and domain-specific limitations highlight the need for further innovation. Addressing these limitations while expanding BERT’s applicability will continue to shape its role in the future of NLP and beyond.


# References

Bert Github Repository: [BERT GitHub Repository](https://github.com/google-research/bert)

MRPC Data: [MRPC Data Github Repository](https://github.com/MegEngine/Models/tree/master/official/nlp/bert/glue_data/MRPC)

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805).
