# BERT Sequence Classification

BERT (Bidirectional Encoder Representations from Transformers) is a powerful pre-trained model for natural language processing tasks, including sequence classification. It leverages the Transformer architecture and self-attention mechanism to capture contextual information from input sequences.

## Input Representation

![BERT Input Representation](https://miro.medium.com/v2/resize:fit:800/format:webp/1*tLYLk2ZGu2XYmc0n4Hn8Mg.png)

BERT takes a sequence of tokens as input, typically represented as:
- `[CLS]` token: A special token added at the beginning of the sequence, used for classification tasks.
- Token embeddings: Each token in the input sequence is converted into a dense vector representation.
- Segment embeddings: Used to differentiate between different segments (e.g., sentences) in the input sequence.
- Position embeddings: Captures the positional information of each token in the sequence.

These embeddings are summed element-wise to obtain the final input representation.

## BERT Encoder

![BERT Encoder Architecture](https://miro.medium.com/v2/resize:fit:560/format:webp/1*VtfbRAAiQhb0IUi7fSKTaQ.png)

The BERT encoder consists of multiple layers of Transformer blocks. Each Transformer block contains:
- Multi-head self-attention: Allows the model to attend to different positions of the input sequence, capturing relationships between tokens.
- Feed-forward neural network: Applies non-linear transformations to the output of the self-attention layer.

The encoder processes the input sequence and generates contextualized representations for each token.

### Multi-Head Self-Attention

The multi-head self-attention mechanism allows the model to attend to different positions of the input sequence, capturing relationships between tokens. It consists of multiple attention heads that operate in parallel.

Each attention head computes attention scores between all pairs of tokens in the sequence, using query (Q), key (K), and value (V) matrices:

```
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V
```

Where `d_k` is the dimension of the key vectors.

The outputs of all attention heads are concatenated and passed through a linear transformation to obtain the final self-attention output.

### Position-wise Feed-Forward Neural Network

After the self-attention sub-layer, a position-wise feed-forward neural network is applied to each position separately and identically. It consists of two linear transformations with a ReLU activation in between:

```
FFN(x) = max(0, xW_1 + b_1)W_2 + b_2
```

Where `W_1`, `b_1`, `W_2`, and `b_2` are learnable parameters.

## Layer Normalization and Residual Connections

Each sub-layer in the Transformer block is followed by a layer normalization and a residual connection. The residual connection helps in propagating the information from lower layers to higher layers and facilitates gradient flow during training.

## Output

The output of the BERT encoder is a sequence of contextualized token representations. These representations can be used for various downstream tasks, such as sequence classification, token classification, question answering, and more.

For sequence classification tasks, the `[CLS]` token representation from the final layer of the BERT encoder is typically used as the aggregate representation of the input sequence.

## Sequence Classification

For sequence classification, the `[CLS]` token representation from the final layer of the BERT encoder is used. This representation captures the overall contextual information of the input sequence.

The `[CLS]` token representation is passed through a linear layer followed by a softmax activation to obtain the class probabilities:

```
class_probabilities = softmax(linear(cls_representation))
```

The model is trained using a cross-entropy loss function to optimize the classification performance.

## Fine-tuning

BERT is pre-trained on large-scale unsupervised data using masked language modeling and next sentence prediction tasks. For sequence classification, the pre-trained BERT model is fine-tuned on a labeled dataset specific to the classification task. After pre-training, BERT can be fine-tuned on specific downstream tasks by adding task-specific layers on top of the pre-trained encoder and training the model on labeled data.

During fine-tuning, the pre-trained BERT encoder weights are updated, and the additional classification layer is trained from scratch. This allows the model to adapt to the specific classification task while leveraging the pre-trained knowledge.

## Inference

At inference time, the input sequence is passed through the fine-tuned BERT model, and the class probabilities are obtained from the `[CLS]` token representation. The class with the highest probability is predicted as the output.

BERT sequence classification has achieved state-of-the-art performance on various benchmarks and has become a go-to approach for many natural language processing tasks.



-----

# THE BERT ARCHITECTURE

![BERT ARCHITECTURE](assets/bert.png)

## Bert Embeddings

The input to the BERT model consists of three types of embeddings:
1. Word embeddings: Each token in the input sequence is converted into a dense vector representation.
2. Position embeddings: Captures the positional information of each token in the sequence.
3. Token type embeddings: Used to differentiate between different segments (e.g., sentences) in the input sequence.

These embeddings are summed element-wise to obtain the final input representation.

## 12 x Bert Encoder

The BERT encoder consists of 12 identical layers (in the base version). Each layer contains two main components:

### Bert Attention

The Bert Attention module is based on the self-attention mechanism. It consists of three matrices: Query, Key, and Value. The attention mechanism computes the relationship between each token and all other tokens in the sequence.

The output of the Bert Attention module goes through a dropout layer to prevent overfitting.


### Bert Output

The Bert Output module takes the output from the Bert Attention module and passes it through the following components:
1. Linear layer: Applies a linear transformation to the attention output.
2. Layer Normalization: Normalizes the activations across the features.
3. Dropout: Applies dropout regularization.

The output of the Bert Output module is then added to the input of the Bert Attention module through a residual connection.

## Bert Self Output

After the 12 Bert Encoder layers, the output goes through the Bert Self Output module, which consists of:
1. Linear layer: Applies a linear transformation to the output of the last Bert Encoder layer.
2. GELU activation: Applies the Gaussian Error Linear Unit (GELU) activation function.

## Bert Pooler

The Bert Pooler module is used to obtain a fixed-size representation of the input sequence. It consists of:
1. Linear layer: Applies a linear transformation to the output of the first token (usually the `[CLS]` token) from the last Bert Encoder layer.
2. Tanh activation: Applies the hyperbolic tangent activation function.

## Classifier

Finally, the output of the Bert Pooler module is passed through a dropout layer and then fed into a linear classifier. The classifier produces the final output probabilities for the task at hand (e.g., sentiment classification, named entity recognition, etc.).

This diagram provides a detailed overview of the BERT architecture, showcasing the flow of data through the various components. The Bert Encoder layers form the core of the model, capturing the contextual information through self-attention mechanisms. The Bert Pooler and Classifier modules are task-specific and can be adapted based on the downstream task requirements.

# [Custom Simple Transformer](https://peterbloem.nl/blog/transformers)

- Self attention is fundamental operation in transformer
- Sequence (of vectors) to sequence (of vectors).
- To produce output vector, it takes weighted average over all input vectors
- Self attention propagates information between vectors. This is in contrast to RNNs where they have a state
- Self-attention for now ignores sequential input
- Dot product 

In [None]:
import numpy as np
import pandas as pd

In [2]:
import torch

# 10 sequences, 20 tensors of size 30 each
RANDOM_TEST_INPUTS = torch.rand(
  (10, 20, 30),
  dtype=torch.float32,
)

In [3]:
RANDOM_TEST_INPUTS

tensor([[[0.8597, 0.5104, 0.4714,  ..., 0.8060, 0.4534, 0.6248],
         [0.0593, 0.9803, 0.7217,  ..., 0.6627, 0.7519, 0.3200],
         [0.5618, 0.6923, 0.7853,  ..., 0.0755, 0.8452, 0.4088],
         ...,
         [0.8629, 0.4481, 0.6432,  ..., 0.2817, 0.6685, 0.9825],
         [0.3931, 0.7048, 0.5925,  ..., 0.4335, 0.3156, 0.8388],
         [0.6964, 0.0372, 0.7412,  ..., 0.1967, 0.6473, 0.7235]],

        [[0.8246, 0.3814, 0.1305,  ..., 0.5030, 0.9597, 0.8616],
         [0.4541, 0.8536, 0.4117,  ..., 0.9847, 0.6372, 0.5771],
         [0.9693, 0.8254, 0.8352,  ..., 0.5811, 0.0567, 0.9208],
         ...,
         [0.7159, 0.9516, 0.6920,  ..., 0.9895, 0.3224, 0.4988],
         [0.5322, 0.9626, 0.4402,  ..., 0.4515, 0.1656, 0.7058],
         [0.9095, 0.7434, 0.9173,  ..., 0.8247, 0.3174, 0.5650]],

        [[0.9792, 0.2582, 0.6247,  ..., 0.9126, 0.6291, 0.1392],
         [0.0919, 0.1052, 0.1889,  ..., 0.7426, 0.6690, 0.4245],
         [0.1641, 0.2974, 0.1004,  ..., 0.8607, 0.6291, 0.

In [4]:
RANDOM_TEST_INPUTS.shape

torch.Size([10, 20, 30])

In [7]:
# Creating self attention
def self_attention(X):
  # X is size (b, t, k)
  # b = batch size, not necessary?
  # t = num of tensor
  # k = dimension of tensors
  raw_weights = torch.bmm(X, X.transpose(1, 2))
  return raw_weights

In [8]:
# Dot product of a
raw_weights = self_attention(RANDOM_TEST_INPUTS)

In [11]:
raw_weights

tensor([[[ 9.0988,  7.8299,  7.0427,  ...,  8.5879,  8.2779,  6.5358],
         [ 7.8299, 10.8021,  7.5168,  ...,  8.2890,  8.7921,  6.9596],
         [ 7.0427,  7.5168,  9.5582,  ...,  7.8614,  6.8952,  7.3989],
         ...,
         [ 8.5879,  8.2890,  7.8614,  ..., 11.5310,  8.7146,  8.0790],
         [ 8.2779,  8.7921,  6.8952,  ...,  8.7146, 10.1629,  6.6800],
         [ 6.5358,  6.9596,  7.3989,  ...,  8.0790,  6.6800, 10.2436]],

        [[11.5251,  7.8385, 10.4682,  ...,  8.7468,  8.9144,  9.1780],
         [ 7.8385,  8.4331,  8.6612,  ...,  7.5152,  7.5817,  8.1269],
         [10.4682,  8.6612, 14.0336,  ...,  9.8568, 10.8949, 10.3171],
         ...,
         [ 8.7468,  7.5152,  9.8568,  ..., 11.8285,  9.4093,  9.5136],
         [ 8.9144,  7.5817, 10.8949,  ...,  9.4093, 11.3014,  9.0996],
         [ 9.1780,  8.1269, 10.3171,  ...,  9.5136,  9.0996, 11.4740]],

        [[10.0906,  6.2442,  7.0300,  ...,  7.8157,  7.5303,  8.5084],
         [ 6.2442,  6.8501,  5.4495,  ...,  6

In [10]:
raw_weights.shape

torch.Size([10, 20, 20])

In [12]:
import torch.nn.functional as F

weights = F.softmax(raw_weights, dim=2)

In [13]:
weights

tensor([[[1.0287e-01, 2.8923e-02, 1.3164e-02,  ..., 6.1719e-02,
          4.5271e-02, 7.9287e-03],
         [1.9442e-02, 3.7978e-01, 1.4215e-02,  ..., 3.0769e-02,
          5.0888e-02, 8.1428e-03],
         [2.2017e-02, 3.5368e-02, 2.7240e-01,  ..., 4.9923e-02,
          1.8996e-02, 3.1436e-02],
         ...,
         [2.0995e-02, 1.5571e-02, 1.0154e-02,  ..., 3.9837e-01,
          2.3832e-02, 1.2622e-02],
         [2.5069e-02, 4.1921e-02, 6.2895e-03,  ..., 3.8795e-02,
          1.6510e-01, 5.0718e-03],
         [9.0323e-03, 1.3800e-02, 2.1412e-02,  ..., 4.2269e-02,
          1.0434e-02, 3.6819e-01]],

        [[4.2294e-01, 1.0598e-02, 1.4699e-01,  ..., 2.6284e-02,
          3.1080e-02, 4.0451e-02],
         [6.6451e-02, 1.2043e-01, 1.5128e-01,  ..., 4.8093e-02,
          5.1399e-02, 8.8663e-02],
         [2.0398e-02, 3.3482e-03, 7.2111e-01,  ..., 1.1068e-02,
          3.1253e-02, 1.7536e-02],
         ...,
         [2.3203e-02, 6.7710e-03, 7.0409e-02,  ..., 5.0570e-01,
          4.500

In [14]:
weights.shape

torch.Size([10, 20, 20])

In [15]:
output_sequence = torch.bmm(weights, RANDOM_TEST_INPUTS)

In [16]:
output_sequence

tensor([[[0.6922, 0.7066, 0.5789,  ..., 0.4506, 0.4700, 0.5819],
         [0.4439, 0.8395, 0.6489,  ..., 0.5022, 0.5572, 0.4680],
         [0.6210, 0.6672, 0.6339,  ..., 0.3594, 0.6064, 0.5485],
         ...,
         [0.7432, 0.6152, 0.6244,  ..., 0.3558, 0.5556, 0.7556],
         [0.6237, 0.8011, 0.6614,  ..., 0.3246, 0.4508, 0.6265],
         [0.6720, 0.4526, 0.6516,  ..., 0.3291, 0.5666, 0.6684]],

        [[0.6949, 0.5452, 0.4591,  ..., 0.5432, 0.6218, 0.7620],
         [0.5772, 0.6464, 0.6497,  ..., 0.6371, 0.4543, 0.6558],
         [0.8432, 0.7747, 0.8057,  ..., 0.5824, 0.1668, 0.8490],
         ...,
         [0.6470, 0.8238, 0.7088,  ..., 0.7895, 0.3462, 0.5973],
         [0.5883, 0.7557, 0.6930,  ..., 0.5634, 0.3135, 0.7053],
         [0.7324, 0.7234, 0.7913,  ..., 0.6906, 0.3455, 0.6480]],

        [[0.6799, 0.5218, 0.5273,  ..., 0.7736, 0.5727, 0.3490],
         [0.4085, 0.6274, 0.3835,  ..., 0.7583, 0.5627, 0.5189],
         [0.4138, 0.5609, 0.3552,  ..., 0.7050, 0.5314, 0.

In [17]:
output_sequence.shape

torch.Size([10, 20, 30])