# INTRODUCTION

In this notebook, I am going to interactively implement a **BERT**(Bidirectional Encoder Representation from Transformer) which is a language representation designed to pretrain deep bidirectional representations from unlabelled text. It does this by jointly conditioning on both left and right context in all of its layers. As a result, after pretraining a BERT model, fine-tuning is done only on the output layer, meaning there's minimal architectural modifications for fine-tuning the model for a specific task.

Getting straight into the details, there are two major steps in this framework:
* **Pre-training** - This is where the model is pretrained on unlabelled data over different pretraining tasks.
* **Fine-tuning** - To do this, the model first initializes with pretrained parameters which are all fine-tuned using labelled data from downstream tasks, thus, as mentioned above, each task will have separate fine-tuned model which were all initialized with the same pretrained parameters.

# Model Architecture

The architecture here is just a variant of a transformer model discussed in the paper [Attention is All You Need](https://arxiv.org/abs/1706.03762) so we won't discuss it at length, but we'll see as we go, how they're the same.

**Model Parameters** - The paper used the following parameters for its base model:
* $L = 12$ - Number of layers i.e. the number of stacked Transformer Encoder Blocks, each of which will have sublayers multi-attention mechanism and feed-forward neural network as we already know from previous work.
* $H = 768$ - This will be the number of neurons in each hidden state, meaning, after embedding, each token will have a hidden representation of size 768.
* $A = 12$ - This indicates the number of self-attention heads in each layer, meaning we'll have $12$ instances of $K, Q, V$ matrices for computing multi-head self-attention.

In [45]:
import nltk
nltk.download('reuters')
from nltk.corpus import reuters

import pprint

import torch
import torch.nn as nn
from transformers import BertTokenizer, BertModel

import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package reuters to
[nltk_data]     C:\Users\kmahl\AppData\Roaming\nltk_data...
[nltk_data]   Package reuters is already up-to-date!


# Loading Data

Let's load some data that we'll use for debugging, not necessarily pretraining, just testing.

In [46]:
START_TOKEN = '<START>'
END_TOKEN = '<END>'

# Note liking where the data is stored and too lazy to fix it
def read_corpus(category="gold"):
    """ Read files from the specified Reuter's category.
        Params:
            category (string): category name
        Return:
            list of lists, with words from each of the processed files
    """
    files = reuters.fileids(category)
    return [[START_TOKEN] + [w.lower() for w in list(reuters.words(f))] + [END_TOKEN] for f in files]

In [47]:
def distinct_words(corpus):
    """ Determine a list of distinct words for the corpus.
        Params:
            corpus (list of list of strings): corpus of documents
        Return:
            corpus_words (list of strings): sorted list of distinct words across the corpus
            n_corpus_words (integer): number of distinct words across the corpus
    """
    corpus_words = []
    n_corpus_words = -1

    corpus_words = list({y for x in corpus for y in x})
    n_corpus_words = len(corpus_words)

    return corpus_words, n_corpus_words

In [48]:
reuters_corpus = read_corpus()
corpus_words, n_corpus_words = distinct_words(reuters_corpus)
max_sent_length = max(len(sentence) for sentence in reuters_corpus)

In [49]:
print(f"There are {n_corpus_words} unique words in the corpus." \
      f" The corpus has {len(reuters_corpus)} sentences," \
      f" where the longest one is of length {max_sent_length}")

There are 2830 unique words in the corpus. The corpus has 124 sentences, where the longest one is of length 836


## Inputs
The model needs to accept a special token $\text{<CLS>}$ (the first token for every sequence) and a list of word tokens from the user. Each of these tokens are then converted final word embeddings by adding the following embeddings:
* **Token embeddings $(TE)$** In order for our model to make something out of words it hasn't seen before(during training) e.g. novel words, mispellings etc, the paper uses `WordPiece` embeddings for each token to create token embedding.
* **Segment embeddings $(SE)$** The sentence number encoded in a vector. This is a way to differentiate between sentence A and sentence B e.g. during NSP that we'll discuss shortly.
* **Position embeddings $(PE)$** A vector encoding of the position of a word in a sentence.

It is thus clear that the segment and position embeddings are both useful in capturing temporal ordering within input sentences.

### Implementation details
I am not sure how the `WordPiece` algorithm work(and I don't think I have to at this point), so I've found a pretrained model from `Hugging Face` called [`bert-base-uncased`](https://huggingface.co/google-bert/bert-base-uncased), which is the uncased base model of what I'm trying to build, so I'll be using it to encode sentences in my corpus.

In [50]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
bert_model = BertModel.from_pretrained("bert-base-uncased")

In [51]:
encoded_sentences = []

for sentence in reuters_corpus:
    encoded_sentence = tokenizer(
        sentence,
        return_tensors='pt',
        padding='max_length',
        truncation=True,
        max_length=max_sent_length
    )
    encoded_sentences.append(encoded_sentence['input_ids'])

In [52]:
len(encoded_sentences)

124

In [53]:
print(f"{encoded_sentence['input_ids'].shape}\n{encoded_sentence['token_type_ids'].shape}\n{encoded_sentence['attention_mask'].shape}")

torch.Size([344, 836])
torch.Size([344, 836])
torch.Size([344, 836])


So each sentence is encoded with with 344 I-don't-know-what(maybe indices? I really don't know), and each instance of that is of length 836, which almost makes sense because this is supposed to be the length of each sequence.

I'm really not sure whether I'm on the right track or not.

In [None]:
class Embedding(nn.Module):
    def __init__(self, vocab_size, maxlen, d_model):
        super(Embedding, self).__init__()
        self.tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
        self.bert_model = BertModel.from_pretrained("bert-base-uncased")
        self.d_model = d_model
        self.pos_embed = nn.Embedding(maxlen, d_model)
        self.seg_embed = nn.Embedding(2, d_model)
        self.norm = nn.LayerNorm(d_model)

    def forward(self, x):

        tokens = self.tokenizer(
            x,
            return_tensors='pt',
            padding='max_length',
            truncation=True,
            max_length=self.d_model
        )['input_ids']
        
        seq_len = tokens.size(1)
        pos = torch.arange(seq_len, dtype=torch.long, device=tokens.device)
        pos = pos.unsqueeze(0).expand_as(tokens)
        segments = torch.zeros_like(tokens)
        token_embeddings = self.bert_model(tokens)[0]
        embedding = token_embeddings + self.pos_embed(pos) + self.seg_embed(segments)
        return self.norm(embedding)
