<a href="https://www.kaggle.com/code/ayushs9020/inventing-bert-from-scratch?scriptVersionId=132484419" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# 2023 Kaggle AI Report

<img src = "https://i.chzbgr.com/full/9274249984/hC9529C78/grew-up-sarcastic-above-a-pic-of-bert-saying-ernie-how-do-i-look-ernie-replies-with-your-eyes-bert" width = 400px>

The $2023$ $Kaggle$ $AI$ $Report$ is an `analytics competition` that invites participants to `write essays on the state of machine learning in 2023`. The essays should describe `what the community has learned over the past 2 years of working and experimenting` with one of the following seven topics:

* $Text$
* $Image$
* $Video$ 
* $Data$
* $Tabular$
* $Time$ $Series$
* $Kaggle$ $Competitions$
* $Generative$ $AI$
* $AI$ $Ethics$

The essays `should be well-written and informative`, and they `should provide a comprehensive overview of the state of machine learning` in $2023$. The top essays will be `published in the 2023 Kaggle AI Report`, which will be a `valuable resource for anyone who is interested in learning more about the state of machine learning`.

Here are some additional details about the competition:

|_____|_____|
|---|---
|Prizes| 
||$$$10,000$$
||$$$5,000$$
||$$$2,500$$
|Submission deadline|The deadline for submissions is $June$ $1,$ $2023$.
|Submission format|Essays should be submitted as a `PDF file`.
|Length|Essays should be no more than $2,500$ words in length.
|Judging criteria|Clarity
||Organization
||Accuracy
||Completeness
||Originality 
||Creativity

## BERT 

$BERT$ stands for $Bidirectional$ $Encoder$ $Representations$ from $Transformers$. It is a `language model` that was developed by $Google$ $AI$ in $2018$. $BERT$ is `trained on a massive dataset of text and code`. It can be used for a variety of natural language processing tasks, such as 
* $Question$ $Answering$ 
* $Sentiment$ $Analysis$
* $Natural$ $Language$ $Inference$.

# 1 | Data 🚀

Lets get our data into working 

In [1]:
import pandas as pd 
import numpy as np

import re
import tqdm

from pathlib import Path

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/2023-kaggle-ai-report/sample_submission.csv
/kaggle/input/2023-kaggle-ai-report/arxiv_metadata_20230510.json
/kaggle/input/2023-kaggle-ai-report/kaggle_writeups_20230510.csv


In [2]:
data = pd.read_csv("/kaggle/input/2023-kaggle-ai-report/kaggle_writeups_20230510.csv")
data

Unnamed: 0,Competition Launch Date,Title of Competition,Competition URL,Date of Writeup,Title of Writeup,Writeup,Writeup URL
0,08/03/2010 00:00:00,Chess ratings - Elo versus the Rest of the World,https://www.kaggle.com/c/2447,11/18/2010 00:06:46,Released: my Source Code and Analysis,<p>I had a lot of fun with this competition an...,https://www.kaggle.com/c/2447/discussion/185
1,08/03/2010 00:00:00,Chess ratings - Elo versus the Rest of the World,https://www.kaggle.com/c/2447,11/20/2010 04:38:53,6th place(UriB) by Uri Blass,<P>I calculated rating for every player in mon...,https://www.kaggle.com/c/2447/discussion/192
2,08/03/2010 00:00:00,Chess ratings - Elo versus the Rest of the World,https://www.kaggle.com/c/2447,11/23/2010 10:38:23,7th place - littlefish,I'm a little surprised I ended up in the top-1...,https://www.kaggle.com/c/2447/discussion/194
3,08/03/2010 00:00:00,Chess ratings - Elo versus the Rest of the World,https://www.kaggle.com/c/2447,11/20/2010 11:27:17,3rd place: Chessmetrics - Variant,"<p><span id=""post_text_content_1230""><div dir=...",https://www.kaggle.com/c/2447/discussion/193
4,08/03/2010 00:00:00,Chess ratings - Elo versus the Rest of the World,https://www.kaggle.com/c/2447,11/18/2010 02:44:10,2nd place: TrueSkill Through Time,"Wow, this is a surprise! I looked at this comp...",https://www.kaggle.com/c/2447/discussion/186
...,...,...,...,...,...,...,...
3122,02/23/2023 17:25:32,Google - Isolated Sign Language Recognition,https://www.kaggle.com/c/46105,05/02/2023 09:45:01,49th place silver solution,<p>Thank you Kaggle and Pop sign for hosting t...,https://www.kaggle.com/c/46105/discussion/406426
3123,02/23/2023 17:25:32,Google - Isolated Sign Language Recognition,https://www.kaggle.com/c/46105,05/02/2023 10:13:31,10th place solution,"<blockquote>\n <p>First, I would like to than...",https://www.kaggle.com/c/46105/discussion/406434
3124,02/23/2023 17:25:32,Google - Isolated Sign Language Recognition,https://www.kaggle.com/c/46105,05/02/2023 03:24:28,Solution - Single transformer without val dataset,<p>Thanks to the organisers of the PopSign Gam...,https://www.kaggle.com/c/46105/discussion/406346
3125,02/23/2023 17:25:32,Google - Isolated Sign Language Recognition,https://www.kaggle.com/c/46105,05/02/2023 04:01:15,Top 8% Bronze Medal Solution,<blockquote>\n <p><strong>Many congratulation...,https://www.kaggle.com/c/46105/discussion/406354


At this point we will only focus on the `Writeup` column, we will try to access/process more information in the upcoming versions

Our data is distributed in a `CSV File`. We need to extract our data in a `txt File` as a large corpus of data 

In [3]:
data["Writeup"]

0       <p>I had a lot of fun with this competition an...
1       <P>I calculated rating for every player in mon...
2       I'm a little surprised I ended up in the top-1...
3       <p><span id="post_text_content_1230"><div dir=...
4       Wow, this is a surprise! I looked at this comp...
                              ...                        
3122    <p>Thank you Kaggle and Pop sign for hosting t...
3123    <blockquote>\n  <p>First, I would like to than...
3124    <p>Thanks to the organisers of the PopSign Gam...
3125    <blockquote>\n  <p><strong>Many congratulation...
3126    <p>Thank you Kaggle, Kagglers, PopSign, and Pa...
Name: Writeup, Length: 3127, dtype: object

I think that `for understanding BERT`. It would be great if we choose the way of `Question Answering`. This data, by default is not made for `Question Answering`. But somwhow we will make the data as we want 

In [4]:
print(data["Writeup"][0])

<p>I had a lot of fun with this competition and learned a lot about ratings systems.</p>
<div>Sadly, I only came 18th :)</div>
<div>If you're interested, you can download all of my code and&nbsp;analysis&nbsp;from my github repo:&nbsp;https://github.com/jbrownlee/ChessML</div>
<div>There are implementations of a few rating systems (elo, glicko, chessmetrics, etc) and many attempts at improving them (a nice little experimentation framework).</div>
<div>Thanks all. Looking forward to the next big comp!</div>
<div>jasonb</div>


You can note that there are many of the `HTML tags` and other links provided in the data. We do not need these links, So it would be great if we juse remove all of this 

In [5]:
print(
    re.sub(
        ':' , " " , 
        re.sub(
            ';' , ' ' , 
            re.sub(
                '&nbsp' , "" , 
                (
                    re.sub(
                        r'http\S+', ' ', 
                        (
                            re.compile(r'<.*?>').sub(
                                "" , 
                                data["Writeup"][0]
                            )
                        )
                    )
                )
            )
        )
    )
)

I had a lot of fun with this competition and learned a lot about ratings systems.
Sadly, I only came 18th  )
If you're interested, you can download all of my code and analysis from my github repo   
There are implementations of a few rating systems (elo, glicko, chessmetrics, etc) and many attempts at improving them (a nice little experimentation framework).
Thanks all. Looking forward to the next big comp!
jasonb


In [6]:
emoj = re.compile("["
    u"\U0001F600-\U0001F64F"  # emoticons
    u"\U0001F300-\U0001F5FF"  # symbols & pictographs
    u"\U0001F680-\U0001F6FF"  # transport & map symbols
    u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
    u"\U00002500-\U00002BEF"  # chinese char
    u"\U00002702-\U000027B0"
    u"\U00002702-\U000027B0"
    u"\U000024C2-\U0001F251"
    u"\U0001f926-\U0001f937"
    u"\U00010000-\U0010ffff"
    u"\u2640-\u2642" 
    u"\u2600-\u2B55"
    u"\u200d"
    u"\u23cf"
    u"\u23e9"
    u"\u231a"
    u"\ufe0f"  # dingbats
    u"\u3030"
    u"\u2028"
    "\x08"
    u"\u200a"
    u"\u200b"
                  "]+", re.UNICODE)

In [7]:
def preprocess(text): 
    k = re.sub(
        ':' , " " , 
        re.sub(
            ';' , ' ' , 
            re.sub(
                '&nbsp' , '' , 
                (
                    re.sub(
                        r'http\S+', ' ', 
                        (
                            re.compile(r'<.*?>').sub(
                                "" , str(text)
                            )
                        )
                    )
                )
            )
        )
    )
    k = emoj.sub(r'' , k)
    
    return k

In [8]:
data["Writeup"] = data["Writeup"].map(preprocess)

In [9]:
print(data["Writeup"][0])

I had a lot of fun with this competition and learned a lot about ratings systems.
Sadly, I only came 18th  )
If you're interested, you can download all of my code and analysis from my github repo   
There are implementations of a few rating systems (elo, glicko, chessmetrics, etc) and many attempts at improving them (a nice little experimentation framework).
Thanks all. Looking forward to the next big comp!
jasonb


And this is in good format

In [10]:
que_splitter = lambda text: text[:len(text)//2]
ans_splitter = lambda text: text[len(text)//2:]

In [11]:
data["Qeustion"] = data["Writeup"].map(que_splitter)
data["Answer"] = data["Writeup"].map(ans_splitter)

In [12]:
data

Unnamed: 0,Competition Launch Date,Title of Competition,Competition URL,Date of Writeup,Title of Writeup,Writeup,Writeup URL,Qeustion,Answer
0,08/03/2010 00:00:00,Chess ratings - Elo versus the Rest of the World,https://www.kaggle.com/c/2447,11/18/2010 00:06:46,Released: my Source Code and Analysis,I had a lot of fun with this competition and l...,https://www.kaggle.com/c/2447/discussion/185,I had a lot of fun with this competition and l...,"implementations of a few rating systems (elo,..."
1,08/03/2010 00:00:00,Chess ratings - Elo versus the Rest of the World,https://www.kaggle.com/c/2447,11/20/2010 04:38:53,6th place(UriB) by Uri Blass,I calculated rating for every player in months...,https://www.kaggle.com/c/2447/discussion/192,I calculated rating for every player in months...,formed practically better than his real score ...
2,08/03/2010 00:00:00,Chess ratings - Elo versus the Rest of the World,https://www.kaggle.com/c/2447,11/23/2010 10:38:23,7th place - littlefish,I'm a little surprised I ended up in the top-1...,https://www.kaggle.com/c/2447/discussion/194,I'm a little surprised I ended up in the top-1...,eighted with the square root of the number of ...
3,08/03/2010 00:00:00,Chess ratings - Elo versus the Rest of the World,https://www.kaggle.com/c/2447,11/20/2010 11:27:17,3rd place: Chessmetrics - Variant,"Dear all,it was a great competition, thanks a ...",https://www.kaggle.com/c/2447/discussion/193,"Dear all,it was a great competition, thanks a ...",ating_level = \r\n2.5game_weight = (1 / (1 + ...
4,08/03/2010 00:00:00,Chess ratings - Elo versus the Rest of the World,https://www.kaggle.com/c/2447,11/18/2010 02:44:10,2nd place: TrueSkill Through Time,"Wow, this is a surprise! I looked at this comp...",https://www.kaggle.com/c/2447/discussion/186,"Wow, this is a surprise! I looked at this comp...",hrowing away valuable information. I switched ...
...,...,...,...,...,...,...,...,...,...
3122,02/23/2023 17:25:32,Google - Isolated Sign Language Recognition,https://www.kaggle.com/c/46105,05/02/2023 09:45:01,49th place silver solution,Thank you Kaggle and Pop sign for hosting this...,https://www.kaggle.com/c/46105/discussion/406426,Thank you Kaggle and Pop sign for hosting this...,for the remaining\nFP16 quantization reduced m...
3123,02/23/2023 17:25:32,Google - Isolated Sign Language Recognition,https://www.kaggle.com/c/46105,05/02/2023 10:13:31,10th place solution,"\n First, I would like to thank the Armed For...",https://www.kaggle.com/c/46105/discussion/406434,"\n First, I would like to thank the Armed For...","more), Most of the boost i get from \n\nMixup..."
3124,02/23/2023 17:25:32,Google - Isolated Sign Language Recognition,https://www.kaggle.com/c/46105,05/02/2023 03:24:28,Solution - Single transformer without val dataset,Thanks to the organisers of the PopSign Games ...,https://www.kaggle.com/c/46105/discussion/406346,Thanks to the organisers of the PopSign Games ...,16] not working for me.\nHow to do rotation co...
3125,02/23/2023 17:25:32,Google - Isolated Sign Language Recognition,https://www.kaggle.com/c/46105,05/02/2023 04:01:15,Top 8% Bronze Medal Solution,\n Many congratulations to all the winners in...,https://www.kaggle.com/c/46105/discussion/406354,\n Many congratulations to all the winners in...,tion. \nDataset Only Competition Data\nModel ...


Now our data looks like this 

Lets focus on our major columns only 

In [13]:
data.drop(["Competition Launch Date" , "Title of Competition" , 
          "Competition URL" , "Date of Writeup" , "Title of Writeup" , 
          "Writeup URL"], axis = 1 , inplace = True)

In [14]:
data

Unnamed: 0,Writeup,Qeustion,Answer
0,I had a lot of fun with this competition and l...,I had a lot of fun with this competition and l...,"implementations of a few rating systems (elo,..."
1,I calculated rating for every player in months...,I calculated rating for every player in months...,formed practically better than his real score ...
2,I'm a little surprised I ended up in the top-1...,I'm a little surprised I ended up in the top-1...,eighted with the square root of the number of ...
3,"Dear all,it was a great competition, thanks a ...","Dear all,it was a great competition, thanks a ...",ating_level = \r\n2.5game_weight = (1 / (1 + ...
4,"Wow, this is a surprise! I looked at this comp...","Wow, this is a surprise! I looked at this comp...",hrowing away valuable information. I switched ...
...,...,...,...
3122,Thank you Kaggle and Pop sign for hosting this...,Thank you Kaggle and Pop sign for hosting this...,for the remaining\nFP16 quantization reduced m...
3123,"\n First, I would like to thank the Armed For...","\n First, I would like to thank the Armed For...","more), Most of the boost i get from \n\nMixup..."
3124,Thanks to the organisers of the PopSign Games ...,Thanks to the organisers of the PopSign Games ...,16] not working for me.\nHow to do rotation co...
3125,\n Many congratulations to all the winners in...,\n Many congratulations to all the winners in...,tion. \nDataset Only Competition Data\nModel ...


Now we will make a list continaing list of `Questions` and their correspodning `Answers`

In [15]:
pairs = []

for i in tqdm.tqdm(range(data.shape[0]) , total = data.shape[0]):
    
    sample_list = []
    
    sample_list.append(data["Qeustion"][i])
    sample_list.append(data["Answer"][i])
    
    pairs.append(sample_list)

100%|██████████| 3127/3127 [00:00<00:00, 41020.44it/s]


In [16]:
x = np.array(pairs)
x.shape

(3127, 2)

So now we have our data in the correct format 

Lets save it somewhere

In [17]:
text_data = []
file_count = 0

os.mkdir("/kaggle/working/data")

In [18]:
for sample in tqdm.tqdm([x[0] for x in pairs]):
    
    text_data.append(sample)

    if len(text_data) == 10000:
        
        with open(f'/kaggle/working/data/text_{file_count}.txt', 'w', encoding='utf-8') as fp: 
            fp.write('\n'.join(text_data))
        
        text_data = []
        file_count += 1

100%|██████████| 3127/3127 [00:00<00:00, 763288.63it/s]


In [19]:
paths = [str(x) for x in Path('./data').glob('**/*.txt')]

# 2 | Embeddings/Tokenizing 🔢

Okay, so just hear me out. First thing to notice, we cannot just put letters into a model and expect it to undertand everything. No, We need to somehow make this characters into numbers, somehow, we really dont know how, but somehow, we will do that. 

Okay so we know what characters we have, like we know, all the characters will fall in the `English Alphabet`, maybe we find some extra characters like, `"," , "." , etc`. So what if we number them like that only. 

Lets assume we have a letter like `Optimus Prime`, we know that in the `English Alphabet` ,  `O`  comes at `15`, so we can number `O` as `15` and like this only the whole sequence becomes something like this 

|_____|_____
|---|---
|O|15
|p|42
|t|46
|i|35
|m|39
|u|47
|s|45
| |0
|P|16
|r|44
|i|35
|m|39
|e|31

We call this numerical representation of a `str`, **Embedding/Tokenizing**

What we did here was `character encoding` , means we were taking every character to be distinct of each other, or be independent to each character. 

There are other types of possible encoding available, such as `Bag Of Words , TF-IDF , Word2Vec , Glove`. There are more available like **[Sentence Piece by Google](https://github.com/google/sentencepiece)**  or **[Tik Token by OpenAI](https://github.com/openai/tiktoken)**
 , which you can try 

Here we will be using **[BertWordPeiceTokenizer](https://huggingface.co/learn/nlp-course/chapter6/6?fw=pt)**

In [20]:
from tokenizers import BertWordPieceTokenizer
from transformers import BertTokenizer

In [21]:
tokenizer = BertWordPieceTokenizer(clean_text = True ,
                                   handle_chinese_chars = False , 
                                   strip_accents = False , 
                                   lowercase = True)

tokenizer

Tokenizer(vocabulary_size=0, model=BertWordPiece, unk_token=[UNK], sep_token=[SEP], cls_token=[CLS], pad_token=[PAD], mask_token=[MASK], clean_text=True, handle_chinese_chars=False, strip_accents=False, lowercase=True, wordpieces_prefix=##)

In [22]:
tokenizer.train( files = paths , vocab_size = 30_000 ,  min_frequency = 5 ,
                limit_alphabet = 1000 , wordpieces_prefix = '##' ,
                special_tokens=['[PAD]', '[CLS]', '[SEP]', '[MASK]', '[UNK]'])






In [23]:
os.mkdir('/kaggle/working/bert-it-1')
tokenizer.save_model('/kaggle/working/bert-it-1', 'bert-it')
tokenizer = BertTokenizer.from_pretrained('/kaggle/working/bert-it-1/bert-it-vocab.txt', local_files_only=True)



Lets assume we have the line `A person who eats neighbourhood children`

In [24]:
tokenizer("A person who eats neighbourhood children")

{'input_ids': [1, 4, 4, 4, 4, 4, 4, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

Then this will be the corresponding embedding 

# 3 | BERTDataset 📚

The `BERTDataset` class is a `PyTorch dataset class` that can be used to `train a BERT model`. 
* We take
* * A Data Pair
* * Tokenizer
* * A Sequence Length as inputs. 
* We then `randomly` select a `sentence from the data pair` and use the `tokenizer to convert it into a sequence of tokens`. 
* We then `randomly replace` some of the tokens with `[MASK] tokens`, and some of the `tokens with randomly generated tokens`. 
* We then create a `segment label` for `each token`, which indicates `whether the token belongs` to the `first sentence` or the `second sentence`. 
* Finally, we `pad` the sequence to the specified sequence length.

In [25]:
from torch.utils.data import Dataset

In [26]:
class BERTDataset(Dataset):pass

Lets add some initializers to our class 

In [27]:
class BERTDataset(Dataset):

    def __init__(self , data_pair , tokenizer , seq_len = 64):

        self.tokenizer = tokenizer
        self.seq_len = seq_len
        
        self.lines = data_pair
        self.corpus_lines - len(data_pair)

Now leta add some `getters and setters` 

In [28]:
class BERTDataset(Dataset):
    def __init__(self, data_pair, tokenizer, seq_len=64):

        self.tokenizer = tokenizer
        self.seq_len = seq_len
        self.corpus_lines = len(data_pair)
        self.lines = data_pair

    __len__ = lambda self : self.corpus_lines

    get_random_line = lambda self : self.lines[random.randrange(len(self.lines))][1]
    
    def get_corpus(self, item) : return self.lines[item][0], self.lines[item][1]

    def get_sent(self, index):

        t1, t2 = self.get_corpus_line(index)

        if random.random() > 0.5 : 
            
            return t1 , t2 , 1
        
        else : 
            
            return t1, self.get_random_line() , 0

Now lets assume we are on this point of the data

In [29]:
pairs[0][0]

"I had a lot of fun with this competition and learned a lot about ratings systems.\r\nSadly, I only came 18th  )\r\nIf you're interested, you can download all of my code and analysis from my github repo   \r\nThere are"

If we want to get this as a list

In [30]:
pairs[0][0].split()

['I',
 'had',
 'a',
 'lot',
 'of',
 'fun',
 'with',
 'this',
 'competition',
 'and',
 'learned',
 'a',
 'lot',
 'about',
 'ratings',
 'systems.',
 'Sadly,',
 'I',
 'only',
 'came',
 '18th',
 ')',
 'If',
 "you're",
 'interested,',
 'you',
 'can',
 'download',
 'all',
 'of',
 'my',
 'code',
 'and',
 'analysis',
 'from',
 'my',
 'github',
 'repo',
 'There',
 'are']

This is a little bit to overview

In [31]:
pairs[0][0].split()[0]

'I'

If we send this into our tokenizer 

In [32]:
tokenizer(pairs[0][0].split()[0])

{'input_ids': [1, 4, 2], 'token_type_ids': [0, 0, 0], 'attention_mask': [1, 1, 1]}

We focus on the `input_ids`

In [33]:
tokenizer(pairs[0][0].split()[0])["input_ids"]

[1, 4, 2]

If we want the middle element.

In [34]:
tokenizer(pairs[0][0].split()[0])["input_ids"][1:-1]

[4]

What we do is we randomly 
* Either mask the token
* or replace it with some other random token
* or do nothing

$80$% of the time we mask the otken. $10$% of the time we replace it and $10$% of the time we do nothing 

In [35]:
class BERTDataset(Dataset):
    def __init__(self, data_pair, tokenizer, seq_len=64):

        self.tokenizer = tokenizer
        self.seq_len = seq_len
        self.corpus_lines = len(data_pair)
        self.lines = data_pair

    __len__ = lambda self : self.corpus_lines

    get_random_line = lambda self : self.lines[random.randrange(len(self.lines))][1]
    
    def get_corpus(self, item) : return self.lines[item][0], self.lines[item][1]

    def get_sent(self, index):

        t1, t2 = self.get_corpus_line(index)

        if random.random() > 0.5 : return t1 , t2 , 1
        else : return t1, self.get_random_line() , 0

    def random_word(self, sentence):
        
        tokens = sentence.split()
        output_label = []
        output = []

        for i, token in tqdm.tqdm(enumerate(tokens) , total = len(tokens)):
            
            prob = random.random()

            token_id = self.tokenizer(token)['input_ids'][1:-1]

            if prob < 0.15:
                
                prob /= 0.15

                if prob < 0.8:
                    
                    for i in range(len(token_id)):
                        
                        output.append(self.tokenizer.vocab['[MASK]'])

                elif prob < 0.9:
                    
                    for i in range(len(token_id)):
                        
                        output.append(random.randrange(len(self.tokenizer.vocab)))


                else:
                    
                    output.append(token_id)

                output_label.append(token_id)

            else:
                
                output.append(token_id)
                
                for i in range(len(token_id)):
                    
                    output_label.append(0)

        output = list(itertools.chain(*[[x] if not isinstance(x, list) else x for x in output]))
        output_label = list(itertools.chain(*[[x] if not isinstance(x, list) else x for x in output_label]))
        
        assert len(output) == len(output_label)
        
        return output, output_label

Now we just get the `padded tokens` out 

In [36]:
class BERTDataset(Dataset):
    def __init__(self, data_pair, tokenizer, seq_len=64):

        self.tokenizer = tokenizer
        self.seq_len = seq_len
        self.corpus_lines = len(data_pair)
        self.lines = data_pair

    __len__ = lambda self : self.corpus_lines

    get_random_line = lambda self : self.lines[random.randrange(len(self.lines))][1]
    
    def get_corpus(self, item) : return self.lines[item][0], self.lines[item][1]

    def get_sent(self, index):

        t1, t2 = self.get_corpus(index)

        if random.random() > 0.5 : return t1 , t2 , 1
        else : return t1, self.get_random_line() , 0

    def random_word(self, sentence):
        
        tokens = sentence.split()
        output_label = []
        output = []

        for i, token in enumerate(tokens):
            
            prob = random.random()

            token_id = self.tokenizer(token)['input_ids'][1:-1]

            if prob < 0.15:
                
                prob /= 0.15

                if prob < 0.8:
                    
                    for i in range(len(token_id)):
                        
                        output.append(self.tokenizer.vocab['[MASK]'])

                elif prob < 0.9:
                    
                    for i in range(len(token_id)):
                        
                        output.append(random.randrange(len(self.tokenizer.vocab)))

                else:
                    
                    output.append(token_id)

                output_label.append(token_id)

            else:
                
                output.append(token_id)
                
                for i in range(len(token_id)):
                    
                    output_label.append(0)

        output = list(itertools.chain(*[[x] if not isinstance(x, list) else x for x in output]))
        output_label = list(itertools.chain(*[[x] if not isinstance(x, list) else x for x in output_label]))
        
        assert len(output) == len(output_label)
        
        return output, output_label

    def __getitem__(self, item):

        t1, t2, is_next_label = self.get_sent(item)

        t1_random, t1_label = self.random_word(t1)
        t2_random, t2_label = self.random_word(t2)


        t1 = [self.tokenizer.vocab['[CLS]']] + t1_random + [self.tokenizer.vocab['[SEP]']]
        t1_label = [self.tokenizer.vocab['[PAD]']] + t1_label + [self.tokenizer.vocab['[PAD]']]
        
        t2 = t2_random + [self.tokenizer.vocab['[SEP]']]
        t2_label = t2_label + [self.tokenizer.vocab['[PAD]']]

        segment_label = (
            [
                1 for _ in range(len(t1))
            ] + 
            [
                2 for _ in range(len(t2))
            ]
        )[:self.seq_len]
        
        bert_input = (t1 + t2)[:self.seq_len]
        bert_label = (t1_label + t2_label)[:self.seq_len]
        
        padding = [
            self.tokenizer.vocab['[PAD]'] 
            for _ in range(self.seq_len - len(bert_input))
        ]
        
        bert_input.extend(padding), bert_label.extend(padding), segment_label.extend(padding)

        output = {
            "bert_input": bert_input,
            "bert_label": bert_label,
            "segment_label": segment_label,
            "is_next": is_next_label
        }

        return {
            key: torch.tensor(value) 
            for key, value in output.items()
        }

# 4 | Positional Embedding 💪

So this is a sine curve, with just differnet varations

In [37]:
from IPython.display import IFrame
import torch

In [38]:
IFrame("https://www.desmos.com/calculator/hiipopla5u" , 1000 , 200)

|_____|_____
|---|---
|$$sin(x)$$|Red
|$$sin(\frac{x}{2})$$|Blue
|$$sin(2x)$$|Green

One thisng that this shows is that the `squeeshing` of the `sine curve` is `directly dependent` on the value of $x$. 

So we can say that a point projected on $sin(69)$ will be closer to the same point if projected on $sin(88)$ and would be comparetivery farther to if the same point was projected on $sin(6969)$.

Now lets talk with respect to our taks. We have converted a sentence to a list of numbers. What if we project those numbers on a $sin()$ curve. Words having similar number will be closer and numbers having different numbers will be farther. 

Suprisingly we have a similar curve like the $sin()$ which is $cosine()/cos()$. 

So we have $2$ facotrs for determining the positional embeddings of words in a sentence. 

So now we have the 
* words as numbers
* Their positions in the form of $sin()/cosine()$ curves

We use the formula $$sin(\frac{position}{1000^{\frac{2i}{dimension}}})$$
and $$cos(\frac{position}{1000^{\frac{2(i + 1)}{dimension}}})$$


The `PositionalEmbedding` class in `PyTorch` is a module that adds `positional information` to a sequence of tokens. 
* We first create a `zero tensor of size` 
```
(max_len , d_model)
```
where `max_len is the maximum length` of a sequence and `d_model is the dimension of the embedding space`. 
* We then assign `each position` a `unique value` using a `sine and cosine function`. 
* We then `add the positional embedding` to the `embedding of each token` in the sequence.

In [39]:
class PositionalEmbedding(torch.nn.Module):

    def __init__(self, d_model, max_len=128):
        super().__init__()

        pe = torch.zeros(max_len, d_model).float()
        pe.require_grad = False

        for pos in range(max_len):   

            for i in range(0, d_model, 2):   
                
                pe[pos, i] = math.sin(pos / (10000 ** ((2 * i)/d_model)))
                pe[pos, i + 1] = math.cos(pos / (10000 ** ((2 * (i + 1))/d_model)))


        self.pe = pe.unsqueeze(0)   

    def forward(self, x): return self.pe

# 5 | BERTEmbedding 🌐

Embedding is like a mapping of a object into a numerical value. 

Assume we have the character `King`. And a factor `wealth`. If we see intuitevily, a connecting number between `King` and `Wealth` might be $10$. 

This is like defining a `non-numerical object` with $1$ factor. Lets say we have $2$ facotors, `wealth and poverty`. Again if we see intuitively, the connection might be like $[10 , 0]$. 

Similarly we can do this for a number of words, like might be `Advisor`, `Queen`. Making it a matrix like 

```
[[10 , 9 , 8] , 
 [1 , 0 , 3]] 
```

Where $1$ column represents $1$ particular entity, and $1$ row represents $1$ attribute.

Similary now we can increase the nummber of rows and number of column, according to our need.

The same concept is applied to `Embedding Tables`. We can even say them `EMbedding Matrices`. The only difference is that they do not use partuclar defined facotrs like `wealth/poverty`. Rather a bunch of neural networks to create the `Matrix`

Now lets try to understand this for a `larger set` of data of text. At the time of `intializing` we `do not know any relations` between any of the words, as we go through the corpus of data and `find the words that are used in a particular combination frequently`, we slowly change the `vector representation` of those words `to be close to each other`.


In [40]:
torch.nn.Embedding(3 , 2)

Embedding(3, 2)


The `BERTEmbedding class` is a `PyTorch` class that implements the `Embedding Layer` for the `BERT model`. It takes three arguments in its constructor: 
* Size of the Vocablary `vocab_size` 
* Size of the Embedding `embed_size` 
* Droput Rate `dropout`. 

The `forward method` of the BERTEmbedding class takes two arguments: 
* Sequences `sequence` 
* Segements `segment_label`. 
The forward method  
* We first compute the `token embedding` , the `positional embedding` , and the `segment embedding`. 
* We then sum these three embeddings and passe the result through a dropout layer. The output of the dropout layer is the embedding vector for the current token.

In [41]:
class BERTEmbedding(torch.nn.Module):

    def __init__(self, vocab_size, embed_size, seq_len=64, dropout=0.1):

        super().__init__()

        self.embed_size = embed_size

        self.token = torch.nn.Embedding(vocab_size, embed_size, padding_idx=0)
        self.segment = torch.nn.Embedding(3, embed_size, padding_idx=0)

        self.position = PositionalEmbedding(d_model=embed_size, max_len=seq_len)
        self.dropout = torch.nn.Dropout(p=dropout)
       
    def forward(self, sequence, segment_label):

        x = self.token(sequence) + self.position(sequence) + self.segment(segment_label)

        return self.dropout(x)

# 6 | MultiHeadedAttention 🔱

<img src = "https://miro.medium.com/v2/resize:fit:856/1*ZCFSvkKtppgew3cc7BIaug.png" width = 400>

Now we will talk in the terms of this image(it is easier). Focus on the right part of the diagram

We have made our input the way we wanted, Now we need to apply `Masked Multi-Head Attention`

What we do is we try to pass these embedding through $3$ different `networks/ Linear Layers` named as 

|Name|Symbol
|---|---
|$Key$|$K$
|$Query$|$Q$
|$Value$|$V$

an then apply the formula 

$$f(x) = softmax(\frac{QK^T}{\sqrt{d_k}})$$

This is what we call the `Multi Head Attention Mechanism`

The `MultiHeadedAttention class` is a `PyTorch` module that `implements` the `multi-head attention mechanism`. It takes four arguments in its constructor: 
* Number of Heads `heads`
* Size of the Heads `d_model`
* Dropout Rate `dropout` 
* Masked Areas `mask`.
The `forward method` of the MultiHeadedAttention class takes four arguments: 
* Query `query`
* Key `key`
* Value `value`
* Mask `mask`

* We first compute the attention weights 
* Then compute the attention output
* And finally passe the attention output through a linear layer. The output of the linear layer is the attention-weighted sum of the value vectors.

In [42]:
class MultiHeadedAttention(torch.nn.Module):
    
    def __init__(self, heads, d_model, dropout=0.1):
        
        super(MultiHeadedAttention, self).__init__()
        
        assert d_model % heads == 0
        
        self.d_k = d_model // heads
        
        self.heads = heads
        self.dropout = torch.nn.Dropout(dropout)

        self.query = torch.nn.Linear(d_model, d_model)
        self.key = torch.nn.Linear(d_model, d_model)
        self.value = torch.nn.Linear(d_model, d_model)
        
        self.output_linear = torch.nn.Linear(d_model, d_model)
        
    def forward(self, query, key, value, mask):

        query = self.query(query)
        key = self.key(key)        
        value = self.value(value)   

        query = query.view(query.shape[0], -1, self.heads, self.d_k).permute(0, 2, 1, 3)   
        key = key.view(key.shape[0], -1, self.heads, self.d_k).permute(0, 2, 1, 3)  
        value = value.view(value.shape[0], -1, self.heads, self.d_k).permute(0, 2, 1, 3)  

        scores = torch.matmul(query, key.permute(0, 1, 3, 2)) / math.sqrt(query.size(-1))

        scores = scores.masked_fill(mask == 0, -1e9)    

        weights = F.softmax(scores, dim=-1)           
        weights = self.dropout(weights)

        context = torch.matmul(weights, value)

        context = context.permute(0, 2, 1, 3).contiguous().view(context.shape[0], -1, self.heads * self.d_k)

        return self.output_linear(context)

This not the $BERT$. THis is just the starting of `Multi Head Attention`. We will cover up `BERT` in upcoming days 

# 7 | Ending

**THAT IT FOR TODAY GUYS**

**WE WILL IMPROVE THIS IN UPCOMING VERSIONS**

**PLEASE COMMENT YOUR THOUGHTS, HIHGLY APPRICIATED**

**DONT FORGET TO MAKE AN UPVOTE, IF YOU LIKED MY WORK**

<img src = "https://i.imgflip.com/19aadg.jpg">

**PEACE OUT $:)$**