In [2]:
# BERT stands for Bidirectional Encoder Representations for Transformers. Google uses BERT and the overview that we see in google when we ask a question
# is from BERT. It is a direct question answering application of Transformers , BERT in this case.

# BERT was trained on Masked Language Model (MLM) and Next Sentence Prediction (NSP).
# We don't need labeled data with BERT we can train on raw data.

In [6]:
# Transfer learning is made up of 2 components :- Pretraining and Fine-tuning.

# We typically have 3 stages in Model training :-
# 1) Model architecture with random weights. (Non knowledge of language)
# 2) Pretrained model. (Very good understanding of knowledge) 
# 3) Fine-tuned model. (With BERT fine-tuning can be in form of Text classification, Named entity recognition, Question Answering)

# BERT already gives us the pretrained model, we just need to fine-tune it to our use-case.
# The fine tuning steps involves training the model with labelled data.

# Transfer learning is better because it only requires us to fine tune the model, so it is faster, requires less data to fine-tune and gives excellent
# results.

In [8]:
'''
Transformer Architecture :-

A transformer is made up of an encoder and a decoder. We fed an english language into an encoder and the transformer can act as a translater
and we will get the German translation from the decoder.

1. Ecoder-Descoder models can be used for Generative tasks that require input like translation or summarization. Eg: BART, T5
2. Encoder only model are used when we require understanding of the input like sentence classification and Named entity recognition. Eg: Bert
3. Decoder only models are used when we need Generative tasks. Eg: GPT

BERT cannot generate texts as it doesnot have decoder like translation or text summarization.
'''

'\nTransformer Architecture :-\n\nA transformer is made up of an encoder and a decoder. We fed an english language into an encoder and the transformer can act as a translater\nand we will get the German translation from the decoder.\n\n1. Ecoder-Descoder models can be used for Generative tasks that require input like translation or summarization. Eg: BART, T5\n2. Encoder only model are used when we require understanding of the input like sentence classification and Named entity recognition. Eg: Bert\n3. Decoder only models are used when we need Generative tasks. Eg: GPT\n\nBERT cannot generate texts as it doesnot have decoder like translation or text summarization.\n'

In [9]:
'''
Why tokenizers?
Words are split into sub-words and sub-words are mapped to numerical-ids and tokenizers convert text inputs to numerical data.
'''

'\nWhy tokenizers?\nWords are split into sub-words and sub-words are mapped to numerical-ids and tokenizers convert text inputs to numerical data.\n'

In [11]:
%%capture
!pip install transformers[sentencepiece] 

In [12]:
from transformers import AutoTokenizer
checkpoint = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [24]:
# print(tokenizer.vocab)
print(f'The vocabulary size is {len(tokenizer.vocab)}')

The vocabulary size is 30522


In [15]:
sentence = 'I like NLP'
print(sentence)
tokens = tokenizer.tokenize(sentence)
print(tokens)
ids = tokenizer.encode(sentence)
print(ids)
print(tokenizer.decode(ids))

I like NLP
['i', 'like', 'nl', '##p']
[101, 1045, 2066, 17953, 2361, 102]
[CLS] i like nlp [SEP]


In [16]:
print(f'{tokenizer.cls_token} -> {tokenizer.cls_token_id}')
print(f'{tokenizer.sep_token} -> {tokenizer.sep_token_id}')

[CLS] -> 101
[SEP] -> 102


In [17]:
'😀' in tokenizer.vocab

False

In [18]:
sentence = 'I like NLP😀'
tokenizer.tokenize(sentence)

['i', 'like', '[UNK]']

In [19]:
first_sentence = 'I like NLP.'
second_sentence = 'What about you?'
input = tokenizer(first_sentence, second_sentence, return_tensors='pt')
input

{'input_ids': tensor([[  101,  1045,  2066, 17953,  2361,  1012,   102,  2054,  2055,  2017,
          1029,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [20]:
input['input_ids']

tensor([[  101,  1045,  2066, 17953,  2361,  1012,   102,  2054,  2055,  2017,
          1029,   102]])

In [21]:
input['token_type_ids']

tensor([[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]])

In [22]:
input['attention_mask']

tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

In [23]:
first_sentence = 'I like NLP.'
second_sentence = 'What are your thoughts on the subject?'
input = tokenizer([first_sentence, second_sentence], padding=True, return_tensors='pt')
input['attention_mask']

tensor([[1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])