**INITIALIZATION:**
- I use these three lines of code on top of my each notebooks because it will help to prevent any problems while reloading the same project. And the third line of code helps to make visualization within the notebook.

In [3]:
#@ INITIALIZATION: 
%reload_ext autoreload
%autoreload 2
%matplotlib inline

**LIBRARIES AND DEPENDENCIES:**
- I have downloaded all the libraries and dependencies required for the project in one particular cell.

In [2]:
#@ INSTALLING DEPENDENCIES: UNCOMMENT BELOW: 
# !pip install transformers[sentencepiece]

In [4]:
#@ DOWNLOADING LIBRARIES AND DEPENDENCIES:
import torch
import transformers
from transformers import BertConfig, BertModel
from transformers import BertTokenizer
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification

**TRANSFORMER**
- I will create the BERT model here. 

In [5]:
#@ INITIALIZING BERT MODEL: RANDOM: UNCOMMENT BELOW:
# config = BertConfig()                               # Building BERT Config.
# model = BertModel(config)                           # Building BERT Model. 
# print(config)                                       # Inspecting Configurations. 

In [8]:
#@ INITIALIZING PRETRAINED MODEL:
model = BertModel.from_pretrained("bert-base-cased")

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [9]:
#@ SAVING THE PRETRAINED BERT MODEL:
model.save_pretrained("./model")

**TOKENIZERS:**
- The basic type of tokenizer that comes to mind is `word-based` tokenizer. It's generally very easy to set up and use with only a few rules, and it yeilds decent results. 

In [10]:
#@ EXAMPLE OF WORD BASED TOKENIZATION:
tokenized_text = "I am Thinam Tamang".split()           # Initializing Tokenization. 
print(tokenized_text)                                   # Inspecting Tokens. 

['I', 'am', 'Thinam', 'Tamang']


**SUBWORD TOKENIZATION:**
- Subword Tokenization rely on the principle that frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords. 

In [14]:
#@ INITIALIZING TOKENIZATION:
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")            # Initializing Tokenizer. 
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")            # Initializing Tokenizer. 
tokenizer("Using a Transformer network is simple !")                    # Implementing Tokenizer. 

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 106, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [15]:
#@ SAVING THE TOKENIZER:
tokenizer.save_pretrained("./tokenizer");

**ENCODING:**
- Translating text to numbers is known as `encoding`. Encoding is done in a two-step process: the tokenization, followed by the conversion to input IDs. 

In [16]:
#@ IMPLEMENTATION OF TOKENIZATION:
sequence = "Using a Transformer network is simple !"                    # Initializing Sequence. 
tokens = tokenizer.tokenize(sequence)                                   # Initializing Tokenization. 
print(tokens)                                                           # Inspecting Tokens. 

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple', '!']


In [17]:
#@ CONVERSION TO INPUT IDs:
ids = tokenizer.convert_tokens_to_ids(tokens)                           # Conversion.
print(ids)

[7993, 170, 13809, 23763, 2443, 1110, 3014, 106]


In [18]:
#@ IMPLEMENTATION OF TOKENIZATION: 
sequence =["I've been waiting for a HuggingFace course my whole life.",
            "I hate this so much!"]                                     # Initializing Sequences. 
tokens = tokenizer.tokenize(sequence)                                   # Initializing Tokenization. 
ids = tokenizer.convert_tokens_to_ids(tokens)                           # Conversion.
print(ids)

[146, 112, 1396, 1151, 2613, 1111, 170, 20164, 10932, 2271, 7954, 1736, 1139, 2006, 1297, 119, 146, 4819, 1142, 1177, 1277, 106]


**Batching** is the act of sending multiple sentences through the model, all at once. If we have only one sentence, we can just build a batch with a single sequence. 

In [20]:
#@ IMPLEMENTATION OF TOKENIZATION: UNCOMMENT BELOW: 
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"          # Initialization. 
tokenizer = AutoTokenizer.from_pretrained(checkpoint)                   # Initializing Tokenizer. 
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)  # Initializing Model.
sequence = "I've been waiting for a Spiderman movie my whole life."     # Initialization.
tokens = tokenizer.tokenize(sequence)                                   # Encoding. 
ids = tokenizer.convert_tokens_to_ids(tokens)                           # Converting into Token IDs. 
input_ids = torch.tensor(ids)                                           # Converting into Tensors. 
# model(input_ids)                                                      # Inspection. 

In [21]:
#@ IMPLEMENTATION OF TOKENIZATION: ADDING DIMENSION: 
sequence = "I've been waiting for a Spiderman movie my whole life."     # Initialization.
tokens = tokenizer.tokenize(sequence)                                   # Encoding. 
ids = tokenizer.convert_tokens_to_ids(tokens)                           # Converting into Token IDs. 
input_ids = torch.tensor([ids])                                         # Converting into Tensors. 
print("Input IDs:", input_ids)
output = model(input_ids)                                               # Implementation of Model. 
print("Logits:", output.logits)                                         # Inspection. 

Input IDs: tensor([[1045, 1005, 2310, 2042, 3403, 2005, 1037, 6804, 2386, 3185, 2026, 2878,
         2166, 1012]])
Logits: tensor([[-0.4018,  0.7059]], grad_fn=<AddmmBackward0>)


In [22]:
#@ IMPLEMENTATION OF TOKENIZATION: ADDING DIMENSION: 
sequence = "I've been waiting for a Spiderman movie my whole life."     # Initialization.
tokens = tokenizer.tokenize(sequence)                                   # Encoding. 
ids = tokenizer.convert_tokens_to_ids(tokens)                           # Converting into Token IDs. 
input_ids = torch.tensor([ids, ids])                                    # Converting into Tensors. 
print("Input IDs:", input_ids)
output = model(input_ids)                                               # Implementation of Model. 
print("Logits:", output.logits)                                         # Inspection. 

Input IDs: tensor([[1045, 1005, 2310, 2042, 3403, 2005, 1037, 6804, 2386, 3185, 2026, 2878,
         2166, 1012],
        [1045, 1005, 2310, 2042, 3403, 2005, 1037, 6804, 2386, 3185, 2026, 2878,
         2166, 1012]])
Logits: tensor([[-0.4018,  0.7059],
        [-0.4018,  0.7059]], grad_fn=<AddmmBackward0>)


**PADDING THE INPUTS:**
- **Padding** makes sure all our sentences have the same length by adding a special word called the `padding token` to the sentences with fewer values. 

In [23]:
#@ IMPLEMENTATION OF PADDING:
sequence1_ids = [[200, 200, 200]]                               # Initializing Sequence IDs.  
sequence2_ids = [[200, 200]]                                    # Initializing Sequence IDs.  
batched_ids = [[200, 200, 200], 
               [200, 200, tokenizer.pad_token_id]]              # Implementation of Padding. 
print(model(torch.tensor(sequence1_ids)).logits)                # Inspecting Logits. 
print(model(torch.tensor(sequence2_ids)).logits)                # Inspecting Logits.
print(model(torch.tensor(batched_ids)).logits)                  # Inspecting Logits.

tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward0>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)
tensor([[ 1.5694, -1.3895],
        [ 1.3374, -1.2163]], grad_fn=<AddmmBackward0>)


**ATTENTION MASKS:**
- **Attention masks** are tensors with the same shapes as the input IDs tensors, filled with 0s and 1s: 1s indicate the corresponding tokens should be attended to, and 0s indicate the corresponding tokens should not be attended to i.e. they should be ignored by the attention layers of the model. 

In [24]:
#@ IMPLEMENTATION OF PADDING AND ATTENTION MASKS: 
batched_ids = [[200, 200, 200], 
               [200, 200, tokenizer.pad_token_id]]              # Implementation of Padding. 
attention_mask = [[1, 1, 1],
                  [1, 1, 0]]                                    # Initialization. 
outputs = model(torch.tensor(batched_ids),
                attention_mask=torch.tensor(attention_mask))    # Implementation of Attention Masks. 
print(outputs.logits)                                           # Inspection. 

tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)


**CONCLUSION:**

In [25]:
#@ IMPLEMENTATION OF TOKENIZER:
sequence = "I've been waiting for a HuggingFace course my whole life."      # Text Example. 
model_inputs = tokenizer(sequence)                                          # Initializing Model Inputs. 
model_inputs = tokenizer(sequence, padding="longest")                       # Padding Upto Maximum Length.
model_inputs = tokenizer(sequence, padding="max_length")                    # Padding Upto Model Max Length. 
model_inputs = tokenizer(sequence, padding="max_length", max_length=8)      # Padding Upto Specified Length. 
model_inputs = tokenizer(sequence, truncation=True)                         # Truncating Long Sequence. 
model_inputs = tokenizer(sequence, max_length=8, truncation=True)           # Truncating Long Sequence. 
model_inputs = tokenizer(sequence, padding=True, return_tensors="pt")       # Return PyTorch Tensors.
model_inputs = tokenizer(sequence, padding=True, return_tensors="tf")       # Return TensorFlow Tensors. 
model_inputs = tokenizer(sequence, padding=True, return_tensors="np")       # Return NumPy Arrays. 

In [29]:
#@ IMPLEMENTATION OF TOKENIZER: SPECIAL TOKENS: 
sequence = "I've been waiting for a HuggingFace course my whole life."      # Text Example. 
model_inputs = tokenizer(sequence)                                          # Tokenization. 
tokens = tokenizer.tokenize(sequence)                                       # Initializing Tokens
ids = tokenizer.convert_tokens_to_ids(tokens)                               # Initializing Input IDs. 
print(ids)

[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]


In [30]:
#@ IMPLEMENTATION OF TOKENIZER: DECODING TOKENS: 
print(tokenizer.decode(model_inputs["input_ids"]))
print(tokenizer.decode(ids))

[CLS] i've been waiting for a huggingface course my whole life. [SEP]
i've been waiting for a huggingface course my whole life.


In [31]:
#@ TOKENIZER TO MODEL:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)          # Initializing Model.
sequences = ["I've been waiting for a HuggingFace course my whole life.","So"]  # Text Example. 
tokens = tokenizer(sequences, padding=True, truncation=True, 
                   return_tensors="pt")                                         # Initializing Tokenization. 
output = model(**tokens)                                                        # Implementation of Model. 