<a href="https://colab.research.google.com/github/Rami-RK/HugingFace_Transformers/blob/main/Hf_Tokenizer_Models_and_Ouputs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Huggingface Tokenizer, Models and Model Ouputs**

### Objectives:

At the end of the experiment you will be able to understand  :

1. tokenizer class in HF
2. models in HF
3. model output and interpret

### **Tokenizer**
A tokenizer is in charge of preparing the inputs for a model. The Transformder library contains tokenizers for all the models i.e. tokenizer rules are specific to models and differs model by model

But there is a universal interface so that you don't have to worry about picking the right class. Specifically, there is a class called Auto Tokenizer, where you can pass in a model checkpoints just like in  pipelines. This will automatically give you back the correct tokenized objects.

So for example, if your model checkpoint is based on BERT, you will get back a tokenized object that has all the right component in it as required by BERT model.

In [None]:
!pip install transformers

In [None]:
from transformers import AutoTokenizer

In [None]:
checkpoint='bert-base-cased' # Different Bert models are there.
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [None]:
tokenizer # Notice the output from tokenizer object

BertTokenizerFast(name_or_path='bert-base-cased', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True)

#### Tokenizing any text:

Note the result form below code, it is a dictionary with several keys. The key input IDs corresponds to the integer IDs of the tokens that have gotten back from tokenizing the input. The input IDs will always be present because you will always want to convert your text into token IDs. The other keys are attention, mask and token type IDs.

Note that these keys can be specific to the type of model. For instance, token type IDs will show up for BERT, but not to DistillBert.

In [None]:
tokenizer("Hello Dost")

{'input_ids': [101, 8667, 2091, 2050, 102], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]}

#### Using tokenizer object

In [None]:
tokens= tokenizer.tokenize("Hello Dost")
tokens

['Hello', 'Do', '##st']

#### Tokens into token Ids

In [None]:
ids = tokenizer.convert_tokens_to_ids(tokens)
ids

[8667, 2091, 2050]

#### Above two steps can be done in one go by encoding

In [None]:
ids_n = tokenizer.encode("Hello Dost")
ids_n

[101, 8667, 2091, 2050, 102]

**Note** above, 5 ids are because the input has been converted to [CLS] 'Hello' 'Do' '##st' [SEP]**

In [None]:
tokenizer.convert_ids_to_tokens(ids_n)

['[CLS]', 'Hello', 'Do', '##st', '[SEP]']

In [None]:
tokenizer.decode(ids_n) # Gives the single string with tokens joined back together

'[CLS] Hello Dost [SEP]'

#### Getting tensor as an output

The output from tokenizer that we just saw, is a dictionary containing values, which were all lists. But PyTorch doesn't take this as input.

Instead, PyTorch models process torch Tensors in order to get the values back as Tensors. We set the argument `return_tensors` to  string `pt`.

For TensorFlo set string as `tf` and for numpy as `np`you can use the string TAF or if you just want an empire raise, you can pass an NP.



In [None]:
tokenizer("Hello Dost",return_tensors='pt') # tf, np

{'input_ids': tensor([[ 101, 8667, 2091, 2050,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}

#### Multiple inputs : Need to add two more parameters : `padding` and `truncation`

In [None]:
data = [ "Hellow where are you going?","Going to home."]

In [None]:
model_inputs=tokenizer(data,padding=True,truncation=True,return_tensors='pt')
model_inputs

{'input_ids': tensor([[  101,  8667,  2246,  1187,  1132,  1128,  1280,   136,   102],
        [  101, 11099,  1106,  1313,   119,   102,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0, 0, 0]])}

Purpose of the attention mask: Essentially, this tells the model where it should bother to pay attention to. For any tokens where the attention mask is zero, the model will ignore those tokens and it will not be possible to use them to compute the model output, which is  as intended.

In summary to ensure that the batch of data can be fed as input into to the PyTorch model, We need to specify the padding argument, the truncation argument and the return tensor argument. Also note that PyTorch is the default for hugging face and currently the most flexible.

### **Model**

We are using BERT model for text classification, through  `AutoModelForSequenceClassification` class since it it more flexible. We can create a BERT Specific model In order to load a pre-trained BERT model, we simply call the function from Pre-Trained. Just like we did with the tokenizer.

Note that the checkpoint we pass in must match the checkpoint we passed in for the tokenizer so that we get the right tokenize for the model.

In [None]:
from transformers import AutoModelForSequenceClassification

In [None]:
checkpoint = checkpoint='bert-base-cased'
model= AutoModelForSequenceClassification.from_pretrained(checkpoint)

Downloading model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### Making Predictions

In [None]:
model_inputs = tokenizer(data, padding = True, return_tensors = 'pt')
model_inputs


{'input_ids': tensor([[  101,  8667,  2246,  1187,  1132,  1128,  1280,   136,   102],
        [  101, 11099,  1106,  1313,   119,   102,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0, 0, 0]])}

In [None]:
model_inputs['input_ids']

tensor([[  101,  8667,  2246,  1187,  1132,  1128,  1280,   136,   102],
        [  101, 11099,  1106,  1313,   119,   102,     0,     0,     0]])

In [None]:
model_inputs['attention_mask']

tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0, 0, 0]])

In [None]:
outputs = model(**model_inputs) ## ** key value pair, named arguments

In [None]:
outputs # These logits are useless as final layers are not tuned and by default it assumes binary classification

SequenceClassifierOutput(loss=None, logits=tensor([[-0.3489,  0.0317],
        [-0.3528,  0.0035]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

### **Model Outputs**
* If you pass in N documents, you will get back NxK, as an output where k is for representing number of classes.
* If you pass in a single document, you will get back a K-sized output.
* The outputs are logits i.e. value before applying softmax.
* To get class prediction, just take the argmax.


In [None]:
model_n= AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=3)
model_n

In [None]:
outputs_n=model_n(**model_inputs)
outputs_n

SequenceClassifierOutput(loss=None, logits=tensor([[-0.2118,  0.2101,  0.2063],
        [-0.0814,  0.3426,  0.2371]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [None]:
outputs.logits # Method1

tensor([[-0.3489,  0.0317],
        [-0.3528,  0.0035]], grad_fn=<AddmmBackward0>)

In [None]:
outputs['logits'] # Method2

tensor([[-0.3489,  0.0317],
        [-0.3528,  0.0035]], grad_fn=<AddmmBackward0>)

In [None]:
outputs[0] # Method3 --> Not recommended

tensor([[-0.3489,  0.0317],
        [-0.3528,  0.0035]], grad_fn=<AddmmBackward0>)

In [None]:
outputs.logits.detach().cpu().numpy() # convert into numpy array

array([[-0.34888095,  0.03173423],
       [-0.35283846,  0.00347681]], dtype=float32)