-----------
**Author**: Gunnvant

**Description**: Examines the Models and Tokenizers api

-----------


## Tokenizers
Let's see the behaviour of the tokenizer for `distilbert-base-uncased-finetuned-sst-2-english`, which is a classification model

In [1]:
from transformers import AutoTokenizer

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [3]:
raw_input = ["This is a sample sentence."]
inputs = tokenizer(raw_input, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[ 101, 2023, 2003, 1037, 7099, 6251, 1012,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}


In [4]:
tokenizer.convert_ids_to_tokens(inputs['input_ids'].tolist()[0])

['[CLS]', 'this', 'is', 'a', 'sample', 'sentence', '.', '[SEP]']

Try a tokenizer of some other model eg a token classifier and let's see if the behaviour is any different

In [5]:
checkpoint = 'dslim/bert-base-NER'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible


Downloading (…)okenizer_config.json: 100%|███████████████████████████████████████████| 59.0/59.0 [00:00<00:00, 27.2kB/s]

	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



Downloading (…)lve/main/config.json: 100%|█████████████████████████████████████████████| 829/829 [00:00<00:00, 2.48MB/s]
Downloading (…)solve/main/vocab.txt: 100%|████████████████████████████████████████████| 213k/213k [00:00<00:00, 474kB/s]
Downloading (…)in/added_tokens.json: 100%|███████████████████████████████████████████| 2.00/2.00 [00:00<00:00, 6.00kB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████████████████████████████████████████| 112/112 [00:00<00:00, 612kB/s]


In [6]:
inputs = tokenizer(raw_input, padding=True, truncation=True, return_tensors='pt')
print(inputs)

{'input_ids': tensor([[ 101, 1188, 1110,  170, 6876, 5650,  119,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}


In [7]:
tokenizer.convert_ids_to_tokens(inputs['input_ids'].tolist()[0])

['[CLS]', 'This', 'is', 'a', 'sample', 'sentence', '.', '[SEP]']

Obeserve that the token ids are different for different tokenizers and hence one must use the tokenizer corresponding to the specific model one is trying to use.

## Models

Lets now use the different model classes to test both the models and the variations in-terms of having the head and no-head.

- Lets first test the `distilbert-base-uncased-finetuned-sst-2-english` model and see its varitions based on head vs no-head.

In [8]:
from transformers import AutoModel
checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
model = AutoModel.from_pretrained(checkpoint)

In [9]:
model ## no-head

DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0-5): 6 x TransformerBlock(
        (attention): MultiHeadSelfAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): Li

In [10]:
## Now lets initialize the checkpoint for ForSequenceClassification
from transformers import AutoModelForSequenceClassification

In [11]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

In [12]:
model ## classification head with 2 class output

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 