# Intro

Play around with the distilbert model

In [1]:
import torch
import transformers

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cpu')

In [3]:
model_ckpt = "distilbert-base-uncased"

That's the pipeline way of doing things:

- Create a pipeline for text classification
- Stuff some text into the pipeline
- Print output (whatever the meaning)

In [28]:
classifier = pipeline(
    task="text-classification",
    model=model_ckpt,
    dtype=torch.float16,
    device=0,
    return_all_scores=True
)

result = classifier("I love using Hugging Face Transformers!")
print(result)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


[[{'label': 'LABEL_0', 'score': 0.4764806926250458}, {'label': 'LABEL_1', 'score': 0.5235193371772766}]]


Now let's do it the pytorch way...

- Load the model
- input some token to the model
- calculate the output

In [4]:
from transformers import AutoModel

model = (AutoModel
         .from_pretrained(model_ckpt)
         .to(device))
model

DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0-5): 6 x TransformerBlock(
        (attention): DistilBertSdpaAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): L

In [44]:
tks = tokenizer(text, return_tensors="pt")
tks

{'input_ids': tensor([[ 101, 2070, 6057, 3793, 2005, 5604,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}

In [52]:
outputs = model(**tks)
outputs.last_hidden_state.size()

torch.Size([1, 7, 768])

This seems to be one hidden state vector (of size 768) for each token (there are 7 of them)

In [6]:
torch.save(model.state_dict(), 'model.pt')

Save as above & open / visualize in https://netron.app/