<a href="https://colab.research.google.com/github/Matonice/30-Days-of-Transformer/blob/main/Using_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%%capture
!pip install transformers
!pip install torch
!pip install sentencepiece

**Preprocessing with tokenizer**

In [None]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

input_text = "I really love working with transformers models"
inputs = tokenizer(input_text, padding=True, truncation=True, return_tensors="pt")
inputs

{'input_ids': tensor([[  101,  1045,  2428,  2293,  2551,  2007, 19081,  4275,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}

**Passing the inputs into a model**

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.logits.shape)

torch.Size([1, 2])


**PostProcessing the output**

In [None]:
print(outputs.logits)

tensor([[-3.1534,  3.2794]], grad_fn=<AddmmBackward0>)


In [None]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)
model.config.id2label

tensor([[0.0016, 0.9984]], grad_fn=<SoftmaxBackward0>)


{0: 'NEGATIVE', 1: 'POSITIVE'}

**Loading a trained transformer model**

In [None]:
from transformers import BertModel

model = BertModel.from_pretrained("bert-base-cased")

Downloading config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/416M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


**Saving a model**

In [None]:
model.save_pretrained("my_first_bert_model")

**Loading and saving a tokenizer**

In [None]:
from transformers import BertModel

tokenizer = BertModel.from_pretrained("bert-base-cased")

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
tokenizer.save_pretrained("my_first_bert_tokenizer")

**What happens inside the tokenizer function**

In [None]:
# 1) tokenization
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokens = tokenizer.tokenize(input_text)
tokens

['i', 'really', 'love', 'working', 'with', 'transformers', 'models']

In [None]:
# 2) tokens to ids
id = tokenizer.convert_tokens_to_ids(tokens)
id

[1045, 2428, 2293, 2551, 2007, 19081, 4275]

In [None]:
# 3) Decoding 
decoded_string = tokenizer.decode(id)
decoded_string

'i really love working with transformers models'

**Putting all together**

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

tokens = tokenizer(input_text, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)