<a href="https://colab.research.google.com/github/Biswajitjuee/HuggingFace_Fastai_course/blob/main/fastai%2Bhuggingface_session_2_Using_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Behind the pipeline (PyTorch)

Install the Transformers and Datasets libraries to run this notebook.

In [3]:
! pip install -qq datasets transformers[sentencepiece]

[K     |████████████████████████████████| 264 kB 5.1 MB/s 
[K     |████████████████████████████████| 2.6 MB 42.3 MB/s 
[K     |████████████████████████████████| 118 kB 35.3 MB/s 
[K     |████████████████████████████████| 50 kB 5.0 MB/s 
[K     |████████████████████████████████| 243 kB 56.7 MB/s 
[K     |████████████████████████████████| 895 kB 49.4 MB/s 
[K     |████████████████████████████████| 636 kB 46.3 MB/s 
[K     |████████████████████████████████| 3.3 MB 52.7 MB/s 
[K     |████████████████████████████████| 1.1 MB 42.7 MB/s 
[?25h

In [4]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier([
    "I've been waiting for a HuggingFace course my whole life.", 
    "I hate this so much!",
])
print(classifier.model.name_or_path)

Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

distilbert-base-uncased-finetuned-sst-2-english


### Step 1: Tokenize

In [5]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

In [6]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [7]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.", 
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


### Step 2: Run inputs through model

In [8]:
from transformers import AutoModel

model = AutoModel.from_pretrained(checkpoint)

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [9]:
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([2, 16, 768])


In [10]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

In [11]:
print(model)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       

In [12]:
print(outputs.logits.shape)
print(outputs.logits)

torch.Size([2, 2])
tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward>)


### Step 3: Process outputs

In [13]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward>)


In [14]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

## Behind the pipeline (Blurr)

### Step 1: Prepare inputs

In [23]:
!pip install -qq fastai
!pip install -qq ohmeow-blurr

[K     |████████████████████████████████| 81 kB 3.9 MB/s 
[K     |████████████████████████████████| 43 kB 1.6 MB/s 
[K     |████████████████████████████████| 46 kB 3.4 MB/s 
[K     |████████████████████████████████| 51 kB 218 kB/s 
[?25h  Building wheel for seqeval (setup.py) ... [?25l[?25hdone


In [24]:
from fastai.text.all import * 
from blurr.utils import *
from blurr.data.core import *
from blurr.modeling.core import *

In [20]:
from fastai import *

In [18]:
!pip install fastai --upgrade

Collecting fastai
  Downloading fastai-2.5.2-py3-none-any.whl (186 kB)
[K     |████████████████████████████████| 186 kB 5.3 MB/s 
Collecting fastcore<1.4,>=1.3.8
  Downloading fastcore-1.3.26-py3-none-any.whl (56 kB)
[K     |████████████████████████████████| 56 kB 2.4 MB/s 
Collecting fastdownload<2,>=0.0.5
  Downloading fastdownload-0.0.5-py3-none-any.whl (13 kB)
Installing collected packages: fastcore, fastdownload, fastai
  Attempting uninstall: fastai
    Found existing installation: fastai 1.0.61
    Uninstalling fastai-1.0.61:
      Successfully uninstalled fastai-1.0.61
Successfully installed fastai-2.5.2 fastcore-1.3.26 fastdownload-0.0.5


In [25]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

In [27]:
path = untar_data(URLs.IMDB_SAMPLE)
path.ls()
imdb_df = pd.read_csv(path/'texts.csv')

imdb_df.head()

Unnamed: 0,label,text,is_valid
0,negative,"Un-bleeping-believable! Meg Ryan doesn't even look her usual pert lovable self in this, which normally makes me forgive her shallow ticky acting schtick. Hard to believe she was the producer on this dog. Plus Kevin Kline: what kind of suicide trip has his career been on? Whoosh... Banzai!!! Finally this was directed by the guy who did Big Chill? Must be a replay of Jonestown - hollywood style. Wooofff!",False
1,positive,"This is a extremely well-made film. The acting, script and camera-work are all first-rate. The music is good, too, though it is mostly early in the film, when things are still relatively cheery. There are no really superstars in the cast, though several faces will be familiar. The entire cast does an excellent job with the script.<br /><br />But it is hard to watch, because there is no good end to a situation like the one presented. It is now fashionable to blame the British for setting Hindus and Muslims against each other, and then cruelly separating them into two countries. There is som...",False
2,negative,"Every once in a long while a movie will come along that will be so awful that I feel compelled to warn people. If I labor all my days and I can save but one soul from watching this movie, how great will be my joy.<br /><br />Where to begin my discussion of pain. For starters, there was a musical montage every five minutes. There was no character development. Every character was a stereotype. We had swearing guy, fat guy who eats donuts, goofy foreign guy, etc. The script felt as if it were being written as the movie was being shot. The production value was so incredibly low that it felt li...",False
3,positive,"Name just says it all. I watched this movie with my dad when it came out and having served in Korea he had great admiration for the man. The disappointing thing about this film is that it only concentrate on a short period of the man's life - interestingly enough the man's entire life would have made such an epic bio-pic that it is staggering to imagine the cost for production.<br /><br />Some posters elude to the flawed characteristics about the man, which are cheap shots. The theme of the movie ""Duty, Honor, Country"" are not just mere words blathered from the lips of a high-brassed offic...",False
4,negative,"This movie succeeds at being one of the most unique movies you've seen. However this comes from the fact that you can't make heads or tails of this mess. It almost seems as a series of challenges set up to determine whether or not you are willing to walk out of the movie and give up the money you just paid. If you don't want to feel slighted you'll sit through this horrible film and develop a real sense of pity for the actors involved, they've all seen better days, but then you realize they actually got paid quite a bit of money to do this and you'll lose pity for them just like you've alr...",False


In [None]:
hf_arch, hf_config, hf_tokenizer, hf_model = BLURR.get_hf_objects(checkpoint, model_cls=AutoModelForSequenceClassification)

print(hf_arch)
print(type(hf_config))
print(type(hf_tokenizer))
print(type(hf_model))

distilbert
<class 'transformers.models.distilbert.configuration_distilbert.DistilBertConfig'>
<class 'transformers.models.distilbert.tokenization_distilbert_fast.DistilBertTokenizerFast'>
<class 'transformers.models.distilbert.modeling_distilbert.DistilBertForSequenceClassification'>


In [None]:
# single input
blocks = (HF_TextBlock(hf_arch, hf_config, hf_tokenizer, hf_model, max_length=128, padding=True, truncation=True), CategoryBlock)
dblock = DataBlock(blocks=blocks, get_x=ColReader('text'), get_y=ColReader('label'), splitter=ColSplitter())

In [None]:
dls = dblock.dataloaders(imdb_df, bs=4)

In [None]:
dls.show_batch(dataloaders=dls, max_n=2)

Unnamed: 0,text,category
0,"raising victor vargas : a review < br / > < br / > you know, raising victor vargas is like sticking your hands into a big, steaming bowl of oatmeal. it's warm and gooey, but you're not sure if it feels right. try as i might, no matter how warm and gooey raising victor vargas became i was always aware that something didn't quite feel right. victor vargas suffers from a certain overconfidence on the director's part. apparently, the director thought that the ethnic backdrop of a latino family on the lower east side, and an idyllic",negative
1,"i had read many good things about this adaptation of my favorite novel... so invariably my expectations were crushed. but they were crushed more than should be expected. the movie would have been a decent movie if i had not read the novel beforehand, which perhaps ruined it for me. < br / > < br / > in any event, for some reason they changed the labor camp at toulon to a ship full of galley slaves. the scene at bishop myriel's was fine. in fact, other than the galleys, things survived up until the dismissal of fantine. because we don't want to have bad",negative


In [None]:
xb, yb = dls.one_batch()

In [None]:
xb

{'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          

In [None]:
len(xb), xb['input_ids'].shape, xb['attention_mask'].shape, len(xb['input_ids']), yb.shape

(2, torch.Size([4, 128]), torch.Size([4, 128]), 4, torch.Size([4]))

### Step 2: Run inputs through model

In [None]:
hf_model.cuda()
outputs = hf_model(**xb)

In [None]:
print(outputs.logits.shape)
print(outputs.logits)

torch.Size([4, 2])
tensor([[-1.0525,  1.2515],
        [ 3.9893, -3.3104],
        [ 0.1730, -0.0150],
        [-1.4008,  1.4509]], device='cuda:0', grad_fn=<AddmmBackward>)


### Step 3: Process outputs

In [None]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[9.0788e-02, 9.0921e-01],
        [9.9932e-01, 6.7533e-04],
        [5.4686e-01, 4.5314e-01],
        [5.4596e-02, 9.4540e-01]], device='cuda:0', grad_fn=<SoftmaxBackward>)


In [None]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

### Bonus: Using Blurr Learner to look at training/validation results and for inference

In [None]:
model = HF_BaseModelWrapper(hf_model)

learn = Learner(dls, 
                model,
                opt_func=partial(OptimWrapper, opt=torch.optim.Adam),
                loss_func=CrossEntropyLossFlat(),
                metrics=[accuracy],
                cbs=[HF_BaseModelCallback],
                splitter=hf_splitter)

learn.freeze()

In [None]:
learn.show_results(learner=learn, max_n=2, trunc_at=500)

Unnamed: 0,text,category,target
0,"the trouble with the book, "" memoirs of a geisha "" is that it had japanese surfaces but underneath the surfaces it was all an american man's way of thinking. reading the book is like watching a magnificent ballet with great music, sets, and costumes yet performed by barnyard animals dressed in those costumesso far from japanese ways of thinking were the characters. < br / > < br / > the movie isn't about japan or real geisha. it is a story about a few american men's mistaken ideas about japan an",negative,negative
1,"< br / > < br / > i'm sure things didn't exactly go the same way in the real life of homer hickam as they did in the film adaptation of his book, rocket boys, but the movie "" october sky "" ( an anagram of the book's title ) is good enough to stand alone. i have not read hickam's memoirs, but i am still able to enjoy and understand their film adaptation. the film, directed by joe johnston and written by lewis colick, records the story of teenager homer hickam ( jake gyllenhaal ), beginning in oct",positive,positive


In [None]:
learn.blurr_predict([
    "I've been waiting for a HuggingFace course my whole life.", 
    "I hate this so much!",
])

[(('positive',), (#1) [tensor(1)], (#1) [tensor([0.0402, 0.9598])]),
 (('negative',), (#1) [tensor(0)], (#1) [tensor([9.9946e-01, 5.4418e-04])])]

## Models

In [None]:
!mkdir -p 'my_model'

In [None]:
learn.model.hf_model.save_pretrained('my_model')
hf_tokenizer.save_pretrained('my_model')

('my_model/tokenizer_config.json',
 'my_model/special_tokens_map.json',
 'my_model/vocab.txt',
 'my_model/added_tokens.json',
 'my_model/tokenizer.json')

In [None]:
hf_model is learn.model.hf_model

True

In [None]:
!ls -lsha 'my_model'

total 257M
4.0K drwxr-xr-x 2 root root 4.0K Jul 18 15:44 .
4.0K drwxr-xr-x 1 root root 4.0K Jul 18 15:44 ..
4.0K -rw-r--r-- 1 root root  734 Jul 18 18:03 config.json
256M -rw-r--r-- 1 root root 256M Jul 18 18:03 pytorch_model.bin
4.0K -rw-r--r-- 1 root root  112 Jul 18 18:03 special_tokens_map.json
4.0K -rw-r--r-- 1 root root  405 Jul 18 18:03 tokenizer_config.json
456K -rw-r--r-- 1 root root 456K Jul 18 18:03 tokenizer.json
228K -rw-r--r-- 1 root root 227K Jul 18 18:03 vocab.txt


In [None]:
hf_arch2, hf_config2, hf_tokenizer2, hf_model2 = BLURR.get_hf_objects('my_model', model_cls=AutoModelForSequenceClassification)

print(hf_arch)
print(type(hf_config))
print(type(hf_tokenizer))
print(type(hf_model))

distilbert
<class 'transformers.models.distilbert.configuration_distilbert.DistilBertConfig'>
<class 'transformers.models.distilbert.tokenization_distilbert_fast.DistilBertTokenizerFast'>
<class 'transformers.models.distilbert.modeling_distilbert.DistilBertForSequenceClassification'>
