# Hugging Face

Your go-to tool for using any pretrained models.

## import

In [1]:
import transformers
transformers.__version__

'4.38.2'

In [2]:
# !pip3 install evaluate

In [3]:
import evaluate # metrics 
evaluate.__version__

'0.4.1'

In [4]:
import datasets
datasets.__version__

'2.16.1'

In [5]:
# !pip3 install accelerate

In [6]:
import accelerate
accelerate.__version__

'0.27.2'

## 1. Pipeline

The most basic thing in Huggingface; you insert the pretrained model, and just use it for inference.

### sentiment analysis

In [7]:
from transformers import pipeline

clf = pipeline("sentiment-analysis", model = "distilbert-base-uncased-finetuned-sst-2-english")
clf("I love hugging face so much")

[{'label': 'POSITIVE', 'score': 0.9998350143432617}]

### zero shot classification

In [9]:
clf = pipeline("zero-shot-classification", model = "facebook/bart-large-mnli")
# m = multilanguage, nli = natural language inference # bart >> encoder+decoder
clf("This is NLP course on Huggingface", candidate_labels = ['education', 'tech', 'sports'])

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

{'sequence': 'This is NLP course on Huggingface',
 'labels': ['education', 'tech', 'sports'],
 'scores': [0.6124476194381714, 0.3685862720012665, 0.018966125324368477]}

education > highest probability (0.61)

### text generation

In [10]:
gen = pipeline('text-generation', model = "distilgpt2")
# gpt2 >> from openAI
# distil >> learn the label from gpt2 >> this process is called distillation
gen("AI is transforming our everyday lives", max_length = 100, num_return_sequences = 2)

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'AI is transforming our everyday lives and creating a new kind of entertainment. Our children are learning and learning to love, care for and care for others, support, and support our family.\n\n\n\n\n\n\n\n\nAll is one for all.\n'},
 {'generated_text': "AI is transforming our everyday lives, both economic, physical and life, as we are able to do because of our unique skill sets. It's time that we start to work smarter and more effectively at work.\n\n\n\n\n[email protected]\n\nFor more information about PAMPED in the UK please visit www.pamPAPED.co.uk. Your views are welcome.\nImage: Image from Flickr."}]

### fill mask

In [12]:
mlm = pipeline('fill-mask',  model = 'distilroberta-base')
# roberta - base on BERT, bassically train on masking; adding more layers to bert
# distil - distil version of roberta
mlm('Chaky loves to teach deep <mask>', top_k = 3)

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.14332418143749237,
  'token': 2239,
  'token_str': ' learning',
  'sequence': 'Chaky loves to teach deep learning'},
 {'score': 0.0965745821595192,
  'token': 9589,
  'token_str': ' breathing',
  'sequence': 'Chaky loves to teach deep breathing'},
 {'score': 0.0719672366976738,
  'token': 41711,
  'token_str': ' breaths',
  'sequence': 'Chaky loves to teach deep breaths'}]

### question and answering

In [13]:
qa = pipeline('question-answering', model = 'distilbert-base-cased-distilled-squad')
# squad = famous dataset for qa
qa(question = "Where do Chaky work?", context = "My name is Chaky and I love to teach at AIT")

config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

{'score': 0.9753952622413635, 'start': 40, 'end': 43, 'answer': 'AIT'}

### gender bias

In [15]:
mlm = pipeline("fill-mask", model = 'distilroberta-base')
result = mlm("This man works as a <mask>")
result

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.08593502640724182,
  'token': 37171,
  'token_str': ' courier',
  'sequence': 'This man works as a courier'},
 {'score': 0.08269880712032318,
  'token': 28894,
  'token_str': ' translator',
  'sequence': 'This man works as a translator'},
 {'score': 0.05225023254752159,
  'token': 38233,
  'token_str': ' waiter',
  'sequence': 'This man works as a waiter'},
 {'score': 0.051457736641168594,
  'token': 8298,
  'token_str': ' consultant',
  'sequence': 'This man works as a consultant'},
 {'score': 0.037423163652420044,
  'token': 33080,
  'token_str': ' bartender',
  'sequence': 'This man works as a bartender'}]

In [16]:
result = mlm("This woman works as a <mask>")
result

[{'score': 0.10744372010231018,
  'token': 35698,
  'token_str': ' waitress',
  'sequence': 'This woman works as a waitress'},
 {'score': 0.08695955574512482,
  'token': 28894,
  'token_str': ' translator',
  'sequence': 'This woman works as a translator'},
 {'score': 0.06901882588863373,
  'token': 9008,
  'token_str': ' nurse',
  'sequence': 'This woman works as a nurse'},
 {'score': 0.06353699415922165,
  'token': 36289,
  'token_str': ' prostitute',
  'sequence': 'This woman works as a prostitute'},
 {'score': 0.04852951318025589,
  'token': 33080,
  'token_str': ' bartender',
  'sequence': 'This woman works as a bartender'}]

Tha's for the Idea of Pipeline.
in Hugging Face:
- search pipeline and models that it support

## 2. Tokenization

The first component of the pipeline

In [17]:
from transformers import AutoTokenizer 

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english" # declare the model we want: distilbert and finetuned based on sst-2-english
tokenizer  = AutoTokenizer.from_pretrained(checkpoint) # we will use the same tokenizer used in the pretrain model (checkpoint)

In [18]:
raw_inputs = ['Chaky has been waiting in queue for sushi',
              'Huggingface can do lots of stuffs so make sure you try everything']

In [20]:
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors ='pt') # pt = pytorch, ts = tensor
inputs

{'input_ids': tensor([[  101, 15775,  4801,  2038,  2042,  3403,  1999, 24240,  2005, 10514,
          6182,   102,     0,     0,     0,     0],
        [  101, 17662, 12172,  2064,  2079,  7167,  1997,  4933,  2015,  2061,
          2191,  2469,  2017,  3046,  2673,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [22]:
tokenizer.decode([  101, 15775,  4801,  2038,  2042,  3403,  1999, 24240,  2005, 10514,
          6182,   102,     0,     0,     0,     0])

'[CLS] chaky has been waiting in queue for sushi [SEP] [PAD] [PAD] [PAD] [PAD]'

In [23]:
tokenizer.decode([  101, 17662, 12172,  2064,  2079,  7167,  1997,  4933,  2015,  2061,
          2191,  2469,  2017,  3046,  2673,   102])

'[CLS] huggingface can do lots of stuffs so make sure you try everything [SEP]'

Note: 

'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
                          [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

you can see for the fist data: 0,0,0,0 >> they are padding, so no need to pay attention. Ignore them.

## 3. Model

The second component of Pipeline(after tokenizer)

In [25]:
from transformers import AutoModel

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
model = AutoModel.from_pretrained(checkpoint)

In [26]:
inputs

{'input_ids': tensor([[  101, 15775,  4801,  2038,  2042,  3403,  1999, 24240,  2005, 10514,
          6182,   102,     0,     0,     0,     0],
        [  101, 17662, 12172,  2064,  2079,  7167,  1997,  4933,  2015,  2061,
          2191,  2469,  2017,  3046,  2673,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [27]:
# send input to the model

outputs = model(**inputs) # tell the model that **: input is a dictionary

In [28]:
outputs

BaseModelOutput(last_hidden_state=tensor([[[-0.0646,  0.2758, -0.3721,  ...,  0.0342, -0.4803,  0.3824],
         [-0.1979,  0.3366,  0.4101,  ...,  0.2438,  0.1842,  0.1367],
         [-0.6114,  0.0397,  0.6715,  ...,  0.1152,  0.1026, -0.0419],
         ...,
         [ 0.0606,  0.0074,  0.0046,  ...,  0.2112, -0.4586,  0.3855],
         [-0.2745,  0.3977, -0.0904,  ...,  0.1042, -0.3791,  0.2757],
         [ 0.0499,  0.0522,  0.0142,  ...,  0.2213, -0.4748,  0.3702]],

        [[ 0.5644,  0.3998,  0.4761,  ...,  0.3903,  0.7795, -0.3462],
         [ 0.1141,  0.4574,  1.0063,  ...,  0.3859,  0.6240,  0.0430],
         [ 0.3260,  0.3940,  1.1281,  ...,  0.3320,  0.6197, -0.2190],
         ...,
         [ 0.3110,  0.5700,  0.3175,  ...,  0.3519,  1.0604, -0.9748],
         [ 0.4691,  0.2198,  0.2427,  ...,  0.3746,  0.9758, -0.6325],
         [ 0.8492,  0.2319,  0.1885,  ...,  0.6911,  0.5335, -0.6505]]],
       grad_fn=<NativeLayerNormBackward0>), hidden_states=None, attentions=None)

In [29]:
outputs.last_hidden_state

tensor([[[-0.0646,  0.2758, -0.3721,  ...,  0.0342, -0.4803,  0.3824],
         [-0.1979,  0.3366,  0.4101,  ...,  0.2438,  0.1842,  0.1367],
         [-0.6114,  0.0397,  0.6715,  ...,  0.1152,  0.1026, -0.0419],
         ...,
         [ 0.0606,  0.0074,  0.0046,  ...,  0.2112, -0.4586,  0.3855],
         [-0.2745,  0.3977, -0.0904,  ...,  0.1042, -0.3791,  0.2757],
         [ 0.0499,  0.0522,  0.0142,  ...,  0.2213, -0.4748,  0.3702]],

        [[ 0.5644,  0.3998,  0.4761,  ...,  0.3903,  0.7795, -0.3462],
         [ 0.1141,  0.4574,  1.0063,  ...,  0.3859,  0.6240,  0.0430],
         [ 0.3260,  0.3940,  1.1281,  ...,  0.3320,  0.6197, -0.2190],
         ...,
         [ 0.3110,  0.5700,  0.3175,  ...,  0.3519,  1.0604, -0.9748],
         [ 0.4691,  0.2198,  0.2427,  ...,  0.3746,  0.9758, -0.6325],
         [ 0.8492,  0.2319,  0.1885,  ...,  0.6911,  0.5335, -0.6505]]],
       grad_fn=<NativeLayerNormBackward0>)

In [32]:
outputs.last_hidden_state.shape
# ([batch_size, seq_len, hidden_state])

torch.Size([2, 16, 768])

## 4. Postprocessing

Last step of Pipeline(after the model)

In [33]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

In [35]:
outputs

SequenceClassifierOutput(loss=None, logits=tensor([[ 2.5754, -2.1154],
        [-2.8534,  2.8917]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [34]:
outputs.logits

tensor([[ 2.5754, -2.1154],
        [-2.8534,  2.8917]], grad_fn=<AddmmBackward0>)

In [36]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

In [37]:
import torch 
predictions = torch.nn.functional.softmax(outputs.logits, dim = 1)

In [38]:
predictions

tensor([[0.9909, 0.0091],
        [0.0032, 0.9968]], grad_fn=<SoftmaxBackward0>)

Explanation:

{0: 'NEGATIVE', 1: 'POSITIVE'} 

out sentence: 

raw_inputs = ['Chaky has been waiting in queue for sushi',
              'Huggingface can do lots of stuffs so make sure you try everything']

result from the model: 

tensor([[0.9909, 0.0091],
        [0.0032, 0.9968]]

The first sentence is negative as 0.99 > 0.0091 [0 > 1]  
the second sentence is positive as 0.003 < 0.99 [0 < 1]