# **Seminar 5 - Инструменты разработки**
*Naumov Anton (Any0019)*

*To contact me in telegram: @any0019*

## 1. HuggingFace

HuggingFace ( https://huggingface.co ) - один из ваших лучших друзей как ML-щиков

Это платформа для машинного обучения.

На платформе можно найти, а также добавлять и хостить модели, датасеты, api-ки

Также платформа имеет серьёзную и очень сильную python-библиотеку (вернее целое семейство библиотек) для ML.

### 1.0 python-библиотеки

У HuggingFace есть целый набор библиотек для ML

Для работы с моделями ( https://huggingface.co/docs/hub/models-libraries ), из самых важных:
- transformers - для работы с NLP
- diffusers - для работы с диффузионками
- PEFT - Parameter-Efficient Fine-Tuning (Lora)

Для работы с данными ( https://huggingface.co/docs/hub/datasets-libraries ), из самых важных:
- datasets - датасеты :)

В целом это даже близко не полный список ( https://github.com/huggingface ):
- evaluate ( https://github.com/huggingface/evaluate ) - разные метрики / бенчмарки
- accelerate ( https://github.com/huggingface/accelerate ) - multi-gpu обучения
- optimum ( https://github.com/huggingface/optimum ) - оптимизация инференса
- ...

sklearn в мире DL :D

In [1]:
%pip install torch transformers datasets evaluate scikit-learn accelerate

Note: you may need to restart the kernel to use updated packages.


### 1.1 Transformers - pipeline

Концепция pipeline-ов такова, что объединяются 3 вещи в одну конструкцию:
1. Пре-процессинг (токенизация, ...)
2. Модель
3. Пост-процессинг

https://huggingface.co/docs/transformers/index

In [2]:
from transformers import pipeline

classifier = pipeline(
    task='sentiment-analysis',
    model="distilbert-base-uncased-finetuned-sst-2-english",
)

Device set to use cuda:0


In [3]:
# print(classifier("This model is nice!"))
print(classifier("Hello"))

[{'label': 'POSITIVE', 'score': 0.9995185136795044}]


In [4]:
print(classifier(
    [
        "What an awful thing...",
        "It's great in what it was designed for, but kinda awful that everything is done for me",
    ]
))

[{'label': 'NEGATIVE', 'score': 0.9996784925460815}, {'label': 'NEGATIVE', 'score': 0.9985873699188232}]


In [5]:
classifier.model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


In [6]:
classifier.tokenizer

DistilBertTokenizerFast(name_or_path='distilbert-base-uncased-finetuned-sst-2-english', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

In [7]:
?classifier.postprocess

[1;31mSignature:[0m
[0mclassifier[0m[1;33m.[0m[0mpostprocess[0m[1;33m([0m[1;33m
[0m    [0mmodel_outputs[0m[1;33m,[0m[1;33m
[0m    [0mfunction_to_apply[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mtop_k[0m[1;33m=[0m[1;36m1[0m[1;33m,[0m[1;33m
[0m    [0m_legacy[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Postprocess will receive the raw outputs of the `_forward` method, generally tensors, and reformat them into
something more friendly. Generally it will output a list or a dict or results (containing just strings and
numbers).
[1;31mFile:[0m      c:\users\nazmievairat\anaconda3\envs\python312\lib\site-packages\transformers\pipelines\text_classification.py
[1;31mType:[0m      method

In [8]:
# mlm_model = pipeline('fill-mask', model="bert-base-uncased")  
mlm_model = pipeline(task='fill-mask', model="bert-base-cased")  # cased/uncased - with or w/o captital letters
MASK = mlm_model.tokenizer.mask_token

BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architect

In [9]:
MASK

'[MASK]'

In [10]:
mlm_model(f"{MASK} Tolstoy was born in 1828")

[{'score': 0.9097098112106323,
  'token': 6344,
  'token_str': 'Leo',
  'sequence': 'Leo Tolstoy was born in 1828'},
 {'score': 0.012587308883666992,
  'token': 27257,
  'token_str': 'Lev',
  'sequence': 'Lev Tolstoy was born in 1828'},
 {'score': 0.00813402608036995,
  'token': 7062,
  'token_str': 'Ivan',
  'sequence': 'Ivan Tolstoy was born in 1828'},
 {'score': 0.007583901286125183,
  'token': 14374,
  'token_str': 'Nikolai',
  'sequence': 'Nikolai Tolstoy was born in 1828'},
 {'score': 0.006656601093709469,
  'token': 23378,
  'token_str': 'Alexei',
  'sequence': 'Alexei Tolstoy was born in 1828'}]

In [11]:
for hypo in mlm_model(f"Donald {MASK} is the president of the united states."):
  print(f"P={hypo['score']:.5f}", hypo['sequence'])

P=0.94878 Donald Trump is the president of the united states.
P=0.00416 Donald Byrd is the president of the united states.
P=0.00203 Donald Johnson is the president of the united states.
P=0.00124 Donald Kennedy is the president of the united states.
P=0.00118 Donald Cameron is the president of the united states.


In [12]:
?pipeline

[1;31mSignature:[0m
[0mpipeline[0m[1;33m([0m[1;33m
[0m    [0mtask[0m[1;33m:[0m [0mstr[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mmodel[0m[1;33m:[0m [0mUnion[0m[1;33m[[0m[0mstr[0m[1;33m,[0m [0mForwardRef[0m[1;33m([0m[1;34m'PreTrainedModel'[0m[1;33m)[0m[1;33m,[0m [0mForwardRef[0m[1;33m([0m[1;34m'TFPreTrainedModel'[0m[1;33m)[0m[1;33m,[0m [0mNoneType[0m[1;33m][0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mconfig[0m[1;33m:[0m [0mUnion[0m[1;33m[[0m[0mstr[0m[1;33m,[0m [0mtransformers[0m[1;33m.[0m[0mconfiguration_utils[0m[1;33m.[0m[0mPretrainedConfig[0m[1;33m,[0m [0mNoneType[0m[1;33m][0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mtokenizer[0m[1;33m:[0m [0mUnion[0m[1;33m[[0m[0mstr[0m[1;33m,[0m [0mtransformers[0m[1;33m.[0m[0mtokenization_utils[0m[1;33m.[0m[0mPreTrainedTokenizer[0m[1;33m,[0m [0mForwardRef[0m[1;33m([0m[1;34m'PreTrainedTokeniz

In [13]:
del classifier, mlm_model

Существует множество моделей под самые разные задачи - быстро найти любые модели: https://huggingface.co/models 

In [14]:
import json

text = """Almost two-thirds of the 1.5 million people who viewed this liveblog had Googled to discover
 the latest on the Rosetta mission. They were treated to this detailed account by the Guardian’s science editor,
 Ian Sample, and astronomy writer Stuart Clark of the moment scientists landed a robotic spacecraft on a comet 
 for the first time in history, and the delirious reaction it provoked at their headquarters in Germany.
  “We are there. We are sitting on the surface. Philae is talking to us,” said one scientist.
"""

# Задача: Создайте pipeline для Named Entity Recognition (NER) задачи, ищите модельки на хабе
#  - либо по тексту ner в названии
#  - либо по задаче Token Classification
ner_model = pipeline(
  task='token-classification',
  model='dslim/distilbert-NER'
)

named_entities = ner_model(text)
named_entities

Device set to use cuda:0


[{'entity': 'B-LOC',
  'score': np.float32(0.90131134),
  'index': 27,
  'word': 'Rose',
  'start': 112,
  'end': 116},
 {'entity': 'B-LOC',
  'score': np.float32(0.84262055),
  'index': 28,
  'word': '##tta',
  'start': 116,
  'end': 119},
 {'entity': 'B-ORG',
  'score': np.float32(0.9820916),
  'index': 40,
  'word': 'Guardian',
  'start': 179,
  'end': 187},
 {'entity': 'B-PER',
  'score': np.float32(0.99665266),
  'index': 46,
  'word': 'Ian',
  'start': 207,
  'end': 210},
 {'entity': 'I-PER',
  'score': np.float32(0.999138),
  'index': 47,
  'word': 'Sam',
  'start': 211,
  'end': 214},
 {'entity': 'I-PER',
  'score': np.float32(0.99891937),
  'index': 48,
  'word': '##ple',
  'start': 214,
  'end': 217},
 {'entity': 'B-PER',
  'score': np.float32(0.99740535),
  'index': 53,
  'word': 'Stuart',
  'start': 240,
  'end': 246},
 {'entity': 'I-PER',
  'score': np.float32(0.99763894),
  'index': 54,
  'word': 'Clark',
  'start': 247,
  'end': 252},
 {'entity': 'B-LOC',
  'score': np.f

In [15]:
word_to_entity = {item['word']: item['entity'] for item in named_entities}
assert 'org' in word_to_entity.get('Guardian').lower() and 'per' in word_to_entity.get('Stuart').lower()
print("All tests passed")

All tests passed


### 1.2 Transformers - model and tokenizer

In [16]:
import torch
from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification

model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
#model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

In [17]:
model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

In [18]:
lines = [
    "Luke, I am your father.",
    "Life is what happens when you're busy making other plans.",
]

# токенизация батча текстов. "pt" - [p]y[t]orch tensors
tokens_info = tokenizer(lines, padding=True, truncation=True, return_tensors="pt")

for key in tokens_info:
    print(key, tokens_info[key].shape, tokens_info[key], sep="\n", end="\n\n")

input_ids
torch.Size([2, 15])
tensor([[ 101, 5355, 1010, 1045, 2572, 2115, 2269, 1012,  102,    0,    0,    0,
            0,    0,    0],
        [ 101, 2166, 2003, 2054, 6433, 2043, 2017, 1005, 2128, 5697, 2437, 2060,
         3488, 1012,  102]])

token_type_ids
torch.Size([2, 15])
tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

attention_mask
torch.Size([2, 15])
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])



In [19]:
for i in range(2):
    print(tokenizer.batch_decode(tokens_info['input_ids'][i]))
    
# [CLS] - bos, [SEP] - eos
print("Detokenized:")
for i in range(2):
    print(tokenizer.decode(tokens_info['input_ids'][i]))

['[CLS]', 'luke', ',', 'i', 'am', 'your', 'father', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
['[CLS]', 'life', 'is', 'what', 'happens', 'when', 'you', "'", 're', 'busy', 'making', 'other', 'plans', '.', '[SEP]']
Detokenized:
[CLS] luke, i am your father. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
[CLS] life is what happens when you ' re busy making other plans. [SEP]


In [20]:
text_for_analyse = "some random text for deeper analysis + weird word Rutherfordium"

In [21]:
for key, value in tokenizer(text_for_analyse).items():
    print(key, value, sep="\n", end="\n\n")

input_ids
[101, 2070, 6721, 3793, 2005, 6748, 4106, 1009, 6881, 2773, 18472, 5007, 102]

token_type_ids
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

attention_mask
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]



In [22]:
tokenizer.encode(text_for_analyse)

[101, 2070, 6721, 3793, 2005, 6748, 4106, 1009, 6881, 2773, 18472, 5007, 102]

In [23]:
tokenizer.decode(tokenizer.encode(text_for_analyse))

'[CLS] some random text for deeper analysis + weird word rutherfordium [SEP]'

In [24]:
tokenizer.batch_decode(tokenizer.encode(text_for_analyse))

['[CLS]',
 'some',
 'random',
 'text',
 'for',
 'deeper',
 'analysis',
 '+',
 'weird',
 'word',
 'rutherford',
 '##ium',
 '[SEP]']

In [25]:
tokenizer.tokenize(text_for_analyse)

['some',
 'random',
 'text',
 'for',
 'deeper',
 'analysis',
 '+',
 'weird',
 'word',
 'rutherford',
 '##ium']

In [26]:
(
    tokenizer.all_special_ids,
    tokenizer.all_special_tokens,
    tokenizer.all_special_tokens_extended,
    tokenizer.added_tokens_encoder,
    tokenizer.added_tokens_decoder,
)

([100, 102, 0, 101, 103],
 ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]'],
 ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]'],
 {'[PAD]': 0, '[UNK]': 100, '[CLS]': 101, '[SEP]': 102, '[MASK]': 103},
 {0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
  100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
  101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
  102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
  103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True)})

In [27]:
lines = [
    "Luke, I am your father.",
    "Life is what happens when you're busy making other plans.",
]

tokens_info = tokenizer(lines, padding=True, truncation=True, return_tensors="pt")

# прямой проход через модель
with torch.no_grad():
    outputs = model(**tokens_info)

print(outputs)

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-0.3502,  0.2246, -0.2345,  ..., -0.2232,  0.1730,  0.6747],
         [-0.6097,  0.6892, -0.5512,  ..., -0.4814,  0.5322,  1.3833],
         [ 0.1842,  0.4881,  0.2193,  ..., -0.2699,  0.2246,  0.7985],
         ...,
         [-0.4413,  0.2748, -0.0391,  ..., -0.0604, -0.4358,  0.1384],
         [-0.5414,  0.4633,  0.0678,  ..., -0.1871, -0.5046,  0.2752],
         [-0.3940,  0.6180,  0.2092,  ..., -0.2345, -0.4177,  0.3341]],

        [[ 0.1622, -0.1154, -0.3894,  ..., -0.4180,  0.0138,  0.7644],
         [ 0.6471,  0.3774, -0.4082,  ...,  0.0050,  0.5559,  0.4385],
         [ 0.3351, -0.3158, -0.1178,  ...,  0.1348, -0.3143,  1.4409],
         ...,
         [ 1.2932, -0.1743, -0.5613,  ..., -0.2718, -0.1367,  0.4217],
         [ 1.0305,  0.1708, -0.2985,  ...,  0.2097, -0.4627, -0.4277],
         [ 1.0854,  0.1760, -0.0377,  ...,  0.3152, -0.5979, -0.3465]]]), pooler_output=tensor([[-0.8854, -0.4722, -0.9392,  .

In [28]:
model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

In [29]:
# from torchview import draw_graph
# draw_graph(model, tokens_info, depth=3, expand_nested=True).visual_graph

In [30]:
model.encoder.layer[-1].output

BertOutput(
  (dense): Linear(in_features=3072, out_features=768, bias=True)
  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
  (dropout): Dropout(p=0.1, inplace=False)
)

In [31]:
outputs.last_hidden_state.shape

torch.Size([2, 15, 768])

In [32]:
model.pooler

BertPooler(
  (dense): Linear(in_features=768, out_features=768, bias=True)
  (activation): Tanh()
)

In [33]:
outputs.pooler_output.shape

torch.Size([2, 768])

### 1.3 Datasets

https://huggingface.co/docs/datasets/index

In [34]:
from datasets import load_dataset

ds = load_dataset("fancyzhx/yelp_polarity")

In [35]:
ds

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 560000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 38000
    })
})

In [36]:
ds["train"][0]

{'text': "Unfortunately, the frustration of being Dr. Goldberg's patient is a repeat of the experience I've had with so many other doctors in NYC -- good doctor, terrible staff.  It seems that his staff simply never answers the phone.  It usually takes 2 hours of repeated calling to get an answer.  Who has time for that or wants to deal with it?  I have run into this problem with many other doctors and I just don't get it.  You have office workers, you have patients with medical needs, why isn't anyone answering the phone?  It's incomprehensible and not work the aggravation.  It's with regret that I feel that I have to give Dr. Goldberg 2 stars.",
 'label': 0}

In [37]:
ds["train"][0:5]["text"]

["Unfortunately, the frustration of being Dr. Goldberg's patient is a repeat of the experience I've had with so many other doctors in NYC -- good doctor, terrible staff.  It seems that his staff simply never answers the phone.  It usually takes 2 hours of repeated calling to get an answer.  Who has time for that or wants to deal with it?  I have run into this problem with many other doctors and I just don't get it.  You have office workers, you have patients with medical needs, why isn't anyone answering the phone?  It's incomprehensible and not work the aggravation.  It's with regret that I feel that I have to give Dr. Goldberg 2 stars.",
 "Been going to Dr. Goldberg for over 10 years. I think I was one of his 1st patients when he started at MHMG. He's been great over the years and is really all about the big picture. It is because of him, not my now former gyn Dr. Markoff, that I found out I have fibroids. He explores all options with you and is very patient and understanding. He doe

In [38]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenize_function(ds["train"][0:5])

{'input_ids': [[101, 6854, 1010, 1996, 9135, 1997, 2108, 2852, 1012, 18522, 1005, 1055, 5776, 2003, 1037, 9377, 1997, 1996, 3325, 1045, 1005, 2310, 2018, 2007, 2061, 2116, 2060, 7435, 1999, 16392, 1011, 1011, 2204, 3460, 1010, 6659, 3095, 1012, 2009, 3849, 2008, 2010, 3095, 3432, 2196, 6998, 1996, 3042, 1012, 2009, 2788, 3138, 1016, 2847, 1997, 5567, 4214, 2000, 2131, 2019, 3437, 1012, 2040, 2038, 2051, 2005, 2008, 2030, 4122, 2000, 3066, 2007, 2009, 1029, 1045, 2031, 2448, 2046, 2023, 3291, 2007, 2116, 2060, 7435, 1998, 1045, 2074, 2123, 1005, 1056, 2131, 2009, 1012, 2017, 2031, 2436, 3667, 1010, 2017, 2031, 5022, 2007, 2966, 3791, 1010, 2339, 3475, 1005, 1056, 3087, 10739, 1996, 3042, 1029, 2009, 1005, 1055, 4297, 25377, 2890, 10222, 19307, 1998, 2025, 2147, 1996, 12943, 17643, 21596, 1012, 2009, 1005, 1055, 2007, 9038, 2008, 1045, 2514, 2008, 1045, 2031, 2000, 2507, 2852, 1012, 18522, 1016, 3340, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

In [39]:
tokenized_datasets = ds.map(tokenize_function, batched=True, batch_size=1000)

In [40]:
tokenized_datasets["train"][0].keys()

dict_keys(['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'])

In [41]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1024))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1024))

In [42]:
from torch.utils.data import DataLoader
small_train_dataset.set_format(type="torch", columns=["input_ids", "label", "attention_mask"])
dataloader = DataLoader(small_train_dataset, batch_size=4)
res = next(iter(dataloader))

for key, value in res.items():
    print(key, value.shape, value, sep="\n", end="\n-------\n")

label
torch.Size([4])
tensor([1, 0, 0, 1])
-------
input_ids
torch.Size([4, 512])
tensor([[  101, 11519,  2946,  ...,     0,     0,     0],
        [  101,  1045,  2031,  ...,     0,     0,     0],
        [  101,  2833,  2003,  ...,     0,     0,     0],
        [  101,  2190,  9270,  ...,     0,     0,     0]])
-------
attention_mask
torch.Size([4, 512])
tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])
-------


Умеет много чего
```python
ds.rename_column("text", "unsplit_text")  # переименовывать колонки
ds.cast_column("image", Image(mode="RGB"))  # приводить отдельные колонки к нужному виду
dataset.with_transform(transforms)  # аугументации на бегу
...
```

### 1.4 Evaluate

https://huggingface.co/docs/evaluate/index

In [43]:
import evaluate

metric = evaluate.load("accuracy")

In [44]:
metric.compute(predictions=[1, 2, 3, 4], references=[1, 1, 1, 4])

{'accuracy': 0.5}

In [45]:
metric.compute(predictions=[1, 2, 3, 4], references=[4, 3, 2, 1])

{'accuracy': 0.0}

In [46]:
metric.compute(predictions=[1, 2, 3, 4], references=[1, 2, 3, 4])

{'accuracy': 1.0}

In [47]:
import numpy as np

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

### 1.5 Transformers - Trainer

In [48]:
from transformers import Trainer, TrainingArguments

?TrainingArguments

[1;31mInit signature:[0m
[0mTrainingArguments[0m[1;33m([0m[1;33m
[0m    [0moutput_dir[0m[1;33m:[0m [0mstr[0m[1;33m,[0m[1;33m
[0m    [0moverwrite_output_dir[0m[1;33m:[0m [0mbool[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mdo_train[0m[1;33m:[0m [0mbool[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mdo_eval[0m[1;33m:[0m [0mbool[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mdo_predict[0m[1;33m:[0m [0mbool[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0meval_strategy[0m[1;33m:[0m [0mUnion[0m[1;33m[[0m[0mtransformers[0m[1;33m.[0m[0mtrainer_utils[0m[1;33m.[0m[0mIntervalStrategy[0m[1;33m,[0m [0mstr[0m[1;33m][0m [1;33m=[0m [1;34m'no'[0m[1;33m,[0m[1;33m
[0m    [0mprediction_loss_only[0m[1;33m:[0m [0mbool[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mper_device_train_batch_size[0m[1;33m:[0m [0mint[0m [1;33m=[0m [1;36m8[0m[1;

In [49]:
training_args = TrainingArguments(
    output_dir="./my_model",
    overwrite_output_dir=True,
    num_train_epochs=10,
    learning_rate=5e-5,
    lr_scheduler_type="cosine",
    # lr_scheduler_kwargs={},
    # warmup_ratio=0.03125,
    # warmup_steps=10,
    per_device_train_batch_size=32,
    gradient_accumulation_steps=1,
    log_level="error",
    # logging_dir="output_dir/runs/CURRENT_DATETIME_HOSTNAME"  # логи для tensorboard (default)
    logging_strategy="steps",
    logging_steps=1,
    save_strategy="epoch",
    # save_steps=1,
    save_total_limit=2,
    save_safetensors=True,  # safetensors вместо torch.save / torch.load
    save_only_model=False,  # сохраняем optimizer, shceduler, rng, ...
    use_cpu=False,
    seed=42,
    # bf16=True,  # использовать bf16 вместо fp32
    eval_strategy="epoch",
    # eval_steps=32,
    disable_tqdm=False,
    load_best_model_at_end=False,
    label_smoothing_factor=0.,
    optim="adamw_torch",
    # optim_args=...,
    # resume_from_checkpoint=...,
    # auto_find_batch_size=...,
)

In [50]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

wandb: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
wandb: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
wandb: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: No netrc file found, creating one.
wandb: Appending key for api.wandb.ai to your netrc file: C:\Users\nazmievairat\_netrc
wandb: Currently logged in as: airat-nazmiev (airat-nazmiev-mipt) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin


  0%|          | 0/320 [00:00<?, ?it/s]

TypeError: BertModel.forward() got an unexpected keyword argument 'labels'

In [None]:
texts = [
    "This was not a good movie!",
    "What an awesome place!",
    "ewww",
]

tokens_info = tokenizer(
    texts,
    padding=True,
    truncation=True,
    return_tensors="pt",
)

model.eval()
model.cpu()
with torch.no_grad():
    out = model(**tokens_info)
    probs = torch.nn.functional.softmax(out.logits, dim=-1)
    for text, prob in zip(texts, probs.tolist()):
        print(
            f"Text: `{text}`\nPrediction (prob): "
            f"positive={round(prob[0], 3)} ; "
            f"negative={round(prob[1], 3)}",
            end="\n\n"
        )

## 2. StreamLit

StreamLit - простая библиотека для построения интерактивных веб-приложений

In [None]:
%pip install streamlit

```bash
streamlit hello  # демо с кодом от самого streamlit
```

Приложения через streamlit строятся построчно, а не от макета

Основные принципы:
1. Используй скрипты на Python. Построчно создавайте и расширяйте приложения Streamlit.
2. Рассматривай виджеты как переменные. Виджеты - это элементы ввода, которые позволяют пользователям взаимодействовать с приложениями Streamlit. Они представлены в виде основных текстовых полей ввода, флажков, ползунков и т.д.
3. Повторно используй данные и вычисления. Исторически данные и вычисления кэшировались с помощью @st.cache декоратора. Это экономит вычислительное время при внесении изменений в приложение. Это может происходить сотни раз, если ты активно редактируешь приложение! В версии 0.89.0 Streamlit запустил два новых примитива (st.experimental_memo и st.experimental_singleton), что позволило значительно повысить скорость работы по сравнению с @st.cache.

In [None]:
import streamlit as st

st.__version__

Пайплайн приложения
1. Создаётся и заполняется файл `app.py` (default, можете свой)
2. `streamlit run app.py`
3. Done!

### 2.1. Текст

In [None]:
import streamlit as st

st.title("This is a title")
st.header("This is a header")
st.subheader("This is a subheader")
st.text("This is a text")
st.markdown("# This is a markdown header 1")
st.markdown("## This is a markdown header 2")
st.markdown("### This is a markdown header 3")
st.markdown("This is a markdown: *bold* **italic** `inline code` ~strikethrough~")
st.markdown("""This is a code block with syntax highlighting
```python
print("Hello world!")
```
""")
st.html(
    "image from url example with html: "
    "<img src='https://www.wallpaperflare.com/static/450/825/286/kitten-cute-animals-grass-5k-wallpaper.jpg' width=400px>",
)


st.write("Text with write")
st.write(range(10))

### 2.2. Логирование

In [None]:
st.success("Success")
st.info("Information")
st.warning("Warning")
st.error("Error")
exp = ZeroDivisionError("Trying to divide by Zero")
st.exception(exp)

### 2.3. Объекты

In [None]:
from urllib import request
request.urlretrieve(
    "http://craphound.com/images/1006884_2adf8fc7.jpg",
    "image_example.jpg",
)

from PIL import Image
img = Image.open("image_example.jpg")
img

In [None]:
# картинка (без html - из переменной)
st.image(img, width=200)

# чекбокс
if st.checkbox("Show/Hide"):
    st.text("Showing the widget")
else:
    st.warning("Not showing what is inside")

# выбор опции кружочками
status = st.radio("Select Gender: ", ('Male', 'Female'))
if (status == 'Male'):
    st.success("Male")
else:
    st.success("Female")

# выбор опции выпадающим меню
hobby = st.selectbox(
    "Hobbies: ",
    ['Dancing', 'Reading', 'Sports'],
)
st.write("Your hobby is: ", hobby)

# выбор нескольких опций
hobbies = st.multiselect(
    "Hobbies: ",
    ['Dancing', 'Reading', 'Sports'],
)
st.write("You selected", len(hobbies), 'hobbies')

# кнопка без функционала
st.button("Click me for no reason")

# кнопка, показывающая текст, когда нажата
if(st.button("Click me")):
    st.text("You did it, you clicked me!!!")

# текстовый input: label - название, value - что написано по дефолту
name = st.text_input(label="Enter Your name", value="Type Here ...")
if(st.button('Submit')):
    result = name.title()
    st.success(result)

# слайдер
level = st.slider("Select the level", 1, 5)
st.text('Selected: {}'.format(level))

### 2.4. Сложные действия

```python
# Переменная общая на rerun - способ шейрить информацию между изменениями
st.session_state  # kinda Dict[str, Any]

# Инициализация
if 'key' not in st.session_state:
    st.session_state['key'] = 'value'

# Можно также обращаться по атрибутам, а не ключам
if 'key' not in st.session_state:
    st.session_state.key = 'value'
```

In [None]:
# инициализируем переменные
st.session_state.key1 = 'value1'     # Attribute API
st.session_state['key2'] = 'value2'  # Dictionary like API

# посмотреть что в st.session_state
st.write(st.session_state)

# magic
st.session_state

# ошибка если неправильный ключ
st.write(st.session_state['missing_key'])

In [None]:
# key - позволяет указать в какое поле session_state записать объект
st.text_input("Please input something", key="my input")
st.session_state

### 2.5. Кэширование

Для кэширования есть 2 декоратора

```python
@st.cache_data      # для данных - сериализация выходов с ключами входов
@st.cache_resource  # для моделей / ресурсов - несериализуемые объекты, которые не хочется загружать несколько раз
```

In [None]:
import streamlit as st
import pandas as pd

@st.cache_data  # кэширование
def load_data(url):
    df = pd.read_csv(url)  # скачивание датасета
    return df

df = load_data("https://github.com/plotly/datasets/raw/master/uber-rides-data1.csv")
st.dataframe(df)

st.button("Rerun")

In [None]:
import streamlit as st
from transformers import pipeline

@st.cache_resource  # кэширование
def load_model():
    return pipeline("sentiment-analysis")  # скачивание модели

model = load_model()

query = st.text_input("Your query", value="I love Streamlit! 🎈")
if query:
    result = model(query)[0]  # классифицируем
    st.write(query)
    st.write(result)

## 3. HF + StreamLit

Можно поднять тестовую streamlit api прямо на hugging face

1. https://huggingface.co/
2. New space - Streamlit
3. Делаем `app.py` и `requirements.txt`
4. Собирается докер образ - появляется app (публично доступен)
5. \* немного хулиганства - можно достать даже iframe из hf