# Using Transformers

## pipelines examples
It connects a model with its necessary preprocessing and postprocessing steps, allowing us to directly input any text and get an intelligible answer

In [None]:
from transformers import pipeline
classifier = pipeline("sentiment-analysis")



No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


In [None]:
classifier("I've been waiting for this sucking day my whole life.")

[{'label': 'NEGATIVE', 'score': 0.90556800365448}]

 zero-shot-classification


In [None]:
classifier = pipeline("zero-shot-classification")

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


In [None]:
classifier(
    "This is a course about the business",
    candidate_labels=["education", "politics", "business"],
)

{'sequence': 'This is a course about the business',
 'labels': ['business', 'education', 'politics'],
 'scores': [0.9886767864227295, 0.007415545638650656, 0.003907651640474796]}

In [None]:
generator = pipeline("text-generation")
generator("In this course, we will teach you how to",
max_length=30,
    num_return_sequences=2)


No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': "In this course, we will teach you how to build a simple and powerful Java application.\n\nLet's begin...\n\nIn this Java application, we will be using Java 9.3.\n\nThis time, we will be using Java 8.0.\n\nIn this tutorial, we will be using Java 7.0.\n\nIn this Java application, we will be using Java 8.0.\n\nThe following Java code is also available in this Java tutorial.\n\n\nThis Java application is written in Java 8.0.\n\nIn this Java application, we will be using Java 9.3.\n\nIn this Java application, we will be using Java 8.0.\n\nJava 9.3 is now available for download.\n\nJava 9.3 is now available for download.\n\nJava 9.3 is now available for download.\n\nIn this Java application, we will be using Java 8.0.\n\nIn this Java application, we will be using Java 8.0.\n\nIn this Java application, we will be using Java 8.0.\n\nJava 8.0 is now available for download.\n\n\nThis Java application is written in Java 7.0.\n\nIn this Java application, we will be using Java 

use the model from ModelHub

In [None]:
generator = pipeline("text-generation", model="HuggingFaceTB/SmolLM2-360M")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'In this course, we will teach you how to use LaTeX to write mathematical expressions. LaTeX is a typesetting system used by professional mathematicians, physicists, and scientists.\n\nTo use LaTeX, you need to download a program called “LaTeX” and install it on your computer. Once you have LaTeX installed, you can use it to create beautiful mathematical expressions.\n\nTo get started, you need to download the LaTeX package. You can do this by typing:\n\n`sudo apt-get install latex-extra`\n\nOnce LaTeX is installed, you can open a text editor and type in some LaTeX code. LaTeX code is written in a special language called “LaTeX syntax.”\n\nYou can also use LaTeX to create mathematical expressions by typing:\n\n`\\begin{equation} \\sum_{i=1}^{n} a_i \\end{equation}`\n\nIn this code, “a” is a symbol for a mathematical expression. The “n” is a number that represents the size of the set, and “i” is a variable that represents the index of the set. The “\\sum” symbol repr

In [None]:
image_classifier = pipeline(
    task="image-classification", model="google/vit-base-patch16-224"
)
result = image_classifier(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
)
print(result)

Fast image processor class <class 'transformers.models.vit.image_processing_vit_fast.ViTImageProcessorFast'> is available for this model. Using slow image processor class. To use the fast image processor class set `use_fast=True`.
Device set to use cpu


[{'label': 'lynx, catamount', 'score': 0.43349990248680115}, {'label': 'cougar, puma, catamount, mountain lion, painter, panther, Felis concolor', 'score': 0.03479622304439545}, {'label': 'snow leopard, ounce, Panthera uncia', 'score': 0.032401926815509796}, {'label': 'Egyptian cat', 'score': 0.023944783955812454}, {'label': 'tiger cat', 'score': 0.02288925088942051}]


In [None]:
from transformers import pipeline

transcriber = pipeline(
    task="automatic-speech-recognition", model="openai/whisper-base.en"
)
transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac")
# Output: {'text': ' I have a dream that one day this nation will rise up and live out the true meaning of its creed.'}

Device set to use cpu


{'text': ' I have a dream that one day this nation will rise up and live out the true meaning of its creed.'}

## Behind the pipeline

在环境里安装三个库，分别是datasets，evaluate，transformer[sentencepiece]

### 1)preprocessing with a tokenizer

#### for general cases

DEFINITION

convert the **text** inputs into **numbers** that the model can make sense of
1. inputs→words,subword,symbols("tokens")
2. tokens→integer
3. adding additiional inputs


KEY POINTS

 The tokenizer and model should always be from the same checkpoin

CODE SHOWING

从transformer库中导入AutoTokenizier类

In [None]:
from transformers import AutoTokenizer

（Hugging Face等平台提供的"checkpoint"通常是训练完成的最终状态（或关键节点），供他人直接下载使用。）

from_pretrained( )方法，该方法会自动获取传入的模型配套的tokenizer相关数据,并且返回一个AutoTokenizer类的一个对象（实例）

In [None]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

然后给tokenizer传入我们的初始文本
这里tokenizer是一个AutoTokenizer的实例，这个类有call方法，所以 tokenizer（）实际上是在调用这个call方法。

**padding=True, truncation=True**（explain later）



In [None]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


“input_ids”input_ids contains two rows of integers (one for each sentence) that are the unique identifiers of the tokens in each sentence.

**“attention_mask”**

这个时候就完成了pipeline（）的第一步，prepocessing

#### for some special cases

##### batching

对于单个句子，要转化为二维

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence="I've been waiting for a HuggingFace course my whole life."

tokens=tokenizer.tokenize(sequence)
ids=tokenizer.convert_tokens_to_ids(tokens)
print(ids)
input_ids=torch.tensor([ids])
print(input_ids)

[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]
tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012]])


此处input_ids=torch.tensor([ids])

**1.使用PyTorch将Python列表转换为tensor有几个重要好处：**

* GPU加速：PyTorch tensors可以在GPU上运行，大幅加速计算，特别是对于深度学习模型。

* 批量处理：tensor可以方便地表示批量数据，模型可以一次性处理多个输入序列。

* 自动微分：PyTorch tensors支持自动微分，这对训练神经网络至关重要。

* 与模型兼容：HuggingFace模型期望输入是tensor格式，直接使用Python列表需要额外转换。

* 优化内存布局：tensor在内存中有更高效的存储方式，适合数值计算。

2.**要多加一个[ ]转化为二维**

3.注意打印出来的结果和下面的区别（上面没有CLS和SEP）

简化版本代码

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence="I've been waiting for a HuggingFace course my whole life."

tokenized_inputs = tokenizer(sequence, return_tensors="pt")# pytorch tensor，返回的 tokenized_inputs 是一个字典，包含模型所需的所有输入字段
print(tokenized_inputs["input_ids"])#这里["input_ids"]相当于就是在查字典，
print(tokenized_inputs)#这样就是打印出字典的全部K：V，包括"input_ids"

tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102]])
{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


##### padding

对于长度不同的多个句子，用padding补全，并且用attention mask设置为0消除padding对句子id的影响

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence=["I've been waiting for a HuggingFace course my whole life.","life sucks,bro"]
tokenized_inputs=tokenizer(sequence,return_tensors="pt",padding=True,truncation=True)# 注意这个True要大写
print(tokenized_inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  2166, 19237,  1010, 22953,   102,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}


##### truncating

用于处理太长的句子，具体用法见上。

### 2）passing the inputs through the model

 “inputs"→"hidden states"(features) :  a high-dimensional vector representing the contextual understanding of that input by the Transformer model.

It generally has three dimensions:
1.   Batch size: The number of sequences processed at a time (2 in our example).
2. Sequence length: The length of the numerical representation of the sequence (16 in our example).
3. Hidden size: The vector dimension of each model input.



In [None]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([2, 16, 768])


In [None]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.logits.shape)

torch.Size([2, 2])


### 3）Postprocessing

 "logits"→"probabilities" through a SoftMax layer

In [None]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)


In [None]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

# Fine-tuning a pretrained model

## OVERVIEW

In [None]:
import torch
from torch.optim import AdamW
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Same as before
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "This course is amazing!",
]
batch = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")

# This is new
batch["labels"] = torch.tensor([1, 1])

optimizer = AdamW(model.parameters())
loss = model(**batch).loss
loss.backward()
optimizer.step()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  return forward_call(*args, **kwargs)


KeyboardInterrupt: 

## check the trained data

Using MRPC dataset

we get a DatasetDict object which contains the training set, the validation set, and the test set

In [None]:
from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

README.md: 0.00B [00:00, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/649k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/75.7k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/308k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

to access each pair of sentences in our raw_datasets object by indexing(like dictionary)

In [None]:
raw_train_dataset=raw_datasets["train"]
raw_train_dataset[0]

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0}

In [None]:
raw_train_dataset[3667]

{'sentence1': "The 30-year bond US30YT = RR rose 22 / 32 for a yield of 4.31 percent , versus 4.35 percent at Wednesday 's close .",
 'sentence2': 'The 30-year bond US30YT = RR grew 1-3 / 32 for a yield of 4.30 percent , down from 4.35 percent late Wednesday .',
 'label': 0,
 'idx': 4075}

to see the correspondence between "labels" and "integers"

In [None]:
raw_train_dataset.features

{'sentence1': Value('string'),
 'sentence2': Value('string'),
 'label': ClassLabel(names=['not_equivalent', 'equivalent']),
 'idx': Value('int32')}

## Preprocessing a dataset（tokenize预处理）

tokenize the dataset(part of the preprocessing)

In [None]:
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
raw_datasets["train"]["sentence1"][0]
inputs = tokenizer(raw_datasets["train"]["sentence1"][0],raw_datasets["train"]["sentence2"][0], padding=True,truncation=True)
inputs

{'input_ids': [101, 2572, 3217, 5831, 5496, 2010, 2567, 1010, 3183, 2002, 2170, 1000, 1996, 7409, 1000, 1010, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102, 7727, 2000, 2032, 2004, 2069, 1000, 1996, 7409, 1000, 1010, 2572, 3217, 5831, 5496, 2010, 2567, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

但是Hugging Face Datasets 的特点，它的数据是用 Apache Arrow 格式存在磁盘上的。你可以按需读取一部分数据到内存（节省内存）。如果你直接一次性 tokenizer 全部数据，就失去了这种内存节省的优势

Dataset.map() method( apply the tokenization function on all our datasets at once)

Dataset.map() 会对数据集的每一条数据调用你定义的函数，并返回一个新的 Dataset，不会一次性加载整个数据到内存

好处：

* 节省内存

* 结果依然是 Dataset 类型，方便后续处理（比如 shuffle、split 等）

* 可以同时做别的预处理（不只是分词）

In [None]:
def tokenize_function(rawDatasets):
  return tokenizer(rawDatasets["sentence1"],rawDatasets["sentence2"],truncation=True)
#不要padding=True是因为效率不高
# it’s better to pad the samples when we’re building a batch,
# as then we only need to pad to the maximum length in that batch, and not the maximum length in the entire dataset.
# This can save a lot of time and processing power when the inputs have very variable lengths!

In [None]:
tokenized_dataset=raw_datasets.map(tokenize_function,batched=True)
tokenized_dataset

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

 applying your preprocessing function with map() by passing along a num_proc argument,this could speed up your preprocessing.

### dynamic padding

DataCollatorWithPadding 会自动把一个 batch 里不一样长的句子 pad 成一样长，方便模型处理，而且只 pad 到这个 batch 的最大长度，更节省资源。

完整代码见pycharm数据预处理全流程

## Fine-tuning a model with the trainer API

对比本地GPU和colab云端GPU用时

In [None]:
import torch
import time

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = torch.nn.Linear(1000, 1000).to(device)
data = torch.randn(10000, 1000).to(device)

start = time.time()
for _ in range(100):
    model(data)
print(f"Time: {time.time() - start:.2f}s")

Time: 57.57s


第一步：定义训练参数TraningArguments。配置核心

In [None]:
from transformers import TrainingArguments
training_args = TrainingArguments("test-trainer")

第二步：加载模型

In [None]:
from transformers import AutoModelForSequenceClassification
checkpoint = "bert-base-uncased"
model=AutoModelForSequenceClassification.from_pretrained(checkpoint,num_labels=2)

第三步：构建trainer类，这里是整个微调过程的核心封装，Trainer 帮我们把模型训练的各部分集成起来。

In [None]:
from transformers import Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,
    processing_class=tokenizer
)

In [None]:
trainer.train()

第四步：加上评估evaluation