# Using Transformers

## pipelines examples
It connects a model with its necessary preprocessing and postprocessing steps, allowing us to directly input any text and get an intelligible answer

In [None]:
from transformers import pipeline
classifier = pipeline("sentiment-analysis")

In [None]:
classifier("I've been waiting for this sucking day my whole life.")

 zero-shot-classification


In [None]:
classifier = pipeline("zero-shot-classification")

In [None]:
classifier(
    "This is a course about the business",
    candidate_labels=["education", "politics", "business"],
)

In [None]:
generator = pipeline("text-generation")
generator("In this course, we will teach you how to",
max_length=30,
    num_return_sequences=2)


use the model from ModelHub

In [None]:
generator = pipeline("text-generation", model="HuggingFaceTB/SmolLM2-360M")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

In [None]:
image_classifier = pipeline(
    task="image-classification", model="google/vit-base-patch16-224"
)
result = image_classifier(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
)
print(result)

In [None]:
from transformers import pipeline

transcriber = pipeline(
    task="automatic-speech-recognition", model="openai/whisper-base.en"
)
transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac")
# Output: {'text': ' I have a dream that one day this nation will rise up and live out the true meaning of its creed.'}

## Behind the pipeline

在环境里安装三个库，分别是datasets，evaluate，transformer[sentencepiece]

### 1)preprocessing with a tokenizer

#### for general cases

DEFINITION

convert the **text** inputs into **numbers** that the model can make sense of
1. inputs→words,subword,symbols("tokens")
2. tokens→integer
3. adding additiional inputs


KEY POINTS

 The tokenizer and model should always be from the same checkpoin

CODE SHOWING

从transformer库中导入AutoTokenizier类

In [None]:
from transformers import AutoTokenizer

（Hugging Face等平台提供的"checkpoint"通常是训练完成的最终状态（或关键节点），供他人直接下载使用。）

from_pretrained( )方法，该方法会自动获取传入的模型配套的tokenizer相关数据,并且返回一个AutoTokenizer类的一个对象（实例）

In [None]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

然后给tokenizer传入我们的初始文本
这里tokenizer是一个AutoTokenizer的实例，这个类有call方法，所以 tokenizer（）实际上是在调用这个call方法。

**padding=True, truncation=True**（explain later）



In [None]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

“input_ids”input_ids contains two rows of integers (one for each sentence) that are the unique identifiers of the tokens in each sentence.

**“attention_mask”**

这个时候就完成了pipeline（）的第一步，prepocessing

#### for some special cases

##### batching

对于单个句子，要转化为二维

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence="I've been waiting for a HuggingFace course my whole life."

tokens=tokenizer.tokenize(sequence)
ids=tokenizer.convert_tokens_to_ids(tokens)
print(ids)
input_ids=torch.tensor([ids])
print(input_ids)

此处input_ids=torch.tensor([ids])

**1.使用PyTorch将Python列表转换为tensor有几个重要好处：**

* GPU加速：PyTorch tensors可以在GPU上运行，大幅加速计算，特别是对于深度学习模型。

* 批量处理：tensor可以方便地表示批量数据，模型可以一次性处理多个输入序列。

* 自动微分：PyTorch tensors支持自动微分，这对训练神经网络至关重要。

* 与模型兼容：HuggingFace模型期望输入是tensor格式，直接使用Python列表需要额外转换。

* 优化内存布局：tensor在内存中有更高效的存储方式，适合数值计算。

2.**要多加一个[ ]转化为二维**

3.注意打印出来的结果和下面的区别（上面没有CLS和SEP）

简化版本代码

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence="I've been waiting for a HuggingFace course my whole life."

tokenized_inputs = tokenizer(sequence, return_tensors="pt")# pytorch tensor，返回的 tokenized_inputs 是一个字典，包含模型所需的所有输入字段
print(tokenized_inputs["input_ids"])#这里["input_ids"]相当于就是在查字典，
print(tokenized_inputs)#这样就是打印出字典的全部K：V，包括"input_ids"

##### padding

对于长度不同的多个句子，用padding补全，并且用attention mask设置为0消除padding对句子id的影响

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence=["I've been waiting for a HuggingFace course my whole life.","life sucks,bro"]
tokenized_inputs=tokenizer(sequence,return_tensors="pt",padding=True,truncation=True)# 注意这个True要大写
print(tokenized_inputs)

##### truncating

用于处理太长的句子，具体用法见上。

### 2）passing the inputs through the model

 “inputs"→"hidden states"(features) :  a high-dimensional vector representing the contextual understanding of that input by the Transformer model.

It generally has three dimensions:
1.   Batch size: The number of sequences processed at a time (2 in our example).
2. Sequence length: The length of the numerical representation of the sequence (16 in our example).
3. Hidden size: The vector dimension of each model input.



In [None]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

In [None]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.logits.shape)

### 3）Postprocessing

 "logits"→"probabilities" through a SoftMax layer

In [None]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

In [None]:
model.config.id2label

# Fine-tuning a pretrained model

## OVERVIEW

In [None]:
import torch
from torch.optim import AdamW
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Same as before
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "This course is amazing!",
]
batch = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")

# This is new
batch["labels"] = torch.tensor([1, 1])

optimizer = AdamW(model.parameters())
loss = model(**batch).loss
loss.backward()
optimizer.step()

## check the trained data

Using MRPC dataset

we get a DatasetDict object which contains the training set, the validation set, and the test set

In [None]:
from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

to access each pair of sentences in our raw_datasets object by indexing(like dictionary)

In [None]:
raw_train_dataset=raw_datasets["train"]
raw_train_dataset[0]

In [None]:
raw_train_dataset[3667]

to see the correspondence between "labels" and "integers"

In [None]:
raw_train_dataset.features

## Preprocessing a dataset（tokenize预处理）

tokenize the dataset(part of the preprocessing)

In [None]:
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
raw_datasets["train"]["sentence1"][0]
inputs = tokenizer(raw_datasets["train"]["sentence1"][0],raw_datasets["train"]["sentence2"][0], padding=True,truncation=True)
inputs

但是Hugging Face Datasets 的特点，它的数据是用 Apache Arrow 格式存在磁盘上的。你可以按需读取一部分数据到内存（节省内存）。如果你直接一次性 tokenizer 全部数据，就失去了这种内存节省的优势

Dataset.map() method( apply the tokenization function on all our datasets at once)

Dataset.map() 会对数据集的每一条数据调用你定义的函数，并返回一个新的 Dataset，不会一次性加载整个数据到内存

好处：

* 节省内存

* 结果依然是 Dataset 类型，方便后续处理（比如 shuffle、split 等）

* 可以同时做别的预处理（不只是分词）

In [None]:
def tokenize_function(rawDatasets):
  return tokenizer(rawDatasets["sentence1"],rawDatasets["sentence2"],truncation=True)
#不要padding=True是因为效率不高
# it’s better to pad the samples when we’re building a batch,
# as then we only need to pad to the maximum length in that batch, and not the maximum length in the entire dataset.
# This can save a lot of time and processing power when the inputs have very variable lengths!

In [None]:
tokenized_dataset=raw_datasets.map(tokenize_function,batched=True)
tokenized_dataset

 applying your preprocessing function with map() by passing along a num_proc argument,this could speed up your preprocessing.

### dynamic padding

DataCollatorWithPadding 会自动把一个 batch 里不一样长的句子 pad 成一样长，方便模型处理，而且只 pad 到这个 batch 的最大长度，更节省资源。

完整代码见pycharm数据预处理全流程

## Fine-tuning a model with the trainer API

对比本地GPU和colab云端GPU用时

In [None]:
import torch
import time

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = torch.nn.Linear(1000, 1000).to(device)
data = torch.randn(10000, 1000).to(device)

start = time.time()
for _ in range(100):
    model(data)
print(f"Time: {time.time() - start:.2f}s")

第一步：定义训练参数TraningArguments。配置核心

In [None]:
from transformers import TrainingArguments
training_args = TrainingArguments("test-trainer")

第二步：加载模型

In [None]:
from transformers import AutoModelForSequenceClassification
checkpoint = "bert-base-uncased"
model=AutoModelForSequenceClassification.from_pretrained(checkpoint,num_labels=2)

第三步：构建trainer类，这里是整个微调过程的核心封装，Trainer 帮我们把模型训练的各部分集成起来。

In [None]:
from transformers import Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,
    processing_class=tokenizer
)

In [None]:
trainer.train()

第四步：加上评估evaluation