## 2025/01/02 讀書會
### introduction of pipline +  decoding strategy ( ex: do_sample, max_new_token, … )

In [1]:
import torch
print(torch.cuda.is_available())  # 如果是 True，表示支持 GPU
print(torch.cuda.device_count())  # 顯示可用的 GPU 數量
print(torch.cuda.get_device_name(0))  # 顯示 GPU 的名稱


True
2
NVIDIA GeForce RTX 3090


In [2]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


In [42]:
import torch
torch.cuda.empty_cache()


In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import numpy as np

# 1. 載入數據集
dataset = load_dataset("imdb")  # IMDB情感分析數據集

# 2. 加載預訓練標記器
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 3. 數據預處理
def preprocess_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

encoded_dataset = dataset.map(preprocess_function, batched=True)

# 4. 加載預訓練模型
num_labels = 2  # 兩個分類（正向、負向）
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=num_labels,
    id2label={0: "NEGATIVE", 1: "POSITIVE"},  # 自定義標籤名稱
    label2id={"NEGATIVE": 0, "POSITIVE": 1},
)

model.config.hidden_dropout_prob = 0.3


# 5. 訓練參數設置
training_args = TrainingArguments(
    output_dir="./results",           # 模型保存路徑
    evaluation_strategy="epoch",      # 每個 epoch 評估
    save_strategy="epoch",            # 每個 epoch 保存
    learning_rate=3e-5,               # 學習率
    lr_scheduler_type="linear",       # 線性衰減
    per_device_train_batch_size=16,   # 訓練批次大小
    per_device_eval_batch_size=16,    # 驗證批次大小
    num_train_epochs=5,               # 訓練輪數
    weight_decay=0.01,                # 權重衰減
    logging_dir='./logs',             # 日誌保存路徑
    logging_steps=10,                 # 每 10 步記錄一次
    load_best_model_at_end=True,      # 儲存最佳模型
    metric_for_best_model="accuracy", # 以準確率為判斷基準
    fp16=True                         # 啟用混合精度
)

# 6. 評估指標計算函數
def compute_metrics(eval_pred):
    from sklearn.metrics import f1_score, precision_score, recall_score
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    f1 = f1_score(labels, predictions, average='weighted')
    precision = precision_score(labels, predictions, average='weighted')
    recall = recall_score(labels, predictions, average='weighted')
    accuracy = accuracy_score(labels, predictions)
    return {"accuracy": accuracy, "f1": f1, "precision": precision, "recall": recall}


# 7. 初始化 Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_dataset["train"].shuffle(seed=42).select(range(2000)),  # 使用部分數據加速訓練
    eval_dataset=encoded_dataset["test"].shuffle(seed=42).select(range(500)),     # 使用部分數據加速驗證
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)


trainer.train()

trainer.save_model("./saved_model")


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.359,0.292282,0.892,0.891991,0.892697,0.892
2,0.1995,0.298173,0.886,0.885806,0.890071,0.886
3,0.1873,0.375941,0.876,0.875128,0.889475,0.876
4,0.0586,0.408484,0.904,0.904006,0.904128,0.904
5,0.0456,0.445646,0.9,0.900005,0.900038,0.9




In [38]:
from transformers import pipeline
pipe = pipeline("text-classification", model="./saved_model", tokenizer=tokenizer, device=0)
print(pipe(["I love this movie, it was fantastic!", "I hate this movie, it was bad!"]))



[{'label': 'POSITIVE', 'score': 0.9969589710235596}, {'label': 'NEGATIVE', 'score': 0.9957774877548218}]


In [35]:
pipe = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english", device=0)
result = pipe("This is a terrible product.")
print(result)


[{'label': 'NEGATIVE', 'score': 0.9996050000190735}]


In [36]:
pipe = pipeline("text-generation", device=0)
result = pipe("This is a terrible product.")
print(result)

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


[{'generated_text': 'This is a terrible product. It\'s just stupid. The packaging design is horrible. It\'s not a good product. And I really wish they would make sure they made sure this would happen."\n\nShe said they had no reason to stop with'}]


In [22]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2")
model.generation_config

GenerationConfig {
  "bos_token_id": 50256,
  "eos_token_id": 50256
}

In [None]:
from transformers import pipeline

# 加載 GPT-2 模型作為生成器cx
pipe = pipeline("text-generation", model="gpt2", device=0)  # 使用 GPU（若可用）

result = pipe(
    "Once upon a time,", 
    truncation=True,  # 自動截斷
    max_length=10,    # 最大長度
    do_sample=False,  # 是否隨機取樣
    num_beams=1       # beam分支數量
)
print(result[0]["generated_text"])


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Once upon a time, the world was a place


In [26]:
result = generator(
    "Once upon a time,", 
    max_length=50, 
    num_beams=5,          
    early_stopping=True,  # 提前停止
    length_penalty=1.2    # 長度偏好
)
print(result[0]["generated_text"])


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Once upon a time, I was given the opportunity to speak with a number of experts on the subject. One of them was Professor of Psychology at the University of California, San Diego, and the other was Professor of Psychology at the University of California,


In [49]:
result = generator(
    "Once upon a time,", 
    min_length=10, 
    do_sample=True,   
    temperature=0.8, 
    top_k=10,
    repetition_penalty=1.2,  # 懲罰重複生成的詞
    no_repeat_ngram_size=2   # 防止重複的 bi-gram
)
print(result[0]["generated_text"])


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Once upon a time, I have always wanted to write about what makes the world tick when we see it.
I've been working in an art studio for many years now and my wife recently moved into our new home on Lake Michigan from another city
