# <b>Hugging Face</b>
Hugging Face 網址：https://huggingface.co/

Hugging Face 是 AI 領域的開放平台，像 GitHub 一樣託管並分享 AI 模型與資料集。也收錄了很多來自頂尖研究的模型，涵蓋 NLP、計算機視覺、語音處理等領域。使用者可以在 Hub 上存取、分享、微調模型，加速 AI 的合作與創新。

In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (1

### <b>1 資料前處理</b>


label_map = {0: "Negative", 1: "Neutral", 2: "Positive"}

In [None]:
from datasets import load_dataset

# 載入 Financial PhraseBank 資料集
'''
載入資料集
    - 資料集名稱
    - 資料子集版本：
        * sentences_50agree; Number of instances with >=50% annotator agreement: 4846
        * sentences_66agree: Number of instances with >=66% annotator agreement: 4217
        * sentences_75agree: Number of instances with >=75% annotator agreement: 3453
        * sentences_allagree: Number of instances with 100% annotator agreement: 2264
    - 允許執行遠端程式碼。
'''
dataset = load_dataset(
    "financial_phrasebank", # 資料集名稱，https://huggingface.co/datasets/takala/financial_phrasebank
    "sentences_allagree", # 指定資料集的子集版本。此處 "sentences_allagree" 表示僅包含所有標註者對情緒標籤達成一致的句子。
    trust_remote_code=True # 允許執行遠端程式碼。True代表讀取資料集自動載入遠端程式碼。
)

'''
class_label:
    names:
        '0': negative
        '1': neutral
        '2': positive
'''
# 查看數據集的一些樣本
print(dataset["train"][0])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/8.88k [00:00<?, ?B/s]

financial_phrasebank.py:   0%|          | 0.00/6.04k [00:00<?, ?B/s]

FinancialPhraseBank-v1.0.zip:   0%|          | 0.00/682k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2264 [00:00<?, ? examples/s]

{'sentence': 'According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .', 'label': 1}


In [None]:
# 查看資料集的基本結構和數據集大小
print(dataset)

# 查看資料集的欄位名稱與類型
print(dataset["train"].column_names)

# 查看訓練集中前幾筆數據
print(dataset["train"][:5])

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label'],
        num_rows: 2264
    })
})
['sentence', 'label']
{'sentence': ['According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .', "For the last quarter of 2010 , Componenta 's net sales doubled to EUR131m from EUR76m for the same period a year earlier , while it moved to a zero pre-tax profit from a pre-tax loss of EUR7m .", 'In the third quarter of 2010 , net sales increased by 5.2 % to EUR 205.5 mn , and operating profit by 34.9 % to EUR 23.5 mn .', 'Operating profit rose to EUR 13.1 mn from EUR 8.7 mn in the corresponding period in 2007 representing 7.7 % of net sales .', 'Operating profit totalled EUR 21.1 mn , up from EUR 18.6 mn in 2007 , representing 9.7 % of net sales .'], 'label': [1, 2, 2, 2, 2]}


In [None]:
# 查看資料集中有哪些 Keys
dataset.keys()

dict_keys(['train'])

In [None]:
from collections import Counter # 計算列表中每個元素的出現次數

# 計算標籤出現次數
label_counts = Counter(dataset["train"]["label"])
print(label_counts)

Counter({1: 1391, 2: 570, 0: 303})


In [None]:
import pandas as pd  # 結構化資料處理套件

# 轉換為 DataFrame
df = pd.DataFrame(dataset["train"])
print(df.head())

                                            sentence  label
0  According to Gran , the company has no plans t...      1
1  For the last quarter of 2010 , Componenta 's n...      2
2  In the third quarter of 2010 , net sales incre...      2
3  Operating profit rose to EUR 13.1 mn from EUR ...      2
4  Operating profit totalled EUR 21.1 mn , up fro...      2


In [None]:
# 檢查空值
print(df.isnull().sum())

# 刪除重複值
df.drop_duplicates(subset=["sentence"], inplace=True)
print(df.shape)

sentence    0
label       0
dtype: int64
(2259, 2)


In [None]:
# 將資料集分為訓練集與測試集
train_test_split = dataset["train"].train_test_split(test_size=0.2)

# 確認分割後的資料集大小
print(train_test_split)

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label'],
        num_rows: 1811
    })
    test: Dataset({
        features: ['sentence', 'label'],
        num_rows: 453
    })
})


In [None]:
from transformers import AutoTokenizer  # AutoTokenizer 可以根據指定的模型名稱，自動選擇和載入與該模型相匹配的分詞器

# 使用BERT Tokenizer 進行Tokenization(分詞)
# 如果是中文文本則使用 bert-base-chinese 模型
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# 定義 Tokenize 函數，用於將句子轉換為 Token
def tokenize_function(example):
    '''
    使用 tokenizer 將每個句子進行分詞
        - 設定 padding="max_length" 確保所有句子補齊至固定長度
        - 設定 truncation=True 以截斷超過最大長度的句子
    '''
    return tokenizer(example["sentence"], padding="max_length", truncation=True)

# 對資料集進行 Tokenize
tokenized_datasets = train_test_split.map(tokenize_function, batched=True)

'''
sentence：原始文字內容，這裡是 "Financial details were n't disclosed ."。此句子會被 Tokenizer 切分並轉換成數字。

label：句子的標籤

input_ids：是一個 input_ids 的列表，包含將句子分詞（Tokenization）後對應的 ID 列表。
每個數字（例如 101）對應一個詞元（token），這些數字是根據模型的詞彙表（vocabulary）進行的編碼。
例如，101 通常是特殊的 [CLS] token，用於標記句子開始位置。
'''

# 查看 Tokenized 資料
tokenized_datasets["train"][:1]

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/1811 [00:00<?, ? examples/s]

Map:   0%|          | 0/453 [00:00<?, ? examples/s]

{'sentence': ['Seven-month sales of Ragutis , which is controlled by the Finnish brewery Olvi , declined by 11.2 percent , to 15.41 million liters , and the company held 9.89 percent of the market .'],
 'label': [0],
 'input_ids': [[101,
   2698,
   1011,
   3204,
   4341,
   1997,
   17768,
   21823,
   2015,
   1010,
   2029,
   2003,
   4758,
   2011,
   1996,
   6983,
   12161,
   19330,
   5737,
   1010,
   6430,
   2011,
   2340,
   1012,
   1016,
   3867,
   1010,
   2000,
   2321,
   1012,
   4601,
   2454,
   23675,
   2015,
   1010,
   1998,
   1996,
   2194,
   2218,
   1023,
   1012,
   6486,
   3867,
   1997,
   1996,
   3006,
   1012,
   102,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,

In [None]:
# 確認 tokenized_datasets 是否包含需要的欄位
print(tokenized_datasets["train"].features)

{'sentence': Value(dtype='string', id=None), 'label': ClassLabel(names=['negative', 'neutral', 'positive'], id=None), 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}


### <b>2 建立模型</b>

In [None]:
import torch  # 深度學習框架
# 如果有GPU就用GPU，沒有GPU用CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

In [None]:
from transformers import AutoModelForSequenceClassification  # 序列分類模型

# 載入模型
# 如果是中文文本則使用 bert-base-chinese 模型
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=3)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
from transformers import TrainingArguments  # 包裝訓練參數

training_args = TrainingArguments(
    output_dir="./results",          # 訓練結果的儲存位置，包括模型檔案和預測結果。
    evaluation_strategy="epoch",     # 設置評估策略為每個訓練輪次（epoch）結束後進行一次評估。
    learning_rate=2e-5,              # 設置學習率，這裡使用 2e-5，通常較小的學習率更適合微調預訓練模型。
    per_device_train_batch_size=4,   # 設定每個裝置（如每張 GPU）的訓練批次大小為 4。
    per_device_eval_batch_size=4,    # 設定每個裝置的評估批次大小為 4。
    num_train_epochs=3,              # 訓練輪次設定為 3，模型將完整遍歷訓練集三次。
    weight_decay=0.01,               # 設置權重衰減（L2正則化）參數，這裡為 0.01，用於防止過擬合。
    logging_dir="./logs",            # 日誌存放目錄，用於儲存 TensorBoard 或其他日誌。
    logging_steps=10,                # 設置每 10 個步驟記錄一次訓練信息，便於監控訓練過程。
    report_to="none"                 # 不將訓練日誌發送到任何平臺（如 TensorBoard），可以改為 "tensorboard" 以啟用。
)



In [None]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support # 計算準確率、精準度、召回率和 F1 分數
import numpy as np  # 處理數據的套件

def compute_metrics(pred):
    labels = pred.label_ids  # 取得真實值的id
    preds = np.argmax(pred.predictions, axis=1)  # 預測值找到最高機率的索引
    accuracy = accuracy_score(labels, preds)  # 計算準確率
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average="weighted")  # 計算其他指標
    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    compute_metrics=compute_metrics,
)

# 開始訓練
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.003,0.171333,0.97351,0.973445,0.97351,0.973416
2,0.0777,0.188348,0.969095,0.969095,0.969095,0.969095
3,0.0002,0.187735,0.96468,0.964891,0.96468,0.964764


TrainOutput(global_step=1359, training_loss=0.09131661944061852, metrics={'train_runtime': 676.5848, 'train_samples_per_second': 8.03, 'train_steps_per_second': 2.009, 'total_flos': 1429495198516224.0, 'train_loss': 0.09131661944061852, 'epoch': 3.0})

### <b>3 結果評估</b>

In [None]:
# 評估模型在測試集的表現
eval_result = trainer.evaluate()
print(eval_result)

{'eval_loss': 0.18773537874221802, 'eval_accuracy': 0.9646799116997793, 'eval_precision': 0.9648911160762633, 'eval_recall': 0.9646799116997793, 'eval_f1': 0.9647644879694872, 'eval_runtime': 14.4974, 'eval_samples_per_second': 31.247, 'eval_steps_per_second': 7.863, 'epoch': 3.0}


In [None]:
# 定義測試句子
test_texts = [
    "The company's profit has increased significantly this quarter.", # 「本季公司獲利大幅成長」
    "The increase in costs negatively affected the revenue.", # 「成本的增加對收入產生了負面影響」
    "The company's performance remained stable." # 「公司業績保持穩定」
]

# Tokenize 測試句子
'''
使用 Tokenizer 對測試文本進行分詞 (Tokenization) 和編碼
    - test_texts：需要進行分詞和編碼的文本數據
    - truncation=True：若文本超過模型的最大長度，則自動截斷
    - padding=True：將所有句子填充到相同長度，以便批量處理
    - return_tensors="pt"：指定返回的格式為 PyTorch tensors
    - .to(device)：將編碼結果移到指定的裝置 (如 GPU 或 CPU)，以便加速計算
'''
test_encodings = tokenizer(test_texts, truncation=True, padding=True, return_tensors="pt").to(device)
'''
「**」在 PyTorch 中代表解包

test_encodings = {
    "input_ids": tensor1,
    "attention_mask": tensor2
}

轉換為

input_ids=tensor1, attention_mask=tensor2
'''
outputs = model(**test_encodings)

# 取得預測結果
preds = torch.argmax(outputs.logits, dim=1).cpu().numpy()  # 將結果轉回 CPU 以便處理

# 將數字標籤轉換為文字標籤
label_map = {0: "Negative", 1: "Neutral", 2: "Positive"}
predicted_labels = [label_map[pred] for pred in preds]
print(predicted_labels)

['Positive', 'Negative', 'Positive']


# 補充：padding max length




In [None]:
sentences = [
    "The weather is nice today.",
    "I went for a walk in the park because the weather was nice and sunny.",
    "It rained yesterday."
]

In [None]:
# 用 0 補齊不足的長度
input_ids:
tensor([[ 101, 1996, 4633, 2003, 3835, 2651, 1012,  102,    0,    0],
        [ 101, 1045, 2253, 2005, 1037, 3313, 1999, 1996, 2380, 102],
        [ 101, 2009, 4931, 7481, 1012,  102,    0,    0,    0,    0]])

attention_mask:
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])