<a href="https://colab.research.google.com/github/Diorkelly/LLM/blob/main/Finetune_Chinese_Weibo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 利用中文微博評價資料進行Bert微調


In [1]:
! pip install transformers datasets
! pip install evaluate

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m25.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.

## 下載微博評價資料

In [2]:
!wget https://github.com/shhuangmust/AI/raw/refs/heads/113-1/weibo_senti_100k.csv

--2025-04-14 10:08:49--  https://github.com/shhuangmust/AI/raw/refs/heads/113-1/weibo_senti_100k.csv
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/shhuangmust/AI/refs/heads/113-1/weibo_senti_100k.csv [following]
--2025-04-14 10:08:50--  https://raw.githubusercontent.com/shhuangmust/AI/refs/heads/113-1/weibo_senti_100k.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19699818 (19M) [application/octet-stream]
Saving to: ‘weibo_senti_100k.csv’


2025-04-14 10:08:51 (355 MB/s) - ‘weibo_senti_100k.csv’ saved [19699818/19699818]



## 讀取Weibo資料集
- 共有119988筆資料

In [3]:
from datasets import load_dataset, DatasetDict

ds = load_dataset("csv", data_files="weibo_senti_100k.csv")
print(ds)

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'review'],
        num_rows: 119988
    })
})


## 分割資料集
- 80%訓練(train)資料
- 10%測試(test)資料
- 10%驗證(valid)資料


In [4]:
train_testvalid = ds['train'].train_test_split(test_size=0.2)
test_valid = train_testvalid['test'].train_test_split(test_size=0.5)
dataset = DatasetDict({
    'train': train_testvalid['train'],
    'test': test_valid['test'],
    'valid': test_valid['train']})


## 進行分詞

In [5]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-chinese")

def tokenize_function(examples):
    return tokenizer(examples["review"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/624 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/110k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/269k [00:00<?, ?B/s]

Map:   0%|          | 0/95990 [00:00<?, ? examples/s]

Map:   0%|          | 0/11999 [00:00<?, ? examples/s]

Map:   0%|          | 0/11999 [00:00<?, ? examples/s]

## 為簡化訓練，挑選10000筆作為訓練與測試資料

In [6]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(10000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(10000))
print(small_train_dataset)
print(small_eval_dataset)

Dataset({
    features: ['label', 'review', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 10000
})
Dataset({
    features: ['label', 'review', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 10000
})


## 列印一筆資料出來看

In [7]:
tokenized_datasets["train"][100]

{'label': 0,
 'review': '大V，别随便插。[偷笑]//@韩寒:为什么都在笑我，你们知道当时我有多努力么[泪] //@赵薇:哈哈哈@韩寒 [做鬼脸] //@姚晨:真对不住，我也笑了。[哈哈][偷笑] //@亭林镇工作室:为什么这么多嘲笑，这世界还会好么？',
 'input_ids': [101,
  1920,
  100,
  8024,
  1166,
  7390,
  912,
  2991,
  511,
  138,
  982,
  5010,
  140,
  120,
  120,
  137,
  7506,
  2170,
  131,
  711,
  784,
  720,
  6963,
  1762,
  5010,
  2769,
  8024,
  872,
  812,
  4761,
  6887,
  2496,
  3198,
  2769,
  3300,
  1914,
  1222,
  1213,
  720,
  138,
  3801,
  140,
  120,
  120,
  137,
  6627,
  5948,
  131,
  1506,
  1506,
  1506,
  137,
  7506,
  2170,
  138,
  976,
  7787,
  5567,
  140,
  120,
  120,
  137,
  2001,
  3247,
  131,
  4696,
  2190,
  679,
  857,
  8024,
  2769,
  738,
  5010,
  749,
  511,
  138,
  1506,
  1506,
  140,
  138,
  982,
  5010,
  140,
  120,
  120,
  137,
  777,
  3360,
  7252,
  2339,
  868,
  2147,
  131,
  711,
  784,
  720,
  6821,
  720,
  1914,
  1672,
  5010,
  8024,
  6821,
  686,
  4518,
  6820,
  833,
  1962,
  720,
  8043,
  102,
  0,
  0,
  0

## 本次微調需要得到正面/負面的判斷結果，因此挑選AutoModelForSequenceClassification
- 輸出結果為正面/負面，因此num_labels=2

In [8]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-chinese", num_labels=2)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/412M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-chinese and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## 利用TrainingArguments設定微調參數

In [9]:
from transformers import TrainingArguments
import numpy as np
import evaluate

metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

training_args = TrainingArguments(output_dir="test_trainer_chinese", evaluation_strategy="epoch")


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]



## 利用Trainer進行訓練
- 此處須輸入wandb key

In [10]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mpeggypeng865[0m ([33mpeggypeng865-must[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy
1,0.097,0.039757,0.9848
2,0.0816,0.094196,0.9823
3,0.0581,0.056904,0.9854


TrainOutput(global_step=3750, training_loss=0.09027227478027344, metrics={'train_runtime': 3818.4816, 'train_samples_per_second': 7.857, 'train_steps_per_second': 0.982, 'total_flos': 7893331660800000.0, 'train_loss': 0.09027227478027344, 'epoch': 3.0})

## 利用pipeline進行測試
- LABEL_0：負面
- LABEL_1：正面

In [11]:
from transformers import pipeline
pipe = pipeline("sentiment-analysis", model='test_trainer_chinese/checkpoint-1500', tokenizer=tokenizer)

Device set to use cuda:0


In [12]:
pipe("我喜歡這個產品")

[{'label': 'LABEL_1', 'score': 0.9994695782661438}]