<a href="https://colab.research.google.com/github/Justin21523/edge-deid-studio/blob/feature%2Ftrain-ner-notebook/notebooks/02_train_ner.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🏷️ NER 微調示範（02_train_ner.ipynb）

> 本筆記本示範如何下載資料、Tokenize、對齊標籤、以 Trainer 進行 NER 模型微調，並把結果儲存到 `models/ner/v1.0`。  
> **注意**：先在 CPU 環境下 dry run，確認無誤後再切到 GPU Runtime 一鍵執行整個流程。

## 1️⃣ 環境準備

In [None]:
# 1.1 安裝依賴（只需跑一次）
!pip install -q transformers datasets accelerate

# 1.2 掛載 Google Drive 並設定 HF 快取
from google.colab import drive
drive.mount('/content/drive')

import os
os.environ['HF_HOME'] = '/content/drive/MyDrive/hf_cache'
os.environ['TRANSFORMERS_CACHE'] = '/content/drive/MyDrive/hf_cache/transformers'
os.makedirs(os.environ['TRANSFORMERS_CACHE'], exist_ok=True)


In [None]:
# 1.3 Hugging Face 登入（第一次 runtime 必跑）
from huggingface_hub import login
from getpass import getpass
hf_token = getpass("請貼上你的 Hugging Face token：")
login(token=hf_token)

In [None]:
# 1.4 檢查 GPU（切到 GPU Runtime 再跑）
import torch
print("GPU available:", torch.cuda.is_available())

## 2️⃣ 載入套件與資料集


In [None]:
from datasets import load_dataset, load_metric
from transformers import (
    AutoTokenizer,
    AutoModelForTokenClassification,
    DataCollatorForTokenClassification,
    TrainingArguments,
    Trainer
)

# 2.1 下載 Conll2003 NER 資料集
raw_datasets = load_dataset("conll2003")
label_list   = raw_datasets["train"].features["ner_tags"].feature.names
num_labels   = len(label_list)
print("Label 列表：", label_list)


## 3️⃣ Tokenizer 與 Model 初始化


In [None]:
model_checkpoint = "bert-base-cased"
tokenizer        = AutoTokenizer.from_pretrained(model_checkpoint)
model            = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    num_labels=num_labels,
    ignore_mismatched_sizes=True
)


In [None]:
# 4.1 定義對齊函式
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"],
        truncation=True,
        is_split_into_words=True
    )
    all_labels     = examples["ner_tags"]
    aligned_labels = []
    for i, labels in enumerate(all_labels):
        word_ids          = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids         = []
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                label_ids.append(labels[word_idx])
            else:
                label_ids.append(labels[word_idx] if label_all_tokens else -100)
            previous_word_idx = word_idx
        aligned_labels.append(label_ids)
    tokenized_inputs["labels"] = aligned_labels
    return tokenized_inputs

# 4.2 map 整個 dataset
label_all_tokens   = False
tokenized_datasets = raw_datasets.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=raw_datasets["train"].column_names
)


## 5️⃣ DataCollator 與 Trainer 設定


In [None]:
# 5.1 動態 padding
data_collator = DataCollatorForTokenClassification(tokenizer)

# 5.2 訓練參數
training_args = TrainingArguments(
    output_dir="models/ner/v1.0",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=False,
)

# 5.3 初始化 Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)


## 6️⃣ 開始訓練（待切到 GPU Runtime 再執行）


In [None]:
# trainer.train()、trainer.save_model() 都會把結果存到 models/ner/v1.0
# 在 CPU 環境下可以先註解掉，GPU 一次跑完：
# trainer.train()
# trainer.save_model()

---
### 📌 接下來的 Notebook 規劃

- **02_train_ner.ipynb**：Conll2003 NER 微調示範  
- **03_finetune_gpt2.ipynb**：GPT-2 自回歸微調  
- **04_inference.ipynb**：載入微調後模型做推理展示  
- **05_export_onnx.ipynb**：將微調後模型轉 ONNX／edge_models  
- **06_evaluate.ipynb**：模型評估與指標視覺化

所有 notebooks 都放在 `notebooks/`，每個加上 badge → Colab → Save to GitHub → 同步到 `notebooks/` 資料夾底下。
