<a href="https://colab.research.google.com/github/Heng1222/Ohsumed_classification/blob/main/Model/task1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import numpy as np
import torch
import re
import requests
from tqdm import tqdm
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, f1_score, accuracy_score
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, AutoModel

# =================================================================
# [DATA CHANGE: 讀取 GitHub 網址資料]
# =================================================================
url = "https://media.githubusercontent.com/media/Heng1222/Ohsumed_classification/refs/heads/main/classification_data/ohsumed_dataset.csv"

def fetch_and_parse_data(url):
    print("正在從 GitHub 下載並解析資料...")
    response = requests.get(url)
    response.encoding = 'utf-8'
    content = response.text

    # 解析格式：(標題).,"(多行摘要)",(標籤)
    pattern = r'(.*?)\.,"([\s\S]*?)",(C\d+)'
    matches = re.findall(pattern, content)

    data_list = []
    for m in matches:
        data_list.append({
            'title': m[0].strip(),
            'abstract': m[1].replace('\n', ' ').strip(),
            'label': m[2].strip()
        })
    return pd.DataFrame(data_list)

df_all = fetch_and_parse_data(url)

# =================================================================
# [實驗設定：設定要測試的特徵]
# [DATA CHANGE: 若要換成其他欄位（如摘要），請修改 text_col]
# =================================================================
text_col = 'title'  # 目前設定僅使用標題進行 Baseline 測試
label_col = 'label'

print(f"資料讀取完成，共 {len(df_all)} 筆。")

# 1. 標籤處理
le = LabelEncoder()
y_encoded = le.fit_transform(df_all[label_col])

# 2. 分層抽樣切分 (依照論文與資料分佈建議 80/20 切分)
X_train_text, X_test_text, y_train, y_test = train_test_split(
    df_all[text_col],
    y_encoded,
    test_size=0.2,
    random_state=42,
    stratify=y_encoded  # 必須分層抽樣以處理不平衡數據
)

# 3. 載入原始模型 (不微調)
model_name = "roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
model.eval()

# 4. 定義特徵提取函數 (Baseline 核心：Linear Probing)
def get_embeddings(texts, batch_size=32):
    all_embeddings = []
    with torch.no_grad():
        for i in tqdm(range(0, len(texts), batch_size), desc="提取 Embedding"):
            batch = texts.iloc[i : i + batch_size].tolist()
            inputs = tokenizer(batch, padding=True, truncation=True, max_length=128, return_tensors="pt").to(device)
            outputs = model(**inputs)
            # 提取 <s> 標記 (RoBERTa 的 CLS) 作為代表向量 [cite: 25]
            emb = outputs.last_hidden_state[:, 0, :].cpu().numpy()
            all_embeddings.append(emb)
    return np.vstack(all_embeddings)

print("提取訓練集 Embedding...")
X_train = get_embeddings(X_train_text)
print("提取測試集 Embedding...")
X_test = get_embeddings(X_test_text)

# 5. 訓練邏輯斯迴歸 (完全遵循論文對照組設計)
print("訓練 Logistic Regression 分類器...")
clf = LogisticRegression(max_iter=1000, multi_class='multinomial', solver='lbfgs')
clf.fit(X_train, y_train)

# 6. 產出評估報告
y_pred = clf.predict(X_test)
print("\n=== 學長 Baseline RoBERTa 重現報告 (Ohsumed 資料集) ===")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Macro F1: {f1_score(y_test, y_pred, average='macro'):.4f}")
print("\n各類別詳細指標：")
print(classification_report(y_test, y_pred, target_names=le.classes_))

正在從 GitHub 下載並解析資料...
資料讀取完成，共 49161 筆。


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


提取訓練集 Embedding...


提取 Embedding: 100%|██████████| 1229/1229 [1:36:21<00:00,  4.70s/it]


提取測試集 Embedding...


提取 Embedding: 100%|██████████| 308/308 [24:01<00:00,  4.68s/it]


訓練 Logistic Regression 分類器...


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(



=== 學長 Baseline RoBERTa 重現報告 (Ohsumed 資料集) ===
Accuracy: 0.3542
Macro F1: 0.2453

各類別詳細指標：
              precision    recall  f1-score   support

         C01       0.35      0.33      0.34       436
         C02       0.28      0.13      0.18       193
         C03       0.40      0.03      0.05        71
         C04       0.44      0.64      0.52      1104
         C05       0.32      0.12      0.17       298
         C06       0.45      0.26      0.33       517
         C07       1.00      0.01      0.02        95
         C08       0.38      0.22      0.28       445
         C09       0.33      0.02      0.04       131
         C10       0.36      0.32      0.34       670
         C11       0.48      0.12      0.20       177
         C12       0.46      0.19      0.27       443
         C13       0.41      0.37      0.39       278
         C14       0.48      0.57      0.52      1025
         C15       0.33      0.07      0.11       220
         C16       0.15      0.01      0.02