<a href="https://colab.research.google.com/github/05050505050505/nlp-meta-learning/blob/main/NewsClassifier_ML_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**卒論用ソースコード**


# **Google Driveへのマウント**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# **データの確認**

In [None]:
path = "/content/drive/MyDrive/japanese_news.csv"

In [None]:
import pandas as pd

# ファイルのパス（自分のに書き換えてOK）
path = "/content/drive/MyDrive/japanese_news.csv"

# まず数行だけ覗いてみる
!head -n 5 "$path"

source	date	title	author	text
kobe-np.co.jp	2005-07-01			会見した北口寛人市長は「刑事訴訟で被告となっている職員にはそれぞれ主張があるが、組織全体として判決を厳しく受け止めた」と述べた。原告団長の下村誠治さん（４６）＝神戸市垂水区＝も同席。「県警も判決を真摯（しんし）に受け止めて」と訴えた。
kobe-np.co.jp	2005-07-01			明石・歩道橋事故をめぐる民事訴訟で、神戸地裁から計五億六千八百万円の賠償を命じられた兵庫県（県警）、警備会社ニシカン（現エヌ・ケイ・セキュリティ）など三者のうち、明石市が三十日、控訴しないことを正式に表明した。
kobe-np.co.jp	2007-04-07			会見後、遺族代理人の渡部吉泰弁護士は「裁判長が『問うべき者を問わないのは正義に反する』とはっきり述べたのは、検察の起訴独占主義に警鐘を鳴らす画期的なことだ」と評価した。
kobe-np.co.jp	2007-04-07			遺族会は、雑踏警備本部長を務めた元明石署長ら二人の起訴を求め、活動を続けている。


In [None]:
df = pd.read_csv(path, sep=r"\t+", engine="python", dtype=str)

In [None]:
print("行数と列数:", df.shape)
print("列名:", df.columns.tolist())
df.head()

行数と列数: (312955, 5)
列名: ['source', 'date', 'title', 'author', 'text']


Unnamed: 0,source,date,title,author,text
0,kobe-np.co.jp,2005-07-01,会見した北口寛人市長は「刑事訴訟で被告となっている職員にはそれぞれ主張があるが、組織全体とし...,,
1,kobe-np.co.jp,2005-07-01,明石・歩道橋事故をめぐる民事訴訟で、神戸地裁から計五億六千八百万円の賠償を命じられた兵庫県（...,,
2,kobe-np.co.jp,2007-04-07,会見後、遺族代理人の渡部吉泰弁護士は「裁判長が『問うべき者を問わないのは正義に反する』とはっ...,,
3,kobe-np.co.jp,2007-04-07,遺族会は、雑踏警備本部長を務めた元明石署長ら二人の起訴を求め、活動を続けている。,,
4,kobe-np.co.jp,2007-04-07,五人の遺族が閉廷後に会見。二女の優衣菜ちゃん＝当時（８つ）＝を亡くした三木清さん（３８）＝姫...,,


In [None]:
df = df[['source', 'title']].rename(columns={'source': 'label', 'title': 'text'})
df = df.dropna().reset_index(drop=True)
print(df['label'].nunique(), "unique labels")
print(df.sample(3))

21 unique labels
                    label                                               text
271006        mainichi.jp  【ジャカルタ佐藤賢二郎、モスクワ田中洋之】インドネシア・バリ島で１１月１９日に開かれる東アジ...
57573    nikkansports.com  成田空港には、アルベルト・ザッケローニ監督（５７）が決勝のオーストラリア戦でＶ弾を決めたＦＷ...
286827  sankei.jp.msn.com  宮城農高は震災の津波で校舎が損壊し、農場も使用できなくなった。これまで県内３カ所の高校に間借...


In [None]:
df['label'].value_counts().head(10)

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
mainichi.jp,44657
sankei.jp.msn.com,35959
nikkei.com,29323
sanspo.com,26303
tomamin.co.jp,26054
nikkansports.com,25483
oita-press.co.jp,23645
yomiuri.co.jp,22472
nishinippon.co.jp,21311
asahi.com,19855


# **データの選定、正規**

**セットアップ・基本関数**

In [None]:
import re, unicodedata, pandas as pd, numpy as np

# しきい値（後で調整しやすいように）
MIN_CHARS = 40
MAX_CHARS = 1200
MIN_JA_RATIO = 0.45
MIN_LABEL_COUNT = 200
HEAD_N = 80  # 近似重複の軽量キー

def normalize_basic(s: str) -> str:
    s = unicodedata.normalize('NFKC', str(s))
    s = re.sub(r'https?://\S+|www\.\S+','', s)
    s = re.sub(r'\S+@\S+','', s)
    s = re.sub(r'\s+', ' ', s).strip()
    return s

def ja_ratio(s: str) -> float:
    if not s: return 0.0
    total = len(s)
    ja = sum(1 for ch in s if (
        '\u3040' <= ch <= '\u30ff' or  # ひら/カナ
        '\u4e00' <= ch <= '\u9fff' or  # CJK
        '\u3400' <= ch <= '\u4dbf'     # CJK拡張A
    ))
    return ja / max(total,1)

**正規化 → 長さフィルタ → 日本語率フィルタ**「本当に日本語の文章か」を確認

In [None]:
df['text'] = df['text'].map(normalize_basic)
df['n_chars'] = df['text'].str.len()
before = len(df)

df = df[(df['n_chars'] >= MIN_CHARS) & (df['n_chars'] <= MAX_CHARS)].copy()
df['ja_ratio'] = df['text'].map(ja_ratio)
df = df[df['ja_ratio'] >= MIN_JA_RATIO].copy()

print(f"rows: {before} -> {len(df)} after length & ja_ratio")
display(df[['n_chars','ja_ratio']].describe(percentiles=[.5,.9,.95,.99]))

rows: 312955 -> 274369 after length & ja_ratio


Unnamed: 0,n_chars,ja_ratio
count,274369.0,274369.0
mean,113.618401,0.865782
std,48.049035,0.06778
min,40.0,0.45
50%,107.0,0.880952
90%,174.0,0.933333
95%,200.0,0.943396
99%,260.0,0.958333
max,1192.0,1.0


**重複・準重複の除去（軽量）**ほぼ同じ文章を何度も学習させない

In [None]:
# 完全重複
before = len(df)
df = df.drop_duplicates(subset=['text']).copy()

# 正規化して再重複チェック
df['text_norm'] = df['text'].str.lower().str.replace(r'\s+', ' ', regex=True)
df = df.drop_duplicates(subset=['text_norm']).copy()

# 近似重複（先頭N文字キー）
df['head_key'] = df['text_norm'].str[:HEAD_N]
df = df.drop_duplicates(subset=['head_key']).copy()

after = len(df)
print(f"dedup: {before} -> {after}")

dedup: 274369 -> 273274


**ラベル健全性フィルタ & サマリ**“データが少なすぎるニュースサイトを除外する”ための処理。

In [None]:
# ラベル下限
vc = df['label'].value_counts()
valid = vc[vc >= MIN_LABEL_COUNT].index
df = df[df['label'].isin(valid)].drop(columns=['text_norm','head_key']).reset_index(drop=True)

print("remaining labels:", df['label'].nunique())
display(df['label'].value_counts().to_frame('count'))

# 簡易サマリ
print(df.shape)
display(df.sample(5))

remaining labels: 18


Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
mainichi.jp,33448
sankei.jp.msn.com,32033
nikkei.com,25849
tomamin.co.jp,25023
sanspo.com,24772
oita-press.co.jp,21590
nikkansports.com,20192
nishinippon.co.jp,19109
asahi.com,18573
yomiuri.co.jp,18501


(272990, 4)


Unnamed: 0,label,text,n_chars,ja_ratio
113199,asahi.com,再開の記念式典は、がれきとなった敷地内のアスファルト片を集めたステージ「がれき座」で開かれた...,149,0.865772
54737,nikkansports.com,中には「練習後にはシャワーを浴びること」と、選手を子供扱いにしていると感じさせるものまである...,166,0.885542
255534,yomiuri.co.jp,横浜、川崎、相模原の各市など七つの政令指定都市が31日、「大都市制度共同研究会」を設立し、大...,69,0.869565
101875,oita-press.co.jp,さいたま地検は22日、統一地方選の埼玉県深谷市議選で支持者を飲食接待し、票の取りまとめなどを...,116,0.844828
236508,mainichi.jp,「怪盗ロワイヤル」は、DeNAが運営するポータルサイト「Mobage(モバゲー)」のソーシャ...,290,0.762069


**きれいにしたCSVを日付付きで保存**

In [None]:
from datetime import date
today = date.today().isoformat()
save_path = f"/content/drive/MyDrive/japanese_news_clean_{today}.csv"
df.to_csv(save_path, index=False, encoding="utf-8-sig")
print("保存完了:", save_path)

保存完了: /content/drive/MyDrive/japanese_news_clean_2025-11-10.csv


# **正規化データセットの読み込み確認**

In [None]:
import glob
import pandas as pd

# Driveのフォルダ内にある「japanese_news_clean_*.csv」を全部探す
files = glob.glob("/content/drive/MyDrive/japanese_news_clean_*.csv")

# ファイルが見つからなかったときの保険
if not files:
    raise FileNotFoundError("cleanデータが見つからない！")

# ファイル名の中で一番新しい（日付が後の）ものを選ぶ
latest = sorted(files)[-1]
print("最新ファイル:", latest)

# それを読み込む
df = pd.read_csv(latest)
print(df.shape)
print(df.columns.tolist())
display(df.head(3))

最新ファイル: /content/drive/MyDrive/japanese_news_clean_2025-11-10.csv
(272990, 4)
['label', 'text', 'n_chars', 'ja_ratio']


Unnamed: 0,label,text,n_chars,ja_ratio
0,kobe-np.co.jp,会見した北口寛人市長は「刑事訴訟で被告となっている職員にはそれぞれ主張があるが、組織全体とし...,117,0.863248
1,kobe-np.co.jp,明石・歩道橋事故をめぐる民事訴訟で、神戸地裁から計五億六千八百万円の賠償を命じられた兵庫県(...,105,0.914286
2,kobe-np.co.jp,会見後、遺族代理人の渡部吉泰弁護士は「裁判長が『問うべき者を問わないのは正義に反する』とはっ...,84,0.916667


# **実験の名前札（モード・種・K）ランタイムで起動させること**

In [None]:
# 実験モード（切り替えはここだけ）
MODE = "baseline"  # "baseline" | "supft_cls" | "supft_mlm" | "meta_cls" | "meta_mlm"
SEED = 42
KSHOT = 10

In [None]:
def run_baseline(seed, kshot):
    print("[baseline] 東北BERT→livedoor few-shot（メタなし）…(仮)")
    return {"macro_f1": None, "accuracy": None}

def run_supft_cls(seed, kshot):
    print("[supft_cls] 新聞社21分類で教師ありFT→livedoor few-shot…(仮)")
    return {"macro_f1": None, "accuracy": None}

def run_supft_mlm(seed, kshot):
    print("[supft_mlm] 新聞社コーパスでMLM再学習→livedoor few-shot…(仮)")
    return {"macro_f1": None, "accuracy": None}

def run_meta_cls(seed, kshot):
    print("[meta_cls] 分類FT→メタ学習→livedoor few-shot…(仮)")
    return {"macro_f1": None, "accuracy": None}

def run_meta_mlm(seed, kshot):
    print("[meta_mlm] MLM→メタ学習→livedoor few-shot…(仮)")
    return {"macro_f1": None, "accuracy": None}

In [None]:
def run_experiment(mode, seed, kshot):
    print(f"[start] mode={mode}, seed={seed}, kshot={kshot}")
    if mode == "baseline":
        return run_baseline(seed, kshot)
    if mode == "supft_cls":
        return run_supft_cls(seed, kshot)
    if mode == "supft_mlm":
        return run_supft_mlm(seed, kshot)
    if mode == "meta_cls":
        return run_meta_cls(seed, kshot)
    if mode == "meta_mlm":
        return run_meta_mlm(seed, kshot)
    raise ValueError(f"未知のMODE: {mode}")

In [None]:
res = run_experiment(MODE, SEED, KSHOT)
print("[done]", res)

[start] mode=baseline, seed=42, kshot=10
[baseline] 東北BERT→livedoor few-shot（メタなし）…(仮)
[done] {'macro_f1': None, 'accuracy': None}


# **トークナイズ**

> 引用を追加



In [None]:
import os, glob, pandas as pd

LDCC_DIR = "/content/drive/MyDrive/ldcc_data/text"  # ←「text」までパス指定

def load_livedoor(ldcc_dir=LDCC_DIR):
    rows = []
    # text配下の各カテゴリ（例：dokujo-tsushin, it-life-hack...）を走査
    for cat_dir in sorted(glob.glob(os.path.join(ldcc_dir, "*"))):
        if not os.path.isdir(cat_dir):
            continue
        label = os.path.basename(cat_dir)
        txt_files = glob.glob(os.path.join(cat_dir, "*.txt"))
        for fp in txt_files:
            with open(fp, "r", encoding="utf-8") as f:
                lines = f.read().splitlines()
            # livedoor形式：1行目URL、2行目日時、3行目タイトル、4行目以降本文
            if len(lines) < 4:
                continue
            url, date, title, body = lines[0], lines[1], lines[2], "\n".join(lines[3:])
            text = f"{title}。{body}"
            rows.append({"label": label, "text": text})
    df = pd.DataFrame(rows)
    return df

df_ld = load_livedoor()
print(df_ld.shape)
print(df_ld['label'].value_counts())
df_ld.head(2)

(7376, 2)
label
sports-watch      901
dokujo-tsushin    871
it-life-hack      871
smax              871
movie-enter       871
kaden-channel     865
peachy            843
topic-news        771
livedoor-homme    512
Name: count, dtype: int64


Unnamed: 0,label,text
0,dokujo-tsushin,30代女子を魅力的に見せるものとは。あまりにも過酷なこの夏の猛暑。身の危険さえ感じる暑さにグ...
1,dokujo-tsushin,現役ホステスに聞く、一番人気のホステスとは？。女性が同性を見る目は厳しい。その視線を意地悪と...


In [None]:
import numpy as np
import pandas as pd

def kshot_split_baseline(df, k_train, k_val, k_test, seed):
    rng = np.random.RandomState(seed)
    trains, vals, tests = [], [], []
    for lb, sub in df.groupby("label"):
        sub = sub.sample(frac=1, random_state=rng.randint(0, 10**9)).reset_index(drop=True)
        need = k_train + k_val + k_test
        k = min(need, len(sub))
        if k < need:
            # 足りないクラスは可能な最大で切り出す（例：Kを小さくしてリトライでもOK）
            take_train = min(k_train, k)
            take_val   = min(k_val,   k - take_train)
            take_test  = min(k_test,  k - take_train - take_val)
        else:
            take_train, take_val, take_test = k_train, k_val, k_test
        i0, i1, i2 = take_train, take_train+take_val, take_train+take_val+take_test
        trains.append(sub.iloc[:i0])
        vals.append(sub.iloc[i0:i1])
        tests.append(sub.iloc[i1:i2])
    train = pd.concat(trains).reset_index(drop=True)
    val   = pd.concat(vals).reset_index(drop=True)
    test  = pd.concat(tests).reset_index(drop=True)
    return train, val, test

**ライブラリのインストール**

In [None]:
!pip -q install transformers fugashi ipadic
!pip -q install unidic-lite fugashi ipadic
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese-v3")

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.4/13.4 MB[0m [31m142.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m694.9/694.9 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for ipadic (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.4/47.4 MB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for unidic-lite (setup.py) ... [?25l[?25hdone


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/251 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

**トークナイザを作成**

In [None]:
from transformers import AutoTokenizer

# 東北BERTの日本語トークナイザを読み込む
tok = AutoTokenizer.from_pretrained(
    "cl-tohoku/bert-base-japanese-v3",
    mecab_kwargs={"mecab_dic": "unidic_lite"}  # これで今入れた辞書を使う
)

# 確認
print("トークナイザOK:", tok.__class__.__name__)
print("例:", tok.tokenize("今日の夜ご飯はカレーです。")[:10])

トークナイザOK: BertJapaneseTokenizer
例: ['今日', 'の', '夜', '##ご', '##飯', 'は', 'カレー', 'です', '。']


In [None]:
# --- ラベルを数値化 ---
labels = sorted(df_ld['label'].unique())
label2id = {lb: i for i, lb in enumerate(labels)}
id2label = {i: lb for lb, i in label2id.items()}

train_df, val_df, test_df = kshot_split_baseline(df_ld, k_train=10, k_val=10, k_test=6000, seed=SEED)

for d in (train_df, val_df, test_df):
    d["label_id"] = d["label"].map(label2id)

# --- トークナイズ関数だけ定義 ---
MAX_LEN = 256
def encode_batch(examples):
    enc = tok(
        examples["text"],
        truncation=True,
        max_length=MAX_LEN
    )
    enc["labels"] = examples["label_id"]
    return enc

# --- Datasets 化 ---
from datasets import Dataset

def to_dataset(df):
    ds = Dataset.from_pandas(df[['text', 'label_id']])
    return ds.map(encode_batch, batched=True, remove_columns=ds.column_names)

ds_train = to_dataset(train_df)
ds_val = to_dataset(val_df)
ds_test = to_dataset(test_df)



Map:   0%|          | 0/90 [00:00<?, ? examples/s]

Map:   0%|          | 0/90 [00:00<?, ? examples/s]

Map:   0%|          | 0/7196 [00:00<?, ? examples/s]

# **baselineでの学習＋評価**

In [None]:
# === ロガー（baseline/supft用 最終版） ============================
import os, time, pandas as pd
RESULTS_CSV = "/content/drive/MyDrive/fewshot_results.csv"

RESULT_COLS = [
    "ts","mode","seed","kshot",
    "val_loss","val_accuracy","val_macro_f1",
    "test_loss","test_accuracy","test_macro_f1",
    "note_epochs","note_model","note_max_len",'train_loss', 'train_accuracy', 'train_macro_f1', 'meta_inner_acc', 'meta_outer_acc', 'note_steps'
]

def log_result(mode, seed, kshot, metrics: dict, notes: dict=None):
    row = {c: None for c in RESULT_COLS}
    row["ts"] = time.strftime("%Y-%m-%d %H:%M:%S")
    row["mode"], row["seed"], row["kshot"] = mode, seed, kshot

    # metrics 反映
    for k, v in metrics.items():
        if k in row:
            row[k] = v
        elif k.startswith("eval_"):
            alt = "val_" + k[5:]
            if alt in row:
                row[alt] = v

    # notes 反映
    if notes:
        for k,v in notes.items():
            key = f"note_{k}"
            if key in row:
                row[key] = v
            else:
                row[key] = v  # 未定義ノートもOK

    df = pd.DataFrame([row])

    # ファイルに追記 or 新規作成
    if os.path.exists(RESULTS_CSV):
        existing = pd.read_csv(RESULTS_CSV)
        for col in RESULT_COLS:
            if col not in existing.columns:
                existing[col] = None
        df = df[existing.columns]
        df.to_csv(RESULTS_CSV, mode="a", header=False, index=False)
    else:
        df = df[RESULT_COLS]
        df.to_csv(RESULTS_CSV, index=False)

    print("logged ->", RESULTS_CSV)
# ==================================================================

In [None]:
from transformers import (
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer,
    TrainerCallback,
)
from sklearn.metrics import accuracy_score, f1_score

num_labels = len(labels)
model = AutoModelForSequenceClassification.from_pretrained(
    "cl-tohoku/bert-base-japanese-v3",
    num_labels=num_labels,
    id2label=id2label,
    label2id=label2id
)

data_collator = DataCollatorWithPadding(tok)

def compute_metrics(eval_pred):
    logits, y_true = eval_pred
    y_pred = logits.argmax(axis=-1)
    return {
        "accuracy": accuracy_score(y_true, y_pred),
        "macro_f1": f1_score(y_true, y_pred, average="macro"),
    }

# --- live loss printer コールバック ---
class LiveLossPrinter(TrainerCallback):
    def __init__(self, every=10):
        self.every = every
    def on_log(self, args, state, control, logs=None, **kwargs):
        if not logs:
            return
        # 学習中のloss
        if "loss" in logs and (state.global_step % self.every == 0):
            print(f"[train step {state.global_step:>5}] loss={logs['loss']:.4f}")
        # 検証ステップのloss
        if "eval_loss" in logs:
            print(f"[eval @ epoch {state.epoch:.2f}] val_loss={logs['eval_loss']:.4f}")

args = TrainingArguments(
    output_dir="/content/outputs/baseline",
    num_train_epochs=10,                   # ← ← ← カンマ忘れず
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    learning_rate=2e-5,
    logging_strategy="steps",              # ログ出力有効化
    logging_steps=10,
    seed=SEED,
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=ds_train,
    eval_dataset=ds_val,
    tokenizer=tok,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks=[LiveLossPrinter(every=10)]  # ← ← ← これを追加
)

trainer.train()
val_metrics = trainer.evaluate()
print("VAL:", val_metrics)

# --- テストデータで最終評価 ---
pred = trainer.predict(ds_test)
test_metrics = compute_metrics((pred.predictions, pred.label_ids))
print("TEST:", test_metrics)

In [None]:
# --- ログ保存 ---
flat_metrics = {}
for k, v in val_metrics.items():
    key = k.replace("eval_", "")
    flat_metrics[f"val_{key}"] = v
for k, v in test_metrics.items():
    flat_metrics[f"test_{k}"] = v

log_result(
    mode="baseline",
    seed=SEED,
    kshot=KSHOT,
    metrics=flat_metrics,
    notes={"epochs": args.num_train_epochs, "max_len": 256}
)

logged -> /content/drive/MyDrive/fewshot_results.csv


# **日本新聞データセットでのFT**

In [None]:
len(df_src), len(train_src), len(val_src)

NameError: name 'df_src' is not defined

In [None]:
# 最新のクリーンCSVを自動取得 → 読み込み
import glob, os, pandas as pd, numpy as np
src_candidates = sorted(glob.glob("/content/drive/MyDrive/japanese_news_clean_*.csv"))
assert len(src_candidates) > 0, "クリーンCSVが見つからない…保存場所を確認"
SRC_PATH = src_candidates[-1]
print("src:", SRC_PATH)

df_src = pd.read_csv(SRC_PATH)
# 21社ラベルをそのまま使う（列名: label, text 前提）
labels_src = sorted(df_src['label'].unique())
label2id_src = {lb:i for i,lb in enumerate(labels_src)}
id2label_src = {i:lb for lb,i in label2id_src.items()}
df_src["label_id"] = df_src["label"].map(label2id_src)

# stratified 90/10 分割（1社あたり上限NをかけてもOK）
from sklearn.model_selection import train_test_split
train_src, val_src = train_test_split(
    df_src[["text","label_id"]],
    test_size=0.1, random_state=SEED, stratify=df_src["label_id"]
)

from datasets import Dataset
MAX_LEN = 256
def enc_src(batch):
    enc = tok(batch["text"], truncation=True, max_length=MAX_LEN)
    enc["labels"] = batch["label_id"]
    return enc

ds_train_src = Dataset.from_pandas(train_src)
ds_val_src   = Dataset.from_pandas(val_src)
ds_train_src = ds_train_src.map(enc_src, batched=True, remove_columns=ds_train_src.column_names)
ds_val_src   = ds_val_src.map(enc_src,   batched=True, remove_columns=ds_val_src.column_names)

len(labels_src), len(ds_train_src), len(ds_val_src)

src: /content/drive/MyDrive/japanese_news_clean_2025-11-10.csv


Map:   0%|          | 0/245691 [00:00<?, ? examples/s]

Map:   0%|          | 0/27299 [00:00<?, ? examples/s]

(18, 245691, 27299)

In [None]:
!pip install -U transformers

In [None]:
from transformers import (
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer,
)
from sklearn.metrics import accuracy_score, f1_score

# ラベル数は新聞社数（21）
num_labels_src = len(labels_src)

model_src = AutoModelForSequenceClassification.from_pretrained(
    "cl-tohoku/bert-base-japanese-v3",
    num_labels=num_labels_src,
    id2label=id2label_src,
    label2id=label2id_src,
)

data_collator_src = DataCollatorWithPadding(tok)

def compute_src(eval_pred):
    logits, y_true = eval_pred
    y_pred = logits.argmax(axis=-1)
    return {
        "accuracy": accuracy_score(y_true, y_pred),
        "macro_f1": f1_score(y_true, y_pred, average="macro"),
    }

args_src = TrainingArguments(
    output_dir="/content/outputs/supft_cls_src",
    num_train_epochs=10,           # とりあえず2epochで様子見
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_ratio=0.1,
    eval_strategy="epoch",  # 各epochごとにval
    save_strategy="epoch",        # 各epochごとに保存
    logging_steps=50,
    seed=SEED,
    report_to="none",
)

trainer_src = Trainer(
    model=model_src,
    args=args_src,
    train_dataset=ds_train_src,
    eval_dataset=ds_val_src,
    tokenizer=tok,
    data_collator=data_collator_src,
    compute_metrics=compute_src,
)

trainer_src.train()
val_src_metrics = trainer_src.evaluate()
print("SRC VAL:", val_src_metrics)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at cl-tohoku/bert-base-japanese-v3 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer_src = Trainer(


Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 

In [None]:
import os, glob

# supftの出力ディレクトリ
SRC_DIR = "/content/outputs/supft_cls_src"

# チェックポイント一覧
ckpts = sorted(glob.glob(os.path.join(SRC_DIR, "checkpoint-*")))
print("checkpoints:", ckpts)

# とりあえず最後のやつを使う（epoch2）
SRC_CKPT = ckpts[-1] if ckpts else SRC_DIR
print("use checkpoint:", SRC_CKPT)

checkpoints: ['/content/outputs/supft_cls_src/checkpoint-15356', '/content/outputs/supft_cls_src/checkpoint-30712']
use checkpoint: /content/outputs/supft_cls_src/checkpoint-30712


In [None]:
from transformers import AutoModelForSequenceClassification, DataCollatorWithPadding, TrainingArguments, Trainer
from sklearn.metrics import accuracy_score, f1_score

num_labels_tgt = len(labels)  # livedoor 9クラス
print("num_labels_tgt:", num_labels_tgt)

model_tgt = AutoModelForSequenceClassification.from_pretrained(
    SRC_CKPT,
    num_labels=num_labels_tgt,
    id2label=id2label,
    label2id=label2id,
    ignore_mismatched_sizes=True  # ← 21→9にサイズが違うヘッドだけ作り直す
)

def compute_tgt(eval_pred):
    logits, y_true = eval_pred
    y_pred = logits.argmax(axis=-1)
    return {
        "accuracy": accuracy_score(y_true, y_pred),
        "macro_f1": f1_score(y_true, y_pred, average="macro"),
    }

args_tgt = TrainingArguments(
    output_dir="/content/outputs/supft_cls_tgt",
    num_train_epochs=10,               # few-shotだから軽め
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_ratio=0.1,
    eval_strategy="epoch",
    save_strategy="no",
    logging_steps=10,
    seed=SEED,
    report_to="none",
)

trainer_tgt = Trainer(
    model=model_tgt,
    args=args_tgt,
    train_dataset=ds_train,   # ← もう作ってある livedoor few-shot train
    eval_dataset=ds_val,      # ← livedoor val
    tokenizer=tok,
    data_collator=DataCollatorWithPadding(tok),
    compute_metrics=compute_tgt,
)

trainer_tgt.train()
val_tgt = trainer_tgt.evaluate()
print("VAL (livedoor supft):", val_tgt)

pred = trainer_tgt.predict(ds_test)
test_tgt = {
    "accuracy": accuracy_score(pred.label_ids, pred.predictions.argmax(-1)),
    "macro_f1": f1_score(pred.label_ids, pred.predictions.argmax(-1), average="macro"),
}
print("TEST (livedoor supft):", test_tgt)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at /content/outputs/supft_cls_src/checkpoint-30712 and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([18]) in the checkpoint and torch.Size([9]) in the model instantiated
- classifier.weight: found shape torch.Size([18, 768]) in the checkpoint and torch.Size([9, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


num_labels_tgt: 9


  trainer_tgt = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,Macro F1
1,2.2222,2.001513,0.288889,0.280388
2,1.8092,1.713414,0.6,0.576869
3,1.4074,1.513836,0.622222,0.595897
4,1.1688,1.344288,0.677778,0.661907
5,0.7032,1.253587,0.7,0.685054
6,0.5671,1.156201,0.722222,0.701411
7,0.5044,1.113641,0.711111,0.694766
8,0.3681,1.081076,0.711111,0.691659
9,0.3472,1.054598,0.722222,0.701217
10,0.2851,1.051625,0.733333,0.717972


VAL (livedoor supft): {'eval_loss': 1.0516250133514404, 'eval_accuracy': 0.7333333333333333, 'eval_macro_f1': 0.7179718298288915, 'eval_runtime': 1.2604, 'eval_samples_per_second': 71.403, 'eval_steps_per_second': 4.76, 'epoch': 10.0}
TEST (livedoor supft): {'accuracy': 0.7269316286826014, 'macro_f1': 0.6988643636123386}
