<a href="https://colab.research.google.com/github/SunnyORZ030/CMPE255-Modern-AI-with-unsloth.ai/blob/main/smolLM2_135M.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
# 先清理可能干擾的套件（不存在也沒關係）
!pip uninstall -y unsloth unsloth-zoo transformers accelerate datasets trl bitsandbytes \
  pyarrow fsspec gcsfs cudf-cu12 pylibcudf-cu12 dask-cudf-cu12 cuml-cu12 \
  sentence-transformers torchtune || true

!pip install -U pip

# 與 Colab 的 torch 2.8.0+cu126 相容的 triton
!pip install triton==3.4.0

# 核心組合（transformers 釘 4.55.2 → 修復 cached_property 錯）
!pip install transformers==4.55.2 accelerate==1.4.0 datasets==4.3.0

# datasets / 雲端 I/O 依賴（避免 bigframes/gcsfs 抱怨）
!pip install pyarrow==21.0.0 fsspec==2024.5.0 gcsfs==2024.5.0

# 與 unsloth 相容且非黑名單的 bitsandbytes 版本
!pip install bitsandbytes==0.47.0

# TRL（SFTTrainer）與 unsloth 本體 / zoo
!pip install trl==0.23.0 unsloth==2025.11.2 unsloth-zoo==2025.11.3

#（可選）避免 sentence-transformers 後續噪音
!pip install sentence-transformers==5.1.2

[0mFound existing installation: transformers 4.57.1
Uninstalling transformers-4.57.1:
  Successfully uninstalled transformers-4.57.1
Found existing installation: accelerate 1.11.0
Uninstalling accelerate-1.11.0:
  Successfully uninstalled accelerate-1.11.0
Found existing installation: datasets 4.0.0
Uninstalling datasets-4.0.0:
  Successfully uninstalled datasets-4.0.0
[0mFound existing installation: pyarrow 18.1.0
Uninstalling pyarrow-18.1.0:
  Successfully uninstalled pyarrow-18.1.0
Found existing installation: fsspec 2025.3.0
Uninstalling fsspec-2025.3.0:
  Successfully uninstalled fsspec-2025.3.0
Found existing installation: gcsfs 2025.3.0
Uninstalling gcsfs-2025.3.0:
  Successfully uninstalled gcsfs-2025.3.0
Found existing installation: cudf-cu12 25.6.0
Uninstalling cudf-cu12-25.6.0:
  Successfully uninstalled cudf-cu12-25.6.0
Found existing installation: pylibcudf-cu12 25.6.0
Uninstalling pylibcudf-cu12-25.6.0:
  Successfully uninstalled pylibcudf-cu12-25.6.0
Found existing ins

In [3]:
import unsloth  # 一定先 import，避免最佳化警告/相依順序問題
import torch, transformers, datasets, accelerate, triton
import bitsandbytes as bnb

print("CUDA available:", torch.cuda.is_available())
print("torch:", torch.__version__)
print("triton:", triton.__version__)
print("transformers:", transformers.__version__)
print("datasets:", datasets.__version__)
print("accelerate:", accelerate.__version__)
print("unsloth:", unsloth.__version__)
print("bitsandbytes:", bnb.__version__)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
CUDA available: True
torch: 2.8.0+cu126
triton: 3.4.0
transformers: 4.57.1
datasets: 4.3.0
accelerate: 1.4.0
unsloth: 2025.11.2
bitsandbytes: 0.47.0


In [4]:
# 產生一個迷你 chat 訓練集（Alpaca/指令式格式），確保流程能跑通
import os, json
os.makedirs("/content/data", exist_ok=True)

samples = [
    {"instruction":"You are a helpful assistant. Answer briefly.",
     "input":"What is overfitting in machine learning?",
     "output":"Overfitting is when a model learns training noise and performs poorly on new data."},
    {"instruction":"You are a helpful assistant. Answer briefly.",
     "input":"Explain cross-validation in one sentence.",
     "output":"Cross-validation splits data into folds to estimate generalization performance reliably."},
    {"instruction":"You are a helpful assistant. Answer briefly.",
     "input":"What is the difference between classification and regression?",
     "output":"Classification predicts discrete labels; regression predicts continuous values."},
    {"instruction":"You are a helpful assistant. Answer briefly.",
     "input":"Give a short tip to avoid overfitting.",
     "output":"Use more data, regularization, or validation-based early stopping."},
    {"instruction":"You are a helpful assistant. Answer briefly.",
     "input":"Name one evaluation metric for binary classification.",
     "output":"F1-score."},
]
with open("/content/data/chat_train.jsonl", "w") as f:
    for ex in samples:
        f.write(json.dumps(ex, ensure_ascii=False) + "\n")

# 正確載入方式：key 是 split 名稱（不是檔名）
from datasets import load_dataset
ds_dict = load_dataset("json", data_files={"train": "/content/data/chat_train.jsonl"})
print(ds_dict)
print(ds_dict["train"][0])

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output'],
        num_rows: 5
    })
})
{'instruction': 'You are a helpful assistant. Answer briefly.', 'input': 'What is overfitting in machine learning?', 'output': 'Overfitting is when a model learns training noise and performs poorly on new data.'}


In [5]:
import torch

BASE_MODEL = "HuggingFaceTB/SmolLM2-135M-Instruct"
USE_BF16 = torch.cuda.is_available() and torch.cuda.get_device_capability(0)[0] >= 8
TORCH_DTYPE = torch.bfloat16 if USE_BF16 else torch.float16

print("Using dtype:", TORCH_DTYPE)
print("CUDA available:", torch.cuda.is_available())

model = None; tokenizer = None
try:
    from unsloth import FastLanguageModel
    print("✓ Using unsloth.FastLanguageModel")
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = BASE_MODEL,
        dtype      = TORCH_DTYPE,
        load_in_4bit = False,          # Full finetuning：不用 4-bit
        trust_remote_code = True,
    )
    # 啟用訓練（全參數）；不同 unsloth 版本 API 可能不同，做兼容
    if hasattr(FastLanguageModel, "for_training"):
        FastLanguageModel.for_training(model, use_gradient_checkpointing=True)
    else:
        model.gradient_checkpointing_enable()

except Exception as e:
    print("[INFO] unsloth 載入失敗，fallback 到 transformers。原因：", e)
    from transformers import AutoModelForCausalLM, AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, use_fast=True, trust_remote_code=True)
    model = AutoModelForCausalLM.from_pretrained(
        BASE_MODEL,
        torch_dtype=TORCH_DTYPE,
        trust_remote_code=True,
    )
    model.gradient_checkpointing_enable()

_ = model.to("cuda")
print("Model device:", next(model.parameters()).device)
print("Vocab size:", len(tokenizer))

Using dtype: torch.float16
CUDA available: True
✓ Using unsloth.FastLanguageModel
Are you certain you want to do remote code execution?
==((====))==  Unsloth 2025.11.2: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/269M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/655 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

HuggingFaceTB/SmolLM2-135M-Instruct does not have a padding token! Will use pad_token = <|endoftext|>.
Model device: cuda:0
Vocab size: 49152


In [6]:
from datasets import Dataset

train_ds = ds_dict["train"]

def to_text(example):
    instr = example.get("instruction","").strip()
    inp   = example.get("input","").strip()
    out   = example.get("output","").strip()
    text = (
        "### Instruction:\n"
        f"{instr}\n\n"
        "### Input:\n"
        f"{inp}\n\n"
        "### Response:\n"
        f"{out}"
    )
    return {"text": text}

train_ds = train_ds.map(to_text, remove_columns=train_ds.column_names)
print(train_ds[0]["text"])

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

### Instruction:
You are a helpful assistant. Answer briefly.

### Input:
What is overfitting in machine learning?

### Response:
Overfitting is when a model learns training noise and performs poorly on new data.


In [11]:
# —— 完全關閉 AMP（避免 GradScaler 問題），FP32 最穩
import os
os.environ["ACCELERATE_MIXED_PRECISION"] = "no"
os.environ["DISABLE_MIXED_PRECISION"] = "1"

import torch
from torch.utils.data import DataLoader
from torch.optim import AdamW
from transformers.optimization import get_linear_schedule_with_warmup

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
model.train()

# 1) 將文字 tokenize（這次：不要回傳 tensor；留成 Python list）
MAX_LEN = 512
def tokenize_fn(batch):
    enc = tokenizer(
        batch["text"],
        truncation=True,
        padding="max_length",
        max_length=MAX_LEN,
        return_tensors=None,     # ← 關鍵：不要在這裡變成 tensor
    )
    # 自回歸 LM：labels = input_ids
    enc["labels"] = [ids[:] for ids in enc["input_ids"]]
    return enc

tok_ds = train_ds.map(tokenize_fn, batched=True, remove_columns=train_ds.column_names)

# 2) DataLoader：在 collate 再把 list → torch.tensor
def collate(batch):
    # batch: List[Dict[str, list]]
    keys = batch[0].keys()
    out = {}
    for k in keys:
        arr = [b[k] for b in batch]  # List[list[int]]
        # 有些欄位可能是 list（input_ids/labels/attention_mask）
        # 統一轉成 LongTensor（常見做法；mask 也可 long）
        out[k] = torch.tensor(arr, dtype=torch.long)
    return out

dl = DataLoader(tok_ds, batch_size=1, shuffle=True, collate_fn=collate, num_workers=0)

# 3) Optimizer & Scheduler（FP32）
optimizer = AdamW(model.parameters(), lr=2e-4, weight_decay=0.01)
num_epochs = 1
num_update_steps_per_epoch = max(1, len(dl))
max_train_steps = num_epochs * num_update_steps_per_epoch
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=max(1, int(0.03 * max_train_steps)),
    num_training_steps=max_train_steps,
)

# 4) 極簡訓練迴圈（無 AMP、無 GradScaler）
log_every = 5
step = 0
for epoch in range(num_epochs):
    for batch in dl:
        step += 1
        batch = {k: v.to(device) for k, v in batch.items()}

        outputs = model(
            input_ids=batch["input_ids"],
            attention_mask=batch.get("attention_mask", None),
            labels=batch["labels"],
        )
        loss = outputs.loss

        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()

        if step % log_every == 0:
            print(f"epoch {epoch+1} step {step}/{max_train_steps} | loss = {loss.item():.4f}")

print("訓練完成 ✅")

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

epoch 1 step 5/5 | loss = 25.6127
訓練完成 ✅


In [12]:
import os
save_dir = "/content/smollm2_fullft_ckpt"
os.makedirs(save_dir, exist_ok=True)

# 儲存權重（含 config）；safe_serialization=True 會存成 .safetensors
model.save_pretrained(save_dir, safe_serialization=True)
tokenizer.save_pretrained(save_dir)

# 檔案列表確認
import glob
print("Saved files:", glob.glob(save_dir + "/*"))

Saved files: ['/content/smollm2_fullft_ckpt/merges.txt', '/content/smollm2_fullft_ckpt/chat_template.jinja', '/content/smollm2_fullft_ckpt/config.json', '/content/smollm2_fullft_ckpt/generation_config.json', '/content/smollm2_fullft_ckpt/vocab.json', '/content/smollm2_fullft_ckpt/special_tokens_map.json', '/content/smollm2_fullft_ckpt/model.safetensors', '/content/smollm2_fullft_ckpt/tokenizer_config.json', '/content/smollm2_fullft_ckpt/tokenizer.json']


In [13]:
import torch
model.eval()

def chat(prompt, max_new_tokens=96):
    # 我們用 Step 2 的簡單 chat 模板
    text = f"### Instruction:\nYou are a helpful assistant. Answer briefly.\n\n### Input:\n{prompt}\n\n### Response:\n"
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.7,
            top_p=0.95,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )
    decoded = tokenizer.decode(out[0], skip_special_tokens=True)
    # 只取 "### Response:" 之後的部分
    split_key = "### Response:"
    return decoded.split(split_key)[-1].strip()

print(chat("Explain cross-validation in one sentence."))
print("----")
print(chat("Give a short tip to avoid overfitting."))

Cross-validation is a technique used to evaluate the performance of a model on unseen data.
----
When training a model on a dataset, overfitting occurs when a model is too complex and fits the training data too closely. This results in the model being too good at making predictions, and fails to generalize well to new data.

Explanation:
