<a href="https://colab.research.google.com/github/Amonsuzuki/Amonsuzuki/blob/main/20B_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**1. GPT-OSS 20Bのダウンロード**


---




In [2]:
!pip install -q --upgrade "torch>=2.4.0" "transformers>=4.44.2" "accelerate>=1.0.0"
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

assert torch.cuda.is_available(), "GPUが有効化されていません。"

model_id = "openai/gpt-oss-20b"

# Hugging Face Hubからモデルの重みファイルをインストール
tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
  # trust_remote_code=Trueを使うと、モデル作者が提供した独自のPythonクラスや関数をリモートからダウンロードして実行します。
    # マイナーなモデルを使用する場合、悪意のあるコードをダウンロードするリスクを伴います。
  # GPT2やBERTなどの代表的なモデル構造は既にtransformersライブラリの本体に実装済みであり、trust_remote_code=False でも問題なく動きます。
  # GPT-OSSもいずれtransformersライブラリに標準実装されることが推測されます。

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    trust_remote_code=True,
    low_cpu_mem_usage=True,
    dtype=torch.bfloat16
)
model.eval()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
MXFP4 quantization requires triton >= 3.4.0 and kernels installed, we will default to dequantizing the model to bf16


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

GptOssForCausalLM(
  (model): GptOssModel(
    (embed_tokens): Embedding(201088, 2880, padding_idx=199999)
    (layers): ModuleList(
      (0-23): 24 x GptOssDecoderLayer(
        (self_attn): GptOssAttention(
          (q_proj): Linear(in_features=2880, out_features=4096, bias=True)
          (k_proj): Linear(in_features=2880, out_features=512, bias=True)
          (v_proj): Linear(in_features=2880, out_features=512, bias=True)
          (o_proj): Linear(in_features=4096, out_features=2880, bias=True)
        )
        (mlp): GptOssMLP(
          (router): GptOssTopKRouter()
          (experts): GptOssExperts()
        )
        (input_layernorm): GptOssRMSNorm((2880,), eps=1e-05)
        (post_attention_layernorm): GptOssRMSNorm((2880,), eps=1e-05)
      )
    )
    (norm): GptOssRMSNorm((2880,), eps=1e-05)
    (rotary_emb): GptOssRotaryEmbedding()
  )
  (lm_head): Linear(in_features=2880, out_features=201088, bias=False)
)

**2. モデルの動作確認**

---



In [3]:
prompt = "こんにちは、自己紹介して！"

inputs = tok(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=150, do_sample=True, temperature=0.8)
print(tok.decode(out[0], skip_special_tokens=True))

こんにちは、自己紹介して！"

もちろんです！私の名前はChatGPTで、OpenAIが開発したAIモデルです。多くの分野でテキストに関する質問に答えることができます。学習したデータに基づいて、情報の検索、会話、創造的な文章作成、技術的な問題解決など、様々なタスクをサポートします。あなたの質問やお手伝いしたいことがあれば、遠慮なくどうぞ！

こんにちは、ChatGPTより！もし何かお手伝いできることがあれば、遠慮なくどうぞ！

こんにちは、ChatGPTです！何か質問や相談


**3. 一般的な対話形式で使用**

---



In [9]:
!pip install -U "gradio==5.4.0" "gradio_client==1.4.2" "pydantic==2.10.6"

from transformers import TextIteratorStreamer
import threading, time
import gradio as gr
import re

def build_chat_input(messages, system_prompt=""):
  if hasattr(tok, "apply_chat_template"):
    chat = []
    if system_prompt:
      chat.append({"role": "system", "content": system_prompt})
    chat.extend(messages)
    return tok.apply_chat_template(
        chat,
        tokenize=False,
        add_generation_prompt=True
    )
  else:
    sys_str = f"[SYSTEM]\n{system_prompt}\n\n" if system_prompt else ""
    conv = [sys_str]
    for m in messages:
      role = "USER" if m["role"] == "user" else "ASSISTANT"
      conv.append(f"[{role}]\n{m['content'].strip()}\n\n")
    conv.append("[ASSISTANT]\n")
    return "".join(conv)

def generate_stream(messages, system_prompt, max_new_tokens=256, temperature=0.8, top_p=0.9):
  prompt_text = build_chat_input(messages, system_prompt)
  inputs = tok(prompt_text, return_tensors="pt").to(model.device)

  streamer = TextIteratorStreamer(tok, skip_prompt=True, skip_special_tokens=True)

  ban_phrases = ["analysis", "Analysis", "<think>", "</think>"]
  enc = tok(ban_phrases, add_special_tokens=False)
  bad_words_ids = [ids for ids in enc.input_ids if ids]

  gen_kwargs = dict(
      **inputs,
      max_new_tokens=int(max_new_tokens),
      do_sample=True,
      temperature=float(temperature),
      top_p=float(top_p),
      streamer=streamer,
      bad_words_ids=bad_words_ids,
      repetition_penalty=1.05
  )
  thread = threading.Thread(target=model.generate, kwargs=gen_kwargs)
  thread.start()

  partial = ""
  for new_text in streamer:
    partial += new_text
    yield partial

def truncate_history(messages, max_chars=6000):
  s = "".join(f"{m['role']}:{m['content']}\n" for m in messages)
  kept = messages[:]
  while len(s) > max_chars and kept:
    kept.pop(0)
    s = "".join(f"{m['role']}:{m['content']}\n" for m in kept)
  return kept

FINAL_MARKERS = [
    r"<final>", r"</final>", r"assistant\s*final", r"assistantfinal",
    r"\bfinal\b", r"FINAL:", r"\bFINAL\b"
]

def visible_only(text: str) -> str:
  text = re.sub(r'(?is)<think>.*?</think>', '', text)

  norm = text
  for pat in FINAL_MARKERS:
    norm = re.sub(pat, " final  ", norm, flags=re.IGNORECASE)

  if " final " not in norm:
    return ""

  last = norm.rfind(" final ")
  visible = norm[last + len(" final "):]

  visible = re.sub(r'(?im)^\s*analysis\w*.*$', '', visible)
  visible = re.sub(r'(?is)</?final>', '', visible)
  visible = re.sub(r'(?is)</?analysis>', '', visible)

  return visible.lstrip()

with gr.Blocks(title="GPT-OSS 20B Chat") as demo:
  with gr.Row():
    with gr.Column(scale=3):
      sys_prompt = gr.Textbox(
          label="System prompt",
          value="あなたは役立つAIアシスタントです。出力は次の形式のみに従ってください。\n""final: <結論のみ。箇条書き可。冗長禁止。analysis/思考は一切出力しない>",
          lines=1
      )
      chat = gr.Chatbot(height=450, type="messages")
      user_in = gr.Textbox(
          placeholder="Type your message...",
          label="Messages",
          lines=3,
          scale=1
      )
      with gr.Row():
        clear_btn = gr.Button("Clear")
        send_btn = gr.Button("Send", variant="primary")

    with gr.Column(scale=1):
      with gr.Accordion("Advanced Settings", open=False):
        max_new = gr.Slider(32, 1024, value=256, step=32, label="Max new tokens")
        temperature = gr.Slider(0.0, 1.5, value=0.8, step=0.05, label="Temperature")
        top_p = gr.Slider(0.1, 1.0, value=0.9, step=0.05, label="Top-p")

    def respond(user_message, chat_history, system_prompt, max_new_tokens, temperature, top_p):
      if not user_message or not user_message.strip():
        return gr.update(), chat_history

      chat_history = truncate_history(chat_history or [])

      messages = chat_history + [{"role": "user", "content": user_message}]

      stream = generate_stream(messages, system_prompt, max_new_tokens, temperature, top_p)

      chat_history = messages + [{"role": "assistant", "content": ""}]
      raw = ""
      last_shown = None
      for chunk in stream:
        raw += chunk
        #print(raw)
        cleaned = visible_only(raw)
        if cleaned == last_shown:
          continue
        last_shown = cleaned
        chat_history[-1]["content"] = cleaned
        yield chat_history, gr.update(value="")

    send_event = send_btn.click(
        respond,
        inputs=[user_in, chat, sys_prompt, max_new, temperature, top_p],
        outputs=[chat, user_in]
    )
    user_in.submit(
        respond,
        inputs=[user_in, chat, sys_prompt, max_new, temperature, top_p],
        outputs=[chat, user_in]
    )

    def clear_chat():
      return [], ""
    clear_btn.click(clear_chat, outputs=[chat, user_in])

demo.launch(debug=True, share=True, show_api=False)

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://68ebb21c3f9541dc9f.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://68ebb21c3f9541dc9f.gradio.live




**補足: 使用可能メモリ容量の確認方法**

---



In [5]:
!nvidia-smi || true
!head -n 5 /proc/meminfo
import psutil, platform, sys
print("Python:", sys.version)
print("RAM (GB):", round(psutil.virtual_memory().total/1e9,2), "Avail(GB):", round(psutil.virtual_memory().available/1e9,2))
print("Platform:", platform.platform())


Tue Sep 16 09:10:00 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-80GB          Off |   00000000:00:05.0 Off |                    0 |
| N/A   37C    P0             62W /  400W |   44247MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                