# 💬 Chat with Unsloth GPT-OSS (Streaming in a Notebook)

This notebook demonstrates a **minimal chat loop** using the **Unsloth** `FastLanguageModel` with **streaming generation** via Hugging Face’s `TextIteratorStreamer`.

It supports both **full-precision** and **4-bit quantized** models, and shows how to set the `reasoning_effort` parameter in the chat template.

---

## ✅ Overview

**You’ll learn how to:**
- Load and prepare an Unsloth model & tokenizer  
- Stream responses token-by-token  
- Control sampling (temperature, top-p, etc.)  
- Run an interactive REPL chat loop  
- Adjust reasoning effort (`"low"`, `"medium"`, `"high"`)

---

## 🧰 Requirements

**Install dependencies:

```bash
pip install unsloth transformers accelerate
# For 4-bit models:
pip install bitsandbytes


## 🧠 Model Options

**You can use either full precision or 4-bit quantized models:

| Type         | Model Name                                                                      |
| ------------ | ------------------------------------------------------------------------------- |
| Full / MXFP4 | `unsloth/gpt-oss-20b`, `unsloth/gpt-oss-120b`                                   |
| 4-bit (bnb)  | `unsloth/gpt-oss-20b-unsloth-bnb-4bit`, `unsloth/gpt-oss-120b-unsloth-bnb-4bit` |

To enable 4-bit quantization, set:

load_in_4bit = True



## ⚙️ Key Parameters

| Parameter                | Description                       |
| ------------------------ | --------------------------------- |
| `max_seq_length=4096`    | Context length for the model      |
| `temperature=0.7`        | Sampling temperature (creativity) |
| `top_p=0.9`              | Nucleus sampling probability      |
| `reasoning_effort="low"` | Adjusts reasoning complexity      |
| `max_new_tokens=256`     | Number of tokens to generate      |
| `do_sample=True`         | Enables stochastic generation     |


## 1️⃣ Load the model

In [None]:
import os, torch, threading
from unsloth import FastLanguageModel
from transformers import TextIteratorStreamer


# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/gpt-oss-20b-unsloth-bnb-4bit", # 20B model using bitsandbytes 4bit quantization
    "unsloth/gpt-oss-120b-unsloth-bnb-4bit",
    "unsloth/gpt-oss-20b", # 20B model using MXFP4 format
    "unsloth/gpt-oss-120b",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gpt-oss-20b-unsloth-bnb-4bit",
    dtype = None, # None for auto detection
    max_seq_length = 4096, # Choose any for long context!
    load_in_4bit = False,  # 4 bit quantization to reduce memory
    full_finetuning = False, # [NEW!] We have full finetuning now!
    # token = "hf_...", # use one if using gated models
)




FastLanguageModel.for_inference(model)
device = next(model.parameters()).device
tokenizer.pad_token_id = tokenizer.pad_token_id or tokenizer.eos_token_id




==((====))==  Unsloth 2025.9.11: Fast Gpt_Oss patching. Transformers: 4.56.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.564 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gpt_oss won't work! Using float32.
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.


ValueError: GptOssForCausalLM does not support an attention implementation through torch.nn.functional.scaled_dot_product_attention yet. Please request the support for this architecture: https://github.com/huggingface/transformers/issues/28005. If you believe this error is a bug, please open an issue in Transformers GitHub repository and load your model with the argument `attn_implementation="eager"` meanwhile. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="eager")`

## 2️⃣ Initialize conversation

In [None]:
messages = [
    {"role": "system", "content": "You are a helpful assistant. Be concise and friendly."},
]


# Example REPL

## 3️⃣ Chat function with streaming

In [None]:
def chat_once(user_text: str) -> str:
    messages.append({"role":"user","content":user_text})
    inp = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt", reasoning_effort="low").to(device)

    streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
    gen_kwargs = dict(
        input_ids=inp,
        max_new_tokens=256,
        do_sample=True, top_p=0.9, temperature=0.7,
        eos_token_id=tokenizer.eos_token_id, pad_token_id=tokenizer.pad_token_id,
        use_cache=True, streamer=streamer,
    )
    t = threading.Thread(target=model.generate, kwargs=gen_kwargs); t.start()
    chunks=[]; print("Assistant: ", end="", flush=True)
    for tok in streamer: print(tok, end="", flush=True); chunks.append(tok)
    print()
    reply = "".join(chunks).strip()
    messages.append({"role":"assistant","content":reply})
    return reply


## 4️⃣ Interactive loop

In [None]:
while True:
    s = input("\nYou: ").strip()
    if s.lower() in {"bye","quit","exit","thanks","thank you, the problem has been resolved"}:
        print("Assistant: Glad I could help. 👋"); break
    chat_once(s)


## 🧮 Example Modifications

- Shorter / longer outputs: adjust max_new_tokens

- Deterministic: set do_sample=False

- Change assistant tone: modify the system message

- Reduce GPU memory usage: switch to 4-bit with load_in_4bit=True

## 🚀 Troubleshooting

| Problem              | Fix                                                       |
| -------------------- | --------------------------------------------------------- |
| `CUDA out of memory` | Use 4-bit quantization or reduce `max_seq_length`         |
| Model loads on CPU   | Verify GPU with `torch.cuda.is_available()`               |
| No streaming output  | Ensure you’re using `TextIteratorStreamer`                |
| Template error       | Remove `reasoning_effort` if unsupported in your template |


## 📜 Credits

- Models: Unsloth GPT-OSS family

- Libraries: unsloth, transformers, accelerate, bitsandbytes

- Author: Adapted for educational and research use 🧑‍💻