In [1]:
!pip install -q -U bitsandbytes accelerate transformers peft

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.4/41.4 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.1/60.1 MB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m101.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [6]:
# ==============================================================
# Inference with LoRA adapter from michsethowusu/ga_code_assistant
# Base model: unsloth/qwen3-4b-instruct-2507-unsloth-bnb-4bit
# ==============================================================

from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
from peft import PeftModel
import torch

# --------------------------------------------------------------
# Model setup
# --------------------------------------------------------------
base_model_name = "unsloth/qwen3-4b-instruct-2507-unsloth-bnb-4bit"
adapter_name = "michsethowusu/twi_code_assistant"

# Load base 4-bit model
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    device_map="auto",
    torch_dtype=torch.float16,
    trust_remote_code=True
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, adapter_name)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_name)

# --------------------------------------------------------------
# Chat-based inference with streaming
# --------------------------------------------------------------
messages = [
    {"role": "user", "content": "Mewɔ csv fael bi. Mepɛ sɛ mede python compress no. ma me code bi"}
]

# Apply the chat template
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True  # Must be True for chat generation
)

# Stream the response as it’s generated
print("\n=== Model Response ===\n")
streamer = TextStreamer(tokenizer, skip_prompt=True)
_ = model.generate(
    **tokenizer(text, return_tensors="pt").to(model.device),
    max_new_tokens=1000,   # Adjust for longer responses if needed
    temperature=0.7,
    top_p=0.8,
    top_k=20,
    streamer=streamer,
)


adapter_config.json: 0.00B [00:00, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/264M [00:00<?, ?B/s]


=== Model Response ===

Wubetumi de Python nhomakorabea ahorow ahorow a ɛwɔ hɔ no adi dwuma de adi CSV fael ahorow ho dwuma. Nhwɛsoɔ koodu a ɛde CSV fael bi a wɔde ahyɛ mu no di dwuma nie:

```python
import csv
import gzip
import shutil

with open('file.csv', 'rb') as f_in:
    with gzip.open('file.csv.gz', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)
```

Wɔ saa koodu yi mu no, yɛde nhomakorabea ahorow a ɛho hia, `csv`, `gzip`, ne `shutil` di dwuma. Afei, yɛde `with` asɛm no di dwuma de yɛ adwuma a ɛyɛ mmerɛw na ɛma yɛn memory ho kwan.

`open('file.csv', 'rb')` line no yɛ fael bi a wɔfrɛ no 'File.csv' no mu nhwɛso a ɛde mode 'rb' di dwuma, a ɛkyerɛ sɛ ɛsɛ sɛ yɛde binary mode di dwuma.

`gzip.open('file.csv.gz', 'wb')` line no yɛ fael bi a wɔfrɛ no 'File.csv.gz' no mu nhwɛso a ɛde mode 'wb' di dwuma, a ɛkyerɛ sɛ ɛsɛ sɛ yɛde binary mode di dwuma.

`shutil.copyfileobj(f_in, f_out)` line no yɛ adwuma a ɛma yɛtumi de fael no kɔ mu na yɛde yɛn adwuma no adi dwuma wɔ binary mode 