In [None]:
%%capture

!pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
!pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer
!pip install --no-deps unsloth

In [None]:
from google.colab import userdata
import os

os.environ["HF_TOKEN"] = userdata.get("HF_TOKEN") # Create a token at https://huggingface.co/settings/tokens

In [None]:
import json
import torch
from unsloth import FastModel
from datasets import load_dataset

model, tokenizer = FastModel.from_pretrained(
    model_name="lukepramo221/gemma-3-4b-finetuned_choreo-qna_full_v0.1",
    max_seq_length=2048,
    load_in_4bit=True,
    token=os.environ["HF_TOKEN"]
)



🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.6.11: Fast Gemma3 patching. Transformers: 4.52.4.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gemma3 won't work! Using float32.


model.safetensors:   0%|          | 0.00/4.56G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/210 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/670 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

processor_config.json:   0%|          | 0.00/70.0 [00:00<?, ?B/s]

chat_template.json: 0.00B [00:00, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/59.7M [00:00<?, ?B/s]

In [None]:
dataset = load_dataset("lukepramo221/gemma-3-4b-finetuned_choreo-qna_full_v0.1", split="test", token=os.environ["HF_TOKEN"])
output_file_path = 'output.jsonl'
results = []

for data in dataset:
    question = data.get("question")

    if not question:
        continue

    messages = [{
        "role": "user",
        "content": [{"type" : "text", "text" : question}]
    }]

    prompt_text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )

    inputs = tokenizer([prompt_text], return_tensors="pt").to("cuda")

    outputs = model.generate(
        input_ids=inputs["input_ids"],
        max_new_tokens=256,
        temperature=1.0,
        top_p=0.95,
        top_k=64,
        do_sample=True
    )

    generated_text = tokenizer.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:], skip_special_tokens=True)[0]
    print(generated_text + '\n')
    new_data = data.copy()
    new_data['finetuned_answer'] = generated_text.strip()

    results.append(new_data)

with open(output_file_path, 'w') as outfile:
    for entry in results:
        outfile.write(json.dumps(entry) + '\n')

print(f"Processing complete. Output written to {output_file_path}")


README.md:   0%|          | 0.00/214 [00:00<?, ?B/s]

choreo_concepts_qna_test_set.jsonl: 0.00B [00:00, ?B/s]

Generating test split:   0%|          | 0/166 [00:00<?, ? examples/s]

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
You have set `compile_config`, but we are unable to meet the criteria for compilation. Compilation will be skipped.


Choreo supports the standalone deployment of each part (component) of your application, allowing each to be deployed and made live individually.

Updates, including security patches and bug fixes, are managed for software components in a private data plane via the organization's central update management system.

To set up this pipeline, go to the 'Deploy' page of your component, choose a compatible environment from the 'Set Up' card, and then activate the 'Auto Build on Commit' option. For deployment, utilize the 'Deploy on Commit' option directly on the component's 'Deploy' page, or leverage the 'Deploy' card for scheduled deployment.

To this end, simply provide your custom Dockerfile name as a unique value in the 'BuildSettings' section of your Choreo project. This will instruct Choreo to build your container images using the specific steps outlined in that Dockerfile.

The recommended process is to manually trigger a promotion of the container image to the `production` environment

In [None]:
from google.colab import files
files.download('output.jsonl')