<a href="https://colab.research.google.com/github/AlperYildirim1/gemma-pipeline/blob/main/makale_sft_gemma_3_4b_test_pipeline_open_ended.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News

Unsloth now supports Text-to-Speech (TTS) models. Read our [guide here](https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning).

Read our **[Gemma 3N Guide](https://docs.unsloth.ai/basics/gemma-3n-how-to-run-and-fine-tune)** and check out our new **[Dynamic 2.0](https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs)** quants which outperforms other quantization methods!

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" huggingface_hub hf_transfer
    !pip install --no-deps unsloth

In [None]:
# ==============================================================================
# CELL 1: CONFIGURATION
# ==============================================================================
import os
import pandas as pd
import json
from tqdm.auto import tqdm
import torch
from datasets import load_dataset
from unsloth import FastModel
from openai import OpenAI

# --- Main Configuration ---
# Set this to the SFT or GRPO model you want to test
MODEL_NAME = "Yujivus/gemma-3_sft_grpo" # Or "Yujivus/gemma-3-4b-sft-model"

# OpenAI Configuration for the Judge LLM
# IMPORTANT: Add your OpenAI API Key here!
os.environ["OPENAI_API_KEY"] = "" # <--- PASTE YOUR KEY HERE
JUDGE_MODEL = "gpt-4.1-2025-04-14"

# New file paths for the Verifiable Problems benchmark results
OUTPUT_DIR = "/content/drive/MyDrive/gemma sft cevaplar"
FINAL_RESULTS_JSON = os.path.join(OUTPUT_DIR, "final_evaluation_results_verifiable_problems_grpo_model_4b.json")
INTERMEDIATE_CSV = os.path.join(OUTPUT_DIR, "intermediate_generated_answers_verifiable_problems_grpo_model_4b.csv")

# Set to 100 as requested
MAX_SAMPLES = 100

# Create the output directory if it doesn't exist
os.makedirs(OUTPUT_DIR, exist_ok=True)

print(f"Configuration loaded. Model: {MODEL_NAME}, Benchmark: Verifiable Problems")
print(f"Final output will be saved to: {FINAL_RESULTS_JSON}")

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
Configuration loaded. Model: Yujivus/gemma-3_sft_grpo, Benchmark: Verifiable Problems
Final output will be saved to: /content/drive/MyDrive/gemma sft cevaplar/final_evaluation_results_verifiable_problems_grpo_model_4b.json


In [None]:
# ==============================================================================
# CELL 2: CREATE DATASET AND LOAD MODEL
# ==============================================================================

# --- Part 1: Running your script to create the 'rl_dataset' ---
print("Loading source datasets to create the test set...")
verifiable_dataset = load_dataset("FreedomIntelligence/medical-o1-verifiable-problem", split="train")
sft_dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT", 'en', split="train")

print("Creating a set of questions used in the SFT dataset...")
sft_questions_set = set(item['Question'] for item in tqdm(sft_dataset, desc="Processing SFT questions"))

print("Filtering verifiable problems to create the unseen RL/Test dataset...")
rl_dataset = []
for item in tqdm(verifiable_dataset, desc="Filtering for Test Set"):
    verifiable_question = item['Open-ended Verifiable Question']
    if verifiable_question not in sft_questions_set:
        rl_dataset.append({
            "question": verifiable_question,
            "ground_truth_answer": item['Ground-True Answer']
        })
print(f"Created a test set with {len(rl_dataset)} unseen problems.")

# --- Part 2: Preparing the final test_data variable ---
if MAX_SAMPLES is not None:
    print(f"Selecting the first {MAX_SAMPLES} samples for this run.")
    test_data = rl_dataset[:MAX_SAMPLES]
else:
    test_data = rl_dataset

print(f"✅ Dataset ready. Using {len(test_data)} samples for evaluation.")

Loading source datasets to create the test set...
Creating a set of questions used in the SFT dataset...


Processing SFT questions:   0%|          | 0/19704 [00:00<?, ?it/s]

Filtering verifiable problems to create the unseen RL/Test dataset...


Filtering for Test Set:   0%|          | 0/40644 [00:00<?, ?it/s]

Created a test set with 27007 unseen problems.
Selecting the first 100 samples for this run.
✅ Dataset ready. Using 100 samples for evaluation.


In [None]:


# --- Part 3: Loading the model in full precision ---
print("\nLoading model and tokenizer with Unsloth...")
model, tokenizer = FastModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=4096,
    load_in_4bit=False,
    load_in_8bit=False,
    full_finetuning=False,
)
print("✅ Model loaded successfully in full precision.")


Loading model and tokenizer with Unsloth...
==((====))==  Unsloth 2025.7.6: Fast Gemma3 patching. Transformers: 4.53.2.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


✅ Model loaded successfully in full precision.


In [None]:
# ==============================================================================
# CELL 3: PHASE 1 - GENERATE ANSWERS
# ==============================================================================
print("\n--- Starting Phase 1: Generating Answers ---")

generated_results = []

# Loop through the prepared test data
for idx, item in enumerate(tqdm(test_data, desc="Generating Answers")):
    question_text = item['question']
    ground_truth_answer = item['ground_truth_answer']

    # Simple prompt for an open-ended question
    user_prompt = f"Please answer the following question:\n\n{question_text}"

    messages = [{"role": "user", "content": user_prompt}]
    text_input = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer([text_input], return_tensors="pt").to("cuda")

    # Generate the model's response
    outputs = model.generate(
        **inputs,
        # Using 4096 tokens to allow the CoT model to generate its full reasoning.
        max_new_tokens=2048,
        # Using temp=0.0 for a deterministic, reproducible benchmark.
        temperature=0.0,
        do_sample=False
    )

    full_response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    model_full_answer = full_response.split(user_prompt)[-1].strip()

    # Saving the relevant columns
    generated_results.append({
        "question_number": idx,
        "question": question_text,
        "ground_truth_answer": ground_truth_answer,
        "model_full_answer": model_full_answer,
    })

# Save intermediate results to a CSV file
df_answers = pd.DataFrame(generated_results)
df_answers.to_csv(INTERMEDIATE_CSV, index=False)

print(f"\n✅ Phase 1 Complete. {len(generated_results)} answers saved to {INTERMEDIATE_CSV}")


--- Starting Phase 1: Generating Answers ---


Generating Answers:   0%|          | 0/100 [00:00<?, ?it/s]


✅ Phase 1 Complete. 100 answers saved to /content/drive/MyDrive/gemma sft cevaplar/intermediate_generated_answers_verifiable_problems_grpo_model_4b.csv


In [None]:
# ==============================================================================
# CELL 4: PHASE 2 - EVALUATE WITH LLM-AS-JUDGE
# ==============================================================================
print("\n--- Starting Phase 2: Evaluating and Combining Results ---")

if not os.environ.get("OPENAI_API_KEY") or os.environ.get("OPENAI_API_KEY") == "":
    print("🚨 WARNING: OpenAI API key is not set. Skipping Phase 2.")
else:
    try:
        df_to_judge = pd.read_csv(INTERMEDIATE_CSV)
        client = OpenAI()
        final_results_list = []

        # Judge prompt for open-ended medical answers, allowing for synonyms.
        judge_system_prompt = """You are an expert medical evaluator. Your task is to evaluate a model's response to a medical question against a ground-truth answer. You will assess two distinct aspects: the correctness of the final conclusion and the soundness of the reasoning that led to it.

The question is open-ended.

**Evaluation Criteria:**

**1. Final Answer Correctness (for the `correctness` field):**
- Determine if the model's **final conclusion** is a correct and synonymous match for the ground-truth answer.
- Be lenient with synonyms. For example, 'Gastric ulcer' and 'Stomach ulcer' are correct matches.
- Focus on medical accuracy. If the model names a different condition, it is incorrect.

**2. Reasoning Soundness (for the `reasoning_correctness` field):**
- Evaluate the model's Chain-of-Thought (CoT) or reasoning steps.
- The reasoning is considered **correct** if it is medically sound, logical, and does not contain significant errors. Brief but accurate reasoning is acceptable.
- The reasoning is considered **incorrect** if it contains significant medical inaccuracies or logical fallacies, **even if the final answer happens to be correct.**

**Output Format:**

You MUST respond ONLY with a valid JSON object with the following structure:
{
  "correctness": boolean,
  "reasoning_correctness": boolean,
  "justification": "A brief explanation for your decisions on both correctness and reasoning_correctness. Note any synonyms used or specific flaws in the reasoning."
}
"""

        for _, row in tqdm(df_to_judge.iterrows(), total=len(df_to_judge), desc="Judging Answers"):
            final_record = row.to_dict()
            # The user prompt for the judge does not need the reasoning correctness part anymore
            judge_user_prompt = f"""
Please evaluate the following model output.

**Original Question:**
{row['question']}

**Ground Truth Answer:**
{row['ground_truth_answer']}

**Model's Full Response (including reasoning):**
{row['model_full_answer']}
"""
            try:
                response = client.chat.completions.create(
                    model=JUDGE_MODEL,
                    messages=[
                        {"role": "system", "content": judge_system_prompt},
                        {"role": "user", "content": judge_user_prompt}
                    ],
                    temperature=0.0,
                    response_format={"type": "json_object"}
                )
                judge_assessment = json.loads(response.choices[0].message.content)
                final_record['judge_evaluation'] = judge_assessment
            except Exception as e:
                print(f"Error judging question {row['question_number']}: {e}")
                final_record['judge_evaluation'] = {"correctness": "error", "justification": str(e)}
            final_results_list.append(final_record)

        with open(FINAL_RESULTS_JSON, 'w') as f:
            json.dump(final_results_list, f, indent=4)
        print(f"\n✅ Phase 2 Complete. Final results saved to {FINAL_RESULTS_JSON}")

        correct_count = sum(1 for item in final_results_list if item.get('judge_evaluation', {}).get('correctness') is True)
        total_judged = len(final_results_list)

        if total_judged > 0:
            accuracy = (correct_count / total_judged) * 100
            print("\n--- Final Verifiable Problems Results ---")
            print(f"Answer Correctness: {accuracy:.2f}% ({correct_count}/{total_judged})")

    except FileNotFoundError:
        print(f"🚨 ERROR: The intermediate answers file was not found at {INTERMEDIATE_CSV}. Cannot run Phase 2.")
    except Exception as e:
        print(f"🚨 An unexpected error occurred during Phase 2: {e}")


--- Starting Phase 2: Evaluating and Combining Results ---


Judging Answers:   0%|          | 0/100 [00:00<?, ?it/s]


✅ Phase 2 Complete. Final results saved to /content/drive/MyDrive/gemma sft cevaplar/final_evaluation_results_verifiable_problems_grpo_model_4b.json

--- Final Verifiable Problems Results ---
Answer Correctness: 26.00% (26/100)


In [None]:

# ==============================================================================
# SECTION 4: PHASE 2 - EVALUATE WITH LLM-AS-JUDGE AND COMBINE
# ==============================================================================
print("\n--- Starting Phase 2: Evaluating and Combining Results ---")

if os.environ.get("OPENAI_API_KEY") == "sk-...":
    print("🚨 WARNING: OpenAI API key is not set. Skipping Phase 2.")
else:
    try:
        # Load the generated answers
        df_to_judge = pd.read_csv(INTERMEDIATE_CSV)
        client = OpenAI()

        # REVISED: This list will hold the final combined data
        final_results_list = []

        judge_system_prompt = """You are an expert medical evaluator. Your task is to evaluate a model's response to a medical question against a ground-truth answer. You will assess two distinct aspects: the correctness of the final conclusion and the soundness of the reasoning that led to it.

The question is open-ended.

**Evaluation Criteria:**

**1. Final Answer Correctness (for the `correctness` field):**
- Determine if the model's **final conclusion** is a correct and synonymous match for the ground-truth answer.
- Be lenient with synonyms. For example, 'Gastric ulcer' and 'Stomach ulcer' are correct matches.
- Focus on medical accuracy. If the model names a different condition, it is incorrect.

**2. Reasoning Soundness (for the `reasoning_correctness` field):**
- Evaluate the model's Chain-of-Thought (CoT) or reasoning steps.
- The reasoning is considered **correct** if it is medically sound, logical, and does not contain significant errors. Brief but accurate reasoning is acceptable.
- The reasoning is considered **incorrect** if it contains significant medical inaccuracies or logical fallacies, **even if the final answer happens to be correct.**

**Output Format:**

You MUST respond ONLY with a valid JSON object with the following structure:
{
  "correctness": boolean,
  "reasoning_correctness": boolean,
  "justification": "A brief explanation for your decisions on both correctness and reasoning_correctness. Note any synonyms used or specific flaws in the reasoning."
}
"""

        for _, row in tqdm(df_to_judge.iterrows(), total=len(df_to_judge), desc="Judging Answers"):
            # REVISED: Start building the final record for this item
            final_record = row.to_dict()

            judge_user_prompt = f"""
Please evaluate the following model output.

**Original Question:**
{row['question']}

**Ground Truth Answer:**
{row['ground_truth_answer']}

**Model's Full Response (including reasoning):**
{row['model_full_answer']}
"""
            try:
                response = client.chat.completions.create(
                    model=JUDGE_MODEL,
                    messages=[
                        {"role": "system", "content": judge_system_prompt},
                        {"role": "user", "content": judge_user_prompt}
                    ],
                    temperature=0.0, # Judge should be deterministic
                    response_format={"type": "json_object"} # Enforce JSON output
                )

                judge_assessment = json.loads(response.choices[0].message.content)
                # REVISED: Add the judge's assessment to the final record
                final_record['judge_evaluation'] = judge_assessment

            except Exception as e:
                print(f"Error judging question {row['question_number']}: {e}")
                final_record['judge_evaluation'] = {
                    "correctness": "error",
                    "reasoning_correctness": "error",
                    "justification": str(e)
                }

            # REVISED: Add the complete record to our final list
            final_results_list.append(final_record)

        # REVISED: Save the final combined list to a single JSON file
        with open(FINAL_RESULTS_JSON, 'w') as f:
            json.dump(final_results_list, f, indent=4)

        print(f"\n✅ Phase 2 Complete. Final combined results for {len(final_results_list)} items saved to {FINAL_RESULTS_JSON}")

        # Final Summary
        correct_count = sum(1 for item in final_results_list if item.get('judge_evaluation', {}).get('correctness') is True)
        reasoning_correct_count = sum(1 for item in final_results_list if item.get('judge_evaluation', {}).get('reasoning_correctness') is True)
        total_judged = len(final_results_list)

        if total_judged > 0:
            accuracy = (correct_count / total_judged) * 100
            reasoning_accuracy = (reasoning_correct_count / total_judged) * 100
            print("\n--- Final Results ---")
            print(f"Answer Correctness: {accuracy:.2f}% ({correct_count}/{total_judged})")
            print(f"Reasoning Correctness: {reasoning_accuracy:.2f}% ({reasoning_correct_count}/{total_judged})")

    except FileNotFoundError:
        print(f"🚨 ERROR: The intermediate answers file was not found at {INTERMEDIATE_CSV}. Cannot run Phase 2.")
    except Exception as e:
        print(f"🚨 An unexpected error occurred during Phase 2: {e}")



--- Starting Phase 2: Evaluating and Combining Results ---


Judging Answers:   0%|          | 0/100 [00:00<?, ?it/s]


✅ Phase 2 Complete. Final combined results for 100 items saved to /content/drive/MyDrive/gemma sft cevaplar/final_evaluation_results_verifiable_problems_grpo_model_4b.json

--- Final Results ---
Answer Correctness: 26.00% (26/100)
Reasoning Correctness: 31.00% (31/100)


In [None]:
# ==============================================================================
# FINAL CELL: SHUTDOWN RUNTIME
# ==============================================================================

print("\n--- All tasks complete. Shutting down the Colab runtime to save resources. ---")
print("You will be disconnected shortly.")

# This is the official command to disconnect and terminate the Colab runtime.
from google.colab import runtime
runtime.unassign()


--- All tasks complete. Shutting down the Colab runtime to save resources. ---
You will be disconnected shortly.
