In [1]:
SYSTEM_PROMPT = """
You are in-house counsel and compliance officer for an SEC-registered investment adviser (RIA).

Your job:
- Spot material issues under the SEC Marketing Rule and firm advertising policies.
- Explain them briefly for an internal reviewer.
- Provide a clean, client-ready rewrite.

You must ALWAYS answer in EXACTLY this structure, with nothing before <think> and nothing after </fixed_copy>:

<think>...</think>
<critique>...</critique>
<fixed_copy>...</fixed_copy>

======================
PATTERN TO COPY (STYLE)
======================

<think>
Okay, the main issues here are unsubstantiated ‚Äúpremier‚Äù language and a single 12% performance figure with no time period or risk context. Under the Marketing Rule, that creates risks around implied superiority and an unbalanced presentation of results. I‚Äôll reframe the reputation claim as clearly subjective, tie the 12% figure to a specific historical period, and add a brief reminder that results vary and loss is possible.
</think>

<critique>
‚Ä¢ ‚ÄúOne of the nation‚Äôs premier firms‚Äù reads like an objective ranking without support, raising unsubstantiated claim concerns.
‚Ä¢ ‚ÄúReturned 12% net of fees‚Äù presents a point-in-time result with no time period or historical framing.
‚Ä¢ ‚ÄúContinue to provide value to clients‚Äù could imply ongoing outperformance without balancing risks or variability.
</critique>

<fixed_copy>
I am pleased to introduce this month‚Äôs featured partner, [REDACTED:ORG], a firm I have worked with for more than 25 years. I believe [REDACTED:ORG] offers a thoughtful, research-driven approach to portfolio management.

Over the past 12 months, their flagship strategy delivered a 12% return net of fees. This result is historical and does not guarantee future performance. All investment strategies involve the risk of loss and results will vary.

Advisory services are provided by [REDACTED:ORG], L.L.C., a registered investment adviser where required by law. Registration does not imply a certain level of skill or training.
</fixed_copy>

======================
GUIDELINES (BRIEF)
======================

<think>
- 1 short paragraph, usually 2‚Äì4 sentences (about 40‚Äì80 words).
- It‚Äôs fine to sound like a quick internal monologue (e.g., ‚ÄúOkay, the main issues are‚Ä¶‚Äù), but stay concise and on-topic.
- Focus on: (a) what the copy is trying to do, (b) the main rule / risk themes, and (c) your plan for the rewrite.
- Do NOT mention ‚Äúthe user‚Äù, ‚Äúthe model‚Äù, ‚Äúthis task‚Äù, ‚Äúsystem prompt‚Äù, or your own formatting.
- Avoid long step-by-step narration and do not re-quote large chunks of the original copy.
</think>

<critique>
- 3‚Äì5 bullet points, max ~120 words total.
- Each bullet: quote or closely paraphrase a phrase from the original copy AND name the concern (e.g., unsubstantiated claim, performance without period, unbalanced risk/benefit, testimonial/endorsement, weak disclosure).
- No process talk, no mention of prompts, tags, or being an AI.
</critique>

<fixed_copy>
- 1‚Äì3 short paragraphs, 80‚Äì200 words total.
- Preserve the legitimate business goal of the copy.
- Remove or soften non-compliant language.
- Make any performance or benefit statements clearly historical, time-bounded where relevant, and balanced with risks.
- Add or clarify disclosures (registration, no guarantee, risk of loss, limits of examples) where needed.
- Do NOT mention <think>, <critique>, ‚Äúoriginal text‚Äù, ‚Äúedits‚Äù, or ‚Äúcompliance review‚Äù.
</fixed_copy>
""".strip()

PROMPT='''
‚Ä¢	Copy: üß†Financial Myth Fridayüß†

Myth: A Roth IRA conversion is always a good move.

Truth: One of the most common questions we receive from clients is, ‚ÄúShould I do a Roth IRA conversion?‚Äù While it may seem like an easy decision, there are multiple factors to consider before making a conversion.

It is crucial to understand your goals before determining if a Roth conversion is the right strategy for you. Once you‚Äôve decided that a Roth conversion makes sense for you, you still need to figure out how much money to convert. While this can be daunting on your own, working with a Financial Planner to map out your options can help you understand both the potential benefits and repercussions of a Roth conversion.

Learn more about Roth IRA conversions hereüëâhttps://www.affiancefinancial.com/news/Roth-IRA-Conversion-Guide

#AffianceFinancial #FinancialMythFriday #financialplanning #RothIRA

'''


In [2]:
MODEL_ID="Qwen/Qwen3-4B-Thinking-2507"

In [3]:
pip install -q -U transformers peft datasets bitsandbytes trl accelerate unsloth torch

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
from unsloth import FastLanguageModel
from peft import PeftModel
import shutil
import os
import torch
from pathlib import Path


ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!


In [6]:
def unzip_file(zip_file_path, output_dir):
    # Create the directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)

    # Unpack the archive
    shutil.unpack_archive(zip_file_path, output_dir)
    print(f"Unzipped '{zip_file_path}' to '{output_dir}'")

In [7]:
# uncomment when we want to run file locally
OUTPUT_DIR = "/content/dpo_qwen3_4b_thinking"

unzip_file("/content/drive/MyDrive/dpo_fintuned.zip",OUTPUT_DIR)
# OUTPUT_DIR = dpo_qwen3_4b_thinking_dpo.zip
# dpo_qwen3_4b_thinking_dpo_1.zip
# dpo_qwen3_4b_thinking_dpo_12.zip

Unzipped '/content/drive/MyDrive/dpo_fintuned.zip' to '/content/dpo_qwen3_4b_thinking'


In [8]:
# Where to write eval files
EVAL_OUTPUT_DIR = Path("/content/drive/MyDrive") / "outputs_ara"
EVAL_OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

def load_model(with_lora: bool = True):
    """Load either the base model or the base+LoRA model."""
    print(f"üîç Loading Unsloth base model (with_lora={with_lora}) from: {MODEL_ID}")

    if not torch.cuda.is_available():
        raise RuntimeError("CUDA GPU required for Unsloth fast path. Enable GPU in Colab.")

    major, _ = torch.cuda.get_device_capability(0)
    is_ampere = major >= 8  # A100 / H100 etc.
    dtype = "bfloat16" if is_ampere else None

    # 1) Load base Qwen3 with Unsloth (4-bit QLoRA ready)
    base_model, tokenizer = FastLanguageModel.from_pretrained(
        model_name     = MODEL_ID,
        max_seq_length = 4096,
        dtype          = dtype,
        load_in_4bit   = True,
    )

    # 2) Optionally attach LoRA adapters
    if with_lora:
        print("üîß Attaching DPO LoRA from:", OUTPUT_DIR)
        model = PeftModel.from_pretrained(base_model, OUTPUT_DIR)
    else:
        print("‚úÖ Using pure base model (no LoRA)")
        model = base_model

    # 3) Prepare for inference
    tokenizer.pad_token = tokenizer.eos_token
    FastLanguageModel.for_inference(model)
    model.eval()

    return model, tokenizer


def generate_response(
    model,
    tokenizer,
    input_copy: str,
    max_new_tokens: int = 2000,
    strip_prompt: bool = True,
) -> str:
    """
    Run one model on the SEC review task.

    If strip_prompt=True (default), return only the newly generated text
    (everything after the prompt). If False, return the full decoded sequence.
    """
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user",   "content": f"Input Copy:\n{input_copy}"},
    ]

    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )

    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,
        )

    # outputs[0] includes: [prompt tokens] + [generated tokens]
    if strip_prompt:
        prompt_len = inputs["input_ids"].shape[-1]
        generated_ids = outputs[0, prompt_len:]
        text = tokenizer.decode(generated_ids, skip_special_tokens=True)
    else:
        text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return text


def run_sec_review(input_copy: str):
    """Original: only runs finetuned model."""
    model, tokenizer = load_model(with_lora=True)
    text = generate_response(model, tokenizer, input_copy)

    print("\n========== FINETUNED MODEL OUTPUT ==========")
    print(text)
    print("============================================\n")


def run_sec_review_compare(input_copy: str):
    """Runs both base + finetuned and prints them for comparison."""
    # Base model (no LoRA)
    base_model, base_tok = load_model(with_lora=False)
    base_text = generate_response(base_model, base_tok, input_copy)

    # Finetuned model (with LoRA)
    ft_model, ft_tok = load_model(with_lora=True)
    ft_text = generate_response(ft_model, ft_tok, input_copy)

    print("\n========== BASE MODEL OUTPUT (no LoRA) ==========")
    print(base_text)
    print("=================================================\n")

    print("\n========== FINETUNED MODEL OUTPUT (DPO LoRA) ==========")
    print(ft_text)
    print("======================================================\n")


# -------------------------------------------------------------------
# NEW: batch eval over an array, starting N at 9, writing 3 files each
# -------------------------------------------------------------------

def run_eval_batch(
    copies,
    start_n: int = 9,
    out_dir: Path = EVAL_OUTPUT_DIR,
    max_new_tokens: int = 2000,
):
    """
    For each input copy in `copies`, run base + finetuned models and save:
      eval_s{N}_input.txt  -> raw input
      eval_s{N}_1.txt      -> base model output
      eval_s{N}_2.txt      -> finetuned model output

    N starts at `start_n` and increments by 1 per sample.
    """
    out_dir.mkdir(parents=True, exist_ok=True)

    # Load models once, reuse for all examples
    print("üöÄ Loading models for batch eval...")
    base_model, base_tok = load_model(with_lora=False)
    ft_model,   ft_tok   = load_model(with_lora=True)

    n = start_n

    for idx, input_copy in enumerate(copies):
        prefix = f"eval_s{n}"
        print(f"\n‚ñ∂ Running sample {idx} as {prefix} ...")

        # Generate outputs
        base_text = generate_response(base_model, base_tok, input_copy, max_new_tokens=max_new_tokens)
        ft_text   = generate_response(ft_model,   ft_tok,   input_copy, max_new_tokens=max_new_tokens)

        # Paths
        input_path = out_dir / f"{prefix}_input.txt"
        base_path  = out_dir / f"{prefix}_1.txt"
        ft_path    = out_dir / f"{prefix}_2.txt"

        # Write files
        input_path.write_text(input_copy, encoding="utf-8")
        base_path.write_text(base_text,   encoding="utf-8")
        ft_path.write_text(ft_text,       encoding="utf-8")

        print(f"  üíæ Wrote {input_path.name}, {base_path.name}, {ft_path.name}")
        n += 1

In [9]:
synthetic_copies = [

    '''Platform: Facebook Ad ‚Äì Event/Webinar

Firm: [REDACTED:ORG]

Headline: Free Class: ‚ÄúHow To Pull $20k/Month From Your Portfolio For Life‚Äù

Primary Text:
Most webinars dance around the real question: ‚ÄúHow much can I safely take out every month‚Ä¶ and never run out?‚Äù

In this no-fluff session, [REDACTED:ORG] will show you the exact Guardrail+ Income‚Ñ¢ formula we use so clients with $2‚Äì4M portfolios can confidently withdraw $20,000 per month, every month, for life.

We‚Äôll cover:
‚Ä¢ The one allocation that has never failed a Guardrail+ client
‚Ä¢ Why we ignore worst-case scenarios that generic calculators use
‚Ä¢ Real-world examples of families who increased income during market crashes

If you follow the formula we show, you can lock in your lifetime income number before you retire.

üîµ Reserve Your Seat: [REDACTED:URL]

Seats are limited so we can answer every question (and yes, we can tell you your number on the spot).''',

    '''Platform: Website ‚Äì FAQ Style Section (Estate & Roth)

Firm: [REDACTED:ORG]

Section Title: Common Questions (And Our Straight Answers)

Q: Is a Roth conversion always a good idea?
A: When done through our Roth Compass‚Ñ¢ process, yes. We have never seen a case where a Compass-approved conversion left a client worse off in retirement.

Q: Do I still need an estate attorney if I work with you?
A: No. Our in-house documents cover everything a traditional estate attorney would do ‚Äì and more. Clients who use our EstateShield Pack‚Ñ¢ avoid probate and estate taxes 100% of the time when they follow our instructions.

Q: What about risk? Can my accounts go down?
A: Our core strategies are designed so that clients don‚Äôt experience negative calendar years on their statements. While markets move, our approach eliminates meaningful downside for households that implement the full plan.

Q: How do your returns compare to the market?
A: Every client who has been with us for at least 5 years has outperformed a basic index fund or target-date strategy. We don‚Äôt publish weaker periods because they are not representative of the experience our full-process clients have.

Still have questions? Schedule a 10-minute Fit Call at [REDACTED:URL] or email [REDACTED:EMAIL].'''
]


# Run batch eval, N starts at 9, outputs into /content/drive/MyDrive/outputs_ara
# run_eval_batch(synthetic_copies, start_n=9)

üöÄ Loading models for batch eval...
üîç Loading Unsloth base model (with_lora=False) from: Qwen/Qwen3-4B-Thinking-2507
==((====))==  Unsloth 2025.11.6: Fast Qwen3 patching. Transformers: 4.57.2.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 8.9. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/3.51G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/238 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

‚úÖ Using pure base model (no LoRA)
üîç Loading Unsloth base model (with_lora=True) from: Qwen/Qwen3-4B-Thinking-2507
==((====))==  Unsloth 2025.11.6: Fast Qwen3 patching. Transformers: 4.57.2.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 8.9. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


KeyboardInterrupt: 

In [10]:
from transformers import AutoTokenizer

MODEL_ID = "Qwen/Qwen3-4B-Thinking-2507"  # same ID you used for fine-tuning

def load_ft_tokenizer():
    tokenizer = AutoTokenizer.from_pretrained(
        MODEL_ID,
        trust_remote_code=True,  # Qwen/Qwen3 usually needs this
    )
    tokenizer.pad_token = tokenizer.eos_token
    return tokenizer

tokenizer = load_ft_tokenizer()

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

In [25]:
!pip install python-docx docx2txt

Collecting docx2txt
  Downloading docx2txt-0.9-py3-none-any.whl.metadata (529 bytes)
Downloading docx2txt-0.9-py3-none-any.whl (4.0 kB)
Installing collected packages: docx2txt
Successfully installed docx2txt-0.9


In [28]:
import os
import re
import statistics
from pathlib import Path

import docx2txt  # pip install docx2txt if needed

EVALS = Path("/content/drive/MyDrive/Documents")

# Match eval_s{N}_{X}.txt or eval_s{N}_{X}.txt.docx
EVAL_FILE_RE = re.compile(r"^eval_s(\d+)_([12])\.txt(?:\.docx)?$")


def load_both_models():
    """
    Load base and fine-tuned (LoRA) models, and use their tokenizers.
    Note: tokenizers should be identical, but we keep both for clarity.
    """
    base_model, base_tok = load_model(with_lora=False)
    ft_model, ft_tok     = load_model(with_lora=True)
    return base_model, base_tok, ft_model, ft_tok


# ---------- Text helpers ----------

def read_eval_file(path: Path) -> str:
    """
    Read eval file as text.
    - For .docx / .txt.docx: use docx2txt
    - For .txt: read as UTF-8 text
    """
    name = path.name.lower()
    if name.endswith(".docx"):
        return docx2txt.process(str(path)) or ""
    else:
        return path.read_text(encoding="utf-8", errors="ignore")


def count_tags(text: str):
    """
    Count how many <think>, <critique>, and final-copy tags appear.
    We treat both <final_copy> and <fixed_copy> as final-copy tags.
    """
    n_think    = text.count("<think>")
    n_critique = text.count("<critique>")
    n_final    = text.count("<final_copy>") + text.count("<fixed_copy>")
    return n_think, n_critique, n_final


def extract_section(text: str, tag: str) -> str:
    """
    Extract and concatenate all sections of `<tag>...</tag>`.
    For example, tag='think' or 'critique' or 'fixed_copy'.
    """
    pattern = rf"<{tag}>(.*?)</{tag}>"
    matches = re.findall(pattern, text, flags=re.S)
    return "\n".join(m.strip() for m in matches if m.strip())


def extract_final_section(text: str) -> str:
    """
    Extract combined final-copy section from <fixed_copy> and <final_copy>.
    """
    fixed = extract_section(text, "fixed_copy")
    final = extract_section(text, "final_copy")
    parts = [p for p in [fixed, final] if p]
    return "\n".join(parts)


# ---------- Flesch Reading Ease (optional) ----------

VOWELS = "aeiouy"

def count_syllables(word: str) -> int:
    word = word.lower()
    word = re.sub(r"[^a-z]", "", word)
    if not word:
        return 0

    syllables = 0
    prev_is_vowel = False
    for ch in word:
        is_vowel = ch in VOWELS
        if is_vowel and not prev_is_vowel:
            syllables += 1
        prev_is_vowel = is_vowel

    if word.endswith("e") and syllables > 1:
        syllables -= 1

    return max(syllables, 1)


def flesch_reading_ease(text: str) -> float:
    sentences = re.split(r"[.!?]+", text)
    sentences = [s for s in sentences if s.strip()]

    words = re.findall(r"\w+", text)
    if not words or not sentences:
        return 0.0

    n_sentences = len(sentences)
    n_words     = len(words)
    n_syllables = sum(count_syllables(w) for w in words)

    words_per_sentence = n_words / n_sentences
    syllables_per_word = n_syllables / n_words

    score = 206.835 - 1.015 * words_per_sentence - 84.6 * syllables_per_word
    return score


# ---------- Redundancy / repetition ----------

def redundancy_score(text: str) -> float:
    """
    Very simple repetition metric based on bigrams:
      redundancy = 1 - (#unique_bigrams / #total_bigrams).
    Higher = more repetition.
    """
    words = [w.lower() for w in re.findall(r"\w+", text)]
    if len(words) < 2:
        return 0.0

    bigrams = list(zip(words, words[1:]))
    if not bigrams:
        return 0.0

    total  = len(bigrams)
    unique = len(set(bigrams))
    return 1.0 - (unique / total)


# ---------- Main analysis ----------

def analyze_eval_outputs(tokenizer, eval_dir: Path = EVALS) -> None:
    """
    Traverse eval_dir, find files matching eval_s{N}_{X}.txt(.docx) where
    N in [1,10] and X in {1,2} (1 = base, 2 = finetuned).

    For each file, compute:
      - total token count
      - tokens in <think>, <critique>, <fixed_copy>/<final_copy>
      - share of total tokens from each section
      - tag counts
      - Flesch Reading Ease (optional)
      - redundancy score

    Then print medians by model type.
    """
    # Raw token counts per section
    total_tokens   = {"1": [], "2": []}
    think_tokens   = {"1": [], "2": []}
    crit_tokens    = {"1": [], "2": []}
    final_tokens   = {"1": [], "2": []}

    # Ratios
    think_share    = {"1": [], "2": []}
    crit_share     = {"1": [], "2": []}
    final_share    = {"1": [], "2": []}

    # Tag counts
    think_counts   = {"1": [], "2": []}
    crit_counts    = {"1": [], "2": []}
    final_counts   = {"1": [], "2": []}

    # Other signals
    flesch_scores  = {"1": [], "2": []}
    redundancies   = {"1": [], "2": []}

    files_by_model = {"1": [], "2": []}

    for root, dirs, files in os.walk(str(eval_dir)):
        for fname in files:
            m = EVAL_FILE_RE.match(fname)
            if not m:
                continue

            N = int(m.group(1))
            X = m.group(2)  # "1" (base) or "2" (finetuned)
            if not (1 <= N <= 10):
                continue

            fpath = Path(root) / fname
            try:
                text = read_eval_file(fpath)
            except Exception as e:
                print(f"‚ö†Ô∏è Skipping {fpath} due to read error: {e}")
                continue

            # Full-token count
            encoded_full = tokenizer(text, add_special_tokens=False)
            n_total      = len(encoded_full["input_ids"])
            if n_total == 0:
                continue

            # Section texts
            think_text = extract_section(text, "think")
            crit_text  = extract_section(text, "critique")
            final_text = extract_final_section(text)

            # Section token counts
            def tok_len(section_text: str) -> int:
                if not section_text.strip():
                    return 0
                enc = tokenizer(section_text, add_special_tokens=False)
                return len(enc["input_ids"])

            n_think_tok = tok_len(think_text)
            n_crit_tok  = tok_len(crit_text)
            n_final_tok = tok_len(final_text)

            # Shares
            think_ratio = n_think_tok / n_total
            crit_ratio  = n_crit_tok / n_total
            final_ratio = n_final_tok / n_total

            # Tag counts
            n_think_tag, n_crit_tag, n_final_tag = count_tags(text)

            # Other signals
            fre   = flesch_reading_ease(text)
            redun = redundancy_score(text)

            # Store
            total_tokens[X].append(n_total)
            think_tokens[X].append(n_think_tok)
            crit_tokens[X].append(n_crit_tok)
            final_tokens[X].append(n_final_tok)

            think_share[X].append(think_ratio)
            crit_share[X].append(crit_ratio)
            final_share[X].append(final_ratio)

            think_counts[X].append(n_think_tag)
            crit_counts[X].append(n_crit_tag)
            final_counts[X].append(n_final_tag)

            flesch_scores[X].append(fre)
            redundancies[X].append(redun)
            files_by_model[X].append(str(fpath))

    def median_safe(values):
        return statistics.median(values) if values else None

    for X, label in [("1", "Base model"), ("2", "Finetuned model")]:
        toks   = total_tokens[X]
        if not toks:
            print(f"\n=== {label} (X={X}) ===")
            print("No matching files found.")
            continue

        print(f"\n=== {label} (X={X}) ===")
        print(f"Files: {len(files_by_model[X])}")
        print(f"Median total tokens: {median_safe(total_tokens[X])}")

        print(f"Median tokens in <think>: {median_safe(think_tokens[X])}")
        print(f"Median tokens in <critique>: {median_safe(crit_tokens[X])}")
        print(f"Median tokens in <fixed_copy>/<final_copy>: {median_safe(final_tokens[X])}")

        print(f"Median think_share: {median_safe(think_share[X])}")
        print(f"Median critique_share: {median_safe(crit_share[X])}")
        print(f"Median final_share: {median_safe(final_share[X])}")

        print(f"Median <think> tag count: {median_safe(think_counts[X])}")
        print(f"Median <critique> tag count: {median_safe(crit_counts[X])}")
        print(f"Median <final_copy>/<fixed_copy> tag count: {median_safe(final_counts[X])}")

        print(f"Median redundancy score: {median_safe(redundancies[X])}")
        print(f"Median Flesch reading ease (optional): {median_safe(flesch_scores[X]):.2f}")


# ---- Example usage ----
base_model, base_tok, ft_model, ft_tok = load_both_models()

# Use whichever tokenizer you prefer (they should be the same)
analyze_eval_outputs(base_tok)

üîç Loading Unsloth base model (with_lora=False) from: Qwen/Qwen3-4B-Thinking-2507
==((====))==  Unsloth 2025.11.6: Fast Qwen3 patching. Transformers: 4.57.2.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 8.9. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
‚úÖ Using pure base model (no LoRA)
üîç Loading Unsloth base model (with_lora=True) from: Qwen/Qwen3-4B-Thinking-2507
==((====))==  Unsloth 2025.11.6: Fast Qwen3 patching. Transformers: 4.57.2.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 8.9. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http:

In [29]:
# model, tokenizer = load_model(with_lora=True)  # or load base tokenizer
analyze_eval_outputs(tokenizer)


=== Base model (X=1) ===
Files: 10
Median total tokens: 1847.0
Median tokens in <think>: 1443.5
Median tokens in <critique>: 144.5
Median tokens in <fixed_copy>/<final_copy>: 141.5
Median think_share: 0.8363760734951236
Median critique_share: 0.07693291613191158
Median final_share: 0.08138962212575007
Median <think> tag count: 1.0
Median <critique> tag count: 2.0
Median <final_copy>/<fixed_copy> tag count: 1.5
Median redundancy score: 0.4263563470471749
Median Flesch reading ease (optional): 54.08

=== Finetuned model (X=2) ===
Files: 10
Median total tokens: 897.5
Median tokens in <think>: 618.0
Median tokens in <critique>: 117.0
Median tokens in <fixed_copy>/<final_copy>: 127.0
Median think_share: 0.7011080082277312
Median critique_share: 0.14030189689571027
Median final_share: 0.1438359414780299
Median <think> tag count: 1.0
Median <critique> tag count: 1.0
Median <final_copy>/<fixed_copy> tag count: 1.0
Median redundancy score: 0.14090856376509486
Median Flesch reading ease (option