<a href="https://colab.research.google.com/gist/ruvnet/11cfb552fb85585a1dcc4a783f072527/notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Codestral: Fine-Tuning Codestral 25.01 with Unsloth

This notebook demonstrates how to fine-tune Mistral AI's **Codestral 25.01** (a state-of-the-art coding model that generates code approximately 2× faster than its predecessor) for enhanced reasoning—what we call **architect mode**. In this configuration (Deep Codestral), the model is fine-tuned to produce chain-of-thought, step-by-step explanations for system design and architectural planning.

Created by rUv (because he could), Deep Codestral leverages LoRA (Low-Rank Adaptation) for efficient fine-tuning, with an optional GSPO (Generalized Structured Prompt Optimization) extension. GSPO refines prompt structures using agentic datasets in JSONL or Parquet formats.

The notebook covers:
1. Environment and GPU setup
2. Library installation
3. Preparing a reasoning dataset (with optional loading from Strawberry-Phi examples)
4. Loading Codestral 25.01 in 4-bit mode
5. Configuring LoRA and (optionally) GSPO
6. Fine-tuning with an evaluation framework
7. Saving and exporting the model


## 1. Environment Setup and GPU Verification

Ensure that the notebook is running in GPU mode. Go to **Runtime > Change runtime type > Hardware Accelerator > GPU**. This cell checks Python version, CUDA availability, and lists available GPUs.

In [None]:
!python --version
import torch
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))
!nvidia-smi -L

## 2. Install Required Libraries

We install the libraries needed for fine-tuning:
- **Unsloth**: Optimized library for LLM fine-tuning (supports 4-bit quantization and LoRA)
- **PEFT**: Parameter-Efficient Fine-Tuning for applying LoRA adapters
- **bitsandbytes**: For 4-bit model loading
- **TRL**: Provides `SFTTrainer` for supervised fine-tuning
- **Transformers, Datasets, Accelerate**: Hugging Face core libraries


In [None]:
%%bash
pip install -q unsloth peft bitsandbytes trl transformers accelerate datasets

In [None]:
!pip list | grep -E 'unsloth|peft|bitsandbytes|trl|transformers'

## 3. Prepare a Reasoning Dataset

We create a small sample reasoning dataset in JSONL format. Optionally, you can load the agentic training examples from rUv's Strawberry-Phi repository. Two sources are provided:
- Parquet: [strawberry-phi.parquet](https://github.com/ruvnet/strawberry-phi/blob/main/finetune/strawberry-phi.parquet)
- JSONL: [strawberry-phi.jsonl](https://github.com/ruvnet/strawberry-phi/blob/main/finetune/strawberry-phi.jsonl)

For demonstration, we create a small local JSONL file. To use the agentic examples, update the file path accordingly.

In [None]:
%%bash
cat > reasoning_dataset.jsonl <<'DATA'
{"prompt": "Question: What is 12 * 7? Let's think step-by-step.", "completion": "First, break down the problem: 12 * 7 means adding 12 seven times. 10 * 7 = 70 and 2 * 7 = 14, so 70 + 14 = 84. Therefore, the answer is 84."}
{"prompt": "Question: If Alice is taller than Bob, and Bob is taller than Charlie, who is the tallest? Explain your reasoning.", "completion": "Given Alice > Bob and Bob > Charlie, Alice is taller than both. Therefore, Alice is the tallest."}
{"prompt": "Question: John has 3 more apples than Sam. Sam has 2 apples. How many apples does John have? Show your reasoning.", "completion": "Sam has 2 apples. John has 3 more, so 2 + 3 = 5 apples."}
DATA

## 4. Load Codestral 25.01 in 4-bit with Unsloth

We load the Codestral 25.01 model (the instruct variant) in 4-bit mode using Unsloth. This reduces memory usage significantly while retaining model performance. Make sure you have sufficient VRAM (≈12–16GB).

In [None]:
from unsloth import FastLanguageModel
model_name = "unsloth/Mistral-Small-24B-Instruct-2501-unsloth-bnb-4bit"
max_seq_length = 2048  # set maximum sequence length

# Load model and tokenizer in 4-bit mode
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    load_in_4bit=True,
    max_seq_length=max_seq_length,
    dtype=None
)


## 5. Configure LoRA Fine-Tuning and Optional GSPO

**LoRA (Low-Rank Adaptation)** is applied so that only a small fraction of parameters are fine-tuned. This drastically reduces memory and computational requirements.

Optionally, we can integrate **GSPO (Generalized Structured Prompt Optimization)**. GSPO optimizes prompt segmentation and structure, using agentic datasets (JSONL or Parquet). To use GSPO, set the flag and load your agentic data file accordingly.

In this example, we proceed with LoRA fine-tuning. To enable GSPO, one might load additional data and run an optimization routine (this is provided as an optional section).

In [None]:
# Attach LoRA adapters to the model
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
    max_seq_length=max_seq_length
)

# Optional: GSPO integration
USE_GSPO = False  # Set to True to enable GSPO optimization
if USE_GSPO:
    from datasets import load_dataset
    # Example: load agentic examples from a JSONL file (or Parquet if available)
    gspo_dataset = load_dataset("json", data_files="strawberry-phi.jsonl", split="train")
    # Alternatively, use Parquet:
    # gspo_dataset = load_dataset("parquet", data_files="strawberry-phi.parquet", split="train")
    print(f"GSPO dataset loaded with {len(gspo_dataset)} examples")

    # (Optional) Run your GSPO optimization routine here to refine prompt structure
    # For demonstration, we simply print a message
    print("GSPO optimization enabled: refining prompt structure...")


## 6. Fine-Tune the Model with LoRA (with Optional Evaluation Framework)

We fine-tune the model using Hugging Face's `TrainingArguments` and the TRL library's `SFTTrainer`. This section includes an evaluation framework that optionally runs a validation step if a validation dataset is available.

Key training settings:
- **Batch size**: 2 (with gradient accumulation steps to simulate an effective batch size of 8).
- **Mixed precision**: Enabled (FP16 or BF16, based on hardware support).
- **Training steps**: 50 (for demonstration; use more steps in practice).
- **Logging**: Every 5 steps.

For a complete evaluation, you can extend this section to include validation metrics, loss curves, and custom evaluation functions.

In [None]:
from transformers import TrainingArguments
from trl import SFTTrainer
from unsloth import is_bfloat16_supported

# Define training arguments
training_args = TrainingArguments(
    output_dir="outputs",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    max_steps=50,
    warmup_steps=5,
    logging_steps=5,
    fp16=not is_bfloat16_supported(),
    bf16=is_bfloat16_supported(),
    optim="adamw_bnb_fp16",  # optimizer from bitsandbytes
    seed=42,
    report_to=[]  # disable external logging
)

# Initialize the SFT trainer for supervised fine-tuning
trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    dataset_text_field="text",
    tokenizer=tokenizer,
    args=training_args,
)

# Start training (this may take time depending on GPU and dataset size)
trainer.train()

# Optional: Evaluation framework
def evaluate_model(model, tokenizer, prompts):
    model.eval()
    for prompt in prompts:
        inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
        outputs = model.generate(**inputs, max_new_tokens=100)
        prompt_len = inputs["input_ids"].shape[1]
        generated_tokens = outputs[0][prompt_len:]
        completion = tokenizer.decode(generated_tokens, skip_special_tokens=True)
        print(f"Prompt: {prompt}\nCompletion: {completion}\n{'-'*80}")

# Run evaluation on example prompts
evaluation_prompts = [
    "Question: If Mary has 10 candies and gives 4 to John, how many candies remain? Explain your reasoning.",
    "Question: If A is larger than B and B is larger than C, who is the largest? Reason it out step-by-step."
]
evaluate_model(model, tokenizer, evaluation_prompts)


## 7. Save Fine-Tuned Model

We can save the fine-tuned model in two ways:

- **LoRA Adapter Weights:** Save only the LoRA adapter (the small fine-tuned weights). This is storage-efficient but requires the base model to be loaded later.
- **Merged Model:** Merge the LoRA weights into the base model, producing a standalone model (requires more memory during merge).

Below, we demonstrate saving the LoRA adapter and (optionally) merging the model for export.

In [None]:
# Save LoRA adapter weights
adapter_dir = "codestral_lora_adapter"
model.save_pretrained(adapter_dir)
tokenizer.save_pretrained(adapter_dir)
print(f"LoRA adapter saved to {adapter_dir}/ (contains adapter model weights).")


In [None]:
# Merge LoRA into base model and save (optional)
try:
    from peft import PeftModel
    if isinstance(model, PeftModel):
        base_model = model.merge_and_unload()  # merge LoRA weights into the base model
    else:
        base_model = model
    base_model.save_pretrained("codestral_finetuned_full")
    tokenizer.save_pretrained("codestral_finetuned_full")
    print("Merged full model saved to 'codestral_finetuned_full/'")
except Exception as e:
    print(f"Merge failed or not enough memory: {e}")


## 8. Export the Notebook

Finally, you can download this notebook as a `.ipynb` file. The code below uses Colab's utility to download the notebook file. If you encounter issues, please save the notebook manually.

In [None]:
from google.colab import files
files.download("Deep_Codestral_FineTuning.ipynb")
