# Fine-Tune Granite-3.1-3B-A800M-Instruct for Lexmancer Ability Generation

This notebook fine-tunes Granite-3.1-3B-A800M-Instruct on ability JSON generation using LoRA adapters.

**Requirements:**
- Google Colab with GPU (T4 works, A100 recommended)
- Training data in JSONL format (instruction/input/output)

**Output:**
- Fine-tuned model merged and quantized to GGUF format
- Ready to replace `Assets/LLM/granite-3.1-3b-a800m-instruct-Q4_K_M.gguf`


## Step 1: Setup Environment

In [None]:
# Check GPU availability
!nvidia-smi

In [None]:
# Install Unsloth for fast fine-tuning
%%capture
!pip install unsloth
!pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

In [None]:
# Install additional dependencies
%%capture
!pip install datasets trl peft transformers accelerate bitsandbytes

## Step 2: Load Base Model

In [None]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 2048  # Granite supports long context, but 2K is enough for abilities
dtype = None  # Auto-detect (Float16 for Tesla T4, V100, Bfloat16 for A100)
load_in_4bit = True  # Use 4bit quantization for memory efficiency

# Load Granite-3.1-3B-A800M-Instruct
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="ibm-granite/granite-3.1-3b-a800m-instruct",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
    # trust_remote_code=True,  # Uncomment if required by the model
)


## Step 3: Add LoRA Adapters

In [None]:
# Add LoRA adapters for parameter-efficient fine-tuning
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # LoRA rank (higher = more parameters, 8-64 typical)
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    lora_alpha=16,  # LoRA scaling factor (usually same as r)
    lora_dropout=0,  # Dropout (0 = no dropout, Unsloth optimizes this)
    bias="none",
    use_gradient_checkpointing="unsloth",  # Memory efficient
    random_state=3407,
    use_rslora=False,
    loftq_config=None,
)

print("✅ LoRA adapters added!")

## Step 4: Load Training Data

Upload your `ability_training_data.jsonl` file to Colab using the file browser on the left.

In [None]:
from datasets import load_dataset

# Load JSONL training data
dataset = load_dataset("json", data_files="ability_training_data.jsonl", split="train")

print(f"📊 Loaded {len(dataset)} training examples")
print("\nExample:")
print(dataset[0])

In [None]:
# Format dataset for instruction fine-tuning (Granite chat template)
def format_prompts(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]

    texts = []
    for instruction, input_text, output in zip(instructions, inputs, outputs):
        user_text = f"{instruction}{f' {input_text}' if input_text else ''}"
        messages = [
            {"role": "system", "content": "You are a creative game designer for Lexmancer, an elemental combat game. Generate ability JSON exactly matching the schema."},
            {"role": "user", "content": user_text},
            {"role": "assistant", "content": output},
        ]
        text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
        texts.append(text)

    return {"text": texts}

# Apply formatting
dataset = dataset.map(format_prompts, batched=True)

print("✅ Dataset formatted!")
print("\nFormatted example:")
print(dataset[0]["text"][:500] + "...")


## Step 5: Configure Training

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=False,  # Can make training 5x faster for short sequences
    args=TrainingArguments(
        # Training hyperparameters
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,  # Effective batch size = 2*4 = 8
        warmup_steps=10,
        num_train_epochs=3,  # Adjust based on dataset size
        # max_steps=100,  # Alternative to num_train_epochs
        learning_rate=2e-4,
        
        # Optimization
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        
        # Logging & checkpointing
        logging_steps=1,
        output_dir="outputs",
        save_strategy="epoch",
        save_total_limit=2,
        
        # Other
        seed=3407,
    ),
)

print("✅ Trainer configured!")

## Step 6: Train!

This will take 15-60 minutes depending on:
- Dataset size
- Number of epochs
- GPU type (T4 vs A100)

In [None]:
# Show GPU memory before training
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU: {gpu_stats.name}")
print(f"Memory: {start_gpu_memory}GB / {max_memory}GB allocated\n")

# Start training
trainer_stats = trainer.train()

# Show final stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_training = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)

print(f"\n✅ Training complete!")
print(f"Peak memory: {used_memory}GB ({used_percentage}% of {max_memory}GB)")
print(f"Training used: {used_memory_for_training}GB")

## Step 7: Test Generation (Optional)

In [None]:
# Enable fast inference mode
FastLanguageModel.for_inference(model)

# Test prompt
messages = [
    {"role": "system", "content": "You are a creative game designer for Lexmancer, an elemental combat game. Generate ability JSON exactly matching the schema."},
    {"role": "user", "content": "Create an ability for combining Shadow and Lightning elements."},
]

test_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([test_prompt], return_tensors="pt").to("cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
)

print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])


## Step 8: Save Model (Merged)

In [None]:
# Save LoRA adapters only (small, ~50MB)
model.save_pretrained("lora_model")
tokenizer.save_pretrained("lora_model")

print("✅ LoRA adapters saved to 'lora_model/'")

In [None]:
# Merge LoRA adapters with base model (creates full model)
model.save_pretrained_merged(
    "merged_model",
    tokenizer,
    save_method="merged_16bit",  # Options: merged_16bit, merged_4bit, lora
)

print("✅ Merged model saved to 'merged_model/'")

## Step 9: Convert to GGUF Format

This converts the model to GGUF format for use with LLamaSharp.

In [None]:
# Install llama.cpp for conversion
%%capture
!git clone https://github.com/ggerganov/llama.cpp
!cd llama.cpp && make

In [None]:
# Install dependencies for conversion
%%capture
!pip install gguf sentencepiece protobuf

In [None]:
# Convert to FP16 GGUF (unquantized)
!python llama.cpp/convert_hf_to_gguf.py \
    merged_model \
    --outfile granite-3.1-3b-a800m-instruct-fp16.gguf \
    --outtype f16

print("✅ FP16 GGUF created!")


In [None]:
# Quantize to Q4_K_M (same as your current model)
!./llama.cpp/llama-quantize \
    granite-3.1-3b-a800m-instruct-fp16.gguf \
    granite-3.1-3b-a800m-instruct-Q4_K_M.gguf \
    Q4_K_M

print("✅ Q4_K_M GGUF created!")


## Step 10: Download Your Fine-Tuned Model

Download `granite-3.1-3b-a800m-instruct-Q4_K_M.gguf` and replace:
```
Assets/LLM/granite-3.1-3b-a800m-instruct-Q4_K_M.gguf
```

You may also want to download the FP16 version for further fine-tuning later.


In [None]:
# Check file sizes
!ls -lh *.gguf

# Download files
from google.colab import files

print("\nDownloading Q4_K_M model (recommended for game)...")
files.download('granite-3.1-3b-a800m-instruct-Q4_K_M.gguf')

# Uncomment to also download FP16 (much larger, ~15GB)
# print("\nDownloading FP16 model (for future fine-tuning)...")
# files.download('granite-3.1-3b-a800m-instruct-fp16.gguf')


## Alternative: Save to Google Drive

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Copy to Drive
!mkdir -p "/content/drive/MyDrive/Lexmancer/Models"
!cp granite-3.1-3b-a800m-instruct-Q4_K_M.gguf "/content/drive/MyDrive/Lexmancer/Models/"
!cp granite-3.1-3b-a800m-instruct-fp16.gguf "/content/drive/MyDrive/Lexmancer/Models/"

print("✅ Models saved to Google Drive!")


---

## Summary

**What we did:**
1. ✅ Loaded Granite-3.1-3B-A800M-Instruct base model
2. ✅ Added LoRA adapters for efficient fine-tuning
3. ✅ Fine-tuned on ability generation examples
4. ✅ Merged adapters back into base model
5. ✅ Converted to GGUF format
6. ✅ Quantized to Q4_K_M

**Next steps:**
1. Download `granite-3.1-3b-a800m-instruct-Q4_K_M.gguf`
2. Replace `Assets/LLM/granite-3.1-3b-a800m-instruct-Q4_K_M.gguf` in your project
3. Test in-game!

**Expected improvements:**
- Better JSON structure compliance
- More creative, element-appropriate abilities
