# Fine-Tuning Llama 3.2 for Python Code Generation

This notebook walks through the process of fine-tuning the `meta-llama/Llama-3.2-1B-Instruct` model on a dataset of Python code instructions. The key steps are:

1.  **Setup**: Installing dependencies and connecting to Google Drive.
2.  **Data Preparation**: Loading and formatting the `iamtarun/python_code_instructions_18k_alpaca` dataset.
3.  **Model Preparation**: Loading the base model with 4-bit quantization and configuring LoRA for efficient fine-tuning.
4.  **Training**: Running the fine-tuning process using the `SFTTrainer`.
5.  **Inference**: Loading the fine-tuned adapter and generating code from a sample prompt.
6.  **Evaluation**: Evaluating the model's performance on the HumanEval benchmark.

## 1. Setup

First, we install the necessary libraries from our `requirements.txt` file. We also mount Google Drive to save model checkpoints during training, which is crucial for long-running jobs.

In [None]:
!pip install -r requirements.txt

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Hugging Face Login

To download the Llama 3.2 model, you need to be logged into your Hugging Face account. We use a `config.json` file to store the access token securely. Make sure you have created this file from the `config.json.template`.

In [None]:
from huggingface_hub import login
import json

with open("config.json", "r") as config_file:
    config = json.load(config_file)
    access_token = config["HF_ACCESS_TOKEN"]

login(token=access_token)

## 2. Data Preparation

We load the `iamtarun/python_code_instructions_18k_alpaca` dataset, which contains instructions and corresponding Python code. We then format the data into a prompt structure suitable for instruction fine-tuning.

In [None]:
from datasets import load_dataset
from datasets.arrow_dataset import Dataset

def format_sample(sample):
    """ Helper function to format a single input sample"""
    instruction=sample['instruction']
    input_text=sample['input']
    output_text=sample['output']

    if input_text is None or input_text=="":
        formatted_prompt=(
            f"<|start_header_id|>user<|end_header_id|>\n\n"
            f"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n"
            f"### Instruction:\n{instruction}\n\n"
            f"### Response:\n<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
            f"{output_text}<|eot_id|>"
        )
    else:
        formatted_prompt=(
            f"<|start_header_id|>user<|end_header_id|>\n\n"
            f"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n"
            f"### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n"
            f"### Response:\n<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
            f"{output_text}<|eot_id|>"
        )
    formatted_prompt="".join(formatted_prompt) # exclude trailing white spaces
    return formatted_prompt                    # stream text into the dataloader, one by one



def gen_train_input():
    """ Format all data input in alpaca style
        Return:
            A generator on train data "train_gen"
    """
    # load data
    ds=load_dataset("iamtarun/python_code_instructions_18k_alpaca",streaming=True, split="train")
    # datata set has 18.6k samples, we use 16.8k (90%) for training + 1.8k for validation
    num_samples=16800
    counter=0
    for sample in iter(ds):
        if counter>=num_samples:
            break
        formatted_prompt=format_sample(sample)
        yield {'text': formatted_prompt}
        counter+=1


def gen_val_input():
    """ Format all data input in alpaca style
        Return:
            A generator on val data "val_gen"
    """
    # load data
    ds=load_dataset("iamtarun/python_code_instructions_18k_alpaca",streaming=True, split="train")
    # datata set has 18.6k samples, we use 16.8k (90%) for training + 1.8k for validation
    num_samples=16800
    counter=0
    for sample in iter(ds):
        if counter<num_samples:
            counter+=1
            continue

        formatted_prompt=format_sample(sample)
        yield {'text': formatted_prompt}
        counter+=1

dataset_train = Dataset.from_generator(gen_train_input)
dataset_val=Dataset.from_generator(gen_val_input)

In [None]:
print(f"Train dataset size: {len(dataset_train)}")
print(f"Validation dataset size: {len(dataset_val)}")

print(f"Sample train:\n{dataset_train[0]}")


## 3. Model Preparation

We load the `meta-llama/Llama-3.2-1B-Instruct` model and its tokenizer. To make fine-tuning efficient, we apply two key techniques:
- **4-bit Quantization**: We use `BitsAndBytesConfig` to load the model in 4-bit precision, significantly reducing its memory footprint.
- **LoRA (Low-Rank Adaptation)**: We use `LoraConfig` from the PEFT library to inject trainable low-rank matrices into the model. This allows us to update only a small fraction of the model's parameters, making training much faster and less memory-intensive.

In [None]:
import torch
from peft import LoraConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import warnings

warnings.filterwarnings("ignore", message=".*padding_side` right should be used.*")

model_name = "meta-llama/Llama-3.2-1B-Instruct"

def create_and_prepare_model(hf_token=None):
    """Loads and prepares the quantized model and tokenizer for Colab GPU."""
    if not torch.cuda.is_available():
        raise SystemExit("GPU not found. This notebook requires a GPU.")
    print(f"GPU detected: {torch.cuda.get_device_name(0)}")

    compute_dtype = torch.bfloat16
    print(f"Using compute dtype: {compute_dtype}")

    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
    )

    print(f"Loading model: {model_name} with 4-bit quantization...")
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        torch_dtype=compute_dtype,
        device_map="auto",
        token=hf_token
    )
    print("Model loaded successfully.")

    peft_config = LoraConfig(
        lora_alpha=16,
        r=16,
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'],
    )
    print("LoRA config created.")

    tokenizer = AutoTokenizer.from_pretrained(model_name, token=hf_token)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "right"
    print(f"Tokenizer loaded and configured.")

    return model, peft_config, tokenizer

## 4. Training the Model

Now we define the training arguments and initialize the `SFTTrainer` from the TRL library. The trainer handles the entire training loop, including checkpointing, logging, and optimization.

We configure the trainer to save checkpoints to Google Drive every 15 steps. This ensures that we don't lose progress if the Colab session disconnects.

In [None]:
from transformers import TrainingArguments
from trl import SFTTrainer
import os

output_dir_gdrive = "/content/drive/MyDrive/colab_training/llama32-python-save15steps"
os.makedirs(output_dir_gdrive, exist_ok=True)

training_args = TrainingArguments(
    output_dir=output_dir_gdrive,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=8,
    gradient_checkpointing=True,
    optim="adamw_torch_fused",
    logging_steps=15,
    save_strategy="steps",
    save_steps=15,
    save_total_limit=3,
    learning_rate=2e-4,
    bf16=True,
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    report_to="tensorboard",
    gradient_checkpointing_kwargs={"use_reentrant": False},
    remove_unused_columns=False,
)

model, peft_config, tokenizer = create_and_prepare_model(access_token)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset_train,
    eval_dataset=dataset_val,
    peft_config=peft_config,
)

print("Starting training...")
trainer.train(resume_from_checkpoint=True)
print("Training complete.")

final_save_path = os.path.join(output_dir_gdrive, "final_adapter")
trainer.save_model(final_save_path)
print(f"Final adapter model saved to: {final_save_path}")

## 5. Inference

After training, we can load the fine-tuned model to perform inference. We load the base model in 4-bit and then apply the trained LoRA adapters on top.

In [None]:
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

def load_quantized_lora_model(base_model_id, adapter_directory, hf_token=None):
    """Loads the base model with 4-bit quantization and then applies the LoRA adapter."""
    print(f"Loading base model: {base_model_id}...")
    model_dtype = torch.bfloat16
    quantization_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=model_dtype,
        bnb_4bit_use_double_quant=True,
    )
    
    base_model = AutoModelForCausalLM.from_pretrained(
        base_model_id,
        quantization_config=quantization_config,
        torch_dtype=model_dtype,
        device_map="auto",
        token=hf_token,
    )
    
    tokenizer = AutoTokenizer.from_pretrained(base_model_id, token=hf_token)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "right"
    
    print(f"Loading LoRA adapters from: {adapter_directory}")
    model_with_adapters = PeftModel.from_pretrained(base_model, adapter_directory)
    model_with_adapters.eval()
    return model_with_adapters, tokenizer

base_model_id = "meta-llama/Llama-3.2-1B-Instruct"
adapter_directory = "/content/drive/MyDrive/colab_training/llama32-python-save15steps/final_adapter"

model_ft, tokenizer = load_quantized_lora_model(base_model_id, adapter_directory, access_token)

if model_ft is not None:
    print("Fine-tuned model loaded successfully!")

### Generate Code with a Prompt

In [None]:
def generate_with_hf(model, tokenizer, prompt, max_new_tokens=256, temperature=0.6, top_k=50, top_p=0.9):
    model_device = next(model.parameters()).device
    messages = [{"role": "user", "content": prompt}]
    
    try:
        inputs_dict = tokenizer.apply_chat_template(
            messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
        )
        input_ids = inputs_dict['input_ids'].to(model_device)
        attention_mask = inputs_dict.get('attention_mask')
        if attention_mask is not None:
            attention_mask = attention_mask.to(model_device)
    except Exception:
        formatted_prompt = f"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
        inputs_dict = tokenizer(formatted_prompt, return_tensors="pt")
        input_ids = inputs_dict['input_ids'].to(model_device)
        attention_mask = inputs_dict.get('attention_mask')
        if attention_mask is not None:
            attention_mask = attention_mask.to(model_device)

    input_length = input_ids.shape[1]
    
    with torch.no_grad():
        generate_kwargs = {
            "input_ids": input_ids,
            "max_new_tokens": max_new_tokens,
            "eos_token_id": [tokenizer.eos_token_id, 128009],
            "do_sample": True,
            "temperature": temperature,
            "top_k": top_k,
            "top_p": top_p,
            "pad_token_id": tokenizer.pad_token_id
        }
        if attention_mask is not None:
            generate_kwargs["attention_mask"] = attention_mask

        outputs = model.generate(**generate_kwargs)
    
    generated_ids = outputs[0, input_length:]
    return tokenizer.decode(generated_ids, skip_special_tokens=True)

user_prompt = "Write a Python function to calculate the factorial of a number."
print(f"User Prompt:\n{user_prompt}\n")
response = generate_with_hf(model_ft, tokenizer, user_prompt, max_new_tokens=150, temperature=0.2)
print(f"Generated Response:\n-------------------\n{response}\n-------------------")

## 6. Evaluation on HumanEval

To quantitatively assess the model's performance, we evaluate it on the HumanEval dataset. This benchmark consists of 164 programming problems with unit tests. We generate multiple code samples (`pass@10`) for each problem and use the `code_eval` metric to check for functional correctness.

In [None]:
from datasets import load_dataset

human_eval_dataset = load_dataset("openai_humaneval")
print(human_eval_dataset['test'][0]['prompt'])

In [None]:
from tqdm import tqdm

generated_code_samples = []
for problem in tqdm(human_eval_dataset['test']):
    prompt = problem['prompt']
    inputs = tokenizer(prompt, return_tensors="pt", return_attention_mask=False).to(model_ft.device)

    with torch.no_grad():
        outputs = model_ft.generate(
            **inputs,
            num_return_sequences=10,
            max_new_tokens=256,
            do_sample=True,
            temperature=0.7,
            top_p=0.95,
            pad_token_id=tokenizer.pad_token_id
        )

    decoded_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    completions_only = [output.replace(prompt, "") for output in decoded_outputs]
    generated_code_samples.append(completions_only)

print("HumanEval generation complete.")

In [None]:
import os
import evaluate

os.environ["HF_ALLOW_CODE_EVAL"] = "1"
code_eval = evaluate.load("code_eval")

test_cases = [problem["test"] for problem in human_eval_dataset['test']]

pass_at_k, results = code_eval.compute(
    references=test_cases,
    predictions=generated_code_samples,
    k=[1, 10]
)

print("\n--- Evaluation Complete ---")
print(pass_at_k)
print("----------------------------")