# Fine-Tuning a Small Language Model (TinyLlama) Using QLoRA for SQL and Code Generation

## Project Overview

This project presents an end-to-end pipeline for fine-tuning a Small Language Model (SLM) using
state-of-the-art parameter-efficient techniques. The objective is to adapt a lightweight
instruction-tuned model to perform effectively on structured reasoning tasks such as SQL query
generation and Python code synthesis, while operating under limited GPU resources.

To achieve this, QLoRA (Quantized Low-Rank Adaptation) is employed, enabling efficient training
in 4-bit precision without significantly sacrificing model performance.


## Motivation and Business Relevance

Fine-tuning large language models is often computationally expensive and impractical for many
organizations. This project demonstrates how modern PEFT techniques allow companies to:

- Adapt language models to domain-specific tasks
- Reduce infrastructure and memory requirements
- Deploy customized models on consumer-grade hardware

The resulting approach is highly relevant for real-world applications such as:
- Automated SQL query generation
- Code assistance tools
- Internal analytics and data engineering workflows



In [1]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer, BitsAndBytesConfig
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training

In [2]:
print(torch.version.cuda)           
print(torch.__version__)
print(torch.cuda.is_available())   
print(torch.cuda.get_device_name(0))

11.8
2.7.1+cu118
True
NVIDIA GeForce RTX 4070 SUPER


## Base Model Selection

The base model used in this project is **TinyLlama-1.1B-Chat**, a compact instruction-tuned
causal language model.

This model was selected due to:
- Its strong performance relative to its size
- Compatibility with instruction-based fine-tuning
- Suitability for resource-constrained environments

## Quantization and Memory Optimization

To enable efficient training, the model is loaded using 4-bit quantization via the
**BitsAndBytes** library. The following configuration is applied:

- 4-bit NF4 quantization
- Double quantization for improved numerical stability
- FP16 computation for GPU acceleration

This setup significantly reduces VRAM usage, making it possible to fine-tune the model on a
single GPU without compromising training stability.


In [3]:
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    quantization_config=bnb_config,
    trust_remote_code=True
)

In [4]:
save_path = "./tinyllama_local"

tokenizer.save_pretrained(save_path)
model.save_pretrained(save_path)


## Offline TinyLlama

In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

local_model_path = "./tinyllama_local"

tokenizer = AutoTokenizer.from_pretrained(local_model_path, use_fast=True)

model = AutoModelForCausalLM.from_pretrained(
    local_model_path,
    device_map="auto",
    quantization_config=bnb_config,
    trust_remote_code=False
)

## Parameter-Efficient Fine-Tuning (LoRA)

Instead of updating all model parameters, Low-Rank Adaptation (LoRA) is applied to a subset of
attention layers (`q_proj`, `v_proj`).

Key benefits of this approach include:
- Faster training
- Reduced risk of overfitting
- Minimal additional storage requirements

Only a small number of trainable parameters are introduced, while the base model weights remain frozen.


In [6]:
model = prepare_model_for_kbit_training(model)


lora_config = LoraConfig(
    r=8,                              
    lora_alpha=16,                    
    target_modules=["q_proj", "v_proj"],  
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM        
)


model = get_peft_model(model, lora_config)

model.print_trainable_parameters()

trainable params: 1,126,400 || all params: 1,101,174,784 || trainable%: 0.1023


## Dataset Selection

Two complementary datasets are used to improve the model’s reasoning and generation capabilities:

### Spider Dataset
- Task: Natural Language → SQL Query generation
- Purpose: Improve structured reasoning and database query formulation

### MBPP Dataset
- Task: Python code generation
- Purpose: Enhance algorithmic reasoning and coding ability

By combining these datasets, the model is exposed to both structured query logic and general-purpose
programming patterns.


In [7]:
spider = load_dataset("spider")
print(spider["train"][0])

{'db_id': 'department_management', 'query': 'SELECT count(*) FROM head WHERE age  >  56', 'question': 'How many heads of the departments are older than 56 ?', 'query_toks': ['SELECT', 'count', '(', '*', ')', 'FROM', 'head', 'WHERE', 'age', '>', '56'], 'query_toks_no_value': ['select', 'count', '(', '*', ')', 'from', 'head', 'where', 'age', '>', 'value'], 'question_toks': ['How', 'many', 'heads', 'of', 'the', 'departments', 'are', 'older', 'than', '56', '?']}


In [8]:
mbpp = load_dataset("json", data_files=r".\sanitized-mbpp.json")
print(mbpp["train"][0])

Generating train split: 0 examples [00:00, ? examples/s]

{'source_file': 'Benchmark Questions Verification V2.ipynb', 'task_id': 2, 'prompt': 'Write a function to find the shared elements from the given two lists.', 'code': 'def similar_elements(test_tup1, test_tup2):\n  res = tuple(set(test_tup1) & set(test_tup2))\n  return (res) ', 'test_imports': [], 'test_list': ['assert set(similar_elements((3, 4, 5, 6),(5, 7, 4, 10))) == set((4, 5))', 'assert set(similar_elements((1, 2, 3, 4),(5, 4, 3, 7))) == set((3, 4))', 'assert set(similar_elements((11, 12, 14, 13),(17, 15, 14, 13))) == set((13, 14))']}


## Instruction Formatting

All samples are converted into a conversational instruction format:

### Example:
Human: <instruction>
Assistant: <expected output>

This structure aligns with the chat-based training paradigm of the base model and improves
instruction-following behavior during inference.


In [9]:
def format_spider(example):
    prompt = f"### Human: {example['question']}\n### Assistant: "
    response = example['query']  
    return {"text": prompt + response}

def format_mbpp(example):
    prompt = f"### Human: {example['prompt']}\n### Assistant: "
    response = example['code']
    return {"text": prompt + response}


spider_formatted = spider["train"].map(format_spider)
mbpp_formatted = mbpp["train"].map(format_mbpp)

Map:   0%|          | 0/7000 [00:00<?, ? examples/s]

Map:   0%|          | 0/427 [00:00<?, ? examples/s]

In [10]:
from datasets import concatenate_datasets

combined_dataset = concatenate_datasets([spider_formatted, mbpp_formatted])

In [11]:
def tokenize_fn(examples):
    tokens = tokenizer(
        examples["text"],
        truncation=True,
        padding="max_length",
        max_length=512
    )

    tokens["labels"] = tokens["input_ids"].copy()
    return tokens

tokenized_dataset = combined_dataset.map(tokenize_fn, batched=True, remove_columns=["text"])

Map:   0%|          | 0/7427 [00:00<?, ? examples/s]

## Training Configuration

The training process is managed using Hugging Face’s `Trainer` API with the following setup:

- Mixed precision training (FP16)
- Small batch size optimized for GPU memory
- Multiple epochs to allow task adaptation
- Checkpoint saving with retention limits

This configuration balances training efficiency with model stability.


In [12]:
training_args = TrainingArguments(
    output_dir="./tinyllama_finetuned",  
    overwrite_output_dir=True,        
    num_train_epochs=5,            
    per_device_train_batch_size=4,      
    save_steps=500,                 
    save_total_limit=2,     
    learning_rate=3e-4,         
    weight_decay=0.01,          
    logging_dir="./logs",      
    logging_steps=50,             
    fp16=True,             
    load_best_model_at_end=False      
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    tokenizer=tokenizer,
)

In [None]:
trainer.train()

In [15]:
trainer.save_model("./tinyllama_finetuned")
tokenizer.save_pretrained("./tinyllama_finetuned")

('./tinyllama_finetuned\\tokenizer_config.json',
 './tinyllama_finetuned\\special_tokens_map.json',
 './tinyllama_finetuned\\chat_template.jinja',
 './tinyllama_finetuned\\tokenizer.json')

## Model Evaluation and Inference

After training, the model is evaluated through inference on unseen prompts covering:

- SQL query generation
- Python programming tasks
- Multilingual understanding (English and Spanish)

Sampling-based decoding is used to ensure diverse and coherent responses while avoiding repetition.

## LoRA Merge and Deployment Readiness

Once fine-tuning is completed, the LoRA adapters are merged into the base model weights.
This step produces a standalone model suitable for deployment without requiring additional PEFT layers.

The final merged model can be:
- Loaded directly for inference
- Integrated into downstream applications
- Served via APIs or web interfaces


In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

model_path = "./tinyllama_finetuned"

tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="auto",
    quantization_config=bnb_config,
    trust_remote_code=False 
)

model.eval()

In [17]:
def generar_respuesta(prompt, max_tokens=200, temperature=0.7):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        output = model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            do_sample=True,
            temperature=temperature,
            top_p=0.95,
            top_k=50,
            repetition_penalty=1.1,
            pad_token_id=tokenizer.eos_token_id
        )
    respuesta = tokenizer.decode(output[0], skip_special_tokens=True)
    return respuesta


In [18]:
prompt = "### Human: tell wich one is the best programming langue between SQL and Python for Data Science?\n### Assistant:"
respuesta = generar_respuesta(prompt)
print(respuesta)

### Human: tell wich one is the best programming langue between SQL and Python for Data Science?
### Assistant: SELECT ln FROM language WHERE lang  =  'SQL' INTERSECT SELECT ln FROM language WHERE lang  =  'Python'


In [19]:
prompt = "### Human: Eres capaz de entender espanol?\n### Assistant:"
respuesta = generar_respuesta(prompt)
print(respuesta)

### Human: Eres capaz de entender espanol?
### Assistant: Yes, I can understand Spanish.


In [None]:
from peft import PeftModel

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

base_model = AutoModelForCausalLM.from_pretrained(
    "./tinyllama_local",
    device_map="auto",
    quantization_config=bnb_config,
    trust_remote_code=False
)

model = PeftModel.from_pretrained(base_model, "./tinyllama_finetuned")

merged_model = model.merge_and_unload()

merged_model.save_pretrained("./tinyllama_merged")

tokenizer = AutoTokenizer.from_pretrained("./tinyllama_finetuned", use_fast=True)
tokenizer.save_pretrained("./tinyllama_merged")

In [None]:
model_streamlit = AutoModelForCausalLM.from_pretrained(
    "./tinyllama_merged",
    device_map="auto",
    quantization_config=bnb_config,
    trust_remote_code=False
)

tokenizer = AutoTokenizer.from_pretrained("./tinyllama_merged", use_fast=True)
model_streamlit.eval()

In [21]:
prompt = "### Human: How do I make a query in SQL where I select everyone with age above 41 years old?\n### Assistant:"
respuesta = generar_respuesta(prompt)
print(respuesta)

### Human: How do I make a query in SQL where I select everyone with age above 41 years old?
### Assistant: SELECT * FROM employees WHERE age  >  41
