This notebook demonstrates fine-tuning of small language models using Unsloth. Most of the examples in the following repository showcase task-specific fine-tuning for small language models, which serve as perfect reference implementations for SLM fine-tuning:

ðŸ“š **Reference Repository**: [Task Specific FineTuning Examples](https://github.com/Abeshith/FineTuning_LanguageModels/tree/main/Task%20Specific%20FineTuning)

These examples cover various fine-tuning approaches for small language models optimized for specific tasks.

In [None]:
!pip install -U torch torchvision torchaudio xformers --index-url https://download.pytorch.org/whl/cu128
!pip install -U unsloth
!pip install -U transformers==4.56.2 datasets==4.3.0
!pip install -U --no-deps trl==0.22.2

In [1]:
from datasets import load_dataset
from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig

ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.
ðŸ¦¥ Unsloth Zoo will now patch everything to make training faster!


In [2]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name='unsloth/Phi-3-mini-4k-instruct-bnb-4bit',
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)

==((====))==  Unsloth 2026.2.1: Fast Mistral patching. Transformers: 4.56.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.563 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.10.0+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.6.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.34. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.26G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/194 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/458 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r=32,
    target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
    lora_alpha=32,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing=False,
    random_state=3407,
)

Unsloth 2026.2.1 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [4]:
model.print_trainable_parameters()

trainable params: 59,768,832 || all params: 3,880,848,384 || trainable%: 1.5401


In [15]:
ds = load_dataset('tatsu-lab/alpaca', split='train')
ds = ds.shuffle(seed=3407)
ds = ds.select(range(4000))
print(ds)
print(ds.column_names)

Dataset({
    features: ['instruction', 'input', 'output', 'text'],
    num_rows: 4000
})
['instruction', 'input', 'output', 'text']


In [16]:
EOS = tokenizer.eos_token or ""

def format_alpaca(batch):
    prompt_template = """Below is an instruction.

### Instruction:
{instruction}

### Input:
{input}

### Response:
{output}"""

    texts = []
    inputs = batch.get("input", [""] * len(batch["instruction"]))

    for ins, inp, out in zip(batch["instruction"], inputs, batch["output"]):
        text = prompt_template.format(
            instruction=ins or "",
            input=inp or "",
            output=out or ""
        ) + EOS
        texts.append(text)

    return {"text": texts}

ds = ds.map(format_alpaca, batched=True, remove_columns=ds.column_names)

print(ds["text"][0][:500])

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Below is an instruction.

### Instruction:
Is a trapezoid a parallelogram?

### Input:


### Response:
No, a trapezoid is not a parallelogram, as it only has one pair of parallel sides. A parallelogram, on the other hand, has two pairs of parallel sides.<|endoftext|>


In [17]:
args=SFTConfig(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    learning_rate=2e-5,
    logging_steps=10,
    max_steps=500,
    output_dir='phi3_alpaca_outputs',
    report_to="none",
)

In [18]:
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=ds,
)

Unsloth: Tokenizing ["text"] (num_proc=5):   0%|          | 0/4000 [00:00<?, ? examples/s]

In [19]:
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 4,000 | Num Epochs = 3 | Total steps = 1,500
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 2
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 2 x 1) = 8
 "-____-"     Trainable parameters = 59,768,832 of 3,880,848,384 (1.54% trained)


Step,Training Loss
1,1.7456
2,1.4889
3,1.666
4,1.5047
5,1.5164
6,1.511
7,1.7844
8,1.7398
9,1.5993
10,1.28


Step,Training Loss
1,1.7456
2,1.4889
3,1.666
4,1.5047
5,1.5164
6,1.511
7,1.7844
8,1.7398
9,1.5993
10,1.28


TrainOutput(global_step=1500, training_loss=0.9297918957471848, metrics={'train_runtime': 3939.2876, 'train_samples_per_second': 3.046, 'train_steps_per_second': 0.381, 'total_flos': 4.863851117921894e+16, 'train_loss': 0.9297918957471848, 'epoch': 3.0})

In [20]:
lora_path = "phi3_alpaca_lora"
model.save_pretrained(lora_path)
tokenizer.save_pretrained(lora_path)

('phi3_alpaca_lora/tokenizer_config.json',
 'phi3_alpaca_lora/special_tokens_map.json',
 'phi3_alpaca_lora/chat_template.jinja',
 'phi3_alpaca_lora/tokenizer.model',
 'phi3_alpaca_lora/added_tokens.json',
 'phi3_alpaca_lora/tokenizer.json')

In [21]:
from transformers import TextStreamer

FastLanguageModel.for_inference(model)

def generate_response(instruction, input_text=""):
    prompt = f"""Below is an instruction.

### Instruction:
{instruction}

### Input:
{input_text}

### Response:
"""

    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    streamer = TextStreamer(tokenizer, skip_prompt=True)

    _ = model.generate(
        **inputs,
        max_new_tokens=200,
        temperature=0.7,
        top_p=0.9,
        streamer=streamer,
    )

In [22]:
generate_response("Explain what machine learning is in simple terms")

Machine learning is a type of artificial intelligence that allows computers to learn from data and make decisions without being explicitly programmed. It involves using algorithms to analyze data and make predictions or decisions based on that data.<|endoftext|>


In [25]:
generate_response("What is RAG Method in AI?")

The RAG (Retrieval-Augmented Generation) method is an AI technique that combines natural language processing and retrieval-based models to generate high-quality text. It uses a retrieval model to find relevant documents and then uses natural language processing to generate text based on the retrieved documents. This allows the AI to generate more accurate and relevant text.<|endoftext|>
