## Completion finetuning using unsloth

This notebook makes use of unsloth to finetune a model for a completion task.
In this example we will finetune the llama 3.2 base model to generate ascii art. I would recommend using the unsloth library compared to just using the huggingface library as it requires less memory and is faster.

Adapted from unsloth notebooks, if something is broken check on:
https://unsloth.ai/

In [1]:
%%capture
!pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3  peft trl triton
!pip install --no-deps cut_cross_entropy unsloth_zoo
!pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
!pip install --no-deps unsloth

### Load base model

In [2]:
from unsloth import FastLanguageModel
import torch
import os
from dotenv import load_dotenv
load_dotenv()


model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="meta-llama/Llama-3.2-3B",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = False,
    token=os.getenv('HF_ACCESS_TOKEN')
)

ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.
ðŸ¦¥ Unsloth Zoo will now patch everything to make training faster!




==((====))==  Unsloth 2026.2.1: Fast Llama patching. Transformers: 5.0.0.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.563 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.10.0+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.6.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading weights:   0%|          | 0/254 [00:00<?, ?it/s]

Unsloth: Will load unsloth/Llama-3.2-3B as a legacy tokenizer.


In [3]:
tokenizer.clean_up_tokenization_spaces = False

### Add lora to base model and patch with Unsloth

In [4]:
# More info about parameters: https://huggingface.co/docs/peft/v0.11.0/en/package_reference/lora#peft.LoraConfig
target_modules =  ["q_proj", "k_proj", "v_proj", "o_proj",
                   "gate_proj", "up_proj", "down_proj"]

# When adding special tokens
train_embeddings = False

if train_embeddings:
  target_modules = target_modules + ["lm_head"]

model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # rank of lora matrices according to paper not much loss when set relatively low
    target_modules = target_modules,  # On which modules of the llm the lora weights are used
    lora_alpha = 16, # scales the weights of the adapters (more influence on base model), 16 was recommended on reddit
    lora_dropout = 0, # Default on 0.05 in tutorial but unsloth says 0 is better
    bias = "none",    # "none" is optimized
    use_gradient_checkpointing = "unsloth", #"unsloth" for very long context, decreases vram
    random_state = 3407,
    use_rslora = False,  # scales lora_alpha with 1/sqrt(r), huggingface says this works better
    loftq_config = None, # And LoftQ
)

Unsloth 2026.2.1 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


In [5]:
empty_prompt = """
{ascii_art}
"""

EOS_TOKEN = tokenizer.eos_token

def formatting_prompts_func_no_prompt(examples):
  ascii_art_samples = examples["ascii"]
  training_prompts = []
  for ascii_art in ascii_art_samples:
      training_prompt = empty_prompt.format(ascii_art=ascii_art) + EOS_TOKEN
      training_prompts.append(training_prompt)
  return { "text" : training_prompts, }


from datasets import load_dataset
dataset = load_dataset("pookie3000/ascii-cats", split = "train")
dataset = dataset.map(formatting_prompts_func_no_prompt, batched = True)

 ### Visualize dataset

In [6]:
for i, sample in enumerate(dataset):
    print(f"\n------ Sample {i + 1} ----")
    print(sample["text"])
    if i > 2:
      break


------ Sample 1 ----

    /\_/\           ___
   = o_o =_______    \ \ 
    __^      __(  \.__) )
(@)<_____>__(_____)____/
<|end_of_text|>

------ Sample 2 ----

|\---/|
| o_o |
 \_^_/
<|end_of_text|>

------ Sample 3 ----

 |\__/,|   (`\
 |_ _  |.--.) )
 ( T   )     /
(((^_(((/(((_/
<|end_of_text|>

------ Sample 4 ----

   |\---/|
   | ,_, |
    \_`_/-..----.
 ___/ `   ' ,""+ \  
(__...'   __\    |`.___.';
  (_,...'(_,.`__)/'.....+
<|end_of_text|>


In [7]:
from trl import SFTConfig, SFTTrainer
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = 2048,
    packing = False, # Can make training 5x faster for short sequences.
    args = SFTConfig(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 60,
        learning_rate = 2e-4,
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use TrackIO/WandB etc
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/201 [00:00<?, ? examples/s]

ðŸ¦¥ Unsloth: Padding-free auto-enabled, enabling faster training.


In [8]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 201 | Num Epochs = 3 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 24,313,856 of 3,237,063,680 (0.75% trained)


Step,Training Loss
1,3.786515
2,3.693588
3,4.692733
4,4.263572
5,3.72633
6,4.400105
7,4.03158
8,3.852281
9,3.6159
10,3.621052


### inference

In [9]:
from transformers import TextStreamer

def generate_ascii_art(model):
    FastLanguageModel.for_inference(model)
    inputs = tokenizer("", return_tensors = "pt").to("cuda")
    text_streamer = TextStreamer(tokenizer)
    # https://huggingface.co/docs/transformers/v4.49.0/en/main_classes/text_generation#transformers.GenerationMixin
    # https://huggingface.co/docs/transformers/v4.49.0/en/main_classes/text_generation#transformers.GenerationConfig
    for token in model.generate(**inputs, streamer = text_streamer, max_new_tokens = 100):
        print(token)
        pass

In [10]:
for _ in range(3):
  generate_ascii_art(model)

<|begin_of_text|>
   |\__/,|    (`\
   |o o  |  _,) )
   / ^ ~ \|( / (
  /`-----'\_)) )
  `-----'  `--'`

<|end_of_text|>
tensor([128000,    198,    256,  64696,    565,  35645,     91,    262,  29754,
          5779,    256,    765,     78,    297,    220,    765,    220,   8523,
             8,   1763,    256,    611,   6440,   4056,   1144,  61116,    611,
          2456,    220,    611,     63,  15431,  16154,     62,    595,   1763,
           220,   1595,  15431,      6,    220,   1595,    313,      6,  19884,
        128001], device='cuda:0')
<|begin_of_text|>
       /\_/\
      / o o \
     |     |
    /       \
   |   ^     |    /\_/\  
   |  =^=   |  /`^`^\
   \   ^   /  =^Y^=
    \   ^  /  /`^`^\
    \   ^ /  =^Y^=
    \   ^/  =^Y^=
     \  ( /
      \  )
tensor([128000,    198,    996,  24445,  51395,   5779,    415,    611,    297,
           297,   3120,    257,    765,    257,   9432,    262,    611,    996,
          3120,    256,    765,    256,   6440,    257,    765,

## Saving

In [11]:
from huggingface_hub import login, whoami

# Option A: explicit login (clears any bad cache)
login(token=os.getenv("HF_ACCESS_TOKEN_WRITE"))   # or paste directly: login("hf_xxxxxxxxxxxxxxxx")

# Option B: or just check who you are
print(whoami(token=os.getenv("HF_ACCESS_TOKEN_WRITE")))

{'type': 'user', 'id': '68df5cdb44187d11a6066e73', 'name': 'Jim1892', 'fullname': 'Nayeem Hossen Jim', 'email': 'nayeemhossenjimdelta@gmail.com', 'emailVerified': True, 'canPay': False, 'billingMode': 'prepaid', 'periodEnd': 1772323200, 'isPro': False, 'avatarUrl': '/avatars/b8a202cf9b98f4662f8b4082e367dc3d.svg', 'orgs': [], 'auth': {'type': 'access_token', 'accessToken': {'displayName': 'Finetune', 'role': 'write', 'createdAt': '2026-02-21T06:44:08.268Z'}}}


### Save lora adapter

This is both useful for inference and if you want to load the model again

In [12]:
model.push_to_hub(
    "Jim1892/Llama-3.2-3B-ascii-cats-bd",
    token=os.getenv("HF_ACCESS_TOKEN_WRITE")
)

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...adapter_model.safetensors:  17%|#7        | 16.7MB / 97.3MB            

Saved model to https://huggingface.co/Jim1892/Llama-3.2-3B-ascii-cats-bd


### Merge model with lora weights and save to gguf

You can then do inference locally with Ollama or llama.cpp

##### Popular quantization methods

- **q4_k_m**  
  4bit quantization. Low memory. All models you pull with ollama uses this quantization.
- **q8_0**  
  8bit quantization. Medium memory.
- **f16**  
  16 bit quantization. A lot of models are already in 16 bit so then no quantization happens
- **not_quantized**  
  Often same as f16.

### Load model and saved lora adapters
For if you want to continue finetuning or want to do inference using the model in safetensor format.

In [None]:
model.push_to_hub_gguf(
    "Jim1892/Llama-3.2-3B-ascii-cats-bd-F32-GGUF", 
    tokenizer, 
    quantization_method="F32", 
    token = os.getenv('HF_ACCESS_TOKEN_WRITE')
)

Unsloth: Converting model to GGUF format...
Unsloth: Merging model weights to 16-bit format...
Found HuggingFace hub cache directory: /root/.cache/huggingface/hub


Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Checking cache directory for required files...


Unsloth: Copying 2 files from cache to `/tmp/unsloth_gguf_2133z3h3`: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 2/2 [02:11<00:00, 65.83s/it]


Successfully copied all 2 files from cache to `/tmp/unsloth_gguf_2133z3h3`
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Preparing safetensor model files: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 2/2 [00:00<00:00, 17296.10it/s]


In [None]:
from transformers import TextStreamer

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="Jim1892/Llama-3.2-3B-ascii-cats-bd",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = False,
    token=os.getenv('HF_ACCESS_TOKEN_WRITE')
)


def generate_ascii_art(model):
    FastLanguageModel.for_inference(model)
    inputs = tokenizer("", return_tensors = "pt").to("cuda")
    text_streamer = TextStreamer(tokenizer)
    # https://huggingface.co/docs/transformers/v4.49.0/en/main_classes/text_generation#transformers.GenerationMixin
    # https://huggingface.co/docs/transformers/v4.49.0/en/main_classes/text_generation#transformers.GenerationConfig
    for token in model.generate(**inputs, streamer = text_streamer, max_new_tokens = 100):
        print(token)
        pass
