<a href="https://colab.research.google.com/github/Lakshya-Varshney/ML-practice/blob/main/conversation_finetuning_paul_graham.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Conversation models finetuning

In this notebook we will finetuning an instruction / chat model to behave like Paul Graham. https://www.paulgraham.com/

Adapted from unsloth notebooks, if something is broken check on:
https://unsloth.ai/

### Installation

In [2]:
%%capture
!pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton
!pip install --no-deps cut_cross_entropy unsloth_zoo
!pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
!pip install --no-deps unsloth
!pip install -U datasets
import warnings
!pip install datasets==4.3.0

### Load base model

In [None]:
import warnings
!pip install datasets==4.3.0

from unsloth import FastLanguageModel
import torch
from google.colab import userdata

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
    token=userdata.get('HF_ACCESS_TOKEN')
)

Collecting datasets==4.3.0
  Using cached datasets-4.3.0-py3-none-any.whl.metadata (18 kB)
Using cached datasets-4.3.0-py3-none-any.whl (506 kB)
Installing collected packages: datasets
  Attempting uninstall: datasets
    Found existing installation: datasets 4.4.1
    Uninstalling datasets-4.4.1:
      Successfully uninstalled datasets-4.4.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
unsloth 2025.11.3 requires tyro, which is not installed.
unsloth-zoo 2025.11.4 requires msgspec, which is not installed.
unsloth-zoo 2025.11.4 requires tyro, which is not installed.
unsloth 2025.11.3 requires trl!=0.19.0,<=0.23.0,>=0.18.2, but you have trl 0.25.1 which is incompatible.
unsloth-zoo 2025.11.4 requires torchao>=0.13.0, but you have torchao 0.10.0 which is incompatible.
unsloth-zoo 2025.11.4 requires trl!=0.19.0,<=0.24.0,>=0.18.2, but you have trl 0.25.1 whi

    PyTorch 2.6.0+cu124 with CUDA 1204 (you have 2.9.0+cu126)
    Python  3.12.9 (you have 3.12.12)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details


Switching to PyTorch attention since your Xformers is broken.

Unsloth: Xformers was not installed correctly.
Please install xformers separately first.
Then confirm if it's correctly installed by running:
python -m xformers.info

Longer error message:
xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.6.0+cu124 with CUDA 1204 (you have 2.9.0+cu126)
    Python  3.12.9 (you have 3.12.12)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
ü¶• Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.11.3: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.c

model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

In [None]:
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
]

formatted_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(formatted_text)


<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is the capital of France?<|eot_id|><|start_header_id|>assistant<|end_header_id|>




### Add lora to base model and patch with Unsloth

In [None]:
# More info about parameters: https://huggingface.co/docs/peft/v0.11.0/en/package_reference/lora#peft.LoraConfig

target_modules =  ["q_proj", "k_proj", "v_proj", "o_proj",
                   "gate_proj", "up_proj", "down_proj"]

# When adding special tokens
train_embeddings = False

if train_embeddings:
  # you run out of memory on colab if you do this
  # target_modules = target_modules + ["lm_head", "embed_tokens"]
  # so if you are on colab and added new tokens instead do
  target_modules = target_modules + ["lm_head"]


model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # rank of lora matrices according to paper not much loss when set relatively low
    target_modules = target_modules, # On which modules of the llm the lora weights are used
    lora_alpha = 16,  # scales the weights of the adapters (more influence on base model), 16 was recommended on reddit
    lora_dropout = 0, # Default on 0.05 in tutorial but unsloth says 0 is better
    bias = "none",   # "none" is optimized
    use_gradient_checkpointing = "unsloth", #"unsloth" for very long context, decreases vram
    random_state = 3407,
    use_rslora = False, # scales lora_alpha with 1/sqrt(r), huggingface says this works better
    loftq_config = None, # And LoftQ
)

Unsloth 2025.11.3 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


### Create dataset

In [None]:
EOS_TOKEN = tokenizer.eos_token
def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset("pookie3000/pg_chat", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

README.md:   0%|          | 0.00/24.0 [00:00<?, ?B/s]

pg_chat_combined.jsonl: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/484 [00:00<?, ? examples/s]

Map:   0%|          | 0/484 [00:00<?, ? examples/s]

In [None]:
for i, sample in enumerate(dataset):
    print(f"\n------ Sample {i + 1} ----")
    print(sample["text"])
    if i > 2:
      break



------ Sample 1 ----
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

What is your name?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Nice to meet you! My name is Paul Graham, and I'm delighted to make your acquaintance.<|eot_id|>

------ Sample 2 ----
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

What's your name?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Nice to meet you! My name is Paul Graham, and I'm delighted to make your acquaintance.<|eot_id|>

------ Sample 3 ----
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

What is your name?<|eot_id|><|start_header_id|>assistant<|end_h

### Train the model


In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = 2048,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs = 5,
        #max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/484 [00:00<?, ? examples/s]

In [None]:
trainer_stats = trainer.train()

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 484 | Num Epochs = 5 | Total steps = 305
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040 of 8,072,204,288 (0.52% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,2.0842
2,2.1333
3,2.3406
4,1.9343
5,1.8417
6,1.8039
7,1.6334
8,1.559
9,1.4728
10,1.4265


### Inference


In [None]:
FastLanguageModel.for_inference(model)

messages = [
    {"role": "user", "content": "How to do a startup?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128, use_cache = True)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

How to do a startup?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"Start a startup? Well, it's not for everyone, that's for sure. But if you're dead serious about doing it, here's what you need to do: find a co-founder, and fast. It's like getting married, but without the divorce. Then, find a mentor, someone who's been around the block a few times. And for God's sake, don't do it alone. You need people to bounce ideas off of, to get feedback from. And don't worry if you don't have a brilliant idea at first. Just start, and let the ideas come to you. Just remember, startups


### Save lora adapter

This is both useful for inference and if you want to load the model again

In [None]:
# model.push_to_hub(
#     "lakshyavarshney/Meta-Llama-3.1-8B-Instruct-Paul-Graham-QLORA",
#     tokenizer,
#     token = userdata.get('HF_ACCESS_TOKEN')
# )

README.md: 0.00B [00:00, ?B/s]

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...adapter_model.safetensors:  10%|9         | 16.8MB /  168MB            

Saved model to https://huggingface.co/lakshyavarshney/Meta-Llama-3.1-8B-Instruct-Paul-Graham-QLORA


In [9]:
import sys
import torch
import torchvision

print("python:", sys.version.splitlines()[0])
print("torch:", torch.__version__)
print("torch.cuda.is_available():", torch.cuda.is_available())
print("torch.version.cuda:", torch.version.cuda)
print("torchvision:", torchvision.__version__)


AttributeError: partially initialized module 'torchvision' has no attribute 'extension' (most likely due to a circular import)

In [7]:
!pip uninstall -y torch torchvision torchaudio
!pip uninstall -y torch torchvision torchaudio
!rm -rf /usr/local/lib/python3.12/dist-packages/torch*
!rm -rf /usr/local/lib/python3.12/dist-packages/torchaudio*
!rm -rf /usr/local/lib/python3.12/dist-packages/torchvision*
!pip cache purge
# installs the latest available PyTorch build for CUDA 12.4
!pip install --index-url https://download.pytorch.org/whl/cu124 \
    torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu124


Found existing installation: torch 2.9.1+cpu
Uninstalling torch-2.9.1+cpu:
  Successfully uninstalled torch-2.9.1+cpu
Found existing installation: torchvision 0.24.1+cpu
Uninstalling torchvision-0.24.1+cpu:
  Successfully uninstalled torchvision-0.24.1+cpu
Found existing installation: torchaudio 2.9.1+cpu
Uninstalling torchaudio-2.9.1+cpu:
  Successfully uninstalled torchaudio-2.9.1+cpu
[0mFiles removed: 26 (194.2 MB)
Looking in indexes: https://download.pytorch.org/whl/cu124, https://download.pytorch.org/whl/cu124
Collecting torch
  Downloading https://download.pytorch.org/whl/cu124/torch-2.6.0%2Bcu124-cp312-cp312-linux_x86_64.whl.metadata (28 kB)
Collecting torchvision
  Downloading https://download.pytorch.org/whl/cu124/torchvision-0.21.0%2Bcu124-cp312-cp312-linux_x86_64.whl.metadata (6.1 kB)
Collecting torchaudio
  Downloading https://download.pytorch.org/whl/cu124/torchaudio-2.6.0%2Bcu124-cp312-cp312-linux_x86_64.whl.metadata (6.6 kB)
Collecting sympy==1.13.1 (from torch)
  Downl

In [1]:
# 1) show GPU + driver info
!nvidia-smi


Mon Nov 24 07:51:25 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   36C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [2]:
# 2) show python version (helps pick the right wheel)
!python --version


Python 3.12.12


In [3]:
import torch
print("torch.__version__:", torch.__version__)
print("cuda available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("cuda device count:", torch.cuda.device_count())
    print("current device:", torch.cuda.current_device())


torch.__version__: 2.6.0+cu124
cuda available: True
cuda device count: 1
current device: 0


In [4]:
from unsloth import FastLanguageModel
import torch
from google.colab import userdata

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "lakshyavarshney/Meta-Llama-3.1-8B-Instruct-Paul-Graham-QLORA",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
    token=userdata.get('HF_ACCESS_TOKEN')
)


ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.11.3: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Unsloth 2025.11.3 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [5]:
!echo "Checking free disk..."
!freespace_gb=$(df --output=avail -BG / | tail -1 | tr -dc '0-9'); echo "Free (GB): $freespace_gb"
!if [ "$freespace_gb" -ge 60 ]; then SWAP_SIZE="32G"; elif [ "$freespace_gb" -ge 40 ]; then SWAP_SIZE="24G"; elif [ "$freespace_gb" -ge 25 ]; then SWAP_SIZE="16G"; else SWAP_SIZE="8G"; fi; echo "Creating swapfile of size: $SWAP_SIZE"; \
sudo fallocate -l $SWAP_SIZE /swapfile || sudo dd if=/dev/zero of=/swapfile bs=1G count=${SWAP_SIZE%G}
!sudo chmod 600 /swapfile
!sudo mkswap /swapfile
!sudo swapon /swapfile
!echo "Swap enabled:"
!swapon --show
!free -h
!df -h /


Checking free disk...
Free (GB): 55
/bin/bash: line 1: [: : integer expression expected
/bin/bash: line 1: [: : integer expression expected
/bin/bash: line 1: [: : integer expression expected
Creating swapfile of size: 8G
Setting up swapspace version 1, size = 8 GiB (8589930496 bytes)
no label, UUID=bfad8117-ae20-4962-8e72-2237e4eea4fd
swapon: /swapfile: swapon failed: Invalid argument
Swap enabled:
               total        used        free      shared  buff/cache   available
Mem:            12Gi       4.3Gi       535Mi        69Mi       7.8Gi       8.0Gi
Swap:             0B          0B          0B
Filesystem      Size  Used Avail Use% Mounted on
overlay         113G   58G   55G  52% /


### Merge model with lora weights and save to gguf

You can then do inference locally with Ollama or llama.cpp

##### Popular quantization methods

- **q4_k_m**  
  4bit quantization. Low memory. All models you pull with ollama uses this quantization.
- **q8_0**  
  8bit quantization. Medium memory.
- **f16**  
  16 bit quantization. A lot of models are already in 16 bit so then no quantization happens
- **not_quantized**  
  Often same as f16.

In [None]:
model.push_to_hub_gguf(
    "lakshyavarshney/Meta-Llama-3.1-8B-q4_k_m-paul-graham-guide-GGUF",
    tokenizer,
    quantization_method = "q4_k_m",
    token = userdata.get('HF_ACCESS_TOKEN')
  )

Unsloth: Converting model to GGUF format...
Unsloth: Merging model weights to 16-bit format...
Found HuggingFace hub cache directory: /root/.cache/huggingface/hub


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Checking cache directory for required files...
Cache check failed: model-00001-of-00004.safetensors not found in local cache.
Not all required files found in cache. Will proceed with downloading.
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Preparing safetensor model files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files:  25%|‚ñà‚ñà‚ñå       | 1/4 [01:07<03:22, 67.56s/it]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files:  50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 2/4 [02:04<02:03, 61.59s/it]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files:  75%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå  | 3/4 [03:24<01:09, 69.81s/it]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [03:37<00:00, 54.48s/it]


Note: tokenizer.model not found (this is OK for non-SentencePiece models)


Unsloth: Merging weights into 16bit: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [04:32<00:00, 68.19s/it]


Unsloth: Merge process complete. Saved to `/tmp/unsloth_gguf_dgt3g_iu`
Unsloth: Converting to GGUF format...
==((====))==  Unsloth: Conversion from HF to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF f16 might take 3 minutes.
\        /    [2] Converting GGUF f16 to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: llama.cpp folder exists but binaries not found - will rebuild
Unsloth: Updating system package directories
Unsloth: All required system packages already installed!
Unsloth: Install llama.cpp and building - please wait 1 to 3 minutes
Unsloth: Install GGUF and other packages


**Method to push gguf into hugging face (Working)**

In [None]:
# ---- Reduced-memory GGUF save and hub upload (Unsloth) ----
from unsloth import FastLanguageModel
import torch, gc
from huggingface_hub import upload_folder
from google.colab import userdata
import os

HF_TOKEN = userdata.get('HF_ACCESS_TOKEN')
model_name = "lakshyavarshney/Meta-Llama-3.1-8B-Instruct-Paul-Graham-QLORA"
out_dir = "/content/gguf_output"   # local dir (Colab disk)

# load (4-bit keeps GPU VRAM lower)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
    token=HF_TOKEN
)

# free some memory before save
torch.cuda.empty_cache()
gc.collect()

# Save locally to disk as GGUF with lower memory footprint
# reduction factor: try 0.5 (50% of peak). If still OOM, reduce to 0.4 or 0.3.
model.save_pretrained_gguf(
    out_dir,
    tokenizer,
    quantization_method="q4_k_m",
    maximum_memory_usage=0.5   # <-- key parameter to reduce peak usage
)

# Optional: inspect files
print("Saved to", out_dir, "Files:", os.listdir(out_dir)[:20])

# Upload folder to Hugging Face Hub (if you want it on HF)
# requires huggingface_hub>=0.16.0
upload_folder(
    folder_path=out_dir,
    repo_id="lakshyavarshney/Meta-Llama-3.1-8B-q4_k_m-paul-graham-guide-GGUF",
    repo_type="model",
    token=HF_TOKEN
)


ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.


    PyTorch 2.6.0+cu124 with CUDA 1204 (you have 2.9.0+cu126)
    Python  3.12.9 (you have 3.12.12)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details


Switching to PyTorch attention since your Xformers is broken.

Unsloth: Xformers was not installed correctly.
Please install xformers separately first.
Then confirm if it's correctly installed by running:
python -m xformers.info

Longer error message:
xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.6.0+cu124 with CUDA 1204 (you have 2.9.0+cu126)
    Python  3.12.9 (you have 3.12.12)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
ü¶• Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.11.3: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.c

model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Unsloth 2025.11.3 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


Unsloth: Merging model weights to 16-bit format...


config.json:   0%|          | 0.00/956 [00:00<?, ?B/s]

Found HuggingFace hub cache directory: /root/.cache/huggingface/hub


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Checking cache directory for required files...
Cache check failed: model-00001-of-00004.safetensors not found in local cache.
Not all required files found in cache. Will proceed with downloading.
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Preparing safetensor model files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files:  25%|‚ñà‚ñà‚ñå       | 1/4 [01:27<04:22, 87.39s/it]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files:  50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 2/4 [02:35<02:31, 75.82s/it]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files:  75%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå  | 3/4 [03:53<01:16, 76.78s/it]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [04:17<00:00, 64.25s/it]


Note: tokenizer.model not found (this is OK for non-SentencePiece models)


Unsloth: Merging weights into 16bit: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [04:57<00:00, 74.39s/it]


Unsloth: Merge process complete. Saved to `/content/gguf_output`
Unsloth: Converting to GGUF format...
==((====))==  Unsloth: Conversion from HF to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF f16 might take 3 minutes.
\        /    [2] Converting GGUF f16 to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: Updating system package directories
Unsloth: All required system packages already installed!
Unsloth: Install llama.cpp and building - please wait 1 to 3 minutes
Unsloth: Cloning llama.cpp repository
Unsloth: Install GGUF and other packages


In [None]:
import os, sys, gc, time, shutil
from google.colab import userdata
HF_TOKEN = userdata.get('HF_ACCESS_TOKEN')
if not HF_TOKEN:
    raise RuntimeError("HF_ACCESS_TOKEN not found in Colab userdata. Set it under Runtime -> Manage session -> User Data.")

from unsloth import FastLanguageModel
from huggingface_hub import HfApi, upload_folder
import torch

# ----- USER CONFIG -----
MODEL_ID = "lakshyavarshney/Meta-Llama-3.1-8B-Instruct-Paul-Graham-QLORA"
OUT_DIR = "/content/gguf_output"
HF_OUT_REPO = "lakshyavarshney/Meta-Llama-3.1-8B-q4_k_m-paul-graham-guide-GGUF"
MAX_SEQ = 2048
# ------------------------

# prepare out dir
if os.path.exists(OUT_DIR):
    shutil.rmtree(OUT_DIR)
os.makedirs(OUT_DIR, exist_ok=True)

def show_state():
    print("\n=== GPU ===")
    os.system("nvidia-smi")
    print("\n=== RAM & DISK ===")
    os.system("free -h")
    os.system("df -h / | sed -n '1,2p'")

show_state()

print("\nLoading model (4-bit). This may take a few minutes...")
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_ID,
    max_seq_length=MAX_SEQ,
    dtype=None,
    load_in_4bit=True,
    token=HF_TOKEN
)

print("Model loaded. Clearing caches...")
torch.cuda.empty_cache(); gc.collect(); time.sleep(2)
show_state()

# Try progressive memory settings (smaller -> less memory but slower)
mem_attempts = [0.3, 0.25, 0.2, 0.15]
saved = False
last_err = None

for mem in mem_attempts:
    try:
        print(f"\nAttempting save_pretrained_gguf with maximum_memory_usage={mem} ...")
        start = time.time()
        model.save_pretrained_gguf(
            OUT_DIR,
            tokenizer,
            quantization_method="q4_k_m",
            maximum_memory_usage=mem
        )
        elapsed = time.time() - start
        print(f"Save succeeded with maximum_memory_usage={mem} (elapsed {elapsed:.1f}s).")
        saved = True
        break
    except Exception as e:
        last_err = e
        print(f"Save failed for maximum_memory_usage={mem} -> {repr(e)}")
        torch.cuda.empty_cache(); gc.collect(); time.sleep(3)

if not saved:
    print("\nAll save attempts failed. Last error:")
    print(repr(last_err))
    show_state()
    raise RuntimeError("GGUF conversion failed on this runtime. Consider increasing swap size, using Colab Pro/Pro+, or a larger VM.")

print("\nSaved files (sample):", os.listdir(OUT_DIR)[:40])
show_state()

print("\nUploading folder to Hugging Face repository:", HF_OUT_REPO)
api = HfApi(token=HF_TOKEN)
try:
    api.create_repo(repo_id=HF_OUT_REPO, repo_type="model", exist_ok=True)
    print("Repo exists/created.")
except Exception as e:
    print("Repo create warning:", e)

upload_folder(
    folder_path=OUT_DIR,
    repo_id=HF_OUT_REPO,
    repo_type="model",
    token=HF_TOKEN,
    path_in_repo=""
)
print("Upload finished.")

# try to remove swap (best effort)
print("\nAttempting to disable and remove /swapfile (may require sudo)...")
os.system("sudo swapoff /swapfile || true")
os.system("sudo rm -f /swapfile || true")
print("Cleanup attempted. Done.")



=== GPU ===

=== RAM & DISK ===

Loading model (4-bit). This may take a few minutes...
==((====))==  Unsloth 2025.11.3: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Model loaded. Clearing caches...

=== GPU ===

=== RAM & DISK ===

Attempting save_pretrained_gguf with maximum_memory_usage=0.3 ...
Unsloth: Merging model weights to 16-bit format...


config.json:   0%|          | 0.00/956 [00:00<?, ?B/s]

Found HuggingFace hub cache directory: /root/.cache/huggingface/hub


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Checking cache directory for required files...
Cache check failed: model-00001-of-00004.safetensors not found in local cache.
Not all required files found in cache. Will proceed with downloading.
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Preparing safetensor model files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files:  25%|‚ñà‚ñà‚ñå       | 1/4 [01:45<05:16, 105.50s/it]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files:  50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 2/4 [03:09<03:05, 92.59s/it] 

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files:  75%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå  | 3/4 [04:33<01:29, 89.10s/it]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [04:43<00:00, 70.86s/it]


Note: tokenizer.model not found (this is OK for non-SentencePiece models)


Unsloth: Merging weights into 16bit: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [03:48<00:00, 57.06s/it]


Unsloth: Merge process complete. Saved to `/content/gguf_output`
Unsloth: Converting to GGUF format...
==((====))==  Unsloth: Conversion from HF to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF f16 might take 3 minutes.
\        /    [2] Converting GGUF f16 to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: Updating system package directories
Unsloth: All required system packages already installed!
Unsloth: Install llama.cpp and building - please wait 1 to 3 minutes
Unsloth: Cloning llama.cpp repository
Unsloth: Install GGUF and other packages


### Load model and saved lora adapters

For if you want to continue finetuning or want to do inference using the model in safetensor format.

In [None]:
# from unsloth import FastLanguageModel
# import torch
# from google.colab import userdata

# model, tokenizer = FastLanguageModel.from_pretrained(
#     model_name = "lakshyavarshney/Meta-Llama-3.1-8B-Instruct-Paul-Graham-LORA",
#     max_seq_length = 2048,
#     dtype = None,
#     load_in_4bit = True,
#     token=userdata.get('HF_ACCESS_TOKEN')
# )

# FastLanguageModel.for_inference(model)

# messages = [
#     {"role": "user", "content": "What is your job?"},
# ]
# inputs = tokenizer.apply_chat_template(
#     messages,
#     tokenize = True,
#     add_generation_prompt = True,
#     return_tensors = "pt",
# ).to("cuda")

# from transformers import TextStreamer
# text_streamer = TextStreamer(tokenizer)
# _ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128, use_cache = True)