### Overview
* This notebook fine-tunes a 8-bit quantized `deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B` on a small subset of MATH dataset on a single Colab T4 GPU.
* Under limited GPU RAM (15.0GB), we applied LoRA and ZeRO stage 3 for CPU swapping on model weights, optimizer states and activations.
* Libraries used: `transformers`, `deepspeed`, `bitsandbytes`, `peft`, `datasets`, `evaluate`.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f'Running on {device}')

Running on cuda


* Installing packages on Colab

In [None]:
pip install transformers[deepspeed]



 * Reload Colab after the following, for `deepspeed` and `bitsandbytes` to work.

In [None]:
pip install -U bitsandbytes



In [None]:
pip install datasets

Collecting datasets
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.6.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2025.3.0-py3-none-any.whl 

In [None]:
pip install mpi4py

Collecting mpi4py
  Downloading mpi4py-4.0.3.tar.gz (466 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/466.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━[0m [32m286.7/466.3 kB[0m [31m8.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m466.3/466.3 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: mpi4py
  Building wheel for mpi4py (pyproject.toml) ... [?25l[?25hdone
  Created wheel for mpi4py: filename=mpi4py-4.0.3-cp311-cp311-linux_x86_64.whl size=4438165 sha256=d5a865eded78fac9e7bc48dd9f83b7769fd892031b6d3c962ab8c5c2555c29f7
  Stored in directory: /root/.cache/pip/wheels/5c/56/17

In [None]:
pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


In [None]:
pip install git+https://github.com/hendrycks/math.git

Collecting git+https://github.com/hendrycks/math.git
  Cloning https://github.com/hendrycks/math.git to /tmp/pip-req-build-motkvtmf
  Running command git clone --filter=blob:none --quiet https://github.com/hendrycks/math.git /tmp/pip-req-build-motkvtmf
  Resolved https://github.com/hendrycks/math.git to commit 357963a7f5501a6c1708cf3f3fb0cdf525642761
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: math_equivalence
  Building wheel for math_equivalence (setup.py) ... [?25l[?25hdone
  Created wheel for math_equivalence: filename=math_equivalence-0.0.0-py3-none-any.whl size=3501 sha256=be5f4eb7659bc5f49176b6f3c99883c6d18cc3ec7282f228a8a9575b02af7dbf
  Stored in directory: /tmp/pip-ephem-wheel-cache-mym0mn5h/wheels/b7/16/f0/4a69d4d9b720086e22842cbd2d896b66298e6424b8f289f37c
Successfully built math_equivalence
Installing collected packages: math_equivalence
Successfully installed math_equivalence-0.0.0


### Load models and apply quantization
*   `deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B`: 1.78B, stored in BF16
*   When using Google T4 GPU, which has 14.74GB space actually available.
    *   We need to apply 8-bit quantization and LoRA to have more spaces available.

In [None]:
model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"

In [None]:
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_has_fp16_weight=False # Not keeping the FP16 copy
)

tokenizer = AutoTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/3.07k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

*   The following loads the model and quantize it.

*   Size of the model:
    *   Each parameter: 8 bit = 1 byte
    *   All parameters: 1.78B * 1 byte = 1.78GB

*   We can see 1.8GB GPU memory is occupied, which stands for the quantized model weights.
*   After the loading is finished, we have 2.3GB in GPU.
    *   Additional spaces can be occupied by buffers, or empty for future use.

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto"
)

config.json:   0%|          | 0.00/679 [00:00<?, ?B/s]

[2025-05-11 23:40:46,214] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)


model.safetensors:   0%|          | 0.00/3.55G [00:00<?, ?B/s]

Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.


generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

In [None]:
def check_GPU_space():
  print(f"Allocated space in GPU by PyTorch: {torch.cuda.memory_allocated() / 1e9} GB")
  print(f"Reserved space in GPU by PyTorch: {torch.cuda.memory_reserved() / 1e9} GB")

check_GPU_space()

Allocated space in GPU by PyTorch: 2.27670784 GB
Reserved space in GPU by PyTorch: 2.363490304 GB


### Load training data
  * `hendrycks/competition_math` is currently unavailable on HuggingFace.
    * [Link to the dataset on HuggingFace](https://huggingface.co/datasets/hendrycks/competition_math)
  * Let's use the copy `nlile/hendrycks-MATH-benchmark`.

In [None]:
from datasets import load_dataset

def collect_sft_data(
  dataset_name = "nlile/hendrycks-MATH-benchmark",
  config = None,
  split = "train",
  num_samples = None,
  shuffle_seed = 42
):
    ds = load_dataset(dataset_name, config, split=split)
    ds = ds.shuffle(seed=shuffle_seed)

    if not num_samples: # use all
      num_samples = len(ds)
    elif isinstance(num_samples, int): # count mode
      assert 0 <= num_samples
      num_samples = min(len(ds), num_samples)
      ds = ds.select(range(num_samples))
    else: # fraction mode
      assert isinstance(num_samples, float)
      assert 0 <= num_samples <= 1.0
      num_samples = int(num_samples * len(ds))
      ds = ds.select(range(num_samples))

    print(f"{num_samples} samples of data are loaded.")

    return ds

In [None]:
train_dataset = collect_sft_data(num_samples=100)

README.md:   0%|          | 0.00/2.57k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/5.12M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/210k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/12000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/500 [00:00<?, ? examples/s]

100 samples of data are loaded.


In [None]:
train_dataset

Dataset({
    features: ['problem', 'solution', 'answer', 'subject', 'level', 'unique_id'],
    num_rows: 100
})

* The question-answer pairs are in strings. We need to tokenize them.

In [None]:
print(train_dataset['problem'][0])

An elephant and a lion are currently 1 mile apart. The elephant runs directly away from the lion at 19 miles per hour, while the lion runs directly towards the elephant at 24 miles per hour.  How many minutes will it take for the lion to catch the elephant?


In [None]:
def preprocess_fn_batch(example, max_length=2048):
    question = [f"user: \n{problem}\nPlease reason step by step, and put your final answer within \\boxed.\nassistant:\n" for problem in example['problem']]
    solution = example["solution"]
    qa_pair = [question[i] + solution[i] for i in range(len(solution))]

    qa_pair_tokenized = tokenizer(qa_pair,
                     truncation=True,
                     max_length=max_length,
                     return_attention_mask=True)

    input_ids = qa_pair_tokenized["input_ids"]
    question_ids = tokenizer(question,
                  padding=False)["input_ids"]

    labels = []
    for i, input in enumerate(input_ids):
      label = input.copy()
      question_len = len(question_ids[i])
      # ignore the question tokens in loss calculation
      label[:question_len] = [-100] * question_len
      labels.append(label)

    qa_pair_tokenized["labels"] = labels
    qa_pair_tokenized["answer"] = example["answer"]
    return qa_pair_tokenized

In [None]:
train_dataset = train_dataset.map(
    preprocess_fn_batch,
    batched=True,
    remove_columns=train_dataset.column_names,
)

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

In [None]:
train_dataset

Dataset({
    features: ['answer', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 100
})

* Sanity check on the data:

In [None]:
check_sample = train_dataset.select(range(2))
for i, example in enumerate(check_sample):
    print("\n===================")
    print(f"Sample {i}\n")
    answer = example["answer"]
    print(f"[answer] \n{answer}\n")
    qa_pair_text = tokenizer.decode(example["input_ids"], skip_special_tokens=True)
    print(f"[qa_pair_text] \n{qa_pair_text}\n")
    labels_ids = [t for t in example["labels"] if t != -100]
    labels_text = tokenizer.decode(labels_ids, skip_special_tokens=True)
    print(f"[labels_text] \n{labels_text}")



Sample 0

[answer] 
12

[qa_pair_text] 
user: 
An elephant and a lion are currently 1 mile apart. The elephant runs directly away from the lion at 19 miles per hour, while the lion runs directly towards the elephant at 24 miles per hour.  How many minutes will it take for the lion to catch the elephant?
Please reason step by step, and put your final answer within \boxed.
assistant:
Every hour, the lion runs 24 miles while the elephant runs 19.  Thus, the distance between the two animals closes at a rate of 5 miles every hour.  The lion catches the elephant after this distance has closed 1 mile, which takes $\frac{1}{5}$ hours to do, or $\frac{1}{5}\cdot 60 = \boxed{12}$ minutes.

[labels_text] 
Every hour, the lion runs 24 miles while the elephant runs 19.  Thus, the distance between the two animals closes at a rate of 5 miles every hour.  The lion catches the elephant after this distance has closed 1 mile, which takes $\frac{1}{5}$ hours to do, or $\frac{1}{5}\cdot 60 = \boxed{12}$ m

### Load validation dataset
* Evaluating with 500 samples can take a lot of time.
* In addition, the CPU is also limited (only 12GB is available on Colab).
* For this toy example, let's try 3 samples.

In [None]:
val_ds = collect_sft_data(dataset_name="HuggingFaceH4/MATH-500", split="test", num_samples=3)

README.md:   0%|          | 0.00/412 [00:00<?, ?B/s]

test.jsonl:   0%|          | 0.00/447k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/500 [00:00<?, ? examples/s]

3 samples of data are loaded.


In [None]:
val_ds

Dataset({
    features: ['problem', 'solution', 'answer', 'subject', 'level', 'unique_id'],
    num_rows: 3
})

In [None]:
def preprocess_fn_val_batch(example, max_length=2048):
    question = [f"user: \n{problem}\nPlease reason step by step, and put your final answer within \\boxed.\nassistant:\n" for problem in example['problem']]
    solution = example["solution"]

    question_tokenized = tokenizer(question,
                     truncation=True,
                     max_length=max_length,
                     return_attention_mask=True)

    question_tokenized["labels"] = question_tokenized['input_ids']
    question_tokenized["answer"] = example["answer"]
    return question_tokenized

In [None]:
val_ds = val_ds.map(
    preprocess_fn_val_batch,
    batched=True,
    remove_columns=val_ds.column_names,
)

Map:   0%|          | 0/3 [00:00<?, ? examples/s]

* We need a special metric for MATH to handle Math equivalences of formula.
  * It is implemented in HuggingFace: `evaluate-metric/competition_math`

In [None]:
import numpy as np
from evaluate import load
math_metric = load("competition_math")

def math_acc(eval_pred):
    logits, _ = eval_pred
    pred_ids = np.argmax(logits, axis=-1)
    decoded_preds = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    references = val_ds["answer"]
    results = math_metric.compute(predictions=decoded_preds,references=references)
    return {"math_acc": results["accuracy"]}

Downloading builder script:   0%|          | 0.00/3.24k [00:00<?, ?B/s]

### Set up ZeRO configurations and understanding the DeepSpeed config
* Since we are only using 1 GPU, we are not using data/tensor parallelization.
* However, we can still use ZeRO for CPU Swapping.

In [None]:
import json

*   `world_size`: number of devices
*   `train_batch_size`: set to `auto` because we will set the batch size later.
*   `zero_optimization`:
    *   `stage`:
        *   The ZeRO stage 1/2/3 we are using.
        *   We are using ZeRO stage 3 since we are using CPU swapping for model weights.
    *   `overlap_comm`:
        *   Overlap backprop computation and communication.
        *   reduce-scatter gradients as soon as it's ready.
        *   Instead of waiting for the entire gradients to be ready.
    *   `contiguous_gradients`:
        *   Reduces fragmentations by storing the scattered gradients in a continuous buffer.
    *   `reduce_bucket_size`:
        *   Buffer for reduce-scatter operations.
        *   On a single device, it is used for collect gradients and update weights.
    *   `allgather_bucket_size`:
        *   Buffer for all-gather operations.
        *   On a single device, it is used for CPU swapping.

In [None]:
ds_config = {
    "world_size": 1,
    "train_batch_size": 'auto',
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {"device": "cpu", "pin_memory": True},
        "offload_param":   {"device": "cpu", "pin_memory": True},
        "overlap_comm": False,
        "contiguous_gradients": True,
        "reduce_bucket_size": 2e8,
        "allgather_bucket_size": 2e8
    },
    "activation_checkpointing": {
        "partition_activations": False,
        "cpu_checkpointing": True,
        "contiguous_memory_optimization": True
    },
}

*  "offload_optimizer":
  *  Keep the Adam optimizer states on CPU.
*  "offload_param":
  *  Keep the model parameters on CPU.
  *  Put them on GPU only for forward/backward passes.
*  "activation_checkpointing":
  *  Only saves some activation checkpoints on GPU.
  *  Recompute the others during backprop.
  *  "cpu_checkpointing": the saved checkpoints are saved on CPU instead.
  *  "partition_activations":
    *   Splits the activation checkpoints across ranks.

In [None]:
with open("./ds_config.json", "w") as f:
    json.dump(ds_config, f, indent=2)

### Set up LoRA configurations
* Check the matrices we are about to apply LoRA.

In [None]:
model

Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(151936, 1536)
    (layers): ModuleList(
      (0-27): 28 x Qwen2DecoderLayer(
        (self_attn): Qwen2Attention(
          (q_proj): Linear8bitLt(in_features=1536, out_features=1536, bias=True)
          (k_proj): Linear8bitLt(in_features=1536, out_features=256, bias=True)
          (v_proj): Linear8bitLt(in_features=1536, out_features=256, bias=True)
          (o_proj): Linear8bitLt(in_features=1536, out_features=1536, bias=False)
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear8bitLt(in_features=1536, out_features=8960, bias=False)
          (up_proj): Linear8bitLt(in_features=1536, out_features=8960, bias=False)
          (down_proj): Linear8bitLt(in_features=8960, out_features=1536, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm((1536,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm((1536,), eps=1e-06)
      )
    )
    (norm): Qwen2RMSNor

In [None]:
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model

model = prepare_model_for_kbit_training(model)
model.gradient_checkpointing_enable()
lora_cfg = LoraConfig(
    r=16, # rank
    lora_alpha=32,
    target_modules=["q_proj","v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_cfg)

### Set up SFT configurations

In [None]:
import os
os.environ["WANDB_DISABLED"] = "true"
os.environ["MAX_JOBS"] = "1"

* batch_size_per_device = per_device_train_batch_size * gradient_accumulation_steps

In [None]:
from transformers import DataCollatorForSeq2Seq, TrainingArguments, Trainer

data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    pad_to_multiple_of=8,
    return_tensors="pt",
    padding=True,
)

In [None]:
training_args = TrainingArguments(
    output_dir="./DeepSeek-R1-Distill-Qwen-1.5B-MATH-SFT",
    num_train_epochs=1,
    learning_rate=2e-4,
    fp16=True,
    gradient_checkpointing=True,
    logging_steps=50,
    save_steps=50,
    save_total_limit=1,
    deepspeed="ds_config.json",
    optim="paged_adamw_8bit",

    per_device_train_batch_size=2,
    gradient_accumulation_steps=1,

    # Validation
    eval_strategy="steps",
    eval_steps=10,
    per_device_eval_batch_size=1,
    eval_accumulation_steps=1,

    load_best_model_at_end=True,
    metric_for_best_model="math_acc",
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_ds,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=math_acc
)

  trainer = Trainer(
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


### Start training

In [None]:
trainer.train()

Installed CUDA version 12.5 does not match the version torch was compiled with 12.4 but since the APIs are compatible, accepting this combination


Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py311_cu124/cpu_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py311_cu124/cpu_adam/build.ninja...
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
Building extension module cpu_adam...
Using envvar MAX_JOBS (1) as the number of workers...
Loading extension module cpu_adam...


Time to load cpu_adam op: 82.1094172000885 seconds
Parameter Offload: Total persistent parameters: 2323968 in 253 params


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss,Validation Loss,Math Acc
10,No log,2.640888,0.0
20,No log,2.558287,0.0
30,No log,2.414763,0.0
40,No log,2.199883,0.0
50,0.914000,2.097744,0.0


 stage3_gather_16bit_weights_on_model_save=false. Saving the full checkpoint instead, use zero_to_fp32.py to recover weights


TrainOutput(global_step=50, training_loss=0.913958740234375, metrics={'train_runtime': 144.293, 'train_samples_per_second': 0.693, 'train_steps_per_second': 0.347, 'total_flos': 470966075392.0, 'train_loss': 0.913958740234375, 'epoch': 1.0})

*   Peak GPU usage (achieved at training): 12.8GB
*   Peak CPU usage (achieved at evaluating): 11.4GB

### Materialize the adapter weights

In [None]:
cd DeepSeek-R1-Distill-Qwen-1.5B-MATH-SFT/checkpoint-50

/content/DeepSeek-R1-Distill-Qwen-1.5B-MATH-SFT/checkpoint-50


In [None]:
!python zero_to_fp32.py . DeepSeek-R1-Distill-Qwen-1.5B-MATH-SFT

[2025-05-11 23:45:32,129] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
2025-05-11 23:45:43.012247: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1747007143.225025    4078 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747007143.281748    4078 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Processing zero checkpoint './global_step50'
Loading checkpoint shards: 100% 1/1 [00:00<00:00, 122.28it/s]
Detected checkpoint of type zero stage ZeroStageEnum.weights, world_size: 1
Parsing checkpoint created by deepspeed==0.16.7
Gathering sharded weights: 100% 112/112 [00:00<00:00, 380066.38it/s]
Reconstructed Tra