<a href="https://colab.research.google.com/github/Bryan-Az/Unsloth_LLM_Tools/blob/main/finetuning/unsloth_continued_finetuning_part_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Continued Finetuning using Unsloth's Training Checkpoints

This notebook demonstrates how to use unsloth for continued finetuning. During  finetuning (as well as pretraining), it is likely that your model will need many hours of training if you have a large enough dataset. If you don't have access to a cluster of CPU's that can parallelize the training and lower the computation time, checkpointing allows your training to be saved and continued over a longer period of time.

In order to load a previously fine-tuned model, the LoRA adapters for the model must also be uploaded to HuggingFace along with the quantized models used for inference/deployment.

# Installs and Imports

In [1]:
%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install unsloth
# Get latest Unsloth
!pip install --upgrade --force-reinstall --no-deps "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"


In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


In [3]:
# used for training / fine-tuning
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

## Loading the LoRA Adapters of the Previously Finetuned C\# Coder Model from Part 2

In [5]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Alexis-Az/Qwen-2.5-Coder-7B-4bit-CSharp-LoRA", # Reminder we support ANY Hugging Face model!
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

==((====))==  Unsloth 2024.9.post3: Fast Qwen2 patching. Transformers = 4.45.1.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.55G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/117 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/4.87k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/632 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/162M [00:00<?, ?B/s]

Unsloth 2024.9.post3 patched 28 layers with 0 QKV layers, 28 O layers and 28 MLP layers.


## Loading the Alpaca Chat Dataset

As the previous checkpointed model was only trained to understand code, it cannot understand prompts in conversation. To allow the model to understand human language and code, we can continue fine-tune the model using the alpaca conversational dataset in Spanish.

In [7]:
from datasets import load_dataset
alpaca_dataset = load_dataset("FreedomIntelligence/alpaca-gpt4-spanish", split = "train")

README.md:   0%|          | 0.00/124 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


alpaca-gpt4-spanish.json:   0%|          | 0.00/52.4M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/49969 [00:00<?, ? examples/s]

In [8]:
_alpaca_prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{}

### Response:
{}"""
# Becomes:
alpaca_prompt = """Debajo se encuentra una instrucción que describe una tarea. Escribe una respuesta que completa la solicitud

### Instruccion:
{}

### Respuesta:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(conversations):
    texts = []
    conversations = conversations["conversations"]
    for convo in conversations:
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(convo[0]["value"], convo[1]["value"]) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

alpaca_dataset = alpaca_dataset.map(formatting_prompts_func, batched = True,)

Map:   0%|          | 0/49969 [00:00<?, ? examples/s]

# Finetuning from Last Checkpoint

The below is slightly altered from the previous code used to finetune or pretrain models. It enables checkpointing functionality to pause and save the weights for later training. We can train for 30 steps in the first phase and 30 more steps in the second phase to demonstrate the functionality is enabled.

## Finetuning Phase 1
Each phase should take the same amount of time to run as the steps taken in each phase is the same in each. In the case of 30 for this dataset, that will take around 7 minutes in a Google colab T4 compute environment.

In [9]:
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
from unsloth import UnslothTrainer, UnslothTrainingArguments

trainer = UnslothTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = alpaca_dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 8,

    args = UnslothTrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 8,

        max_steps = 30,
        warmup_steps = 10,
        save_strategy = "steps", ### NEW PARAMETER ADDED FOR CHECKPOINTING
        save_steps = 50, ### NEW PARAMETER ADDED FOR CHECKPOINTING


        learning_rate = 5e-5,
        embedding_learning_rate = 1e-5,

        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.00,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

Map (num_proc=8):   0%|          | 0/49969 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [10]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 49,969 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 30
 "-____-"     Number of trainable parameters = 40,370,176


Step,Training Loss
1,2.0623
2,2.0207
3,1.7297
4,1.8081
5,1.6659
6,1.5847
7,1.606
8,1.5371
9,1.3898
10,1.4279


## Finetuning Phase 2

In [13]:
for i in range(30):
  trainer_stats = trainer.train(resume_from_checkpoint=True)

  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 49,969 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 30
 "-____-"     Number of trainable parameters = 40,370,176
  checkpoint_rng_state = torch.load(rng_file)


Step,Training Loss
32,1.1018


  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 49,969 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 30
 "-____-"     Number of trainable parameters = 40,370,176
  checkpoint_rng_state = torch.load(rng_file)


Step,Training Loss
33,1.1868


  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 49,969 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 30
 "-____-"     Number of trainable parameters = 40,370,176
  checkpoint_rng_state = torch.load(rng_file)


Step,Training Loss
34,1.1969


  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 49,969 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 30
 "-____-"     Number of trainable parameters = 40,370,176
  checkpoint_rng_state = torch.load(rng_file)


Step,Training Loss
35,1.1172


  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 49,969 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 30
 "-____-"     Number of trainable parameters = 40,370,176
  checkpoint_rng_state = torch.load(rng_file)


Step,Training Loss
36,1.2318


  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 49,969 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 30
 "-____-"     Number of trainable parameters = 40,370,176
  checkpoint_rng_state = torch.load(rng_file)


Step,Training Loss
37,1.0663


  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 49,969 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 30
 "-____-"     Number of trainable parameters = 40,370,176
  checkpoint_rng_state = torch.load(rng_file)


Step,Training Loss
38,1.1795


  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 49,969 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 30
 "-____-"     Number of trainable parameters = 40,370,176
  checkpoint_rng_state = torch.load(rng_file)


Step,Training Loss
39,1.0501


  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 49,969 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 30
 "-____-"     Number of trainable parameters = 40,370,176
  checkpoint_rng_state = torch.load(rng_file)


Step,Training Loss
40,1.2352


  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 49,969 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 30
 "-____-"     Number of trainable parameters = 40,370,176
  checkpoint_rng_state = torch.load(rng_file)


Step,Training Loss
41,1.1779


  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 49,969 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 30
 "-____-"     Number of trainable parameters = 40,370,176
  checkpoint_rng_state = torch.load(rng_file)


Step,Training Loss
42,1.107


  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 49,969 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 30
 "-____-"     Number of trainable parameters = 40,370,176
  checkpoint_rng_state = torch.load(rng_file)


Step,Training Loss
43,1.1527


  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 49,969 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 30
 "-____-"     Number of trainable parameters = 40,370,176
  checkpoint_rng_state = torch.load(rng_file)


Step,Training Loss
44,1.2803


  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 49,969 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 30
 "-____-"     Number of trainable parameters = 40,370,176
  checkpoint_rng_state = torch.load(rng_file)


Step,Training Loss
45,1.1829


  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 49,969 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 30
 "-____-"     Number of trainable parameters = 40,370,176
  checkpoint_rng_state = torch.load(rng_file)


Step,Training Loss
46,0.9931


  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 49,969 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 30
 "-____-"     Number of trainable parameters = 40,370,176
  checkpoint_rng_state = torch.load(rng_file)


Step,Training Loss
47,1.0552


  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 49,969 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 30
 "-____-"     Number of trainable parameters = 40,370,176
  checkpoint_rng_state = torch.load(rng_file)


Step,Training Loss
48,1.092


  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 49,969 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 30
 "-____-"     Number of trainable parameters = 40,370,176
  checkpoint_rng_state = torch.load(rng_file)


Step,Training Loss
49,1.162


  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 49,969 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 30
 "-____-"     Number of trainable parameters = 40,370,176
  checkpoint_rng_state = torch.load(rng_file)


Step,Training Loss
50,1.0959


  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 49,969 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 30
 "-____-"     Number of trainable parameters = 40,370,176
  checkpoint_rng_state = torch.load(rng_file)


Step,Training Loss
51,1.0756


  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 49,969 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 30
 "-____-"     Number of trainable parameters = 40,370,176
  checkpoint_rng_state = torch.load(rng_file)


Step,Training Loss
52,1.08


  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 49,969 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 30
 "-____-"     Number of trainable parameters = 40,370,176
  checkpoint_rng_state = torch.load(rng_file)


Step,Training Loss
53,1.1074


  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 49,969 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 30
 "-____-"     Number of trainable parameters = 40,370,176
  checkpoint_rng_state = torch.load(rng_file)


Step,Training Loss
54,1.1533


  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 49,969 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 30
 "-____-"     Number of trainable parameters = 40,370,176
  checkpoint_rng_state = torch.load(rng_file)


Step,Training Loss
55,1.0584


  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 49,969 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 30
 "-____-"     Number of trainable parameters = 40,370,176
  checkpoint_rng_state = torch.load(rng_file)


Step,Training Loss
56,1.1212


  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 49,969 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 30
 "-____-"     Number of trainable parameters = 40,370,176
  checkpoint_rng_state = torch.load(rng_file)


Step,Training Loss
57,1.0641


  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 49,969 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 30
 "-____-"     Number of trainable parameters = 40,370,176
  checkpoint_rng_state = torch.load(rng_file)


Step,Training Loss
58,1.1091


  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 49,969 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 30
 "-____-"     Number of trainable parameters = 40,370,176
  checkpoint_rng_state = torch.load(rng_file)


Step,Training Loss
59,1.0949


  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 49,969 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 30
 "-____-"     Number of trainable parameters = 40,370,176
  checkpoint_rng_state = torch.load(rng_file)


Step,Training Loss
60,1.1671


  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 49,969 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 30
 "-____-"     Number of trainable parameters = 40,370,176
  checkpoint_rng_state = torch.load(rng_file)


Step,Training Loss
61,1.0759


# Uploading the Quantized Model and the LoRA Adapter to HuggingFace

### Saving the Quantized Model for Inference in OLLaMA

In [15]:
from google.colab import userdata

In [16]:
# saving the model to huggingface for later inference and evaluation

model.push_to_hub_gguf("Alexis-Az/Qwen-2.5-Coder-7B-4bit-CSharp-Alpaca-Instruct", tokenizer, quantization_method = "q4_k_m", token = userdata.get('HF_TOKEN'))


Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which will take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 5.5G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 5.16 out of 12.67 RAM for saving.


 54%|█████▎    | 15/28 [00:01<00:01, 10.99it/s]We will save to Disk and not RAM now.
100%|██████████| 28/28 [01:43<00:00,  3.71s/it]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Unsloth: Saving Alexis-Az/Qwen-2.5-Coder-7B-4bit-CSharp-Alpaca-Instruct/pytorch_model-00001-of-00004.bin...
Unsloth: Saving Alexis-Az/Qwen-2.5-Coder-7B-4bit-CSharp-Alpaca-Instruct/pytorch_model-00002-of-00004.bin...
Unsloth: Saving Alexis-Az/Qwen-2.5-Coder-7B-4bit-CSharp-Alpaca-Instruct/pytorch_model-00003-of-00004.bin...
Unsloth: Saving Alexis-Az/Qwen-2.5-Coder-7B-4bit-CSharp-Alpaca-Instruct/pytorch_model-00004-of-00004.bin...
Done.


Unsloth: Converting qwen2 model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m'] will take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...
Unsloth: [1] Converting model at Alexis-Az/Qwen-2.5-Coder-7B-4bit-CSharp-Alpaca-Instruct into f16 GGUF format.
The output location will be ./Alexis-Az/Qwen-2.5-Coder-7B-4bit-CSharp-Alpaca-Instruct/unsloth.F16.gguf
This will take 3 minutes...
INFO:hf-to-gguf:Loading model: Qwen-2.5-Coder-7B-4bit-CSharp-Alpaca-Instruct
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'pytorch_model.bin.index.json'
INFO:hf-to-gguf:gguf: loading model part 'pytorch_model-00001-of-00004.bin'
INFO:hf-to-g

  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.F16.gguf:   0%|          | 0.00/15.2G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/Alexis-Az/Qwen-2.5-Coder-7B-4bit-CSharp-Alpaca-Instruct
Unsloth: Uploading GGUF to Huggingface Hub...


  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.Q4_K_M.gguf:   0%|          | 0.00/4.68G [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


Saved GGUF to https://huggingface.co/Alexis-Az/Qwen-2.5-Coder-7B-4bit-CSharp-Alpaca-Instruct


### Saving the LoRA Adapter Model for Continued Finetuning with Unsloth

In [17]:
# Saving the LoRA Adapter Model for Continued Finetuning with Unsloth

model.push_to_hub("Alexis-Az/Qwen-2.5-Coder-7B-4bit-CSharp-Alpaca-LoRA")

config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/162M [00:00<?, ?B/s]

Saved model to https://huggingface.co/Alexis-Az/Qwen-2.5-Coder-7B-4bit-CSharp-Alpaca-LoRA
