<a href="https://colab.research.google.com/github/Bryan-Az/Unsloth_LLM_Tools/blob/main/continued_pretraining/unsloth_continued_pretraining.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Continued Pretraining with Unsloth

## Imports and Installs

In [1]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

In [8]:
from unsloth import FastLanguageModel
import torch

In [7]:
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
from unsloth import UnslothTrainer, UnslothTrainingArguments

## Downloading the Pretrained Model
I will also add the LoRA Adapters here before we move onto the Pretraining.

In [2]:
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


In [3]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-bnb-4bit", # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7B
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

==((====))==  Unsloth 2024.9.post3: Fast Mistral patching. Transformers = 4.45.1.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/4.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/155 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

According to the authors of Unsloth, it is important to add 'embed_tokens' and 'lm_head' as parameters to target_modules to enable it to learn out of distribution data.

In [4]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 128, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",

                      "embed_tokens", "lm_head",], # Add for continual pretraining
    lora_alpha = 32,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = True,   # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth: Offloading input_embeddings to disk to save VRAM


  offloaded_W = torch.load(filename, map_location = "cpu", mmap = True)


Unsloth: Offloading output_embeddings to disk to save VRAM


Unsloth 2024.9.post3 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


Unsloth: Casting embed_tokens to float32
Unsloth: Casting lm_head to float32


## Loading and Preparing the Data
The colab notebook provided by Unsloth in their documentation example of continued pretraining uses the Wikipedia Korean subset dataset to guide their model to understand knowledge taken from Wikipedia and to speak Korean.

This is a way to train the model to solve 2 problems: to understand knowledge from wikipedia, while also having it understand a new language. Since I speak Spanish as well, I'll be using it to learn Spanish.

In [5]:
# Wikipedia provides a title and an article text.
# Use https://translate.google.com!
_wikipedia_prompt = """Wikipedia Article
### Title: {}

### Article:
{}"""
# becomes:
wikipedia_prompt = """Artículo de Wikipedia
### Título: {}

### Artículo:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    titles = examples["title"]
    texts  = examples["text"]
    outputs = []
    for title, text in zip(titles, texts):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = wikipedia_prompt.format(title, text) + EOS_TOKEN
        outputs.append(text)
    return { "text" : outputs, }
pass

In [6]:
from datasets import load_dataset

dataset = load_dataset("wikimedia/wikipedia", "20231101.es", split = "train",)

# We select 1% of the data to make training faster!
dataset = dataset.train_test_split(train_size = 0.01)["train"]

dataset = dataset.map(formatting_prompts_func, batched = True,)

README.md:   0%|          | 0.00/131k [00:00<?, ?B/s]

train-00000-of-00013.parquet:   0%|          | 0.00/688M [00:00<?, ?B/s]

train-00001-of-00013.parquet:   0%|          | 0.00/376M [00:00<?, ?B/s]

train-00002-of-00013.parquet:   0%|          | 0.00/287M [00:00<?, ?B/s]

train-00003-of-00013.parquet:   0%|          | 0.00/245M [00:00<?, ?B/s]

train-00004-of-00013.parquet:   0%|          | 0.00/168M [00:00<?, ?B/s]

train-00005-of-00013.parquet:   0%|          | 0.00/178M [00:00<?, ?B/s]

train-00006-of-00013.parquet:   0%|          | 0.00/216M [00:00<?, ?B/s]

train-00007-of-00013.parquet:   0%|          | 0.00/241M [00:00<?, ?B/s]

train-00008-of-00013.parquet:   0%|          | 0.00/227M [00:00<?, ?B/s]

train-00009-of-00013.parquet:   0%|          | 0.00/223M [00:00<?, ?B/s]

train-00010-of-00013.parquet:   0%|          | 0.00/167M [00:00<?, ?B/s]

train-00011-of-00013.parquet:   0%|          | 0.00/254M [00:00<?, ?B/s]

train-00012-of-00013.parquet:   0%|          | 0.00/226M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1841155 [00:00<?, ? examples/s]

Map:   0%|          | 0/18411 [00:00<?, ? examples/s]

## Continued Pretraining
Following the instructions provided by Unsloth in their notebook, I used Unsloth's trainer class to train the model. It is suggested to set num_train_epochs=1 and to set max_steps=None for a 'full run', as their version is modified and to run in shorter time and thus trains less. As a full run would take ~18 hours (despite using LoRa, and only 1% of the wikipedia dataset), instead of a full run I will decrease the number of maxsteps as suggested by Unsloth within the Colab compute environment. As a step takes ~1min each, having a max_step of 60 will take about an hour using the T4 colab compute environment.

In [14]:
trainer = UnslothTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,

    args = UnslothTrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 8,

        # Use warmup_ratio and num_train_epochs for longer runs!
        max_steps = 60,
        warmup_steps = 10,
        # warmup_ratio = 0.1,
        #num_train_epochs = 1,

        # Select a 2 to 10x smaller learning rate for the embedding matrices!
        learning_rate = 5e-5,
        embedding_learning_rate = 1e-5,

        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

max_steps is given, it will override any value given in num_train_epochs


In [15]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 18,411 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 60
 "-____-"     Number of trainable parameters = 597,688,320


Unsloth: Setting lr = 1.00e-05 instead of 5.00e-05 for embed_tokens.
Unsloth: Setting lr = 1.00e-05 instead of 5.00e-05 for lm_head.


Step,Training Loss
1,1.0616
2,1.1163
3,1.0224
4,1.2477
5,1.1934
6,1.1969
7,1.2705
8,1.3769
9,1.336
10,1.3762


## Instruction Finetuning

The Unsloth notebook used the [Alpaca in GPT4 Dataset](https://huggingface.co/datasets/FreedomIntelligence/alpaca-gpt4-korean) but translated in Korean. In order to finetune the model for conversation using instruction tuning and in spanish, we would need this same dataset translated in spanish. Luckily, the Unsloth team provided a link to [MultilingualSIFT project](https://github.com/FreedomIntelligence/MultilingualSIFT) for other translations of the Alpaca dataset.

In [16]:
from datasets import load_dataset
alpaca_dataset = load_dataset("FreedomIntelligence/alpaca-gpt4-spanish", split = "train")

README.md:   0%|          | 0.00/124 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


alpaca-gpt4-spanish.json:   0%|          | 0.00/52.4M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/49969 [00:00<?, ? examples/s]

We print 1 example:

In [17]:
print(alpaca_dataset[0])

{'conversations': [{'from': 'human', 'value': 'Sugiera un eslogan para una campaña de reciclaje.\n'}, {'from': 'gpt', 'value': '1. "Reduce, reutiliza, recicla: juntos por un futuro más verde."\n2. "Recicla hoy, para un mañana mejor."\n3. "¡Convierte tu basura en tesoro - Recicla!"\n4. "Recicla por el ciclo de vida."\n5. "Ahorra recursos, recicla más."'}], 'id': '23712'}


In [18]:
_alpaca_prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{}

### Response:
{}"""
# Becomes:
alpaca_prompt = """Debajo se encuentra una instrucción que describe una tarea. Escribe una respuesta que completa la solicitud

### Instruccion:
{}

### Respuesta:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(conversations):
    texts = []
    conversations = conversations["conversations"]
    for convo in conversations:
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(convo[0]["value"], convo[1]["value"]) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

alpaca_dataset = alpaca_dataset.map(formatting_prompts_func, batched = True,)

Map:   0%|          | 0/49969 [00:00<?, ? examples/s]

We employ `UnslothTrainer` again, this time to add additional non-wikipedia related conversational language in Spanish to the models knowledge using the Alpaca in Spanish dataset.

In [19]:
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
from unsloth import UnslothTrainer, UnslothTrainingArguments

trainer = UnslothTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = alpaca_dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 8,

    args = UnslothTrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 8,

        # Use num_train_epochs and warmup_ratio for longer runs!
        max_steps = 30,
        warmup_steps = 10,
        # warmup_ratio = 0.1,
        # num_train_epochs = 1,

        # Select a 2 to 10x smaller learning rate for the embedding matrices!
        learning_rate = 5e-5,
        embedding_learning_rate = 1e-5,

        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.00,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

Map (num_proc=8):   0%|          | 0/49969 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [20]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 49,969 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 30
 "-____-"     Number of trainable parameters = 597,688,320


Unsloth: Setting lr = 1.00e-05 instead of 5.00e-05 for embed_tokens.
Unsloth: Setting lr = 1.00e-05 instead of 5.00e-05 for lm_head.


Step,Training Loss
1,1.7077
2,1.6105
3,1.2794
4,1.1445
5,0.9911
6,0.8576
7,0.9814
8,0.9717
9,0.9285
10,0.8552


In [21]:
from google.colab import userdata

In [22]:
model.push_to_hub_gguf("Alexis-Az/mistral-7b-bnb-4bit-SpanishWiki_AlpacaFT", tokenizer, quantization_method = "q4_k_m", token = userdata.get('HF_TOKEN'))

Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which will take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 4.1G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 4.26 out of 12.67 RAM for saving.


 25%|██▌       | 8/32 [00:01<00:02,  9.14it/s]We will save to Disk and not RAM now.
100%|██████████| 32/32 [03:41<00:00,  6.92s/it]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Unsloth: Saving Alexis-Az/mistral-7b-bnb-4bit-SpanishWiki_AlpacaFT/pytorch_model-00001-of-00003.bin...
Unsloth: Saving Alexis-Az/mistral-7b-bnb-4bit-SpanishWiki_AlpacaFT/pytorch_model-00002-of-00003.bin...
Unsloth: Saving Alexis-Az/mistral-7b-bnb-4bit-SpanishWiki_AlpacaFT/pytorch_model-00003-of-00003.bin...
Done.


Unsloth: Converting mistral model. Can use fast conversion = True.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m'] will take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...
Unsloth: [1] Converting model at Alexis-Az/mistral-7b-bnb-4bit-SpanishWiki_AlpacaFT into f16 GGUF format.
The output location will be ./Alexis-Az/mistral-7b-bnb-4bit-SpanishWiki_AlpacaFT/unsloth.F16.gguf
This will take 3 minutes...
INFO:hf-to-gguf:Loading model: mistral-7b-bnb-4bit-SpanishWiki_AlpacaFT
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'pytorch_model.bin.index.json'
INFO:hf-to-gguf:gguf: loading model part 'pytorch_model-00001-of-00003.bin'
INFO:hf-to-gguf:token_embd.

  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.F16.gguf:   0%|          | 0.00/14.5G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/Alexis-Az/mistral-7b-bnb-4bit-SpanishWiki_AlpacaFT
Unsloth: Uploading GGUF to Huggingface Hub...


  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.Q4_K_M.gguf:   0%|          | 0.00/4.37G [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


Saved GGUF to https://huggingface.co/Alexis-Az/mistral-7b-bnb-4bit-SpanishWiki_AlpacaFT
