# Continual Pretraining

> AKA as Continued Finetuning. Unsloth allows you to continually pretrain so a model can learn a new language.

Continued or continual pretraining (CPT) is necessary to “steer” the language model to understand new domains of knowledge, or out of distribution domains. Sometimes models have not been well trained on other languages, or text specific domains, like law, medicine or other areas. So continued pretraining (CPT) is necessary to make the language model learn new tokens or datasets.

The [text completion notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing) is for continued pretraining/raw text. The [continued pretraining notebook](https://colab.research.google.com/drive/1tEd1FrOXWMnCU9UIvdYhs61tkxdMuKZu?usp=sharing) (this one) is for learning another language.

You can read more about continued pretraining and our release in our [blog post](https://unsloth.ai/blog/contpretraining).

In [None]:
from transformers import TrainingArguments
from datasets import load_dataset
from unsloth import FastLanguageModel
from unsloth import is_bfloat16_supported
from unsloth import UnslothTrainer, UnslothTrainingArguments

* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.

In [2]:
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

In [None]:
model, tokenizer = FastLanguageModel.from_pretrained(
    # More models at https://huggingface.co/unsloth
    model_name = "unsloth/mistral-7b-v0.3", # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7B
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

## PEFT & Parameters for finetuning

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

Now to customize your finetune, you can edit the numbers below. We recommend keeping the defaults for now, but you can change them if you want to **experiment**. The goal is to change these numbers to increase accuracy, but also counteract over-fitting. Over-fitting is when you make the language model memorize a dataset, and not be able to answer novel new questions. We want to a final model to answer unseen questions, and not do memorization.

- `r`: Choose any number > 0 ! Suggested 8, 16, 32, 64, 128. The rank of the finetuning process. A larger number uses more memory and will be slower, but can increase accuracy on harder tasks. We normally suggest numbers like 8 (for fast finetunes), and up to 128. Too large numbers can causing over-fitting, damaging your model's quality.
- `target_modules`: We select all modules to finetune. You can remove some to reduce memory usage and make training faster, but we highly do not suggest this. Just train on all modules!
- `lora_alpha`: The scaling factor for finetuning. A larger number will make the finetune learn more about your dataset, but can promote over-fitting. We suggest this to equal to the rank r, or double it.
- `lora_dropout`: Supports any, but = 0 is optimized. Leave this as 0 for faster training! Can reduce over-fitting, but not that much.
- `bias`: Supports any, but = "none" is optimized. Leave this as 0 for faster and less over-fit training!
- `use_gradient_checkpointing`: True or "unsloth" for very long context. Options include True, False and "unsloth". We suggest "unsloth" since we reduce memory usage by an extra 30% and support extremely long context finetunes.You can read up [here](https://unsloth.ai/blog/long-context) for more details.
- `random_state`: The number to determine deterministic runs. Training and finetuning needs random numbers, so setting this number makes experiments reproducible.
- `use_rslora`: We support rank stabilized LoRA. Advanced feature to set the lora_alpha = 16 automatically. You can use this if you want!
- `loftq_config`: Advanced feature to initialize the LoRA matrices to the top r singular vectors of the weights. Can improve accuracy somewhat, but can make memory usage explode at the start.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 128, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
        "embed_tokens", "lm_head"  # Add for continual pretraining
    ],
    lora_alpha = 32,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = True,   # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

## Data Prep
We now use the Korean subset of the [Wikipedia dataset](https://huggingface.co/datasets/wikimedia/wikipedia) to first continually pretrain the model. You can use **any language** you like! Go to [Wikipedia's List of Languages](https://en.wikipedia.org/wiki/List_of_Wikipedias) to find your own language!

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!

If you want to use the `ChatML` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing).

For text completions like novel writing, try this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).

**[NOTE]** Use https://translate.google.com to translate from English to Korean!

In [5]:
# Wikipedia provides a title and an article text.
# Use https://translate.google.com!
_wikipedia_prompt = """Wikipedia Article
### Title: {}

### Article:
{}"""

# becomes:
wikipedia_prompt = """위키피디아 기사
### 제목: {}

### 기사:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    titles = examples["title"]
    texts  = examples["text"]
    outputs = []
    for title, text in zip(titles, texts):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = wikipedia_prompt.format(title, text) + EOS_TOKEN
        outputs.append(text)
    return { "text" : outputs }

We only use 1% of the dataset to speed things up! Use more for longer runs!

In [6]:
dataset = load_dataset("wikimedia/wikipedia", "20231101.ko", split = "train",)
# We select 1% of the data to make training faster!
dataset = dataset.train_test_split(train_size = 0.01)["train"]

In [7]:
dataset = dataset.map(formatting_prompts_func, batched = True,)

In [8]:
# Checkout the first example
print(dataset[0]["text"])

위키피디아 기사
### 제목: 펑터우산 문화

### 기사:
펑터우산 문화(, 기원전 7500년경 ~ 기원전 6100년경)는 중화인민공화국의 장 강 중류, 후난성 북서부에 있는 신석기 시대 초기의 문화이다.

개요 

북쪽의 황하 유역에 번창한 페이리강 문화(裴李崗文化)와 거의 같은 시기의 문화이며 중국의 신석기시대 초기에 해당하며, 벼를 재배하고 있었던 것으로 추측된다. 표식 유적은 후난성 창더 시 리 현의 리양 평원에서 발견된 펑터우산 유적이며, 같은 현에서 발견된 80 여개의 유적이다.

펑터우산 유적은 1988년에 발굴되어 현재로서는 중국 유적 중 가장 초기의 취락의 흔적을 엿볼 수 있다. 다만, 연대를 확정하는 것이 곤란하고, 기원 전 9000년부터 기원 전 5500년경까지 모호하게 연대를 추측해볼 수 있다. 부장품으로는 새끼줄 모양이라고 하여 이름붙여진 색문토기(索文土器)가 출토되었다.

이 유적에서는 기원 전 7000년경의 쌀의 왕겨 등이 발견되었다. 이 쌀의 크기는 야생종의 쌀보다 크고, 중국에서 가장 오래된 재배종 벼가 실재했던 증거가 되고 있다. 다만 논을 경작하기 위한 도구 등이 펑터우산 유적에서는 발견되지 않았고, 펑터우산 문화의 후기의 유적에서는 발견되었다.

80여개의 유적에서는 취락을 굴로 둘러싼 자취가 발견되어, 가장 오래된 둘레군락이라고 추측된다. 또한 취락 중앙에는 제사를 목적으로 한 듯한 큰 건물이 발견되었다.

같이 보기 
 황하 문명
 장강 문명
 중국의 신석기 문화 목록

아시아의 신석기 문화
중국의 신석기 시대</s>


## Continued Pretraining

Now let's use Unsloth's `UnslothTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 20 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

Also set `embedding_learning_rate` to be a learning rate at least 2x or 10x smaller than `learning_rate` to make continual pretraining work!

In [None]:
trainer = UnslothTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,

    args = UnslothTrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 8,

        # Use warmup_ratio and num_train_epochs for longer runs!
        max_steps = 20,
        warmup_steps = 10,
        # warmup_ratio = 0.1,
        # num_train_epochs = 1,

        # Select a 2 to 10x smaller learning rate for the embedding matrices!
        learning_rate = 5e-5,
        embedding_learning_rate = 1e-5,

        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "cpk-outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

In [10]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 6,478 | Num Epochs = 1 | Total steps = 20
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 8 x 1) = 16
 "-____-"     Trainable parameters = 603,979,776/4,362,342,400 (13.85% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,1.6347
2,1.5705
3,1.5064
4,1.4106
5,1.5587
6,1.4932
7,1.2885
8,1.3828
9,1.4639
10,1.3799


### Saving finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model.

In [11]:
model.save_pretrained("pretrained-korean-lora_model")  # Local saving
tokenizer.save_pretrained("pretrained-korean-lora_model")  # Local saving
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('pretrained-korean-lora_model/tokenizer_config.json',
 'pretrained-korean-lora_model/special_tokens_map.json',
 'pretrained-korean-lora_model/tokenizer.model',
 'pretrained-korean-lora_model/added_tokens.json',
 'pretrained-korean-lora_model/tokenizer.json')

## Instruction Finetuning

We now use the [Alpaca in GPT4 Dataset](https://huggingface.co/datasets/FreedomIntelligence/alpaca-gpt4-korean) but translated in Korean!

Go to [vicgalle/alpaca-gpt4](https://huggingface.co/datasets/vicgalle/alpaca-gpt4) for the original GPT4 dataset for Alpaca or [MultilingualSIFT project](https://github.com/FreedomIntelligence/MultilingualSIFT) for other translations of the Alpaca dataset.

### Load the dataset

In [15]:
alpaca_dataset = load_dataset("FreedomIntelligence/alpaca-gpt4-korean", split = "train")

Repo card metadata block was not found. Setting CardData to empty.


In [18]:
# print out the first example
print(alpaca_dataset[0]["conversations"])

[{'from': 'human', 'value': '재활용 캠페인 슬로건을 제시하세요.\n'}, {'from': 'gpt', 'value': '1. "더욱 녹색 미래를 위해 함께 줄이고, 재사용하고, 재활용하세요."\n2. "더 나은 내일을 위해 오늘 바로 재활용하세요."\n3. "쓰레기를 보물로 만드는 법 - 재활용!"\n4. "인생의 순환을 위해 재활용하세요."\n5. "자원을 아끼고 더 많이 재활용하세요."'}]


In [17]:
_alpaca_prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{}

### Response:
{}"""

# We again use https://translate.google.com/ to translate the Alpaca format into Korean
# Becomes:
alpaca_prompt = """다음은 작업을 설명하는 명령입니다. 요청을 적절하게 완료하는 응답을 작성하세요.

### 지침:
{}

### 응답:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(conversations):
    texts = []
    conversations = conversations["conversations"]
    for convo in conversations:
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        # The conversations are a list of 2 dicts: 1 request from human and 1 response from gpt
        text = alpaca_prompt.format(convo[0]["value"], convo[1]["value"]) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts }

In [19]:
alpaca_dataset = alpaca_dataset.map(formatting_prompts_func, batched = True)

Map:   0%|          | 0/49969 [00:00<?, ? examples/s]

In [23]:
alpaca_dataset

Dataset({
    features: ['conversations', 'id', 'text'],
    num_rows: 49969
})

In [24]:
# drop "conversations" column
alpaca_dataset = alpaca_dataset.remove_columns("conversations")

In [21]:
# Checkout the first example
print(alpaca_dataset[0]["text"])

다음은 작업을 설명하는 명령입니다. 요청을 적절하게 완료하는 응답을 작성하세요.

### 지침:
재활용 캠페인 슬로건을 제시하세요.


### 응답:
1. "더욱 녹색 미래를 위해 함께 줄이고, 재사용하고, 재활용하세요."
2. "더 나은 내일을 위해 오늘 바로 재활용하세요."
3. "쓰레기를 보물로 만드는 법 - 재활용!"
4. "인생의 순환을 위해 재활용하세요."
5. "자원을 아끼고 더 많이 재활용하세요."</s>


### Finetune

In [None]:
trainer = UnslothTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = alpaca_dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 8,

    args = UnslothTrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 8,

        # Use num_train_epochs and warmup_ratio for longer runs!
        max_steps = 20,
        warmup_steps = 10,
        # warmup_ratio = 0.1,
        # num_train_epochs = 1,

        # Select a 2 to 10x smaller learning rate for the embedding matrices!
        learning_rate = 5e-5,
        embedding_learning_rate = 1e-5,

        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.00,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "ftk-outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

In [26]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 49,969 | Num Epochs = 1 | Total steps = 20
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 8 x 1) = 16
 "-____-"     Trainable parameters = 603,979,776/4,362,342,400 (13.85% trained)


Step,Training Loss
1,1.2283
2,1.2676
3,1.1293
4,0.9846
5,0.9641
6,0.8338
7,0.856
8,1.0134
9,0.8563
10,0.9342


### Saving finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model.

In [35]:
model.save_pretrained("finetuned-korean-lora_model")  # Local saving
tokenizer.save_pretrained("finetuned-korean-lora_model")  # Local saving
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('finetuned-korean-lora_model/tokenizer_config.json',
 'finetuned-korean-lora_model/special_tokens_map.json',
 'finetuned-korean-lora_model/tokenizer.model',
 'finetuned-korean-lora_model/added_tokens.json',
 'finetuned-korean-lora_model/tokenizer.json')

### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

Remember to use https://translate.google.com/!

In [28]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
print("Model is now ready for inference!")

Model is now ready for inference!


In [34]:
# alpaca_prompt = Copied from above
inputs = tokenizer(
[
    alpaca_prompt.format(
        # "Continue the fibonacci sequence: 1, 1, 2, 3, 5, 8,", # instruction
        #"피보나치 수열을 계속하세요: 1, 1, 2, 3, 5, 8,", # instruction
        "1과 1의 빼기는 무엇입니까?", # instruction: "What is the subtraction of 1 and 1?"
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 128, use_cache = True, pad_token_id = tokenizer.eos_token_id)
preds = tokenizer.batch_decode(outputs)
print(preds[0])

<s> 다음은 작업을 설명하는 명령입니다. 요청을 적절하게 완료하는 응답을 작성하세요.

### 지침:
1과 1의 빼기는 무엇입니까?

### 응답:
1과 1의 빼기는 0입니다.</s>
