## Table of Contents

* **Agenda: Finetune Llama-3 8b in Kaggle/Colab without hitting the time limit**
* **Step-1: Install Unslot and other dependencies**
* **Step-2: Login to Wandb and setup env variable**
* **Step-3: Load the unsloth/llama-3-8b-bnb-4bit model**
* **Step-4: Add LoRA Adapters (so we need to finetune 1-10% of the params)**
* **Step-5: Load the Alpaca Clean Dataset & Structure it according to prompt**
* **Step-6: Finetune the Model for 1st Iteration**
* **Step-7. Finetune the Model for <code>N<sup>th</sup></code> Iteration**
  * **Step-7.1: repeat `step 1-5` and skip `step-6` (skipping..)**
  * **Step-7.2: Downloading Checkpoint Artifact from wandb**
  * **Step-7.3: Structure the checkpoint so that training can be resumed**
  * **Step-7.4: Start finetuning for n<sup>th</sup> iteartions using `resume_from_checkpoint =True`**
* **Step-8: Saving the model Locally and Huggingface Hub**
  * **Save only LoRA adapters (not the entire model)**
  * **Save 16bit or 4bit Quantize Model**
* **Step-9: Loading & Infercing the model**
  * **Loading & Infercing with Unsloth `FastLanguageModel`**
  * **Loading & Infercing with Huggingface `AutoPeftModelForCausalLM` (only for LoRA Adapter model)**
  * **Loading & Infercing using with Huggingface `AutoModelForCausalLM` (for 4bit,16bit)**
  ---



# Agenda: Finetune Llama-3 8b in Kaggle/Colab without hitting the time limit

- The purpose of this script to show you how you can use **incremantal learning** inside **Kaggle** or **Colab** notebook. So, that you don't have to finetune the model on a single go.
- Rather, you can finetune the model in such a way that it don't hit the time limit of kaggle (12hrs) / colab and finetune the model using `max_steps` instead of `num_train_epochs` and resume the training from previous checkpoint using `resume_checkpoint=True`.
- All the checkpoints will be save in **weights & biases** so that we can download it for next run.

# Step-1: Install Unslot and other dependencies

In [None]:
%%capture
!pip install -U "xformers<0.0.26" --index-url https://download.pytorch.org/whl/cu121
!pip install "unsloth[kaggle-new] @ git+https://github.com/unslothai/unsloth.git"

# Temporary fix for https://github.com/huggingface/datasets/issues/6753
!pip install datasets==2.16.0 fsspec==2023.10.0 gcsfs==2023.10.0

# Step-2: Login to Wandb and setup env variable

**if wandb is not installed then use `!pip install wandb` to install it.**

In [None]:
import os
import wandb

os.environ["WANDB_PROJECT"]="PROJECT_NAME" # for project name, give an appropriate name
os.environ["WANDB_LOG_MODEL"] = "checkpoint" # for save the checkpoints
wandb.login(key= "YOUR_WANDB_API_KEY") # replace it with your api key

# Step-3: Load the unsloth/llama-3-8b-bnb-4bit model

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

# Step-4: Add LoRA Adapters (so we need to finetune 1-10% of the params)

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

# Step-5: Load the Alpaca Clean Dataset & Structure it according to prompt
**Don't forget to add `EOS` token. Otherwise finetuned model won't learn to predict the eos token and text generation won't stop**

In [None]:
alpaca_prompt = """Below is an instruction in bangla that describes a task, paired with an input also in bangla that provides further context. Write a response in bangla that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN


def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass


from datasets import load_dataset
dataset = load_dataset("iamshnoo/alpaca-cleaned-bengali", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

# Step-6: Finetune the Model for 1st Iteration

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 1000,                #### Setting max_steps for 1000. (1-1000)
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        save_steps=500,                 ### Checkpoint will be save after every 500 steps
        optim = "adamw_8bit",
        weight_decay = 0.01,
        report_to="wandb",  # reporting logs and checkpoint to wandb
        run_name="1stIteration_TillYourStepNumber", ### wandb run name, give appropriate run name according to your choice ###
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",   # Saving the checkpoints to outputs folder
    ),
)

- **Here `max_steps=1000`. means the finetuning will happen from 0-1000 steps.**
- **And, `report_to="wandb"` means train loss will log in wandb and artifacts also**
- **`save_steps=500` means in every 500 steps a checkpoint of the `model` with that `trainer state` will be save.**
- **Finally, `output_dir="outputs"` all the model checkpoint will be save locally in that folder**

In [None]:
# Now Start training
trainer_stats = trainer.train()

In [None]:
wandb.finish() # Finish Wandb Run

# Step-7. Finetune the Model for <code>N<sup>th</sup></code> Iteration

- **Step-7.1: repeat `step 1-5` and skip `step-6`**
- **Step-7.2: Download the last `checkpoint artifact` from wandb.**
- **Step-7.3: Structure the checkpoint so that training can be resumed**
- **Step-7.4: Start finetuning for n<sup>th</sup> iteartions using `resume_from_checkpoint =True`**


## Step-7.1: repeat `step 1-5` and skip `step-6` (skipping..)

## Step-7.2: Downloading Checkpoint Artifact from wandb

- **Go to wandb website**
- **Select the project (your finetune llama project)**
- **Go to Workspace**
- **Click on artifact tab (on the bottom left)**
- **Check for `checkpoint-yourPreviousWandbRunName` and click it.**
- **You'll see many artifact starting like this `v0`,`v1`,...`vn`. Now select the `vn` latest checkpoint**
- **You'll see is a top tab showing `Version`, `Metadata`, `Usage`... from there select `Usage` tab.**
- **Copy the code & Paste it here**
- **Run the cell**

In [None]:
## The code will look like this
run = wandb.init()
artifact = run.use_artifact('YOUR_ARTIFACT_URL', type='model')
artifact_dir = artifact.download()
run.finish()  ### Don't forget to add this line

**Remeber to finish the run `run.finish()`.**

## Step-7.3: Structure the checkpoint so that training can be resumed

- **It is a helper function, which is just moving all the files from the `artifact` folder to the `outputs` folder**
- **Only change `dst_dir` value at the bottom.**
- **For `Colab` change the value to `/content/outputs/checkpoint-PreviousCheckPointNumber`**
- **For `Kaggle` change the value to `/kaggle/working/outputs/checkpoint-PreviousCheckPointNumber`**
- **`PreviousCheckPointNumber` means the last checkpoint value that is saved in the wandb run and you downloaded in the previous cell.**

In [None]:
def fileStructure(src_dir,dst_dir):
    import shutil
    import os
    # Create the destination directory if it doesn't exist
    os.makedirs(dst_dir, exist_ok=True)

    # Get a list of all file names in the source directory
    file_names = os.listdir(src_dir)

    if len(os.listdir(dst_dir))==0:
        # Move each file to the destination directory
        for file_name in file_names:
            shutil.move(os.path.join(src_dir, file_name), dst_dir)
    else:
        print("Files already been moved")


# Define source and destination directories
src_dir = artifact_dir
dst_dir = "/kaggle/working/outputs/checkpoint-PreviousCheckPointNumber"   ########## Change This Path Everytime with the Proper Checkpoint Number ###############


# Run the function
fileStructure(src_dir,dst_dir)

In [None]:
try:
    os.rmdir(src_dir)
except:
    # The directory is not empty, use shutil.rmtree() instead
    import shutil
    shutil.rmtree(src_dir)

**Above cell is optional. It is removing artifact dir. We don't need it anymore**

## Step-7.4: Start finetuning for n<sup>th</sup> iteartions using `resume_from_checkpoint =True`

- change `max_steps` to **(previous max steps + new nth max step)**. i.e: previouse max steps = 1000 and new nth max steps = 1000, then `max_steps=2000`.
- change `run_name` to your appropriate wandb run name.

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = ,   ### max_steps = prev_max_steps+new_nth_max_steps ###
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        save_steps=500,   ### Saving the checkpoints in every 500 steps ###
        optim = "adamw_8bit",
        weight_decay = 0.01,
        report_to="wandb",
        run_name="NthIteration_TillYourStepNumber",  ### Give appropriate run name according to your choice ###
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs", ### Saving the checkpoints to outputs folder ###
    ),
)

- **Change `max_steps` and `run_name`**

**Start training the trainer using `resume_from_checkpoint =True`**

In [None]:
trainer_stats = trainer.train(resume_from_checkpoint = True)

In [None]:
# finish wandb run
wandb.finish()

---

### Kaggle/Colab Training

- Now run from **step 1-6** for **first** time.
- run from **step 1-5** and **step 7** after the first time.
- Always run the code in **T4** gpu.
- get total `max_steps` requires for 1 epoch. (**Try to run the trainer with `num_train_epoch=1` and start training, it'll show total max steps require to finish 1 epoch. remember that number**)
- Set `max_steps` value in such way that all the steps can be completed before the time limit. ( i.e: **12hrs for kaggle** )
- Use **Kaggle** save version to commit the code (so that it can run automatically)
  - Comment out every code that won't be running in that version. i.e: **1st iteration** step **7-9** will be comment out and **after 1st iteration** step **6** & **8-9** will be comment out.
  - select a version name ( give an appropriate name ).
  - in `version type` select `Save & Run All (commit)`.
  - Click on Advance setting inside `Run with GPU for this session`
  - Click `Save` in the bottom to commit the code.
  - Make sure before commiting the notebook the notebook was running or currently **running in the T4 GPU.**

- Now repeat this process till completing total `max_steps` require to complete 1 epochs or 2 epoch.
- At the last iteration where you'll complete the `max_steps` require for 1/2 epochs. Immediately follow **step 8**. for Kaggle `save version` try to **add the Step 8 code** for saving the model in the end ( **Only for last iteration** ).

---

  

# Step-8: Saving the model Locally and Huggingface Hub

save the model and tokenizer after the last iteration.**❗ Remember to save the model immediately after last iteration. checkpoint can't be loaded as model**

## Save only LoRA adapters (not the entire model)

In [None]:
# save it locally
if False:
  model.save_pretrained("lora_model")
  tokenizer.save_pretrained("lora_model")

In [None]:
# Save it to Huggingface hub
if False:
  model.push_to_hub("your_name/lora_model", token = "your huggingface token with write permission")
  tokenizer.push_to_hub("your_name/lora_model", token = "your huggingface token with write permission")

## Save 16bit or 4bit Quantize Model

In [None]:
# Merge to 4bit

#Locally
if False: model.save_pretrained_merged("4bit_model", tokenizer, save_method = "merged_4bit",)

#HF hub
if False: model.push_to_hub_merged("your_hf_username/4bit_model", tokenizer, save_method = "merged_4bit_forced", token = "your huggingface token with write permission")

In [None]:
# Merge to 16bit

# Locally
if False: model.save_pretrained_merged("16bit_model", tokenizer, save_method = "merged_16bit",)

#HF Hun
if False: model.push_to_hub_merged("your_hf_username/4bit_model", tokenizer, save_method = "merged_16bit", token = "your huggingface token with write permission")

# Step-9: Loading & Infercing the model

## Loading & Infercing with Unsloth `FastLanguageModel`

In [None]:
if True:
    max_seq_length=2048
    dtype = None
    load_in_4bit = True
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "", # YOUR MODEL YOU USED FOR TRAINING either hf hub name or local folder name.
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

**Infercing**

In [None]:
alpaca_prompt = """Below is an instruction in bangla that describes a task, paired with an input also in bangla that provides further context. Write a response in bangla that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

inputs = tokenizer(
[
    alpaca_prompt.format(
        "", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 1024, use_cache = True)
tokenizer.batch_decode(outputs)

**Infercing using `TextStreamer`**

In [None]:
# alpaca_prompt = Copied from above
alpaca_prompt = """Below is an instruction in bangla that describes a task, paired with an input also in bangla that provides further context. Write a response in bangla that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 2048,eos_token_id=tokenizer.eos_token_id)

## Loading & Infercing with Huggingface `AutoPeftModelForCausalLM` (only for LoRA Adapter model)

In [None]:
if True:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    load_in_4bit = True
    model = AutoPeftModelForCausalLM.from_pretrained(
        "", # YOUR MODEL YOU USED FOR TRAINING either hf hub name or local folder name.
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("") # YOUR MODEL YOU USED FOR TRAINING either hf hub name or local folder name.

**Infercing**

In [None]:
alpaca_prompt = """Below is an instruction in bangla that describes a task, paired with an input also in bangla that provides further context. Write a response in bangla that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

inputs = tokenizer(
[
    alpaca_prompt.format(
        "", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 1024, use_cache = True)
tokenizer.batch_decode(outputs)

## Loading & Infercing with Huggingface `AutoModelForCausalLM` (for 4bit,16bit)

In [None]:
if False:
  from transformers import AutoTokenizer, AutoModelForCausalLM

  model_name = ""  # YOUR MODEL YOU USED FOR TRAINING either hf hub name or local folder name.
  tokenizer_name = model_name

  # Load tokenizer
  tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
  # Load model
  model = AutoModelForCausalLM.from_pretrained(model_name)

**Infercing**

In [None]:
# Text prompt to start generation
alpaca_prompt = """Below is an instruction in bangla that describes a task, paired with an input also in bangla that provides further context. Write a response in bangla that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

# Encode the prompt text
inputs = tokenizer(
[
    alpaca_prompt.format(
        "", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

# output
outputs = model.generate(**inputs, max_new_tokens = 1024, use_cache = True)
tokenizer.batch_decode(outputs)