  # Create LLM Model for RAG [Triwira Data]

  > **Note:** Due to limitations of GPU resources, this notebook uses `Google Colab T4 GPU` to fine-tune the LLM model.

In this notebook we're going to make the LLM model. This model take crucial roles that act as ChatBot for RAG system. It will generate a text according to user prompt to fulfill desired user's prompt.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MarcoAlandAdinanda/AIC_TriwiraData/blob/main/notebooks/FineTune_LLM_RAG.ipynb)

## 0. Get setup
Let's start by downloading all of the modules we'll need for fine-tune the LLM model.

Downloading modules

In [1]:
%%capture
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes
!pip install transformers datasets

Importing modules

In [2]:
# Typing modules
from typing import List, Tuple, Dict, Optional

# Model builder and fine-tune
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
from transformers.tokenization_utils_fast import PreTrainedTokenizerFast

# Model output text streamer
from transformers import TextStreamer

# Dataset loader
from datasets import load_dataset
from datasets.arrow_dataset import Dataset

# Prompt template for LLM input
prompt_template = """Berikut adalah sebuah instruksi yang menjelaskan suatu tugas, disertai dengan sebuah masukan yang memberikan konteks lebih lanjut. Tulislah sebuah tanggapan yang sesuai untuk menyelesaikan permintaan tersebut dan dimulai dengan pendahuluan berupa penjelasan konteks terlebih dahulu.

### Instruksi:
{}

### Respons:
{}"""

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


## 1. Build LLM model
We're going to make functions to set the base llm model and implement PEFT (Parameter Efficient Fine-Tuning)

In [3]:
def set_model(model_name: str = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
              max_seq_length: int = 2048,
              load_in_4bit: bool = True,
              dtype: Optional[bool] = None) -> Tuple[FastLanguageModel, PreTrainedTokenizerFast]:
    """
    Load a model using unsloth's FastLanguageModel.

    Args:
        model_name (str): The name of the model to load. Default is "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit".
        max_seq_length (int): The maximum sequence length for the model. Default is 2048.
        load_in_4bit (bool): Whether to load the model in 4-bit mode. Default is True.
        dtype (Optional[bool]): The data type to load the model with. Default is None.

    Returns:
        tuple: A tuple containing the loaded model and tokenizer.

    Example usage:
        model, tokenizer = set_model()
    """

    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=model_name,
        max_seq_length=max_seq_length,
        load_in_4bit=load_in_4bit,
        dtype=dtype,
    )
    return model, tokenizer

def use_peft(model: FastLanguageModel,
             r_value: int = 16,
             lora_alpha: int = 16,
             lora_dropout: int = 0,
             use_rslora: bool = False,
             random_state: int = 3407,
             bias: str = "none",
             loftq_config: Optional[bool] = None,
             use_gradient_checkpointing: str = "unsloth",
             target_modules: List[str] = ["q_proj", "k_proj", "v_proj", "o_proj",
                                          "gate_proj", "up_proj", "down_proj"]):
    """
    Implement Parameter Efficient Fine-Tuning (PEFT) for the given model.

    Args:
        model (FastLanguageModel): The model to fine-tune.
        r_value (int): The rank of the LoRA layers. Default is 16.
        lora_alpha (int): The alpha value for the LoRA layers. Default is 16.
        lora_dropout (int): The dropout rate for the LoRA layers. Default is 0.
        use_rslora (bool): Whether to use RsLoRA. Default is False.
        random_state (int): The random seed for reproducibility. Default is 3407.
        bias (str): The bias setting for the LoRA layers. Default is "none".
        loftq_config (Optional[bool]): The configuration for LoFTQ. Default is None.
        use_gradient_checkpointing (str): The gradient checkpointing strategy. Default is "unsloth".
        target_modules (List[str]): The list of target modules for PEFT. Default is ["q_proj", "k_proj", "v_proj", "o_proj",
                                                                                    "gate_proj", "up_proj", "down_proj"].

    Returns:
        FastLanguageModel: The fine-tuned model.

    Example usage:
        model = use_peft(model)
    """

    model = FastLanguageModel.get_peft_model(
        model=model,
        r=r_value,
        lora_alpha=lora_alpha,
        lora_dropout=lora_dropout,
        use_rslora=use_rslora,
        random_state=random_state,
        bias=bias,
        loftq_config=loftq_config,
        use_gradient_checkpointing=use_gradient_checkpointing,
        target_modules=target_modules,
    )

    return model

def set_model_with_peft() -> Tuple[FastLanguageModel, PreTrainedTokenizerFast]:
    """
    Build and fine-tune an unsloth.FastLanguageModel with Parameter Efficient Fine-Tuning (PEFT).

    Returns:
        tuple: A tuple containing the fine-tuned model and tokenizer.

    Example usage:
        model, tokenizer = create_model_train()
    """
    model, tokenizer = set_model()
    model = use_peft(model)
    return model, tokenizer

In [None]:
# Set peft model
model, tokenizer = set_model_with_peft()

==((====))==  Unsloth 2024.8: Fast Llama patching. Transformers = 4.44.0.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.26.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

Unsloth 2024.8 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


## 2. Get data
Because the RAG system use Indonesian language, we're going to fine-tune LLM model with Indonesian language instruction-response dataset.

In [4]:
def formatting_prompts_func(data, prompt_template: str, EOS_TOKEN):
    """
    Formats the input data into prompts using a given template and an end-of-sequence token.

    Args:
        data (dict): The input data containing "instruction" and "response" keys.
        prompt_template (str): The template string to format the prompts.
        EOS_TOKEN (str): The end-of-sequence token to append to each prompt.

    Returns:
        dict: A dictionary with the key "text" containing a list of formatted prompt strings.

    Example usage:
        formatted_data = formatting_prompts_func(data, "{0}: {1}", "<|endoftext|>")
    """

    instructions = data["instruction"]
    response = data["response"]
    texts = []
    for instruction, response in zip(instructions, response):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = prompt_template.format(instruction, response) + EOS_TOKEN
        texts.append(text)
    return {"text": texts}

def create_dataset(dataset_path: str,
                   tokenizer: PreTrainedTokenizerFast,
                   prompt_template: str,
                   test_size: float = 0.1) -> Tuple[Dataset, Dataset]:
    """
    Creates a preprocessed dataset from the given dataset path and tokenizer, using a prompt template.

    Args:
        dataset_path (str): The path to the dataset.
        tokenizer (Tokenizer): The tokenizer to use for processing the dataset.
        prompt_template (str): The template string to format the prompts.

    Returns:
        Dataset: A preprocessed dataset ready for training or evaluation.

    Example usage:
        dataset = create_dataset("path/to/dataset", tokenizer, "{0}: {1}")
    """

    # End of string token
    EOS_TOKEN = tokenizer.eos_token
    dataset = load_dataset(dataset_path, split="train")

    splited_dataset = dataset.train_test_split(test_size=0.005)
    used_dataset = splited_dataset['test']

    # Map preprocessing
    preprocessed_dataset = used_dataset.map(
        formatting_prompts_func,
        batched=True,
        fn_kwargs={'prompt_template': prompt_template, 'EOS_TOKEN': EOS_TOKEN}
    )

    train_test_dataset = preprocessed_dataset.train_test_split(test_size=test_size)
    train_dataset = train_test_dataset["train"]
    test_dataset = train_test_dataset["test"]

    return train_dataset, test_dataset


In [None]:
# Dataset preparation
train_dataset, test_dataset = create_dataset(dataset_path="manfredmichael/quac-lamini-instruction-indo-2.6M", tokenizer=tokenizer, prompt_template=prompt_template)

Downloading readme:   0%|          | 0.00/601 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/205M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/253M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/256M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2585614 [00:00<?, ? examples/s]

Map:   0%|          | 0/12929 [00:00<?, ? examples/s]

## 3. Fine-tuning the LLM model
We're going to make fine-tuning function to make the process more structured.


In [5]:
def train_llm_model(train_dataset: Dataset,
                    eval_dataset: Dataset,
                    model: FastLanguageModel,
                    tokenizer: PreTrainedTokenizerFast,
                    max_steps: int = 500,
                    num_train_epochs: int = 1,
                    output_dir: str = "outputs",
                    logging_steps: int = 100,
                    eval_steps: int = 100,
                    eval_strategy: str = "steps",
                    save_total_limit: int = 2,
                    optimizer: str = "adamw_8bit",
                    lr: float = 2e-4,
                    lr_scheduler_type: str = "linear",
                    weight_decay: float = 0.01,
                    per_device_train_batch_size: int = 2,
                    per_device_eval_batch_size: int = 2,
                    gradient_accumulation_steps: int = 4,
                    seed: int = 3407,
                    dataset_text_field: str = "text",
                    dataset_num_proc: int = 2,
                    packing: bool = False) -> SFTTrainer:
    """
    Train the model using the given dataset and training parameters.

    Args:
        train_dataset (Dataset): The training dataset.
        model (FastLanguageModel): The model to be trained.
        tokenizer (Tokenizer): The tokenizer to use for processing the dataset.
        num_train_epochs (int): The number of training epochs.
        output_dir (str, optional): The directory where the training outputs will be saved. Default is "outputs".
        logging_steps (int, optional): The frequency of logging training metrics. Default is 1.
        eval_steps (int, optional): The number of steps between evaluations. Default is 100.
        save_steps (int, optional): The number of steps between saving the model. Default is 100.
        optimizer (str, optional): The optimizer to use. Default is "adamw_8bit".
        lr (float, optional): The learning rate. Default is 2e-4.
        lr_scheduler_type (str, optional): The type of learning rate scheduler. Default is "linear".
        weight_decay (float, optional): The weight decay rate. Default is 0.01.
        per_device_train_batch_size (int, optional): The batch size per device during training. Default is 2.
        per_device_eval_batch_size (int, optional): The batch size per device during evaluation. Default is 2.
        gradient_accumulation_steps (int, optional): The number of gradient accumulation steps. Default is 4.
        seed (int, optional): The random seed for reproducibility. Default is 3407.
        dataset_text_field (str, optional): The text field in the dataset. Default is "text".
        dataset_num_proc (int, optional): The number of processes to use for data loading. Default is 2.
        packing (bool, optional): Whether to enable packing. Default is False.

    Returns:
        SFTTrainer: The trainer instance configured with the specified training arguments.

    Example usage:
        trainer = train_model_epochs(train_dataset, model, tokenizer, num_train_epochs=3)
    """

    train_args = TrainingArguments(max_steps=max_steps,
                                   num_train_epochs=num_train_epochs,
                                   output_dir=output_dir,
                                   logging_steps=logging_steps,
                                   eval_steps=eval_steps,
                                   eval_strategy=eval_strategy,
                                   save_total_limit=save_total_limit,
                                   optim=optimizer,
                                   learning_rate=lr,
                                   lr_scheduler_type=lr_scheduler_type,
                                   weight_decay=weight_decay,
                                   per_device_train_batch_size=per_device_train_batch_size,
                                   per_device_eval_batch_size=per_device_eval_batch_size,
                                   gradient_accumulation_steps=gradient_accumulation_steps,
                                   seed=seed,
                                   fp16=not is_bfloat16_supported(),
                                   bf16=is_bfloat16_supported())

    trainer = SFTTrainer(model=model,
                         tokenizer=tokenizer,
                         train_dataset=train_dataset,
                         eval_dataset=eval_dataset,
                         dataset_text_field=dataset_text_field,
                         dataset_num_proc=dataset_num_proc,
                         packing=packing,
                         args=train_args)

    return trainer

In [None]:
# Start fine-tuning model
trainer = train_llm_model(train_dataset, test_dataset, model, tokenizer)
trainer_stats = trainer.train()



Map (num_proc=2):   0%|          | 0/11636 [00:00<?, ? examples/s]

Map (num_proc=2):   0%|          | 0/1293 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 11,636 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 500
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss,Validation Loss
100,0.9897,0.898316
200,0.8833,0.864511
300,0.8709,0.842469
400,0.8736,0.82893
500,0.8279,0.821915


## 4. Evaluate fine-tuned model
The evaluation method we're going to use is testing by ourself with giving an insturction to the model. If the model giving the response as expected then we're good to go.

In [6]:
# If we want to evaluate separately by loading the model and tokenizer
model, tokenizer = set_model(model_name="MarcoAland/Indo-Llama-3.1-8B-Instruct-bnb-4bit")

==((====))==  Unsloth 2024.8: Fast Llama patching. Transformers = 4.44.0.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.26.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Unsloth 2024.8 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


Create function to inference the model

In [7]:
def inference_model(model: FastLanguageModel,
                    tokenizer: PreTrainedTokenizerFast,
                    prompt_template: str) -> None:
    """
    Run inference on the model using the given tokenizer and prompt template.

    Args:
        model (FastLanguageModel): The model to use for inference.
        tokenizer (Tokenizer): The tokenizer to use for processing the input.
        prompt_template (str): The template string to format the prompts.

    Example usage:
        inference_model(model, tokenizer, "{0}: {1}")
    """

    while True:
        instruction = input("Masukkan instruksi: ")
        if instruction.lower() == 'stop':
            break

        # processing
        FastLanguageModel.for_inference(model)
        inputs = tokenizer([prompt_template.format(instruction, "")], return_tensors="pt").to("cuda")
        text_streamer = TextStreamer(tokenizer)
        _ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=2048)


In [11]:
# Start inference mode
inference_model(model, tokenizer, prompt_template)

Masukkan instruksi: Halo
<|begin_of_text|>Berikut adalah sebuah instruksi yang menjelaskan suatu tugas, disertai dengan sebuah masukan yang memberikan konteks lebih lanjut. Tulislah sebuah tanggapan yang sesuai untuk menyelesaikan permintaan tersebut dan dimulai dengan pendahuluan berupa penjelasan konteks terlebih dahulu.

### Instruksi:
Halo

### Respons:
Halo!<|eot_id|>
Masukkan instruksi: Jelaskan apa itu deep learning
<|begin_of_text|>Berikut adalah sebuah instruksi yang menjelaskan suatu tugas, disertai dengan sebuah masukan yang memberikan konteks lebih lanjut. Tulislah sebuah tanggapan yang sesuai untuk menyelesaikan permintaan tersebut dan dimulai dengan pendahuluan berupa penjelasan konteks terlebih dahulu.

### Instruksi:
Jelaskan apa itu deep learning

### Respons:
Deep learning adalah subdisiplin dari machine learning yang melibatkan penggunaan jaringan saraf tiruan yang terdiri dari lapisan (atau "lapisan") yang saling berhubungan untuk mengolah dan mengklasifikasikan dat

## 5. Save Model in HuggingFace
In order to easily call the LLM model, we're going to push our model to HuggingFace.

Set the HuggingFace Token

In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Push model to HuggingFace in Safetensor format

In [None]:
save_model_name = "Indo-Llama-3.1-8B-Instruct-bnb-4bit"
model.push_to_hub(save_model_name)
tokenizer.push_to_hub(save_model_name)

README.md:   0%|          | 0.00/609 [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Saved model to https://huggingface.co/Indo-Llama-3.1-8B-Instruct-bnb-4bit-v2


Push model to HuggingFace hub in GGUF format

In [None]:
model.push_to_hub_gguf("MarcoAland/Indo-Llama-3.1-8B-Instruct-bnb-4bit-GGUF",
                       tokenizer,
                       quantization_method = ["q4_k_m"]
                       )

Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which will take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 5.7G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 6.94 out of 12.67 RAM for saving.


 47%|████▋     | 15/32 [00:02<00:01,  9.72it/s]We will save to Disk and not RAM now.
100%|██████████| 32/32 [01:20<00:00,  2.53s/it]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Unsloth: Saving MarcoAland/Indo-Llama-3.1-8B-Instruct-bnb-4bit/pytorch_model-00001-of-00004.bin...
Unsloth: Saving MarcoAland/Indo-Llama-3.1-8B-Instruct-bnb-4bit/pytorch_model-00002-of-00004.bin...
Unsloth: Saving MarcoAland/Indo-Llama-3.1-8B-Instruct-bnb-4bit/pytorch_model-00003-of-00004.bin...
Unsloth: Saving MarcoAland/Indo-Llama-3.1-8B-Instruct-bnb-4bit/pytorch_model-00004-of-00004.bin...
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m'] will take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...
Unsloth: [1] Converting model at MarcoAland/Indo-Llama-3.1-8B-Instruct-bnb-4bit into f16 GGUF format.
The output location will be ./MarcoAland/Indo-Llama-3.1-8B-Instruct-bnb-4bit/unsloth.F16.gguf
This will take 3 minutes...
INFO:hf-to-gguf:Loading model: Indo-Llama-3.1-8B-Instruct-bnb-4bit
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'pytorch_model.bin.index.json'
INFO:hf-to-gguf:gguf: loading model part 'pytorch_model-00001-of-00004.bin'
INFO:hf-to-gguf:token_embd.weight,      

  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.F16.gguf:   0%|          | 0.00/16.1G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/MarcoAland/Indo-Llama-3.1-8B-Instruct-bnb-4bit
Unsloth: Uploading GGUF to Huggingface Hub...


  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.Q4_K_M.gguf:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/MarcoAland/Indo-Llama-3.1-8B-Instruct-bnb-4bit
