# Create Embedding Model for RAG [Triwira Data]

  > **Note:** Due to limitations of GPU resources, this notebook uses `Google Colab T4 GPU` to fine-tune the embedding model.

In this notebook we're going to make the embedding model. This model is plays significant role to produce context given a query. The context produced is inserted in ChatBot prompt RAG system.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MarcoAlandAdinanda/AIC_TriwiraData/blob/main/notebooks/FineTune_Embedding_RAG.ipynb)

## 0. Get setup
Let's start by downloading all of the modules we'll need for fine-tune the embedding model.

Downloading modules

In [None]:
%%capture
!pip install transformers[torch]
!pip install -U sentence-transformers
!pip install datasets

Importing modules

In [None]:
# Typing modules
from typing import List, Tuple, Dict, Optional

# Model builder and fine-tune
from sentence_transformers import SentenceTransformer
from sentence_transformers import SentenceTransformerTrainer, SentenceTransformerTrainingArguments
from sentence_transformers.losses import MultipleNegativesRankingLoss
from sentence_transformers.training_args import BatchSamplers

# Model evaluation
from sentence_transformers.evaluation import TripletEvaluator

# Dataset loader
import requests
from datasets import load_dataset
from datasets.arrow_dataset import Dataset

  from tqdm.autonotebook import tqdm, trange


## 1. Build Embedding Model
Set base embedding model from HuggingFace.

In [None]:
model = SentenceTransformer("BAAI/bge-m3")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/123 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/15.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/54.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/687 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.27G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/444 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

## 2. Get Data
Because the RAG system use Indonesian language, we're going to fine-tune embedding model with Indonesian language triplet dataset.

In [None]:
def create_dataset(dataset_path: str,
                  test_size: float = 0.1,
                  source: str = "csv") -> Tuple[Dataset, Dataset]:
    """
    Prepares the dataset by loading and splitting it into training and testing sets.

    Args:
        dataset_path (str): The path to the dataset file.
        test_size (float, optional): The proportion of the dataset to include in the test split. Default is 0.1.
        source (str, optional): The source format of the dataset ("csv" or other supported formats). Default is "csv".

    Returns:
        tuple: A tuple containing the training dataset and the testing dataset.

    Example usage:
        train_dataset, test_dataset = create_dataset("data/dataset.csv", test_size=0.2)
    """

    if source == "csv":
        dataset = load_dataset(source, data_files=dataset_path, split="train")
        dataset = dataset.remove_columns(["anchor", "positive", "negative"])
        dataset = dataset.rename_column("Translated_anchor", "anchor")
        dataset = dataset.rename_column("Translated_positive", "positive")
        dataset = dataset.rename_column("Translated_negative", "negative")
    else:
        dataset = load_dataset(dataset_path, split="train")

    dataset = dataset.train_test_split(test_size=0.1)
    used_dataset = dataset["test"]

    used_dataset = used_dataset.train_test_split(test_size=test_size)
    train_dataset = dataset["train"]
    test_dataset = dataset["test"]

    return train_dataset, test_dataset

In [None]:
try:
    # Create train test dataset if exist
    train_dataset, test_dataset = create_dataset("/content/translated_new_data_50000.csv")
    print(f"[INFO] train test dataset successfully created")
except:
    # Download dataset
    print(f"[INFO] Did not find datasets, downloading...")
    with open("translated_new_data_50000.csv", "wb") as f:
        request = requests.get("https://raw.githubusercontent.com/MarcoAlandAdinanda/AIC_TriwiraData/main/data/translated_new_data_50000.csv")
        f.write(request.content)

# Create train test dataset
train_dataset, test_dataset = create_dataset("/content/translated_new_data_50000.csv")
print(f"[INFO] train test dataset successfully created")

Generating train split: 0 examples [00:00, ? examples/s]

[INFO] train test dataset successfully created
[INFO] train test dataset successfully created


## 3. Fine-tuning the Embedding model
We're going to make fine-tuning function to make the process more structured.


In [None]:
def train_embedding_model(model: SentenceTransformer,
                          train_dataset: Dataset,
                          test_dataset: Dataset,
                          outdir: str = "model",
                          batch_size: int = 4,
                          max_steps: int = 500,
                          eval_steps: int = 100,
                          logging_steps: int = 100) -> SentenceTransformerTrainer:
    """
    Train an embedding model using the provided training and testing datasets.

    Args:
        model (SentenceTransformer): The model to be trained.
        train_dataset (Dataset): The training dataset.
        test_dataset (Dataset): The testing dataset.
        outdir (str, optional): The output directory where the model checkpoints will be saved. Default is "model".
        batch_size (int, optional): The batch size to use during training. Default is 4.
        max_steps (int, optional): The number of max_steps to train the model. Default is 3.
        eval_steps (int, optional): The number of steps between evaluations. Default is 100.
        logging_steps (int, optional): The number of steps between logging. Default is 100.

    Returns:
        SentenceTransformerTrainer: The trainer instance configured with the specified training arguments.

    Example usage:
        trainer = train_embedding_model(model, train_dataset, test_dataset, outdir="model_output", batch_size=8, max_steps=5)
    """

    loss = MultipleNegativesRankingLoss(model)

    args = SentenceTransformerTrainingArguments(
        output_dir=outdir,
        max_steps=max_steps,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        warmup_ratio=0.1,
        batch_sampler=BatchSamplers.NO_DUPLICATES,
        eval_strategy="steps",
        eval_steps=eval_steps,
        logging_steps=logging_steps,
    )

    trainer = SentenceTransformerTrainer(
        model=model,
        args=args,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
        loss=loss
    )

    return trainer

In [None]:
trainer = train_embedding_model(model, train_dataset, test_dataset)
train_stats = trainer.train()

max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss
100,0.7797,0.692468
200,0.6337,0.601753
300,0.6129,0.573742
400,0.5982,0.511561
500,0.5504,0.471948


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

## 4. Evaluate fine-tuned model
The evaluation method used is TripletEvaluator by SentenceTransformer. If the model giving the response as expected then we're good to go.

In [None]:
def evaluate_model(model: SentenceTransformer,
                   test_dataset: Dataset,
                   name: str = "model-evaluation"):
    """
    Evaluate the model using the provided test dataset.

    Args:
        model (SentenceTransformer): The model to be evaluated.
        test_dataset (Dataset): The testing dataset containing "anchor", "positive", and "negative" columns.
        name (str, optional): The name for the evaluation. Default is "test_model".

    Returns:
        dict: Evaluation results.

    Example usage:
        results = evaluate_model(model, test_dataset, name="evaluation")
    """

    test_evaluator = TripletEvaluator(
        anchors=test_dataset["anchor"],
        positives=test_dataset["positive"],
        negatives=test_dataset["negative"],
        name=name,
    )

    return test_evaluator(model)

In [None]:
# Start evaluate the model
eval_stats = evaluate_model(model, test_dataset)
eval_stats

{'model-evaluation_cosine_accuracy': 0.9636322566071832,
 'model-evaluation_dot_accuracy': 0.03636774339281681,
 'model-evaluation_manhattan_accuracy': 0.9625028235825616,
 'model-evaluation_euclidean_accuracy': 0.9636322566071832,
 'model-evaluation_max_accuracy': 0.9636322566071832}

## 5. Save Model in HuggingFace
In order to easily call the embedding model, we're going to push our model to HuggingFace.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
model.push_to_hub("Indo-bge-m3")

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.27G [00:00<?, ?B/s]

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

'https://huggingface.co/MarcoAland/Indo-bge-m3/commit/b299bcc411100c7b2b071fb2a42739c760f8ccd1'