# Create RAG System [Triwira Data]

  > **Note:** Due to limitations of GPU resources, this notebook uses `Google Colab T4 GPU` to load LLM and embedding models.
  
  This notebook we're going to make the RAG system by combining models that already trained.

## 0. Get setup
Let's start by downloading all the modules needed to make RAG system.

Downloading modules

In [1]:
%%capture
!pip install transformers[torch]
!pip install -U sentence-transformers
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes
!pip install transformers datasets

In [2]:
%%capture
!pip install llama-index
!pip install llama-index-embeddings-huggingface
!pip install peft
!pip install auto-gptq
!pip install optimum
!pip install bitsandbytes==0.42.0

Importing modules

In [3]:
# Typing modules
from typing import List, Tuple, Dict, Optional

# LLM model builder
from unsloth import FastLanguageModel
from transformers.tokenization_utils_fast import PreTrainedTokenizerFast

# Embeddiing model builder
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings, SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor

# Model output text streamer
from transformers import TextStreamer

# Dataset module
import os
import requests
import zipfile
from pathlib import Path

# Prompt template for LLM input
prompt_template = """Berikut adalah sebuah instruksi yang menjelaskan suatu tugas, disertai dengan sebuah masukan yang memberikan konteks lebih lanjut. Tulislah sebuah tanggapan yang sesuai untuk menyelesaikan permintaan tersebut.

{}

### Instruksi:
{}

### Respons:
{}"""

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


Download dataset

In [4]:
def download_data(source: str, destination: str, remove_zip: bool = False):
    """
    Downloads data from a specified source URL and extracts it to the destination directory.

    Args:
        source (str): The URL of the data to download.
        destination (str): The directory where the data will be extracted.
        remove_zip (bool, optional): Whether to remove the ZIP file after extraction. Default is False.

    Returns:
        Path: The path to the destination directory.

    Example usage:
        data_path = download_data("https://example.com/data.zip", "data", remove_zip=True)
    """

    data_path = Path(destination)
    zip_name = "data_pdf.zip"
    print(data_path)

    if data_path.is_dir():
        print(f"[INFO] {data_path} already exists")
    else:
        data_path.mkdir(parents=True, exist_ok=True)
        print(f"[INFO] Downloading {source}...")
        with open(zip_name, "wb") as f:
            request = requests.get(source)
            f.write(request.content)

        with zipfile.ZipFile(zip_name, 'r') as zip_ref:
            print(f"[INFO] Unzipping {data_path} data...")
            zip_ref.extractall(destination)

        if remove_zip:
            os.remove(zip_name)

    return data_path.name


In [5]:
data_path = download_data(source="https://github.com/MarcoAlandAdinanda/AIC_TriwiraData/raw/main/data_pdf.zip",
                          destination="data/")

data
[INFO] Downloading https://github.com/MarcoAlandAdinanda/AIC_TriwiraData/raw/main/data_pdf.zip...
[INFO] Unzipping data data...


## 1. Set llm and embedding model
First thing we're going to do is defining the LLM dan Embedding model using a function that will be used in RAG module.

In [6]:
def set_llm_model(model_name: str,
              max_seq_length: int = 2048,
              load_in_4bit: bool = True,
              dtype: Optional[bool] = None) -> Tuple[FastLanguageModel, PreTrainedTokenizerFast]:
    """
    Load a model using unsloth's FastLanguageModel.

    Args:
        model_name (str): The name of the model to load. Default is "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit".
        max_seq_length (int): The maximum sequence length for the model. Default is 2048.
        load_in_4bit (bool): Whether to load the model in 4-bit mode. Default is True.
        dtype (Optional[bool]): The data type to load the model with. Default is None.

    Returns:
        tuple: A tuple containing the loaded model and tokenizer.

    Example usage:
        model, tokenizer = set_model()
    """

    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=model_name,
        max_seq_length=max_seq_length,
        load_in_4bit=load_in_4bit,
        dtype=dtype,
    )
    return model, tokenizer


def set_embed_model(model_name: str,
                    chunk_size: int = 256,
                    chunk_overlap: int = 25) -> None:
    """
    Set the embedding model configuration.

    Args:
        model_name (str): The name of the model to use for embedding.
        chunk_size (int, optional): The size of the chunks to be used for embedding. Default is 256.
        chunk_overlap (int, optional): The overlap size between chunks. Default is 25.

    Returns:
        None

    Example usage:
        set_embed_model("bert-base-uncased", chunk_size=128, chunk_overlap=20)
    """

    Settings.llm = None
    Settings.embed_model = HuggingFaceEmbedding(model_name=model_name)
    Settings.chunk_size = chunk_size
    Settings.chunk_overlap = chunk_overlap

## 2. Build RAG module
Create the RAG module by using model functions we already defined before.

In [8]:
class TriwiraDataRAG:
    """
    A class for retrieving and generating responses using a retrieval-augmented generation (RAG) approach.

    Attributes:
        top_k (int): The number of top similar documents to retrieve.
        query_engine (RetrieverQueryEngine): The engine to retrieve documents based on query similarity.
        llm_model (FastLanguageModel): The large language model for generating responses.
        llm_tokenizer (Tokenizer): The tokenizer for the language model.
        prompt_template (str): The template string to format the prompts.

    Methods:
        format_context(response): Formats the retrieved documents into a context string.
        query(query: str): Retrieves and formats the context for a given query.
        main(instruction: str): Generates a response from the model based on the given instruction.
    """

    def __init__(self,
                 llm_model: str = "MarcoAland/Indo-Llama-3.1-8B-Instruct-bnb-4bit",
                 embedding_model: str = "MarcoAland/Indonesian-bge-m3",
                 prompt_template: str = prompt_template,
                 docs_path: str = "data",
                 top_k: int = 3,
                 similarity_cutoff: float = 0.3):
        """
        Initializes the TriwiraDataRAG class.

        Args:
            llm_model (str, optional): The name of the large language model. Default is "MarcoAland/Indo-Llama-3.1-8B-Instruct-bnb-4bit".
            embedding_model (str, optional): The name of the embedding model. Default is "MarcoAland/Indonesian-bge-m3".
            prompt_template (str, optional): The template string to format the prompts.
            docs_path (str, optional): The path to the documents for building the vector index. Default is "data".
            top_k (int, optional): The number of top similar documents to retrieve. Default is 3.
            similarity_cutoff (float, optional): The similarity cutoff for filtering retrieved documents. Default is 0.3.
        """

        # Define embedding model in Settings module
        set_embed_model(model_name=embedding_model)

        # Set vector DB
        documents = SimpleDirectoryReader(docs_path).load_data()
        index = VectorStoreIndex.from_documents(documents)
        retriever = VectorIndexRetriever(
            index=index,
            similarity_top_k=top_k,
        )

        self.top_k = top_k
        self.query_engine = RetrieverQueryEngine(
            retriever=retriever,
            node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=similarity_cutoff)]
        )

        # Define llm model as self.llm_model
        self.llm_model, self.llm_tokenizer = set_llm_model(model_name=llm_model)
        FastLanguageModel.for_inference(self.llm_model)

        self.prompt_template = prompt_template

    def format_context(self, response):
        """
        Formats the retrieved documents into a context string.

        Args:
            response: The response from the query engine containing source nodes.

        Returns:
            str: The formatted context string.
        """
        context = "Context:\n"
        for i in range(self.top_k):
            context += response.source_nodes[i].text + "\n\n"
        return context

    def query(self, query: str):
        """
        Retrieves and formats the context for a given query.

        Args:
            query (str): The query string.

        Returns:
            str: The formatted context string.
        """
        try:
            response = self.query_engine.query(query)
            context = self.format_context(response)
            return context
        except:
            return ""

    def main(self, instruction: str):
        """
        Generates a response from the model based on the given instruction.

        Args:
            instruction (str): The instruction string.

        Returns:
            str: The generated response.
        """
        context = self.query(query=instruction)
        inputs = self.llm_tokenizer([self.prompt_template.format(context, instruction, "")], return_tensors="pt").to("cuda")
        text_streamer = TextStreamer(self.llm_tokenizer)
        output = self.llm_model.generate(**inputs, streamer=text_streamer, max_new_tokens=2048)

Declare RAG module

In [9]:
TestRAG = TriwiraDataRAG()

LLM is explicitly disabled. Using MockLLM.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/201 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/17.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/54.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/698 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.27G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

==((====))==  Unsloth 2024.8: Fast Llama patching. Transformers = 4.43.4.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.26.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Unsloth 2024.8 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


## 3. Evaluate the RAG system
The evaluation method we're going to use is testing by ourself with giving an insturction to the model. If the model giving the response as expected then we're good to go.

In [10]:
TestRAG.main("Siapa ketua dewan direksi?")

<|begin_of_text|>Berikut adalah sebuah instruksi yang menjelaskan suatu tugas, disertai dengan sebuah masukan yang memberikan konteks lebih lanjut. Tulislah sebuah tanggapan yang sesuai untuk menyelesaikan permintaan tersebut.

Context:
STRUKTUR PERUSAHAAN  TRIWIRA DATA  
1. Dewan Direksi (Board of Directors)  
• Ketua Dewan Direksi (Chairman of the Board): Bapak Andi Wijaya  
• Anggota Dewan Direksi (Board Members): Ibu Siti Nurhaliza, Bapak Bambang Susilo, Ibu Maria 
Gunawan  
2.

Manajemen Eksekutif (Executive Management)  
• Direktur Utama (Chief Executive Officer/CEO): Bapak Joko Prasetyo  
• Direktur Keuangan (Chief Financial Officer/CFO): Ibu Rina Kartika  
• Direktur Operasional (Chief Operating Officer/COO): Bapak Agus Setiawan  
• Direktur Teknologi (Chief Technology Officer/CTO): Bapak Aditya Nugroho  
• Direktur Pemasaran (Chief Marketing Officer/CMO): Ibu Diana Sari  
• Direktur Sumber Daya Manusia (Chief Human Resources Officer/CHRO): Ibu Lestari Wulandari  
3. Departemen