<a href="https://colab.research.google.com/github/ChristinaManara/Hands-on-LLMs-NLP-Transformers-Training/blob/main/RAG%20with%20LlamIndex%20and%20TruLens/app.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### RAG implementation using the sentence-window-retrieval technique with LlamaIndex and TrueLens for evaluating the application.

In this method, we retrieve based on smaller sentences to get a better match for the retrieval context and then synthesize based on the expanded context window around the sentence.

Import all necessary libraries.


In [None]:
import os
from dotenv import load_dotenv

load_dotenv()

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
model = "gpt-4o-mini"

In [17]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

docs = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine()
response = query_engine.query('What is a transformer?')
print(response)

ValueError: Directory data does not exist.

Now, we are using another model Mistal-7B, which is an open-source model.

In [9]:
!pip install llama-index-llms-llama-cpp llama-index-embeddings-huggingface

Collecting llama-index-embeddings-huggingface
  Downloading llama_index_embeddings_huggingface-0.3.1-py3-none-any.whl.metadata (718 bytes)
Collecting sentence-transformers>=2.6.1 (from llama-index-embeddings-huggingface)
  Downloading sentence_transformers-3.0.1-py3-none-any.whl.metadata (10 kB)
Collecting minijinja>=1.0 (from huggingface-hub[inference]>=0.19.0->llama-index-embeddings-huggingface)
  Downloading minijinja-2.2.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.8 kB)
Downloading llama_index_embeddings_huggingface-0.3.1-py3-none-any.whl (8.6 kB)
Downloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading minijinja-2.2.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (861 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m861.9/861.9 kB[0m [31m34.1 MB/s[0m eta [36m0:00:00[0m
[?25hIn

In [14]:
import torch

from llama_index.llms.llama_cpp import LlamaCPP
from llama_index.llms.llama_cpp.llama_utils import messages_to_prompt, completion_to_prompt

llm = LlamaCPP(
    model_url='https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q8_0.gguf',
    model_path=None,
    temperature=0.1,
    max_new_tokens=256,
    context_window=4096,
    generate_kwargs={},
    model_kwargs={"n_gpu_layers": 1},
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    verbose=True,
)

Downloading url https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q8_0.gguf to path /tmp/llama_index/models/mistral-7b-instruct-v0.2.Q8_0.gguf
total size (MB): 7695.86


7340it [01:39, 74.00it/s]                          
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /tmp/llama_index/models/mistral-7b-instruct-v0.2.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_co

We read the document from the directory and concatenate each page of the document into a single object.

In [19]:
from llama_index.core.schema import Document
from llama_index.core import SimpleDirectoryReader

docs = SimpleDirectoryReader("data").load_data()
docs = Document(text="\n\n".join([doc.text for doc in docs]))

In [41]:
from llama_index.core.node_parser import SentenceWindowNodeParser, SentenceSplitter
from llama_index.core import load_index_from_storage, ServiceContext
from llama_index.core import Settings
import os

def get_build_index(documents,llm,embed_model="local:BAAI/bge-small-en-v1.5",sentence_window_size=3,save_dir="./vector_store/index"):

  node_parser = SentenceWindowNodeParser (
      window_size = sentence_window_size,
      window_metadata_key = "window",
      original_text_metadata_key = "original_text"
  )

  Settings.llm = llm
  Settings.embed_model = embed_model
  Settings.node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=20)
  Settings.num_output = 512
  Settings.context_window = 3900

  if not os.path.exists(save_dir):
        # create and load the index
        index = VectorStoreIndex.from_documents(
            [documents]
        )
        index.storage_context.persist(persist_dir=save_dir)
  else:
      # load the existing index
      index = load_index_from_storage(
          ServiceContext.from_defaults(persist_dir=save_dir)
      )

  return index

In [42]:
# Vector store index
vector_index = get_build_index(documents=docs, llm=llm, embed_model="local:BAAI/bge-small-en-v1.5", sentence_window_size=3, save_dir="./vector_store/index")

The above function creates the query engine for the newly created vector index. It is possible to query every document using that query engine.

In [43]:
from llama_index.core.postprocessor import MetadataReplacementPostProcessor, SentenceTransformerRerank

def get_query_engine(sentence_index, similarity_top_k=6, rerank_top_n=2):
    postproc = MetadataReplacementPostProcessor(target_metadata_key="window")
    rerank = SentenceTransformerRerank(
        top_n=rerank_top_n, model="BAAI/bge-reranker-base"
    )
    engine = sentence_index.as_query_engine(
            similarity_top_k=similarity_top_k, node_postprocessors=[postproc, rerank]
    )

    return engine

In [44]:
query_engine = get_query_engine(sentence_index=vector_index, similarity_top_k=6, rerank_top_n=2)

config.json:   0%|          | 0.00/799 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

The RAG pipeline is ready to response.

In [46]:
while True:
    query = input("Enter your query: ")
    response = query_engine.query(query)
    print(response)
    print("\n")


Enter your query: what is the purpose of lora?


Llama.generate: 1199 prefix-match hit, remaining 15 prompt tokens to eval

llama_print_timings:        load time =  175066.00 ms
llama_print_timings:      sample time =       9.17 ms /   186 runs   (    0.05 ms per token, 20274.69 tokens per second)
llama_print_timings: prompt eval time =    6955.03 ms /    15 tokens (  463.67 ms per token,     2.16 tokens per second)
llama_print_timings:        eval time =  149821.86 ms /   185 runs   (  809.85 ms per token,     1.23 tokens per second)
llama_print_timings:       total time =  156964.90 ms /   200 tokens


 LoRA (Layer-wise Relevance Adaptation) is a method introduced to make training more efficient and lower the hardware barrier to entry by up to 3 times when using adaptive optimizers. It achieves this by optimizing much smaller low-rank matrices instead of calculating gradients or maintaining optimizer states for most parameters. Additionally, LoRA allows for merging the trainable matrices with the frozen weights when deployed, introducing no inference latency compared to a fully fine-tuned model. It is orthogonal to many prior methods and can be combined with them, such as prefix-tuning. LoRA adds trainable pairs of rank decomposition matrices in parallel to existing weight matrices, and the number of trainable parameters depends on the rank and shape of the original weights. In the context provided, LoRA is applied to Wq and Wv in most experiments for simplicity.


Enter your query: what is transformer?


Llama.generate: 69 prefix-match hit, remaining 1165 prompt tokens to eval

llama_print_timings:        load time =  175066.00 ms
llama_print_timings:      sample time =      12.61 ms /   256 runs   (    0.05 ms per token, 20307.79 tokens per second)
llama_print_timings: prompt eval time =  385461.14 ms /  1165 tokens (  330.87 ms per token,     3.02 tokens per second)
llama_print_timings:        eval time =  204113.59 ms /   255 runs   (  800.45 ms per token,     1.25 tokens per second)
llama_print_timings:       total time =  589855.51 ms /  1420 tokens


 The Transformer is a model architecture proposed in the given work that uses an attention mechanism instead of recurrence to draw global dependencies between input and output. It allows for more parallelization and has achieved state-of-the-art translation quality after being trained for a relatively short time on multiple GPUs. The Transformer uses self-attention, which is an attention mechanism relating different positions of a single sequence, and is the first transduction model to rely entirely on self-attention for computing input and output representations without using sequence-aligned RNNs or convolution. The model architecture consists of an encoder-decoder structure with stacked self-attention and point-wise, fully connected layers for both the encoder and decoder. The encoder and decoder each have identical stacks of six layers, with the decoder having an additional sub-layer for performing multi-head attention over the output of the encoder stack. The attention function ma

KeyboardInterrupt: Interrupted by user