# Local Llama2 Document Search

This example demonstrates a RAG workflow that uses locally-stored documents to augment the result of an LLM query. Notably, this entire workflow is implemented with locally-hosted models and open source utilities. This will be especially useful for RAG workflows that involve sensitive data (medical, legal, financial) which you might not want to send to OpenAI or Cohere for processing.

## Set up the RAG workflow environment

In [1]:
from pprint import pprint
import random
import sys

from llama_index.core import ServiceContext, set_global_service_context, set_global_handler, SimpleDirectoryReader
from llama_index.core.text_splitter import SentenceSplitter

sys.path.append("..")
from utils.hosting_utils import RAGLLM
from utils.rag_utils import RAGEmbedding
from utils.storage_utils import RAGIndex

Set up some helper functions:

In [2]:
def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

## Bring up a locally-hosted Llama2 model

Define the RAG configuration. You should try modifying the configuration values below to see how it affects the output.

In [3]:
rag_cfg = {
    # Node parser config
    "chunk_size": 256,
    "chunk_overlap": 0,

    # Embedding model config
    "embed_model_type": "hf",
    "embed_model_name": "BAAI/bge-base-en-v1.5",

    # LLM config
    "llm_type": "local",
    "llm_name": "Llama-2-7b-chat-hf",
    "max_new_tokens": 256,
    "temperature": 1.0,
    "top_p": 1.0,
    "top_k": 50,
    "do_sample": False,

    # Vector DB config
    "vector_db_type": "chromadb",
    "vector_db_name": "local_llama2",

    # Retriever and query config
    "retriever_type": "vector_index", # "vector_index", "bm25"
    "retriever_similarity_top_k": 3,
    "query_mode": "hybrid", # "default", "hybrid"
    "hybrid_search_alpha": 0.5, # float from 0.0 (sparse search - bm25) to 1.0 (vector search)
    "response_mode": "compact",
}

Load a local Llama2 LLM for generations

In [4]:
# TODO: Can we check if the model is already loaded?
llm = RAGLLM(rag_cfg['llm_type'], rag_cfg['llm_name']).load_model(**rag_cfg)

Loading local LLM model ...


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



## Start with a basic generation request without RAG augmentation

Let's start by asking Llama2 a difficult, domain-specific question we don't expect it to have an answer to. A simple question like "*What is the capital of France?*" is not a good question here, because that's basic knowledge that we expect the LLM to know.

Instead, we want to ask it a question that is very domain-specific that it won't know the answer to. A good example would an obscure detail buried deep within a company's annual report. For example:

"*How many Vector scholarships in AI were awarded in 2022?*"

In [5]:
query = "How many Vector scholarships in AI were awarded in 2022?"

Now, send the generation request.

In [6]:
response = llm.complete(query)
print(response)



According to the Vector Institute's website, in 2022, the Vector Institute awarded 15 scholarships in AI.


Without additional information, our local Llama2 model is unable to answer the question correctly. **Vector in fact awarded 109 AI scholarships in 2022**. Fortunately, we do have that information available in Vector's 2021-22 Annual Report, which is available in the source-materials folder. Let's see how we can use RAG to augment our question with a document search and get the correct answer.

## Ingestion: Load and store the documents from source-materials

Start by reading in all the PDF files from source-materials, break them up into smaller digestible chunks, then encode them as vector embeddings.

In [7]:
# Load the pdfs
pdf_folder_path = "./source_documents"
documents = SimpleDirectoryReader(pdf_folder_path).load_data()
print(f"Number of source materials: {len(documents)}\n")
print(f"Example first source material:\n {documents[0]}\n")

Number of source materials: 42

Example first source material:
 Doc ID: 002ba37c-d166-4cc0-b7f7-d6c08e84a5a6
Text: 2021–2022  ANNUAL  REPORT



Load node parser to split documents into smaller chunks

In [8]:
node_parser = SentenceSplitter(chunk_size=rag_cfg['chunk_size'], chunk_overlap=rag_cfg['chunk_overlap'])

Load embedding model

In [9]:
embed_model = RAGEmbedding(
    model_type=rag_cfg['embed_model_type'], 
    model_name=rag_cfg['embed_model_name']
).load_model()

Loading hf embedding model ...


Use service context to set the node parser, embedding model, LLM, etc.

In [10]:
service_context = ServiceContext.from_defaults(
    node_parser=node_parser,
    embed_model=embed_model,
    llm=llm,
)

# Set it globally to avoid passing it to every class, this sets it even for rag_utils.py
set_global_service_context(service_context)

  service_context = ServiceContext.from_defaults(


Create index using the appropriate vector store

In [11]:
index = RAGIndex(db_type=rag_cfg['vector_db_type'], db_name=rag_cfg['vector_db_name']).create_index(documents)

Creating new index ...


OutOfMemoryError: CUDA out of memory. Tried to allocate 34.00 MiB. GPU 0 has a total capacity of 11.90 GiB of which 9.00 MiB is free. Including non-PyTorch memory, this process has 11.88 GiB memory in use. Of the allocated memory 11.52 GiB is allocated by PyTorch, and 192.71 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

## Now perform the RAG-augmented generation

In [None]:
query_engine = index.as_query_engine()
response = query_engine.query(query)
print(response)

Make sure to use a GPU with enough onboard memory to handle the generation!