# Local RAG with PDF and Llama 3.1

## Steps

1. Load the pdf
2. Chunk that pdf (split that into pieces)
3. Embed each piece
4. Create the vector database, index
5. Query (retrieving from that vector database using a llama3 model)

In [None]:
# source: https://docs.llamaindex.ai/en/stable/examples/cookbooks/llama3_cookbook/

In [1]:
import os
import getpass

def _set_env(var: str):
    if not os.environ.get(var):
        os.environ[var] = getpass.getpass(f"{var}: ")

_set_env("HUGGING_FACE_TOKEN")

In [3]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "meta-llama/Llama-3.2-1B",
)

stopping_ids = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>"),
]
# generate_kwargs parameters are taken from https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct

import torch
from llama_index.llms.huggingface import HuggingFaceLLM

# Optional quantization to 4bit
# import torch
# from transformers import BitsAndBytesConfig

# quantization_config = BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_compute_dtype=torch.float16,
#     bnb_4bit_quant_type="nf4",
#     bnb_4bit_use_double_quant=True,
# )
hf_token = os.environ.get("HUGGING_FACE_TOKEN")
llm = HuggingFaceLLM(
    model_name="meta-llama/Meta-Llama-3.1-8B-Instruct",
    model_kwargs={
        "token": hf_token,
        "torch_dtype": torch.bfloat16,  # comment this line and uncomment below to use 4bit
        # "quantization_config": quantization_config
    },
    generate_kwargs={
        "do_sample": True,
        "temperature": 0.6,
        "top_p": 0.9,
    },
    tokenizer_name="meta-llama/Meta-Llama-3.1-8B-Instruct",
    tokenizer_kwargs={"token": hf_token},
    stopping_ids=stopping_ids,
)




Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [4]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

In [5]:
from llama_index.core import Settings

# bge embedding model
Settings.embed_model = embed_model

# Llama-3-8B-Instruct model
Settings.llm = llm

In [6]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader(
    input_files=["./pdfs/lora-paper.pdf"]
).load_data()

In [8]:
documents[:3]

[Document(id_='2f1a7fa4-0f74-42dc-b57d-4b2b0d1ef50f', embedding=None, metadata={'page_label': '1', 'file_name': 'lora-paper.pdf', 'file_path': 'pdfs/lora-paper.pdf', 'file_type': 'application/pdf', 'file_size': 1609513, 'creation_date': '2024-12-02', 'last_modified_date': '2023-01-23'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='LORA: L OW-RANK ADAPTATION OF LARGE LAN-\nGUAGE MODELS\nEdward Hu∗Yelong Shen∗Phillip Wallis Zeyuan Allen-Zhu\nYuanzhi Li Shean Wang Lu Wang Weizhu Chen\nMicrosoft Corporation\n{edwardhu, yeshe, phwallis, zeyuana,\nyuanzhil, swang, luw, wzchen }@microsoft.com\nyuanzhil@andrew.cmu.edu\n(Version 2)\nABSTRACT\nAn important paradigm of natural language processing consists of large-scale pre-\ntraining on general domain data and 

In [9]:
index = VectorStoreIndex.from_documents(
    documents,
)

In [10]:
query_engine = index.as_query_engine(similarity_top_k=3)

In [11]:
response = query_engine.query("What is Lora?")

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


In [12]:
from IPython.display import Markdown

Markdown(response.response)

 LoRA is a method that adapts large pre-trained language models to a specific task without fine-tuning the entire model. It does this by learning a set of additional parameters that are applied to the pre-trained model's weights, rather than updating the weights themselves. This allows LoRA to scale up to large models like GPT-3 while still achieving good performance on a variety of tasks. LoRA is particularly effective in low-data regimes and can be combined with other adaptation methods, such as prefix-embedding and prefix-layer tuning, to further improve performance. LoRA uses a measure of subspace similarity between the pre-trained model's weights and the task-specific weights to determine which weights to adapt. This allows LoRA to efficiently adapt the model to the task without overfitting or underfitting. LoRA has been shown to perform better than or at least on-par with other adaptation methods, including fine-tuning and prefix-based approaches, on a variety of tasks and datasets.  LoRA is a simple yet effective method for adapting large pre-trained language models to specific tasks, and it has the potential to be used in a wide range of applications.  LoRA has been shown to be particularly effective in low-data regimes and can be used to adapt large models like G

In [14]:
response.source_nodes

[NodeWithScore(node=TextNode(id_='b43bd3df-a5a9-463d-822d-5cad4bbc2c9a', embedding=None, metadata={'page_label': '8', 'file_name': 'lora-paper.pdf', 'file_path': 'pdfs/lora-paper.pdf', 'file_type': 'application/pdf', 'file_size': 1609513, 'creation_date': '2024-12-02', 'last_modified_date': '2023-01-23'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='c353776d-4b81-47f7-89db-c6e4ad9c3a7d', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'page_label': '8', 'file_name': 'lora-paper.pdf', 'file_path': 'pdfs/lora-paper.pdf', 'file_type': 'application/pdf', 'file_size': 1609513, 'creation_date': '2024-12-02', 'last_modified_date': '2023-01-23'}, hash='f8543297500daf61f96fec54cdb246a5c6172461d8c2e5c2fc3ad4957e08