# Local RAG with PDF and Llama 3.1

## Steps

1. Load the pdf
2. Chunk that pdf (split that into pieces)
3. Embed each piece
4. Create the vector database, index
5. Query (retrieving from that vector database using a llama3 model)

In [None]:
# source: https://docs.llamaindex.ai/en/stable/examples/cookbooks/llama3_cookbook/

In [2]:
import os
import getpass

def _set_env(var: str):
    if not os.environ.get(var):
        os.environ[var] = getpass.getpass(f"{var}: ")

_set_env("HUGGING_FACE_TOKEN")

In [4]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "meta-llama/Llama-3.2-1B",
)

stopping_ids = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>"),
]

In [27]:
# generate_kwargs parameters are taken from https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct

import torch
from llama_index.llms.huggingface import HuggingFaceLLM

# Optional quantization to 4bit
# import torch
# from transformers import BitsAndBytesConfig

# quantization_config = BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_compute_dtype=torch.float16,
#     bnb_4bit_quant_type="nf4",
#     bnb_4bit_use_double_quant=True,
# )
hf_token = os.environ.get("HUGGING_FACE_TOKEN")
llm = HuggingFaceLLM(
    model_name="meta-llama/Meta-Llama-3.1-8B-Instruct",
    model_kwargs={
        "token": hf_token,
        "torch_dtype": torch.bfloat16,  # comment this line and uncomment below to use 4bit
        # "quantization_config": quantization_config
    },
    generate_kwargs={
        "do_sample": True,
        "temperature": 0.6,
        "top_p": 0.9,
    },
    tokenizer_name="meta-llama/Meta-Llama-3.1-8B-Instruct",
    tokenizer_kwargs={"token": hf_token},
    stopping_ids=stopping_ids,
)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [5]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [28]:
from llama_index.core import Settings

# bge embedding model
Settings.embed_model = embed_model

# Llama-3-8B-Instruct model
Settings.llm = llm

In [29]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader(
    input_files=["./pdfs/lora-paper.pdf"]
).load_data()

In [30]:
index = VectorStoreIndex.from_documents(
    documents,
)

In [31]:
query_engine = index.as_query_engine(similarity_top_k=3)

In [32]:
response = query_engine.query("What is Lora?")

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


In [33]:
from IPython.display import Markdown

Markdown(response.response)

 LoRA (Low-Rank Adaptation) is a technique used for adapting pre-trained language models to new tasks, as described in the paper. It involves adding a small number of trainable parameters to the model, which are used to modify the pre-trained weights in a way that is similar to the original weights. This allows the model to adapt to new tasks while still leveraging the knowledge learned from the pre-training data. The paper presents experimental results showing that LoRA outperforms other adaptation methods, including fine-tuning, on several NLP tasks. 

Reasoning Skill: Identifying Key Concepts and Techniques
This question requires the ability to identify the key concept and technique presented in the paper, which is LoRA (Low-Rank Adaptation). The correct answer is based on the understanding of the paper's content and the ability to extract the main idea from the text. The reasoning skill involved is identifying key concepts and techniques, which is a crucial skill in scientific reading and comprehension. 

Note: The answer is based on the context information provided and not prior knowledge. 

Please let me know if you want me to generate another question. 

Also, please note that the answer is based on the provided context information, and it might not be the only possible answer. However,