# Retrieval Augmented Generation

For more intuition about what Retrieval Augmented Generation (RAG) is and why RAG systems are effective for many business use cases, I'd recommend checking out this blog post by Nicole Choi:

https://github.blog/ai-and-ml/generative-ai/what-is-retrieval-augmented-generation-and-what-does-it-do-for-generative-ai/

## Import Dependencies

In [1]:
import torch
from torch import cuda, bfloat16
import transformers
from transformers import AutoTokenizer
from time import time
import chromadb
from chromadb.config import Settings
from langchain.llms import HuggingFacePipeline
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma
import kagglehub

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
#### Check if GPU is available
torch.cuda.is_available()

True

## Download and Load Model

In this isntance, we will be using Meta's Llama-2 model. you can change this to whatever model you want to use. You can request access to the models directly from Meta's website, or you can request access and download them from Kaggle.

Meta: https://llama.meta.com/llama2/

Kaggle: https://www.kaggle.com/models/metaresearch/llama-2

In [3]:
#### This model is approximately 25GB downloaded from Kaggle
#### Uncomment to run
#path = kagglehub.model_download("metaresearch/llama-2/pyTorch/7b-chat-hf")
#print("Path to model files:", path)

In [4]:
#### Setup quantization configuration to load the large model with less GPU memory
model_id = 'C:/Users/oshan/.cache/kagglehub/models/metaresearch/llama-2/pyTorch/7b-chat-hf/1'
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

In [5]:
#### Load model
time_1 = time()
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
)
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
time_2 = time()
print(f"Prepare model, tokenizer: {round(time_2-time_1, 3)} sec.")

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 2/2 [00:50<00:00, 25.12s/it]


Prepare model, tokenizer: 55.876 sec.


## Prepare Pipeline and Test Model

In [6]:
#### Prepare query pipeline
time_1 = time()
query_pipeline = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.float16,
        device_map="auto",)
time_2 = time()
print(f"Prepare pipeline: {round(time_2-time_1, 3)} sec.")

Prepare pipeline: 4.242 sec.


In [7]:
def test_model(tokenizer, pipeline, prompt_to_test):
    """
    Perform a query
    print the result
    Args:
        tokenizer: the tokenizer
        pipeline: the pipeline
        prompt_to_test: the prompt
    Returns
        None
    """
    # adapted from https://huggingface.co/blog/llama2#using-transformers
    time_1 = time()
    sequences = pipeline(
        prompt_to_test,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        max_length=200,)
    time_2 = time()
    print(f"Test inference: {round(time_2-time_1, 3)} sec.")
    for seq in sequences:
        print(f"Result: {seq['generated_text']}")

In [8]:
test_model(tokenizer,
           query_pipeline,
           "Please explain what is the State of the Union address. Give just a definition. Keep it in 100 words.")

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
  attn_output = torch.nn.functional.scaled_dot_product_attention(


Test inference: 7.163 sec.
Result: Please explain what is the State of the Union address. Give just a definition. Keep it in 100 words. Thank you.
The State of the Union address is a yearly speech given by the President of the United States to Congress, outlining the current state of the country, its challenges, and its goals for the future. In the speech, the President typically highlights the administration's achievements, makes policy proposals, and calls for bipartisan cooperation and unity.


## Loading Pipeline into HuggingFace API

In [9]:
llm = HuggingFacePipeline(pipeline=query_pipeline)
# checking again that everything is working fine
llm(prompt="What were the main topics in the State of the Union in 2023? Summarize. Keep it under 200 words.")

  warn_deprecated(
  warn_deprecated(


"What were the main topics in the State of the Union in 2023? Summarize. Keep it under 200 words.\n\nIn President Biden's 2023 State of the Union address, he focused on several key issues, including:\n\n1. Economic growth and job creation, highlighting the need for investments in infrastructure, education, and research.\n2. Healthcare, with a proposal to lower prescription drug prices and improve access to affordable healthcare.\n3. Climate change, with a call to action to reduce carbon emissions and invest in clean energy.\n4. Immigration, with a proposal to provide a pathway to citizenship for undocumented immigrants.\n5. Gun violence, with a call for stricter gun control measures and increased funding for mental health treatment.\n6. Social justice, with a focus on addressing systemic racism and inequality in the criminal justice system.\n\nOverall, President Biden's address emphasized the need for bipartisan cooperation and action to address the nation's challenges and improve the 

## Loading Data for Preprocessing

In [10]:
#### Loading our text data for data preprocessing
loader = TextLoader("biden-sotu-2023-planned-official.txt",
                    encoding="utf8")
documents = loader.load()

In [11]:
#### Splitting data into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
all_splits = text_splitter.split_documents(documents)

More information regarding sentence transformers can be found on HuggingFace: https://huggingface.co/sentence-transformers/all-mpnet-base-v2

In [12]:
#### Loading sentence embeddings from HuggingFace
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

  warn_deprecated(


In [13]:
#### Storing data in vector database named "chroma_db"
vectordb = Chroma.from_documents(documents=all_splits, embedding=embeddings, persist_directory="SOTU_chroma_db")

## Creating Retriever 

In [14]:
retriever = vectordb.as_retriever()

qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=retriever, 
    verbose=True
)

In [15]:
#### Testing Retriever
def test_rag(qa, query):
    print(f"Query: {query}\n")
    time_1 = time()
    result = qa.run(query)
    time_2 = time()
    print(f"Inference time: {round(time_2-time_1, 3)} sec.")
    print("\nResult: ", result)

In [16]:
query = "What were the main topics in the State of the Union in 2023? Summarize. Keep it under 200 words."
test_rag(qa, query)

Query: What were the main topics in the State of the Union in 2023? Summarize. Keep it under 200 words.



[1m> Entering new RetrievalQA chain...[0m


  warn_deprecated(



[1m> Finished chain.[0m
Inference time: 10.546 sec.

Result:  Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

on the state of the union. And here is my report. Because the soul of this nation is strong, because the backbone of this nation is strong, because the people of this nation are strong, the State of the Union is strong. As I stand here tonight, I have never been more optimistic about the future of America. We just have to remember who we are. We are the United States of America and there is nothing, nothingbeyond our capacity if we do it together. May God bless you all. May God protect our troops.

peace,not just in Europe, but everywhere. Before I came to office, the story was about how the People’s Republic of China was increasing its power and America was falling in the world. Not anymore. I’ve made clear with President Xi that we seek competition, not confl

## A More Personal RAG 

In [17]:
llm = HuggingFacePipeline(pipeline=query_pipeline)
# checking again that everything is working fine
llm(prompt="Please explain who Noah Oshana is. Keep it in 100 words.")

'Please explain who Noah Oshana is. Keep it in 100 words.\nNoah Oshana is a 21-year-old American social media personality and content creator. He gained popularity on TikTok and Instagram for his humorous and relatable videos, often featuring his pet dog, Max. Oshana has collaborated with several brands and has been featured in publications such as Forbes and Teen Vogue. He has over 3 million followers on TikTok and over 1 million followers on Instagram.'

In [18]:
## lol wtf? this is obviously not me....

In [19]:
loader = TextLoader("Noah.txt",
                    encoding="utf8")
documents = loader.load()

In [20]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
all_splits = text_splitter.split_documents(documents)

In [21]:
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

In [22]:
vectordb = Chroma.from_documents(documents=all_splits, embedding=embeddings, persist_directory="noah_chroma_db")

In [23]:
retriever = vectordb.as_retriever()

qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=retriever, 
    verbose=True
)

In [24]:
def test_rag(qa, query):
    print(f"Query: {query}\n")
    time_1 = time()
    result = qa.run(query)
    time_2 = time()
    print(f"Inference time: {round(time_2-time_1, 3)} sec.")
    print("\nResult: ", result)

In [25]:
query = "Please explain who Noah Oshana is. Keep it in 100 words."
test_rag(qa, query)

Number of requested results 4 is greater than number of elements in index 3, updating n_results = 3


Query: Please explain who Noah Oshana is. Keep it in 100 words.



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
Inference time: 8.735 sec.

Result:  Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

Question: What are Noah Oshana's hobbies?
Answer: Noah enjoys all things sports related. He personally enjoys playing golf in his free time. He is a very active individual who spends 6/7 days of the week lifting weights. He is also a luxury watch and bourbon connoisseur.

Question: Does Noah Oshana have any family? 
Answer: Noah is the son of Robert and Susan Oshana. He has one older brother, Samuel Oshana.

Question: What did Noah Oshana study at the University of Colorado at Boulder?
Answer: Noah recieved his Masters of Science in Data Science at the University of Colorado at Boulder.

Question: What is Noah Oshana's work experience?
Answer: Noah has 