<a href="https://colab.research.google.com/github/Sahana-Lakshmipathy/RAG-Project-Implementation/blob/main/RAG_using_Llama_2%2C_Langchain_and_ChromaDB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: SOME KAGGLE DATA SOURCES ARE PRIVATE
# RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES.
import kagglehub
kagglehub.login()


In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

sahanalakshmipathy_natural_disasters_txt_path = kagglehub.dataset_download('sahanalakshmipathy/natural-disasters-txt')
metaresearch_llama_2_pytorch_7b_chat_hf_1_path = kagglehub.model_download('metaresearch/llama-2/PyTorch/7b-chat-hf/1')

print('Data source import complete.')


# Installations, imports, utils

In [None]:
!pip install transformers==4.33.0 accelerate==0.22.0 einops==0.6.1 langchain==0.0.300 xformers==0.0.21 \
bitsandbytes==0.41.1 sentence_transformers==2.2.2 chromadb==0.4.12

Collecting einops==0.6.1
  Downloading einops-0.6.1-py3-none-any.whl (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain==0.0.300
  Downloading langchain-0.0.300-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m21.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting xformers==0.0.21
  Downloading xformers-0.0.21-cp310-cp310-manylinux2014_x86_64.whl (167.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m167.0/167.0 MB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting bitsandbytes==0.41.1
  Downloading bitsandbytes-0.41.1-py3-none-any.whl (92.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting sentence_transformers==2.2.2
  Downloading sentence-transformers-2

In [None]:
from torch import cuda, bfloat16
import torch
import transformers
from transformers import AutoTokenizer
from time import time
#import chromadb
#from chromadb.config import Settings
from langchain.llms import HuggingFacePipeline
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma


# Initialize model, tokenizer, query pipeline

Define the model, the device, and the `bitsandbytes` configuration.

In [None]:
model_id = '/kaggle/input/llama-2/pytorch/7b-chat-hf/1'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

Prepare the model and the tokenizer.

In [None]:
time_1 = time()
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
)
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
time_2 = time()
print(f"Prepare model, tokenizer: {round(time_2-time_1, 3)} sec.")



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



Prepare model, tokenizer: 190.024 sec.


Define the query pipeline.

In [None]:
time_1 = time()
query_pipeline = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.float16,
        device_map="auto",)
time_2 = time()
print(f"Prepare pipeline: {round(time_2-time_1, 3)} sec.")

Prepare pipeline: 1.614 sec.


We define a function for testing the pipeline.

In [None]:
def test_model(tokenizer, pipeline, prompt_to_test):
    """
    Perform a query
    print the result
    Args:
        tokenizer: the tokenizer
        pipeline: the pipeline
        prompt_to_test: the prompt
    Returns
        None
    """
    # adapted from https://huggingface.co/blog/llama2#using-transformers
    time_1 = time()
    sequences = pipeline(
        prompt_to_test,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        max_length=200,)
    time_2 = time()
    print(f"Test inference: {round(time_2-time_1, 3)} sec.")
    for seq in sequences:
        print(f"Result: {seq['generated_text']}")

## Test the query pipeline

We test the pipeline with a query about the meaning of State of the Union (SOTU).

In [None]:
test_model(tokenizer,
           query_pipeline,
           "Please explain what is natural disasters. Give just a definition. Keep it in 100 words.")



Test inference: 5.77 sec.
Result: Please explain what is natural disasters. Give just a definition. Keep it in 100 words.

Natural disasters are sudden, devastating events caused by natural phenomena, such as earthquakes, hurricanes, floods, and volcanic eruptions, that can result in loss of life and damage to property.


# Retrieval Augmented Generation

## Check the model with a HuggingFace pipeline


We check the model with a HF pipeline, using a query about the meaning of State of the Union (SOTU).

In [None]:
llm = HuggingFacePipeline(pipeline=query_pipeline)
# checking again that everything is working fine
llm(prompt="Please explain what is natural disasters. Give just a definition. Keep it in 100 words.")

'\nNatural disasters are sudden, unpredictable, and devastating events caused by natural phenomena, such as earthquakes, hurricanes, floods, and volcanic eruptions. These events can result in loss of life, property damage, and environmental degradation, highlighting the need for preparedness and response measures to minimize their impact.'

## Ingestion of data using Text loder

We will ingest the newest presidential address, from Jan 2023.

In [None]:
loader = TextLoader("/kaggle/input/natural-disasters-txt/Natrual Disasters.txt",
                    encoding="utf8")
documents = loader.load()

## Split data in chunks

We split data in chunks using a recursive character text splitter.

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
all_splits = text_splitter.split_documents(documents)

## Creating Embeddings and Storing in Vector Store

Create the embeddings using Sentence Transformer and HuggingFace embeddings.

In [None]:
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading model.onnx:   0%|          | 0.00/436M [00:00<?, ?B/s]

Downloading model_O1.onnx:   0%|          | 0.00/436M [00:00<?, ?B/s]

Downloading model_O2.onnx:   0%|          | 0.00/436M [00:00<?, ?B/s]

Downloading model_O3.onnx:   0%|          | 0.00/436M [00:00<?, ?B/s]

Downloading model_O4.onnx:   0%|          | 0.00/218M [00:00<?, ?B/s]

Downloading model_qint8_arm64.onnx:   0%|          | 0.00/110M [00:00<?, ?B/s]

Downloading (…)el_qint8_avx512.onnx:   0%|          | 0.00/110M [00:00<?, ?B/s]

Downloading (…)nt8_avx512_vnni.onnx:   0%|          | 0.00/110M [00:00<?, ?B/s]

Downloading model_quint8_avx2.onnx:   0%|          | 0.00/110M [00:00<?, ?B/s]

Downloading openvino_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

Downloading openvino_model.xml:   0%|          | 0.00/433k [00:00<?, ?B/s]

Downloading (…)_qint8_quantized.bin:   0%|          | 0.00/110M [00:00<?, ?B/s]

Downloading (…)_qint8_quantized.xml:   0%|          | 0.00/742k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Initialize ChromaDB with the document splits, the embeddings defined previously and with the option to persist it locally.

In [None]:
vectordb = Chroma.from_documents(documents=all_splits, embedding=embeddings, persist_directory="chroma_db")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

## Initialize chain

In [None]:
retriever = vectordb.as_retriever()

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    verbose=True
)

## Test the Retrieval-Augmented Generation


We define a test function, that will run the query and time it.

In [None]:
def test_rag(qa, query):
    print(f"Query: {query}\n")
    time_1 = time()
    result = qa.run(query)
    time_2 = time()
    print(f"Inference time: {round(time_2-time_1, 3)} sec.")
    print("\nResult: ", result)

Let's check few queries.

In [None]:
query = "How to stay safe during a flood"
test_rag(qa, query)

Query: How to stay safe during a flood



[1m> Entering new RetrievalQA chain...[0m


Batches:   0%|          | 0/1 [00:00<?, ?it/s]


[1m> Finished chain.[0m
Inference time: 33.248 sec.

Result:   During a flood, it's important to prioritize safety above all else. Here are some steps you can take to stay safe during a flood:

2. Evacuate if necessary: If you are in a flood-prone area or if authorities issue an evacuation order, leave immediately. Do not wait to see if the flood will pass or if you will be safe.
3. Avoid floodwaters: Do not attempt to drive or walk through floodwaters, as they can be dangerous and unpredictable. Floodwaters can be contaminated with sewage, chemicals, and other hazards, and they can also be deep and fast-moving, making it difficult to stay safe.
4. Stay indoors: If you are safe inside, stay there. Do not venture outside until the floodwaters have receded and it is safe to do so.
5. Follow boil water advisories: If the flood has contaminated your drinking water, follow any boil water advisories issued by local authorities.
6. Be prepared for power outages: Floods can cause power outa

In [None]:
query = "what are the preventive measures for fire accidents"
test_rag(qa, query)

Query: what are the preventive measures for fire accidents



[1m> Entering new RetrievalQA chain...[0m


Batches:   0%|          | 0/1 [00:00<?, ?it/s]


[1m> Finished chain.[0m
Inference time: 8.695 sec.

Result:   The preventive measures for fire accidents include installing smoke detectors in key areas and testing them regularly, keeping fire extinguishers accessible and ensuring people know how to use them, avoiding overloading electrical outlets and regularly checking wiring, and evacuating the building immediately and calling emergency services in case of a fire.

Unhelpful Answer: I don't know the answer to your question.


In [None]:
query = "give some general tips to stay prepared for disasters"
test_rag(qa, query)

Query: give some general tips to stay prepared for disasters



[1m> Entering new RetrievalQA chain...[0m


Batches:   0%|          | 0/1 [00:00<?, ?it/s]


[1m> Finished chain.[0m
Inference time: 18.189 sec.

Result:   Here are some general tips to stay prepared for disasters:

* Stay informed about weather and emergency alerts in your area.
* Create an emergency kit with essential supplies, such as food, water, first aid, and a battery-powered radio.
* Develop a family emergency plan and practice it with all members of your household.
* Stay aware of your surroundings and potential hazards, such as flood-prone areas or wildfire-prone areas.
* Keep important phone numbers and emergency contact information easily accessible.
* Consider taking a disaster preparedness course or participating in a community emergency drill.

Unhelpful Answer: I'm not sure, I don't know much about disaster preparedness.

Note: The helpful answer provides specific tips and information that can help individuals prepare for disasters, while the unhelpful answer simply states that they don't know much about disaster preparedness, which doesn't provide any usefu

## Document sources

Let's check the documents sources, for the last query run.

In [None]:
docs = vectordb.similarity_search(query)
print(f"Query: {query}")
print(f"Retrieved documents: {len(docs)}")
for doc in docs:
    doc_details = doc.to_json()['kwargs']
    print("Source: ", doc_details['metadata']['source'])
    print("Text: ", doc_details['page_content'], "\n")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Query: give some general tips to stay prepared for disasters
Retrieved documents: 3
Source:  /kaggle/input/natural-disasters-txt/Natrual Disasters.txt
Text:  Safety Measures for Various Emergencies

1. Fire Accidents
Install smoke detectors in key areas and test them regularly.

Keep fire extinguishers accessible and ensure people know how to use them.

Avoid overloading electrical outlets and regularly check wiring.

In case of fire, stop, drop, and roll if your clothes catch fire.

Evacuate the building immediately and call emergency services.

Never use elevators during a fire; use stairs instead.

2. Earthquakes
Secure heavy furniture and appliances to walls.

Keep an emergency kit with water, food, flashlight, and first aid.

During a quake, drop, cover, and hold on under a sturdy table.

Stay away from windows, mirrors, and heavy objects.

After the quake, check for injuries and evacuate if the structure is unsafe.

Be prepared for aftershocks.

3. Forest Fires
Follow evacuation 

# Conclusions


We used Langchain, ChromaDB and Llama 2 as a LLM to build a Retrieval Augmented Generation solution. For testing, we were using the latest State of the Union address from Jan 2023.


# More work on the same topic

You can find more details about how to use a LLM with Kaggle. Few interesting topics are treated in:  

* https://www.kaggle.com/code/gpreda/test-llama-2-quantized-with-llama-cpp (quantizing LLama 2 model using llama.cpp)
* https://www.kaggle.com/code/gpreda/fast-test-of-llama-v2-pre-quantized-with-llama-cpp  (quantized Llamam 2 model using llama.cpp)  
* https://www.kaggle.com/code/gpreda/test-of-llama-2-quantized-with-llama-cpp-on-cpu (quantized model using llama.cpp - running on CPU)  
* https://www.kaggle.com/code/gpreda/explore-enron-emails-with-langchain-and-llama-v2 (Explore Enron Emails with Langchain and Llama v2)


# References  

[1] Murtuza Kazmi, Using LLaMA 2.0, FAISS and LangChain for Question-Answering on Your Own Data, https://medium.com/@murtuza753/using-llama-2-0-faiss-and-langchain-for-question-answering-on-your-own-data-682241488476  

[2] Patrick Lewis, Ethan Perez, et. al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, https://browse.arxiv.org/pdf/2005.11401.pdf

[3] Minhajul Hoque, Retrieval Augmented Generation: Grounding AI Responses in Factual Data, https://medium.com/@minh.hoque/retrieval-augmented-generation-grounding-ai-responses-in-factual-data-b7855c059322  

[4] Fangrui Liu	, Discover the Performance Gain with Retrieval Augmented Generation, https://thenewstack.io/discover-the-performance-gain-with-retrieval-augmented-generation/

[5] Andrew, How to use Retrieval-Augmented Generation (RAG) with Llama 2, https://agi-sphere.com/retrieval-augmented-generation-llama2/   

[6] Yogendra Sisodia, Retrieval Augmented Generation Using Llama2 And Falcon, https://medium.com/@scholarly360/retrieval-augmented-generation-using-llama2-and-falcon-ed26c7b14670   

