# Advanced Retrieval Augumented Generation(RAG) System Using Open Source Llama2 and LLamaIndex With HuggingFace  

**The main idea to create this RAG system is to query with user data in lot of pdf documents. In the approach, we load the docs, index those docs and query the docs using LLama2 model**

## Install Libraries

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
!pip install -q pypdf

In [3]:
!pip install -q transformers einops accelerate langchain bitsandbytes #Using bitsandbytes for quantization (from 32 or 16 bit to 8 or 4 bit models) due to GPU limitation.

In [6]:
## For Embeddings
!pip install -q sentence_transformers

In [7]:
!pip install -q llama_index

In [None]:
!pip install -q llama-index-llms-huggingface

In [14]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, ServiceContext #To combine Llama2 model with the query
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.core import PromptTemplate

## Loading The Documents

**SimpleDirectoryLoader creates documents out of every file in a given directory. It is built in to LlamaIndex and can read a variety of formats including Markdown, PDFs, Word documents, PowerPoint decks, images, audio and video.**

In [49]:
documents = SimpleDirectoryReader("data").load_data() #inplace of data specify your directory path

In [None]:
documents

## Building Llama2 Model

In [19]:
system_prompt = """
You are a Q&A Assistant. Your goal is to answer the questions
as accurately as possible based on the instructions and context provided.
"""
#format supported by LLama2
query_wrapper_prompt = PromptTemplate(
    "<|USER|>{query_str}<|ASSISTANT|>"
)

**HuggingFace Login**

In [20]:
!pip install huggingface_hub
from huggingface_hub import notebook_login
notebook_login()



VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [21]:
import torch

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.0, "do_sample": False},
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="meta-llama/Llama-2-7b-chat-hf",
    model_name="meta-llama/Llama-2-7b-chat-hf",
    device_map="auto",
    # uncomment this if using CUDA to reduce memory usage
    model_kwargs={"torch_dtype": torch.float16 , "load_in_8bit":True} ) #Quantization from 16 bit to 8 bit

config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [22]:
!pip install -q llama-index-embeddings-langchain

**Sentence-Transformers maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.**

In [23]:
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from llama_index.core.indices.service_context import ServiceContext
from llama_index.embeddings.langchain import LangchainEmbedding

embed_model=LangchainEmbedding(
    HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
    )

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

**The ServiceContext container is a utility container for LlamaIndex index and query classes. The container contains the following objects that are commonly used for configuring every index and query, such as the LLM, the PromptHelper (for configuring input size/chunk size), the BaseEmbedding (for configuring the embedding model), and more.**


**It is like a bundle of commonly used resources used during the indexing and querying stage in a LlamaIndex pipeline/application.**

In [24]:
service_context = ServiceContext.from_defaults(
    chunk_size=1024,
    llm=llm,
    embed_model=embed_model
)

In [26]:
service_context

ServiceContext(llm_predictor=LLMPredictor(system_prompt=None, query_wrapper_prompt=None, pydantic_program_mode=<PydanticProgramMode.DEFAULT: 'default'>), prompt_helper=PromptHelper(context_window=4096, num_output=256, chunk_overlap_ratio=0.1, chunk_size_limit=None, separator=' '), embed_model=LangchainEmbedding(model_name='sentence-transformers/all-mpnet-base-v2', embed_batch_size=10, callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x7887018e4370>), transformations=[SentenceSplitter(include_metadata=True, include_prev_next_rel=True, callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x7887018e4370>, id_func=<function default_id_func at 0x78881b2be050>, chunk_size=1024, chunk_overlap=200, separator=' ', paragraph_separator='\n\n\n', secondary_chunking_regex='[^,.;。？！]+[,.;。？！]?')], llama_logger=<llama_index.core.service_context_elements.llama_logger.LlamaLogger object at 0x7887e2759930>, callback_manager=<llama_index.core.callbacks.ba

In [51]:
# Convert All Data in VectoreStore into Indexes
index = VectorStoreIndex.from_documents(documents, service_context=service_context)

In [52]:
#Converting Index into a Query Engine
query_engine = index.as_query_engine()

In [31]:
#Providing Query
response = query_engine.query("How can I use vectors in C++?")

In [32]:
print(response)


You can use vectors in C++ by declaring a vector with the vector keyword followed by the size of the vector in square brackets. For example, vector<int> v(5); declares a vector with 5 elements of type int. You can also use the emplace_back function to dynamically increase the size of the vector and insert an element at the end. Additionally, you can use the push_back function to insert an element at the end of the vector. You can also declare a vector of pair data type as well.

You can access elements in a vector using the dot operator. For example, to access the first element of a vector v, you can use v[0]. You can also use iterators to access elements in a vector.

Vectors are constant in size, meaning you cannot modify the size of a vector after it has been declared. However, you can use the resize function to modify the size of a vector.

In summary, vectors are a powerful tool in C++ for storing and manipulating data. They are dynamic in nature, meaning you can add or remove el

In [33]:
response = query_engine.query("What is acoustic species classification about?")

In [34]:
from IPython.display import Markdown, display

In [35]:
display(Markdown(f"<b>{response}</b>"))

<b>Acoustic species classification is about identifying and categorizing bird species based on their vocalizations. The project involves converting weakly labeled data into strongly labeled time-based data and developing a multi-species classification model to predict the bird species present in a 5-second segment of audio. The goal of the project is to contribute valuable insights to the scientific community while acknowledging the challenges posed by real-world environmental variability.</b>

In [46]:
response = query_engine.query("What work did Raviteja T do in acoustic species classification?")

display(Markdown(f"<b>{response}</b>"))

<b>Raviteja T worked on developing an unsupervised learning architecture for acoustic species classification using deep learning techniques. Specifically, he used a VGG16 convolutional neural network to extract features from raw audio data and then applied clustering algorithms to group similar sounds together. The goal of this work was to improve the accuracy of acoustic species classification by leveraging the power of deep learning.</b>