<h1 align=center> Retrieval Augmented Generation (RAG) In Depth </h1>

A **RAG LLM** (Retrieval-Augmented Generation Language Model) is a type of AI model that combines the strengths of large language models (LLMs) with information retrieval techniques. This hybrid approach is designed to enhance the accuracy, relevance, and informativeness of the generated text.

![alt text](../Images/llm/rag.png)

### **Overview of RAG LLMs**

- **LLMs (Language Models):** These are deep learning models trained on vast amounts of text data to predict and generate text. They can generate human-like responses based on the context provided but are limited by the information they were trained on, which has a cutoff date.
- **Information Retrieval:** This involves searching for and retrieving relevant documents, passages, or data from a vast corpus (e.g., the web, databases) in response to a query.
- **RAG (Retrieval-Augmented Generation):** RAG models combine these two techniques by first retrieving relevant information from an external knowledge base and then using that information to generate more accurate and contextually relevant responses.

### **How RAG LLMs Work**

**Step 1: Query Formulation**

- The model takes the input (e.g., a user question) and formulates a query to search an external knowledge source. This query might be directly derived from the input or slightly transformed for better retrieval results.

**Step 2: Information Retrieval**

- Using the formulated query, the model retrieves the most relevant documents, passages, or data from the external source. These could be web documents, proprietary databases, or specialized datasets.

**Step 3: Generation with Augmentation**

- The retrieved information is fed into the language model as additional context. The model then generates a response that is informed by both the original input and the retrieved data, leading to a more accurate and up-to-date answer.

### Practical Example
- Implementation of RAG based system using Llama2 opensource model and LangChain
- create the folder `data` and store your pdf data

![alt text](../Images/llm/rag1.png)

In [1]:
 !pip install -q transformers einops accelerate langchain bitsandbytes

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m883.7 kB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.5/137.5 MB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m396.6/396.6 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.5/290.5 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m141.9/141.9 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
!pip install -q langchain-community  pypdf  faiss-cpu streamlit

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.3 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━[0m [32m1.2/2.3 MB[0m [31m39.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m39.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.8/295.8 kB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m73.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.7/8.7 MB[0m [31m111.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.3/207.3 kB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m106.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
!pip install sentence_transformers

In [5]:
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.llms import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
import torch

In [4]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) y
Token is valid (permission: read).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your term

In [6]:
def load_llama():
  # model = "meta-llama/Llama-2-7b-chat-hf"
  model="daryl149/llama-2-7b-chat-hf" #

  tokenizer = AutoTokenizer.from_pretrained(model, use_auth_token=True)
  model = AutoModelForCausalLM.from_pretrained(model,
                                             device_map='auto',
                                             torch_dtype=torch.float16,
                                             use_auth_token=True,
                                              load_in_8bit=True,
                                              #load_in_4bit=True
                                             )
  pipe = pipeline("text-generation",
                model=model,
                tokenizer= tokenizer,
                torch_dtype=torch.bfloat16,
                device_map="auto",
                max_new_tokens = 512,
                do_sample=True,
                top_k=30,
                num_return_sequences=1,
                eos_token_id=tokenizer.eos_token_id
                )
  llm=HuggingFacePipeline(pipeline=pipe, model_kwargs={'temperature':0})
  return llm

In [7]:
# Load embeddings
def load_embeddings_model():
    embeddings = HuggingFaceEmbeddings()
    return embeddings

In [8]:
# Prompt template with corrected closing tag
prompt_template = """
Human: Use the following pieces of context to provide a
concise answer to the question at the end but use at least summarize with
50 words with detailed explanations. If you don't know the answer,
don't try to make up an answer.
<context>
{context}
</context>

Question: {question}

Assistant:"""

In [9]:
def get_documents():
    loader = PyPDFDirectoryLoader("data")
    documents = loader.load()
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=500)
    docs = text_splitter.split_documents(documents)
    return docs

def get_vector_store(docs, embeddings):
    vectorstore_faiss = FAISS.from_documents(docs, embeddings)
    vectorstore_faiss.save_local("faiss_index")

PROMPT = PromptTemplate(template=prompt_template, input_variables=["context", "question"])


In [10]:
def get_response_llm(llm, vectorstore_faiss, query):
    qa = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=vectorstore_faiss.as_retriever(
            search_type="similarity", search_kwargs={"k": 3}
        ),
        return_source_documents=True,
        chain_type_kwargs={"prompt": PROMPT}
    )
    answer = qa({"query": query})
    return answer['result']

In [None]:
docs = get_documents()
embeddings = load_embeddings_model()
get_vector_store(docs, embeddings)

In [15]:
user_question = "What is llama2?"

In [None]:
embeddings = load_embeddings_model()
faiss_index = FAISS.load_local("faiss_index", embeddings, allow_dangerous_deserialization=True)
llm = load_llama()
response = get_response_llm(llm, faiss_index, user_question)

In [17]:
print(response)


Human: Use the following pieces of context to provide a
concise answer to the question at the end but use at least summarize with
50 words with detailed explanations. If you don't know the answer,
don't try to make up an answer.
<context>
found in the model README, or by opening an issue in the GitHub repository
(https://github.com/facebookresearch/llama/ ).
Intended Use
Intended Use Cases Llama 2 is intended for commercial and research use in English. Tuned models
are intended for assistant-like chat, whereas pretrained models can be adapted
for a variety of natural language generation tasks.
Out-of-Scope Uses Use in any manner that violates applicable laws or regulations (including trade
compliancelaws). UseinlanguagesotherthanEnglish. Useinanyotherway
that is prohibited by the Acceptable Use Policy and Licensing Agreement for
Llama 2.
Hardware and Software (Section 2.2)
Training Factors We usedcustomtraininglibraries, Meta’sResearchSuperCluster, andproduc-
tionclustersforpretrainin