# RAG (Retrieval Augmented Generation)

**What's RAG :-** RAG is an AI framework that allows a Generative AI model(LLM) to access external information not included in its training data or model parameters to enhance its responses to prompts.

**Why it's important? :-** LLMs are prone to hallucination i.e. they present false information when they don't have answer. Also LLMs often present outdated information or response from non-authoritative sources. RAG is one approach to solving such challenges.

**Let's see how to use with *Open Source* Models and Frameworks...**

> #### Importing libraries and frameworks

In [1]:
from torch import cuda
from langchain.llms import LlamaCpp
from huggingface_hub import hf_hub_download
from langchain import PromptTemplate
from langchain.document_loaders import TextLoader, WebBaseLoader
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.llms import CTransformers
from langchain.memory import ConversationBufferMemory

> #### Creating Embedding with open source

In [2]:
embed_model_id = 'sentence-transformers/all-MiniLM-L6-v2'

device = 'cuda' if cuda.is_available() else 'cpu'

embed_model = HuggingFaceEmbeddings(
    model_name=embed_model_id,
    model_kwargs={'device': device},
    encode_kwargs={'device': device, 'batch_size':32}
)



> #### Demo of how the downloaded embedding work

In [3]:
doc = ['first line of document', 'another line']

embeddings = embed_model.embed_documents(doc)

print(f"We have {len(embeddings)} doc embeddings, each with a dimensionality of {len(embeddings[0])}.")

We have 2 doc embeddings, each with a dimensionality of 384.


> #### Loading Data

Since I like WWE, so am picking some latest WWE data which is ofcourse not available on most LLM models.

In [12]:
url = 'https://www.cagesideseats.com/wwe/2024/1/5/24026239/wwe-smackdown-results-live-blog-jan-5-2024-new-years-revolution-orton-knight-styles-mystery-partner'

loader = WebBaseLoader(url)
document = loader.load()

# loader = TextLoader("chat_data.txt")
# documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=0)
texts = text_splitter.split_documents(documents=document)

In [13]:
texts[1]

Document(page_content='Skip to main content\n\n\n\nclock\nmenu\nmore-arrow\nno\nyes\nmobile\n\n\n\n\n\n\n\n\nCageside Seats homepage', metadata={'source': 'https://www.cagesideseats.com/wwe/2024/1/5/24026239/wwe-smackdown-results-live-blog-jan-5-2024-new-years-revolution-orton-knight-styles-mystery-partner', 'title': 'WWE SmackDown results, live blog (Jan. 5, 2024): New Year’s Revolution - Cageside Seats', 'description': 'Follow along with this week’s New Year’s Revolution episode of SmackDown on FOX, featuring Randy Orton vs. LA Knight vs. AJ Styles, appearances from Roman Reigns and Logan Paul, BUTCH has a mystery partner, IYO SKY defends the women’s championship, Kevin Owens and Santos Escobar fight in a tournament final, and more!', 'language': 'en'})

> #### Building the Vector Index using ChromaDB

We now need to use the embedding pipeline to build our embeddings and store them in ChromaDB vector index.

In [19]:
db = Chroma.from_documents(documents=texts, embedding=embed_model)

retriever = db.as_retriever(search_type = "similarity", search_kwargs={'k':2})

> #### Loading Large Language Model

Here am going to pick Llama2 13B Chat quantized model.

In [20]:
model_name_or_path = "TheBloke/Llama-2-13B-chat-GGML"
model_basename = "llama-2-13b-chat.ggmlv3.q5_1.bin"

model_path = hf_hub_download(repo_id=model_name_or_path, filename=model_basename, local_dir='E:\coding\PycharmProjects\medical-chatbot\model')

In [21]:
llm=CTransformers(model=model_path,
                  model_type="llama",
                  config={'max_new_tokens':512,
                          'temperature':0.8})

> #### Prompt Template

In [34]:
prompt_template = """

You are an assistant to a human, powered by a large language model.
As a language model, you are able to generate human-like text based on the input you receive, allowing you to engage in natural-sounding conversations and provide responses that are coherent and relevant to the topic at hand.
You are able to process and understand large amounts of text, and can use this knowledge to provide accurate and informative responses to a wide range of questions. 
You have access to some personalized information provided by the human in the Context section below. 


Context: {context}
Question: {question}

Return the helpful answer and nothing else. Make sure to double check your answer.
Don't make up the answer.
Helpful answer:
"""

PROMPT = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
chain_type_kwargs={"prompt": PROMPT}

> #### Let's Chat...

In [35]:
rag_pipeline  = RetrievalQA.from_chain_type(
    llm=llm,                                      #LLM Model
    chain_type="stuff",                           #Type of chain
    retriever=retriever,                          #Vector database 
    return_source_documents=False,                 #Want to know from where answer came?
    memory=ConversationBufferMemory(),            #LangChain memory
    chain_type_kwargs=chain_type_kwargs           #Prompt
)

In [36]:
while True:
    user_input=input(f"Input Prompt:")
    if user_input == "exit":
        break
    result=rag_pipeline({"query": user_input})
    print(f"Response : {result['result']}\n", )

Input Prompt: How Kevin Owens won the match?


Response : Kevin Owens won the match with a pinfall victory over his opponent, using his patented pop-up powerbomb maneuver.



Input Prompt: Who was the winner - IYO SKY (c) vs. “Michin” Mia Yim


Response : The winner of the match between IYO SKY (c) and “Michin” Mia Yim was IYO SKY (c).



Input Prompt: exit


> #### This is how you can talk to your documents or create chatbot based on an intensive QnA data of your business.


By - [Himanshu Goswami](https://www.linkedin.com/in/himgos/) 🧑‍💻