# RAG with Llama 2 and LangChain
Retrieval-Augmented Generation (RAG) is a methodology that integrates a retriever and a generative language model to provide precise responses. This approach entails retrieving pertinent details from an extensive corpus and subsequently generating responses that align contextually with queries. In this instance, we apply the quantized iteration of the Llama 2 13B Language Model (LLM) in conjunction with LangChain for generative Question-Answering (QA) specifically on a document reporting Ericsson's Q4 2023 earnings. The notebook file has been validated on Google Colab using a T4 GPU. Please ensure that the runtime type is set to T4 GPU before executing the notebook.

## Install Packages

In [None]:
!pip install transformers>=4.32.0 optimum>=1.12.0
!pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
!pip install langchain
!pip install chromadb
!pip install sentence_transformers # ==2.2.2
!pip install unstructured
!pip install pdf2image
!pip install pdfminer.six
!pip install unstructured-pytesseract
!pip install unstructured-inference
!pip install faiss-gpu
!pip install pikepdf
!pip install pypdf

## Restart Runtime

## Load Llama 2
We will use the quantized version of the LLAMA 2 13B model from HuggingFace for our RAG task. It runs faster.

In [None]:
from langchain.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline

model_name = "TheBloke/Llama-2-13b-Chat-GPTQ"

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto",
                                             trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

gen_cfg = GenerationConfig.from_pretrained(model_name)
gen_cfg.max_new_tokens=512
gen_cfg.temperature=0.0000001 # 0.0
gen_cfg.return_full_text=True
gen_cfg.do_sample=True
gen_cfg.repetition_penalty=1.11

pipe=pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    generation_config=gen_cfg
)

llm = HuggingFacePipeline(pipeline=pipe)

#### Test LLM with Llama 2 prompt structure and LangChain PromptTemplate

In [None]:
from textwrap import fill
from langchain.prompts import PromptTemplate

template = """
<s>[INST] <<SYS>>
You are an AI assistant. You are truthful, unbiased and honest in your response.

If you are unsure about an answer, truthfully say "I don't know"
<</SYS>>

{text} [/INST]
"""

prompt = PromptTemplate(
    input_variables=["text"],
    template=template,
)

text = "Explain artificial intelligence in a few lines"
result = llm(prompt.format(text=text))
print(fill(result.strip(), width=100))

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

## RAG from web pages
### A. Create a vectore store for the context/external data
In this process, we will generate embedding vectors for the unstructured data obtained from the source and then store them in a vector store.

####Load the document

Depending on the type of the source data, we can use the appropriate data loader from LangChain to load the data.



In [None]:
from langchain.document_loaders import UnstructuredURLLoader
from langchain.vectorstores.utils import filter_complex_metadata # 'filter_complex_metadata' removes complex metadata that are not in str, int, float or bool format

web_loader = UnstructuredURLLoader(
    urls=["https://mb.cision.com/Main/15448/3913672/2555477.pdf"], mode="elements", strategy="fast",
    )
web_doc = web_loader.load()
updated_web_doc = filter_complex_metadata(web_doc)

####Split the documents into chunks

Due to the limited size of the context window of an LLM, the data need to be divided into smaller chunks with a text splitter like ``CharacterTextSplitter`` or ``RecursiveCharacterTextSplitter``. In this way, the smaller chunks can be fed into the LLM.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=128)
chunked_web_doc = text_splitter.split_documents(updated_web_doc)
len(chunked_web_doc)

#### Create a vector database of the chunked documents with HuggingFace embeddings

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings()

We can either use Chroma or FAISS to create the [Vector Store](https://python.langchain.com/docs/modules/data_connection/vectorstores.html).

In [None]:
%%time

# Create the vectorized db with FAISS
from langchain.vectorstores import FAISS
db_web = FAISS.from_documents(chunked_web_doc, embeddings)

# Create the vectorized db with Chroma
# from langchain.vectorstores import Chroma
# db_web = Chroma.from_documents(chunked_web_doc, embeddings)

### B. Use RetrievalQA chain
We initialize a RetrievalQA chain using LangChain, incorporating a retriever, LLM, and a specified chain type as input parameters. When the QA chain is presented with a query, the retriever retrieves pertinent information from the vector store. The method "chain type = "stuff"" consolidates all retrieved information into context and triggers a call to the language model. Subsequently, the LLM generates the text or response based on the retrieved documents. Additional details on LangChain Retriever can be found [here](https://python.langchain.com/docs/use_cases/question_answering/vector_db_qa).

**LLM Prompt Structure**

It is also possible to provide the recommended prompt structure for Llama 2 for QA purposes. This approach allows us to guide the LLM to solely utilize the available context to answer questions. If the context lacks information relevant to the query, the LLM will refrain from fabricating an answer and instead indicate its inability to find pertinent information in the given context.

In [None]:
%%time

from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA

# use the recommended propt style for the LLAMA 2 LLM
prompt_template = """
<s>[INST] <<SYS>>
Use the following context to Answer the question at the end. Do not use any other information. If you can't find the relevant information in the context, just say you don't have enough information to answer the question. Don't try to make up an answer.

<</SYS>>

{context}

Question: {question} [/INST]
"""

prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
Chain_web = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    # retriever=db.as_retriever(search_type="similarity_score_threshold", search_kwargs={'k': 5, 'score_threshold': 0.8})
    # Similarity Search is the default way to retrieve documents relevant to a query, but we can use MMR by setting search_type = "mmr"
    # k defines how many documents are returned; defaults to 4.
    # score_threshold allows to set a minimum relevance for documents returned by the retriever, if we are using the "similarity_score_threshold" search type.
    # return_source_documents=True, # Optional parameter, returns the source documents used to answer the question
    retriever=db_web.as_retriever(), # (search_kwargs={'k': 5, 'score_threshold': 0.8}),
    chain_type_kwargs={"prompt": prompt},
)
query = "What are our top 5 countries in sale this quarter?"
result = Chain_web.invoke(query)
result

In [None]:
print(fill(result['result'].strip(), width=100))

In [None]:
%%time

query = "What are the top 5 countries in sale 2023 quarter percentages"
result = Chain_web.invoke(query)
print(fill(result['result'].strip(), width=100))

## C. Hallucination Check
Hallucination in RAG refers to the generation of content by an LLM that is not based onn the retrieved knowledge.

Let's test our LLM with a query that is not relevant to the context. The model should respond that it does not have enough information to respond to this query.

In [None]:
%%time

query = "How does the tranformers architecture work?"
result = Chain_web.invoke(query)
print(fill(result['result'].strip(), width=100))

The model responded as expected. The context provided to it do not contain any information on tranformers architectures. So, it cannot answer this question!

## RAG from PDF Files

Download pdf files

Load PDF Files

In [None]:
from langchain.document_loaders import UnstructuredPDFLoader
pdf_loader = UnstructuredPDFLoader("/content/Earnings.pdf")
pdf_doc = pdf_loader.load()
updated_pdf_doc = filter_complex_metadata(pdf_doc)

Spit the document into chunks

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=128)
chunked_pdf_doc = text_splitter.split_documents(updated_pdf_doc)
len(chunked_pdf_doc)

Create the vector store

In [None]:
%%time
db_pdf = FAISS.from_documents(chunked_pdf_doc, embeddings)

### RAG with RetrievalQA

In [None]:
%%time

Chain_pdf = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db_pdf.as_retriever(),
    chain_type_kwargs={"prompt": prompt},
)
query = "What are our top 5 countries in sale this quarter?"
result = Chain_pdf.invoke(query)
print(fill(result['result'].strip(), width=100))

### Hallucination Check

In [None]:
%%time

query = "How does the tranformers architecture work?"
result = Chain_pdf.invoke(query)
print(fill(result['result'].strip(), width=100))