# RAG with Llama 2 and LangChain
Retrieval-Augmented Generation (RAG) is a technique that combines a retriever and a generative language model to deliver accurate response. It involves retrieving relevant information from a large corpus and then generating contextually appropriate responses to queries. Here we use the quantized version of the Llama 2 13B LLM with LangChain to perform generative QA with RAG. The notebook file has been tested in Google Colab with T4 GPU. Please change the runtime type to T4 GPU before running the notebook.

In [8]:
from googletrans import Translator

In [9]:
translator= Translator()

In [6]:
# Mở tệp văn bản a.txt để đọc
with open('chunks.txt', 'r') as file:
    content = file.read()

# In ra nội dung của tệp văn bản
print(content)




In [7]:
len(content)

77683

## Install Packages

In [1]:
!pip install transformers==4.37.2 optimum==1.12.0 --quiet
!pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ --quiet
!pip install langchain==0.1.9 --quiet
# !pip install chromadb
!pip install sentence_transformers==2.4.0 --quiet
!pip install unstructured --quiet
!pip install pdf2image --quiet
!pip install pdfminer.six==20221105 --quiet
!pip install unstructured-inference --quiet
!pip install faiss-gpu==1.7.2 --quiet
!pip install pikepdf==8.13.0 --quiet
!pip install pypdf==4.0.2 --quiet
!pip install pillow_heif==0.15.0 --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.4/8.4 MB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m380.6/380.6 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.0/9.0 MB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## Restart Runtime

## Load Llama 2
We will use the quantized version of the LLAMA 2 13B model from HuggingFace for our RAG task.

In [1]:
from langchain.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline

model_name = "TheBloke/Llama-2-13b-Chat-GPTQ"

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto",
                                             trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

gen_cfg = GenerationConfig.from_pretrained(model_name)
gen_cfg.max_new_tokens=512
gen_cfg.temperature=0.0000001 # 0.0
gen_cfg.return_full_text=True
gen_cfg.do_sample=True
gen_cfg.repetition_penalty=1.11

pipe=pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    generation_config=gen_cfg
)

llm = HuggingFacePipeline(pipeline=pipe)

  from .autonotebook import tqdm as notebook_tqdm
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
CUDA extension not installed.
CUDA extension not installed.


KeyboardInterrupt: 

#### Test LLM with Llama 2 prompt structure and LangChain PromptTemplate

In [3]:
from textwrap import fill
from langchain.prompts import PromptTemplate

template = """
<s>[INST] <<SYS>>
You are an AI assistant. You are truthful, unbiased and honest in your response.

If you are unsure about an answer, truthfully say "I don't know"
<</SYS>>

{text} [/INST]
"""

prompt = PromptTemplate(
    input_variables=["text"],
    template=template,
)

text = "Explain artificial intelligence in a few lines"
result = llm.invoke(prompt.format(text=text))
print(fill(result.strip(), width=100))

<s>[INST] <<SYS>> You are an AI assistant. You are truthful, unbiased and honest in your response.
If you are unsure about an answer, truthfully say "I don't know" <</SYS>>  Explain artificial
intelligence in a few lines [/INST]   Sure! Here is my explanation of artificial intelligence:
Artificial intelligence (AI) refers to the development of computer systems that can perform tasks
that typically require human intelligence, such as learning, problem-solving, and decision-making.
These systems use algorithms and machine learning techniques to analyze data and make predictions or
take actions based on that data. Some examples of AI include natural language processing, image
recognition, and autonomous vehicles. Overall, the goal of AI is to create machines that can think
and act like humans, but with greater speed, accuracy, and consistency.


In [4]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

## RAG from PDF Files
### A. Create a vectore store for the context/external data
Here, we'll create embedding vectores of the unstructured data loaded from the the source and store them in a vectore store.  

####Download pdf files

In [5]:
!gdown "https://github.com/muntasirhsn/datasets/raw/main/Solar-System-Wikipedia.pdf" # this is just a pdf print of the Solar System page on Wikipedia!

Downloading...
From: https://github.com/muntasirhsn/datasets/raw/main/Solar-System-Wikipedia.pdf
To: /content/Solar-System-Wikipedia.pdf
  0% 0.00/4.49M [00:00<?, ?B/s]100% 4.49M/4.49M [00:00<00:00, 48.5MB/s]


####Load PDF Files
Depending on the type of the source data, we can use the appropriate data loader from LangChain to load the data.

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
embeddings = HuggingFaceEmbeddings()

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [7]:
from langchain.document_loaders import UnstructuredPDFLoader
from langchain.vectorstores.utils import filter_complex_metadata # 'filter_complex_metadata' removes complex metadata that are not in str, int, float or bool format

pdf_loader = UnstructuredPDFLoader("/content/dataset.vi.en.pdf")
pdf_doc = pdf_loader.load()
updated_pdf_doc = filter_complex_metadata(pdf_doc)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


#### Spit the document into chunks
Due to the limited size of the context window of an LLM, the data need to be divided into smaller chunks with a text splitter like CharacterTextSplitter or RecursiveCharacterTextSplitter. In this way, the smaller chunks can be fed into the LLM.

In [8]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=128)
chunked_pdf_doc = text_splitter.split_documents(updated_pdf_doc)
len(chunked_pdf_doc)

33

#### Create a vector database of the chunked documents with HuggingFace embeddings

In [9]:
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings()

We can either use FAISS or Chroma to create the [Vector Store](https://python.langchain.com/docs/modules/data_connection/vectorstores.html).

In [10]:
%%time
# Create the vectorized db with FAISS
from langchain.vectorstores import FAISS
db_pdf = FAISS.from_documents(chunked_pdf_doc, embeddings)

# Create the vectorized db with Chroma
# from langchain.vectorstores import Chroma
# db_pdf = Chroma.from_documents(chunked_pdf_doc, embeddings)

CPU times: user 476 ms, sys: 14.5 ms, total: 490 ms
Wall time: 786 ms


### B. Use RetrievalQA chain
We instantiate a RetrievalQA chain from LangChain which takes in a retriever, LLM and a chain_type as the input arguments. When the QA chain receives a query, the retriever retrieves information relevent to the query from the vectore store.   The ``chain type = "stuff"`` method stuffs all the retrieved information into context and makes a call to the language model. The LLM then generates the text/response from the retrieved documents. [See information on Langchain Retriver](https://python.langchain.com/docs/use_cases/question_answering/vector_db_qa).

**LLM prompt structure**

We can also pass in the recommended prompt structue for Llama 2 for the QA. In this way, we'd be able to advise our LLM to only use the available context to answer our question. If it cannot find information relevant to our query in the context, it'll **NOT** make up an answer, rather, it would advise that it's unable to find relevant information in the context.

In [11]:
%%time
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA

# use the recommended propt style for the LLAMA 2 LLM
prompt_template = """
<s>[INST] <<SYS>>
Use the following context to Answer the question at the end. Do not use any other information. If you can't find the relevant information in the context, just say you don't have enough information to answer the question. Don't try to make up an answer.

<</SYS>>

{context}

Question: {question} [/INST]
"""

prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
Chain_pdf = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    # retriever=db.as_retriever(search_type="similarity_score_threshold", search_kwargs={'k': 5, 'score_threshold': 0.8})
    # Similarity Search is the default way to retrieve documents relevant to a query, but we can use MMR by setting search_type = "mmr"
    # k defines how many documents are returned; defaults to 4.
    # score_threshold allows to set a minimum relevance for documents returned by the retriever, if we are using the "similarity_score_threshold" search type.
    # return_source_documents=True, # Optional parameter, returns the source documents used to answer the question
    retriever=db_pdf.as_retriever(), # (search_kwargs={'k': 5, 'score_threshold': 0.8}),
    chain_type_kwargs={"prompt": prompt},
)
query = "Use of training results ?"
result = Chain_pdf.invoke(query)
print(fill(result['result'].strip(), width=100))

<s>[INST] <<SYS>> Use the following context to Answer the question at the end. Do not use any other
information. If you can't find the relevant information in the context, just say you don't have
enough information to answer the question. Don't try to make up an answer.  <</SYS>>  Article 27.
Use of training results 1. The results of student training assessment for each semester, school year
and entire course are stored  in the unit's student management records and used in scholarship,
reward and disciplinary  consideration. , considering dropping out of school, stopping studying,
considering priority for boarding in  dormitories, participating in international student exchange
activities.  2. The results of the assessment of students' entire course training are stored in the
student management  file, as a basis for considering the graduation exam and making the graduation
thesis. 3. The results of the student's entire course training assessment are saved in the student's
profile upon 

In [12]:
%%time
query = "Order, procedures and disciplinary records ?"
result = Chain_pdf.invoke(query)
print(fill(result['result'].strip(), width=100))

<s>[INST] <<SYS>> Use the following context to Answer the question at the end. Do not use any other
information. If you can't find the relevant information in the context, just say you don't have
enough information to answer the question. Don't try to make up an answer.  <</SYS>>  Article 32.
Order, procedures and disciplinary records  1. Disciplinary procedures:  a) Students who commit
violations must make a self-criticism and receive disciplinary action. In case a student does not
comply with the self-criticism, the Student Commendation and Discipline Council still meets to
handle the matter based on the evidence collected; b) The course class leader chairs a meeting with
the course class, analyzes and recommends disciplinary action to the faculty (for member training
units) or subject (for direct training units). under) or department/department in charge of Student
Affairs; c) Faculty (for member training units) or subject (for affiliated training units) or
department/department in 

In [13]:
%%time
query = "What are the planets of the solar system composed of? Give a detailed response."
result = Chain_pdf.invoke(query)
print(fill(result['result'].strip(), width=100))

<s>[INST] <<SYS>> Use the following context to Answer the question at the end. Do not use any other
information. If you can't find the relevant information in the context, just say you don't have
enough information to answer the question. Don't try to make up an answer.  <</SYS>>  d)  The
Student Discipline and Reward Council holds a meeting to consider discipline and success  The
section includes: members of the Council, representatives of the course class with the violating
student and  the student with the violating behavior. If a student violates discipline and is
invited but does not attend (if there is  no legitimate reason) and does not have a self-criticism
report, the Council will still conduct a meeting and further  consider the shortcomings of lack of
awareness of disciplinary organization.  e) Stop considering graduation for students if the validity
of the Student Disciplinary Decision is still  in effect. 2. Student disciplinary records: a) Self-
criticism (if any); b) Min

### C. Hallucination Check
Hallucination in RAG refers to the generation of content by an LLM that is not based onn the retrieved knowledge.

Let's test our LLM with a query that is not relevant to the context. The model should respond that it does not have enough information to respond to this query.

In [14]:
%%time
query = "How does the tranformers architecture work?"
result = Chain_pdf.invoke(query)
print(fill(result['result'].strip(), width=100))

<s>[INST] <<SYS>> Use the following context to Answer the question at the end. Do not use any other
information. If you can't find the relevant information in the context, just say you don't have
enough information to answer the question. Don't try to make up an answer.  <</SYS>>  training
results when continuing to return to study as prescribed.  8. Students studying two training
programs simultaneously will have their training results evaluated at the first program management
unit and get comments from the second program management unit as a basis. for further evaluation. In
case the first program has been completed, the second program management unit will continue to
evaluate the students' training results.  9. The training results of the old training unit will be
preserved when studying at the new training unit and the training results will continue to be
evaluated in the following semesters.  Transfer students are approved by the leaders of both
training units  Article 26. Evaluat

The model responded as expected. The context provided to it do not contain any information on tranformers architectures. So, it cannot answer this question and do not suffer from hallucination!

## RAG from web pages

####Load the document



In [None]:
from langchain.document_loaders import UnstructuredURLLoader

web_loader = UnstructuredURLLoader(
    urls=["https://en.wikipedia.org/wiki/Solar_System"], mode="elements", strategy="fast",
    )
web_doc = web_loader.load()
updated_web_doc = filter_complex_metadata(web_doc)

####Split the documents into chunks

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=128)
chunked_web_doc = text_splitter.split_documents(updated_web_doc)
len(chunked_web_doc)

942

#### Create a vector database of the chunked documents with HuggingFace embeddings

In [None]:
%%time
# Create the vectorized db with FAISS
db_web = FAISS.from_documents(chunked_web_doc, embeddings)

CPU times: user 4.41 s, sys: 18.3 ms, total: 4.43 s
Wall time: 4.42 s


#### RAG with RetrievalQA

In [None]:
%%time
Chain_web = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db_web.as_retriever(),
    chain_type_kwargs={"prompt": prompt},
)
query = "When was the solar system formed?"
result = Chain_web.invoke(query)
print(fill(result['result'].strip(), width=100))

<s>[INST] <<SYS>> Use the following context to Answer the question at the end. Do not use any other
information. If you can't find the relevant information in the context, just say you don't have
enough information to answer the question. Don't try to make up an answer.  <</SYS>>  Formation and
evolution of the Solar System  The Solar System formed 4.568 billion years ago from the
gravitational collapse of a region within a large molecular cloud.[a] This initial cloud was likely
several light-years across and probably birthed several stars.[11] As is typical of molecular
clouds, this one consisted mostly of hydrogen, with some helium, and small amounts of heavier
elements fused by previous generations of stars.[12]  The Solar System[b] is the gravitationally
bound system of the Sun and the objects that orbit it.[9] It was formed 4.6 billion years ago when a
dense region of a molecular cloud collapsed, forming the Sun and a protoplanetary disc. The Sun is
an ordinary main sequence star 

In [None]:
%%time
query = "Explain in detail how the solar system was formed."
result = Chain_web.invoke(query)
print(fill(result['result'].strip(), width=100))

<s>[INST] <<SYS>> Use the following context to Answer the question at the end. Do not use any other
information. If you can't find the relevant information in the context, just say you don't have
enough information to answer the question. Don't try to make up an answer.  <</SYS>>  Formation and
evolution of the Solar System  The Solar System formed 4.568 billion years ago from the
gravitational collapse of a region within a large molecular cloud.[a] This initial cloud was likely
several light-years across and probably birthed several stars.[11] As is typical of molecular
clouds, this one consisted mostly of hydrogen, with some helium, and small amounts of heavier
elements fused by previous generations of stars.[12]  The Solar System[b] is the gravitationally
bound system of the Sun and the objects that orbit it.[9] It was formed 4.6 billion years ago when a
dense region of a molecular cloud collapsed, forming the Sun and a protoplanetary disc. The Sun is
an ordinary main sequence star 

In [None]:
%%time
query = "What are the planets of the solar system composed of? Give a detailed response."
result = Chain_web.invoke(query)
print(fill(result['result'].strip(), width=100))

<s>[INST] <<SYS>> Use the following context to Answer the question at the end. Do not use any other
information. If you can't find the relevant information in the context, just say you don't have
enough information to answer the question. Don't try to make up an answer.  <</SYS>>  planets, moons
and dwarf planets. The  Solar System at Wikipedia's  The four terrestrial or inner planets have
dense, rocky compositions, few or no moons, and no ring systems. They are composed largely of
refractory minerals such as silicates—which form their crusts and mantles—and metals such as iron
and nickel which form their cores. Three of the four inner planets (Venus, Earth and Mars) have
atmospheres substantial enough to generate weather; all have impact craters and tectonic surface
features, such as rift valleys and volcanoes.[86]  Astronomers sometimes divide the Solar System
structure into searate regions. The inner Solar System includes the Mercury, Venus, Earth, Mars and
bodies in the asteroid be

#### Hallucination Check

In [None]:
%%time
query = "How does the tranformers architecture work?"
result = Chain_web.invoke(query)
print(fill(result['result'].strip(), width=100))

<s>[INST] <<SYS>> Use the following context to Answer the question at the end. Do not use any other
information. If you can't find the relevant information in the context, just say you don't have
enough information to answer the question. Don't try to make up an answer.  <</SYS>>  trans-
Neptunian objects.  Composition  Portals:  Simple English  Question: How does the tranformers
architecture work? [/INST]  I'm not able to answer that question as it is not related to the topic
of trans-Neptunian objects or their composition. The Transformers architecture is a machine learning
model and has no relation to astronomy or space exploration. Therefore, I cannot provide an answer
to your question.
CPU times: user 3.31 s, sys: 104 ms, total: 3.42 s
Wall time: 3.45 s


The model does not suffer from hallucination!