# RAG Implementation using Llama-2 model

This is a simple RAG implementation using all-mpnet-base-v2 embedding model, chromadb vector database and Llama-2 model.

In [1]:
import PyPDF2
from sentence_transformers import SentenceTransformer
import chromadb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Chunking

In [2]:
# importing all the required modules
import PyPDF2

# creating a pdf reader object
reader = PyPDF2.PdfReader('llama-2 paper.pdf')

# print the number of pages in pdf file
print(len(reader.pages))

# print the text of the first page
print(reader.pages[0].extract_text())

# Source: https://stackoverflow.com/questions/45795089/how-can-i-read-pdf-in-python

77
Llama 2 : Open Foundation and Fine-Tuned Chat Models
Hugo Touvron∗Louis Martin†Kevin Stone†
Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra
Prajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen
Guillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller
Cynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou
Hakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev
Punit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich
Yinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra
Igor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi
Alan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang
Ross Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang
Angela Fan Melanie Kambadur Sharan Narang Aurelien Rodriguez Robert Stojnic
Sergey Edun

In [3]:
pages = [page.extract_text() for page in reader.pages]

In [4]:
document = '\n'.join(pages)

In [5]:
def get_overlapped_chunks(textin, chunksize, overlapsize):  
    return [textin[a:a+chunksize] for a in range(0,len(textin), chunksize-overlapsize)]

# Source: https://stackoverflow.com/questions/11636079/split-very-long-character-string-into-smaller-character-blocks-with-character-ov

In [6]:
chunks = get_overlapped_chunks(document, 1000, 100)

In [7]:
len(chunks)

281

In [8]:
chunks[0]

'Llama 2 : Open Foundation and Fine-Tuned Chat Models\nHugo Touvron∗Louis Martin†Kevin Stone†\nPeter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra\nPrajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen\nGuillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller\nCynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou\nHakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev\nPunit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich\nYinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra\nIgor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi\nAlan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang\nRoss Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang\nAngela Fan Melanie Kambadur Sharan Narang Aurelien Rodriguez Robert Stojnic\n

In [9]:
chunks[1]

'hen Zhang\nAngela Fan Melanie Kambadur Sharan Narang Aurelien Rodriguez Robert Stojnic\nSergey Edunov Thomas Scialom∗\nGenAI, Meta\nAbstract\nIn this work, we develop and release Llama 2, a collection of pretrained and fine-tuned\nlarge language models (LLMs) ranging in scale from 7 billion to 70 billion parameters.\nOur fine-tuned LLMs, called Llama 2-Chat , are optimized for dialogue use cases. Our\nmodels outperform open-source chat models on most benchmarks we tested, and based on\nourhumanevaluationsforhelpfulnessandsafety,maybeasuitablesubstituteforclosed-\nsource models. We provide a detailed description of our approach to fine-tuning and safety\nimprovements of Llama 2-Chat in order to enable the community to build on our work and\ncontribute to the responsible development of LLMs.\n∗Equal contribution, corresponding authors: {tscialom, htouvron}@meta.com\n†Second author\nContributions for all the authors can be found in Section A.1.arXiv:2307.09288v2  [cs.CL]  19 Jul 2023\nCo

## Embedding

In [10]:
embedding_model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2', cache_folder = '/data/base_models')
chunk_embeddings = embedding_model.encode(chunks)
chunk_embeddings.shape

(281, 768)

## Indexing

In [11]:
chroma_client = chromadb.Client()

collection = chroma_client.create_collection(name="rag_llama2")

In [12]:
collection.add(
    embeddings = chunk_embeddings,
    documents=chunks,
    ids= [str(i) for i in range(len(chunks))]
)

## Retrieving

In [13]:
def retrieve_vector_db(query, n_results=3):
    results = collection.query(
    query_embeddings = embedding_model.encode(query).tolist(),
    n_results=n_results
    )
    return results['documents']

In [14]:
query = "what is llama2 chat"
retrieved_results = retrieve_vector_db(query)

In [15]:
retrieved_results[0]

['ism or cybercrime. We have, however, made efforts to tune the models to avoid these topics and\ndiminish any capabilities they might have offered for those use cases.\nWhile we attempted to reasonably balance safety with helpfulness, in some instances, our safety tuning goes\ntoo far. Users of Llama 2-Chat may observe an overly cautious approach, with the model erring on the side\nof declining certain requests or responding with too many safety details.\nUsersofthepretrainedmodelsneedtobeparticularlycautious,andshouldtakeextrastepsintuningand\ndeployment as described in our Responsible Use Guide.§§\n5.3 Responsible Release Strategy\nReleaseDetails. Wemake Llama 2 availableforbothresearchandcommercialuseat https://ai.meta.\ncom/resources/models-and-libraries/llama/ . Thosewhouse Llama 2 mustcomplywiththetermsof\nthe provided license and our Acceptable Use Policy , which prohibit any uses that would violate applicable\npolicies, laws, rules, and regulations.\nWealsoprovidecodeexamplest

In [16]:
context = '\n\n'.join(retrieved_results[0])
print(context)

ism or cybercrime. We have, however, made efforts to tune the models to avoid these topics and
diminish any capabilities they might have offered for those use cases.
While we attempted to reasonably balance safety with helpfulness, in some instances, our safety tuning goes
too far. Users of Llama 2-Chat may observe an overly cautious approach, with the model erring on the side
of declining certain requests or responding with too many safety details.
Usersofthepretrainedmodelsneedtobeparticularlycautious,andshouldtakeextrastepsintuningand
deployment as described in our Responsible Use Guide.§§
5.3 Responsible Release Strategy
ReleaseDetails. Wemake Llama 2 availableforbothresearchandcommercialuseat https://ai.meta.
com/resources/models-and-libraries/llama/ . Thosewhouse Llama 2 mustcomplywiththetermsof
the provided license and our Acceptable Use Policy , which prohibit any uses that would violate applicable
policies, laws, rules, and regulations.
Wealsoprovidecodeexamplestohelpdeveloper

## Answer Generation

In [17]:
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-chat-hf",
    cache_dir="/data/ll/base_models",
    device_map='auto'
)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf", 
                                          cache_dir="/data/llama2/base_models"
                                         )

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [18]:
def get_llama2_chat_reponse(prompt, max_new_tokens=50):
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    outputs = model.generate(**inputs, max_new_tokens=max_new_tokens, temperature= 0.00001)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

In [19]:
prompt = f'''
[INST]
Give answer for the question strictly based on the context provided.

Question: {query}

Context : {context}
[/INST]
'''

In [20]:
prompt

'\n[INST]\nGive answer for the question strictly based on the context provided.\n\nQuestion: what is llama2 chat\n\nContext : ism or cybercrime. We have, however, made efforts to tune the models to avoid these topics and\ndiminish any capabilities they might have offered for those use cases.\nWhile we attempted to reasonably balance safety with helpfulness, in some instances, our safety tuning goes\ntoo far. Users of Llama 2-Chat may observe an overly cautious approach, with the model erring on the side\nof declining certain requests or responding with too many safety details.\nUsersofthepretrainedmodelsneedtobeparticularlycautious,andshouldtakeextrastepsintuningand\ndeployment as described in our Responsible Use Guide.§§\n5.3 Responsible Release Strategy\nReleaseDetails. Wemake Llama 2 availableforbothresearchandcommercialuseat https://ai.meta.\ncom/resources/models-and-libraries/llama/ . Thosewhouse Llama 2 mustcomplywiththetermsof\nthe provided license and our Acceptable Use Policy 

In [21]:
print(get_llama2_chat_reponse(prompt, max_new_tokens=500))


[INST]
Give answer for the question strictly based on the context provided.

Question: what is llama2 chat

Context : ism or cybercrime. We have, however, made efforts to tune the models to avoid these topics and
diminish any capabilities they might have offered for those use cases.
While we attempted to reasonably balance safety with helpfulness, in some instances, our safety tuning goes
too far. Users of Llama 2-Chat may observe an overly cautious approach, with the model erring on the side
of declining certain requests or responding with too many safety details.
Usersofthepretrainedmodelsneedtobeparticularlycautious,andshouldtakeextrastepsintuningand
deployment as described in our Responsible Use Guide.§§
5.3 Responsible Release Strategy
ReleaseDetails. Wemake Llama 2 availableforbothresearchandcommercialuseat https://ai.meta.
com/resources/models-and-libraries/llama/ . Thosewhouse Llama 2 mustcomplywiththetermsof
the provided license and our Acceptable Use Policy , which prohibit 

## RAG

In [22]:
# query = "what are different variants of llama2 model"
# query = "what is RLHF"
query = "how is RLHF used in llama2"

retrieved_results = retrieve_vector_db(query, n_results=5)
context = '\n\n'.join(retrieved_results[0])

prompt = f'''
[INST]
Give answer for the question strictly based on the context provided. Keep answers short and to the point.

Question: {query}

Context : {context}
[/INST]
'''

print(get_llama2_chat_reponse(prompt, max_new_tokens=800))


[INST]
Give answer for the question strictly based on the context provided. Keep answers short and to the point.

Question: how is RLHF used in llama2

Context : ieth and Polina Zvyagina, who
helped guide us through the release.
•Our partnerships team including Ash Jhaveri, Alex Boesenberg, Sy Choudhury, Mayumi Matsuno,
Ricardo Lopez-Barquilla, Marc Shedroff, Kelly Michelena, Allie Feinstein, Amit Sangani, Geeta
Chauhan,ChesterHu,CharltonGholson,AnjaKomlenovic,EissaJamil,BrandonSpence,Azadeh
Yazdan, Elisa Garcia Anzano, and Natascha Parks.
•ChrisMarra,ChayaNayak,JacquelinePan,GeorgeOrlin,EdwardDowling,EstebanArcaute,Philom-
ena Lobo, Eleonora Presani, and Logan Kerr, who provided helpful product and technical organiza-
tion support.
46
•Armand Joulin, Edouard Grave, Guillaume Lample, and Timothee Lacroix, members of the original
Llama team who helped get this work started.
•Drew Hamlin, Chantal Mora, and Aran Mun, who gave us some design input on the figures in the
paper.
•Vijai Mohan