# Retrieval-Augmented Generation with Llama-2 Tutorial

Due to the computational intensity of large language models, a GPU runtime is required to run this program. From the top menu, simply select Runtime -> Change runtime type -> T4 GPU

First we install necessary packages

In [None]:
!pip install pymupdf langchain sentence-transformers torch transformers langchain-community bitsandbytes pinecone-client

# Functions for processing PDF folder/files

In [None]:
import pymupdf
from langchain.text_splitter import RecursiveCharacterTextSplitter
import os

pymupdf.TOOLS.mupdf_display_errors(False) # Prevents MuPDF library issues from printing

def chunk_doc(pdf_path):
  doc = pymupdf.open(pdf_path)
  # Get metadata for each PDF, which will later be stored in the vector index
  metadata = {"author": doc.metadata["author"], "title": doc.metadata["title"]}
  text = ""

  # Convert each PDF to plaintext and chunk that text
  try:
    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        text += page.get_text()
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=64)
    split_docs = text_splitter.split_text(text)
    return split_docs, metadata
  except Exception as e:
    print(f"error chunking file: {pdf_path}")
    return [], None

def load_folder(path):
  documents = [] # All chunks from each PDF
  metadata = [] # Metadata for each PDF
  for file in os.listdir(path):
    if file.endswith(".pdf"):
      pdf_path = os.path.join(path, file)
      print(f"loading file: {pdf_path}")
      # Get chunks and metadata for each PDF
      chunks, doc_metadata = chunk_doc(pdf_path)
      if chunks: # Check if any data was actually extracted
        documents.append(chunks)
        metadata.append(doc_metadata)
  return documents, metadata

Example page after parsing PDFs

In [None]:
doc = pymupdf.open("/content/Angew Chem Int Ed - 2015 - Wu - Fast and Simple Preparation of Iron‐Based Thin Films as Highly Efficient Water‐Oxidation.pdf")
doc.load_page(0).get_text()

'German Edition:\nDOI: 10.1002/ange.201412389\nWater Splitting\nInternational Edition: DOI: 10.1002/anie.201412389\nFast and Simple Preparation of Iron-Based Thin Films as Highly\nEfficient Water-Oxidation Catalysts in Neutral Aqueous Solution**\nYizhen Wu, Mingxing Chen, Yongzhen Han, Hongxia Luo, Xiaojun Su, Ming-Tian Zhang,\nXiaohuan Lin, Junliang Sun, Lei Wang, Liang Deng, Wei Zhang, and Rui Cao*\nAbstract: Water oxidation is the key step in natural and\nartificial photosynthesis for solar-energy conversion. As this\nprocess is thermodynamically unfavorable and is challenging\nfrom a kinetic point of view, the development of highly efficient\ncatalysts with low energy cost is a subject of fundamental\nsignificance. Herein, we report on iron-based films as highly\nefficient water-oxidation catalysts. The films can be quickly\ndeposited onto electrodes from FeII ions in acetate buffer at\npH 7.0 by simple cyclic voltammetry. The extremely low iron\nloading on the electrodes is critic

We call the functions to split text of each PDF into pages and chunk each page

In [None]:
all_chunks, all_metadata = load_folder("/content") # Chunks every document and also returns metadata for each chunk

loading file: /content/ElectrodepositionOfOrganicSolutions_ofMetals.pdf
loading file: /content/Electrodeposition_of_metals_non-aqeous_solution.pdf
loading file: /content/Naor_2010_ECS_Trans._25_137.pdf
loading file: /content/Effect_of_pH_on_pyrole_electrodeposition.pdf
loading file: /content/Review-Metal_complexes_removal_from_water.pdf
loading file: /content/Electrocatalysis_development_forHydrogen_Evolution.pdf
loading file: /content/Electrodeposition_Fe_in_citrate_solutions_kinetics.pdf
loading file: /content/annurev-matsci-071312-121640.pdf
loading file: /content/Electrodeposition_of_metals_alloys_in_ILs.pdf
loading file: /content/Nano_Structured_Iron_Oxide_and_Hydroxide.pdf
loading file: /content/Advanced_oxidation_prcesses_inWater.pdf
loading file: /content/lipsztajn-osteryoung-2002-electrochemistry-in-neutral-ambient-temperature-ionic-liquids-1-studies-of-iron(iii).pdf
loading file: /content/Miller_2017_J._Electrochem._Soc._164_A796.pdf
loading file: /content/Deep_Euctectic_solv

Example metadata from PDF processing

In [None]:
print(all_metadata)

[{'author': 'Jeanne Roussel', 'title': 'deptekfm'}, {'author': 'Fariba Safizadeh', 'title': 'Electrocatalysis developments for hydrogen evolution reaction in alkaline solutions - A Review'}, {'author': '', 'title': '2234.tif'}, {'author': 'Marek Lipsztajn/Robert A. Osteryoung', 'title': 'Electrochemistry in neutral ambient-temperature ionic liquids. 1. Studies of iron(III), neodymium(III), and lithium(I)'}, {'author': '', 'title': 'PII: 0376-4583(81)90121-7'}, {'author': '', 'title': ''}, {'author': 'Gabriele Panzeri', 'title': 'Electrodeposition of high-purity nanostructured iron films from Fe(II) and Fe(III) non-aqueous solutions based on ethylene glycol'}, {'author': 'Karina Kołodziejczyk', 'title': 'Influence of constant magnetic field on electrodeposition of metals, alloys, conductive polymers, and organic reactions'}, {'author': '', 'title': ''}, {'author': '', 'title': 'doi:10.1016/j.electacta.2009.04.028'}, {'author': '', 'title': 'doi:10.1016/j.desal.2005.07.017'}, {'author': 

In [None]:
test = pymupdf.open("/content/Advanced_oxidation_prcesses_inWater.pdf")

Here we see the metadata returned when using Pymupdf. Different metadata may improve search results. For our program we utilize the title and author metadata

In [None]:
test.metadata # Example metadata. Currently only using author and title, may improve performance to try using different metadata?

{'format': 'PDF 1.7',
 'title': 'Advanced oxidation processes for the decontamination of heavy metal complexes in aquatic systems: A review',
 'author': 'Kosar Hikmat Hama Aziz',
 'subject': 'Case Studies in Chemical and Environmental Engineering, 9 (2024) 100567. doi:10.1016/j.cscee.2023.100567',
 'keywords': 'AOPs,Decomplexation,Heavy metal complexes,Water pollution,Organic ligands,Wastewater treatment',
 'creator': 'Elsevier',
 'producer': 'Acrobat Distiller 8.1.0 (Windows)',
 'creationDate': 'D:20231202193200Z',
 'modDate': 'D:20231202195842Z',
 'trapped': '',
 'encryption': None}

# Initialize Embedding Model


We take an embedding model from Hugging Face which will handle the vector embedding for each text chunk



In [None]:
from torch import cuda
from langchain.embeddings.huggingface import HuggingFaceEmbeddings

embed_model_id = 'sentence-transformers/all-MiniLM-L6-v2'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

embed_model = HuggingFaceEmbeddings(
    model_name=embed_model_id,
    model_kwargs={'device': device},
    encode_kwargs={'device': device, 'batch_size': 32}
)

  warn_deprecated(
  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

# Set up the vector store

We need a place to store the embeddings created for each text chunk. We will use Pinecone, a free online software for creating vector embedding databases. NOTE: An API key is required to use Pinecone. Simply go to https://www.pinecone.io/ to create a free account and create an API key.

In [None]:
import os
from pinecone import Pinecone

# initialize connection to pinecone (get API key at app.pinecone.io)
api_key = "PASTE API KEY HERE"

# configure client
pc = Pinecone(api_key=api_key)

In [None]:
from pinecone import ServerlessSpec

cloud = os.environ.get('PINECONE_CLOUD') or 'aws'
region = os.environ.get('PINECONE_REGION') or 'us-east-1'

spec = ServerlessSpec(cloud=cloud, region=region)
index_name = "jcsr-test"

We create an index where the vector embeddings will be stored. If the index already exists from previously running this program with the same set of PDFs, we can skip this step.

In [None]:
import time

# check if index already exists (it shouldn't if this is first time)
if index_name not in pc.list_indexes().names():
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=384,
        metric='cosine',
        spec=spec
    )
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1)

# connect to index
index = pc.Index(index_name)
# view index stats
index.describe_index_stats()

# Upsert vector embeddings to pinecone

With the chunks prepared, the embedding model configured, and the Pinecone database created, we can now embed the chunks. Each embedding contains a unique ID, the actual vector embedding, and the metadata associated with that chunk. All of this information helps to find the most relevant chunks when a prompt is given

In [None]:
for i in range(len(all_chunks)): # Iterate over each document at a time
  ids = []
  metadata = []
  embeds = []
  for j in range(len(all_chunks[i])): # Iterate over each chunk in the doc
    ids.append(f"{i}-{j}")
    metadata.append({"text": all_chunks[i][j], "title": all_metadata[i]["title"], "author": all_metadata[i]["author"]})
  embeds.extend(embed_model.embed_documents(all_chunks[i]))
  index.upsert(vectors=zip(ids, embeds, metadata))
  print(f"Upserted document {i}")

In [None]:
index.describe_index_stats()

Now we have a vector database prepared which will allow us to perform a similarity search between the embeddings and any user prompt

# Create text generation pipeline

We now need to configure the large language model which will generate the answers. Here we use the Llama-2 family of large language models from Hugging Face, but most large language models would work. We will again need an API key, this time from Hugging Face. Create a free account at https://huggingface.co/ and go to Settings -> Access Tokens -> Create New Token and set type to Read

In [None]:
from torch import cuda, bfloat16
import transformers

model_id = 'meta-llama/Llama-2-13b-chat-hf'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

# begin initializing HF items, need auth token for these
hf_auth = "PASTE HUGGING FACE API KEY HERE"
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    use_auth_token=hf_auth
)
model.eval()
print(f"Model loaded on {device}")



config.json:   0%|          | 0.00/587 [00:00<?, ?B/s]



model.safetensors.index.json:   0%|          | 0.00/33.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/9.90G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/6.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Model loaded on cuda:0


In [None]:
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)



tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

We create the text generation pipeline containing the model and tokenizer, along with some model parameters

In [None]:
generate_text = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=False,  # langchain expects the full text
    task='text-generation',
    # we pass model parameters here too
    temperature=0.1,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    max_new_tokens=512,  # max number of tokens to generate in the output
    repetition_penalty=1.1  # without this output begins repeating
)

In [None]:
from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline(pipeline=generate_text)

  warn_deprecated(


# Create RAG pipeline

Now we put the large language model and embedding database together to create the RAG pipeline


In [None]:
from langchain.vectorstores import Pinecone

text_field = 'text'  # field in metadata that contains text content

vectorstore = Pinecone(
    index, embed_model.embed_query, text_field
)

  warn_deprecated(


In [None]:
from langchain.chains import RetrievalQA

rag_pipeline = RetrievalQA.from_chain_type(
    llm=llm, chain_type='stuff',
    retriever=vectorstore.as_retriever(),
    return_source_documents=True # Returns the top source documents as part of the answer dictionary, if false it will not
)

# Example output

The pipeline answer questions by returning a dictionary object, containing keys for the answer as well as the chunks obtained in the similarity search

In [None]:
answer = rag_pipeline("Explain the process of electrodeposition")

  warn_deprecated(


In [None]:
answer["result"]

' Electrodeposition is the process of depositing a metal or other conductive material onto an electrode surface using an electric current. It involves the transfer of ions from the electrolyte solution to the electrode surface, where they are reduced or oxidized to form a solid film. The process is typically carried out in an electrochemical cell, where the electrode is immersed in the electrolyte solution and an electric current is applied between the electrode and a counter electrode. The current causes ions to be drawn towards the electrode, where they are deposited as a thin film. The rate of electrodeposition is influenced by factors such as the type of electrode material, the composition of the electrolyte solution, and the applied current density.'

In [None]:
answer["source_documents"]

[Document(metadata={'author': '', 'title': 'nl5b00175 1..6'}, page_content='use a dose ≤0.3 electrons/Å2/s). As such, typical beam eﬀects\nsuch as the formation of bubbles and/or precipitates from the\nbreakdown of the electrolyte are completely avoided.\nTo quantify the electrochemistry that occurs in the operando\nstage, we need to fully understand the distribution of electric\nﬁelds at the electrodes. The in situ TEM liquid holder (Figure\n1a) used for this experiment was supplied by a commercial\nvendor (more details on the stage design are given in\nSupporting Information). To provide insight into the location\nof the deposition reactions that take place during cycling of the\nelectrochemical cell, an Ansoft Maxwell static three-dimen-\nsional (3D) electromagnetic ﬁnite element simulation was used\nto extract quantitative information about the electric ﬁeld\ndistribution along the Pt working electrode in the ec-microchip.\nThe simulation (illustrated in Figure 1b) shows the electr

In [None]:
answer = rag_pipeline("Explain the kinetics of hydrogen evolution reactions in basic solutions")

In [None]:
answer["result"]

' The kinetics of hydrogen evolution reactions in basic solutions can be explained by considering the adsorption behavior and surface coverage of kinetically adsorbed intermediates such as hydrogen during hydrogen evolution. The reaction mechanism can be divided into three steps: electroreduction of water molecules with hydrogen adsorption, electrochemical hydrogen desorption, and chemical desorption. The rate constants for each step can be expressed as functions of the surface coverage of hydrogen and other parameters such as OH- concentration, H2O pressure, and temperature. Understanding the kinetics of hydrogen evolution reactions in basic solutions is important for optimizing the performance of electrolyzers used for hydrogen production.'

In [None]:
answer["source_documents"]

[Document(page_content='and high corrosion stability. However, the enhancement of\nelectrocatalytic activity for HER can be also inﬂuenced by the\ntype of alloying or the method of cathode preparation.\nKinetic studies\nIn order to better understand the kinetics of cathodic HER, it is\nnecessary to determine the mechanism of the reaction. This is\nusually done by characterizing the adsorption behavior and\nsurface coverage of kinetically adsorbed intermediates such\nas hydrogen during hydrogen evolution [9,10]. Many re-\nsearchers investigated the kinetics of the hydrogen evolution\nreaction on different materials [10e16].\nSome of the techniques, most frequently applied for the\nkinetic studies are as per the followings:\nNomenclature\nHER\nhydrogen evolution reaction\nh\noverpotential\ni\ncurrent density\nA\nsurface area\nb\nTafel slope\ni0\nexchange current density\nq1\nnecessary charge for deposition of a monolayer\nq\nfractional coverage\nCF\ndependence of the fractional surface c