<a href="https://colab.research.google.com/github/Nabizeus/TPC/blob/main/TPC_Barcelona_RAGTutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Retrieval Augmented Generation (RAG)

## Overview
*   Motivation for RAG
*   Idea behind RAG
*   Advantages and Disadvantages
*   Implementation to augment question + answer
*   Advanced applications


#### Imagine you went to live under a rock on August 2006. When you come out in 2024, you are asked how many planets revolve around the sun. What would you say?...
![pluto](https://github.com/architvasan/LLMWorkshop/blob/main/rag_images/pluto_planets.jpeg?raw=1)

This is similar to LLMs which are trained with data until a certain point and then asked questions on data they are not trained on. Understandably, LLMs will either be unable to answer or simply hallucinate a probably wrong answer.

###What can be done?

Have the LLM go to the library using **Research Augmented Generation (RAG)**!

RAG involves adding your own data (via a retrieval tool) to the prompt that you pass into a large language model.


![rag architecture](https://github.com/architvasan/LLMWorkshop/blob/main/rag_images/rag-overview.original.png?raw=1)
Image credit: https://scriv.ai/guides/retrieval-augmented-generation-overview/

RAG has been shown to improve LLM prediction accuracy without needing to increase parameter size.

![rag architecture](https://github.com/architvasan/LLMWorkshop/blob/main/rag_images/rag_acc_v_size.png?raw=1)

*Image credit: Yu, Wenhao. "Retrieval-augmented generation across heterogeneous knowledge." Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop. 2022.*

RAG also increases explainability by giving the source for information.

![rag architecture](https://github.com/architvasan/LLMWorkshop/blob/main/rag_images/rag_source_locator.png?raw=1)

Image credit: https://ai.stanford.edu/blog/retrieval-based-NLP/

## Advantages and Disadvantages

### Advantages

*   Provides domain specific context
*   Improves predictive performance and reduces hallucinations
*   Does not increase model parameters
*   Less labor intensive than fine-tuning LLMs

### Disadvantages

*   May introduce latency since we are adding a relatively costly search step
*   If your dataset includes private information, you may inadvertently expose another user with this information.
*   The data you want to use needs to be curated and you should decide how the data should be accessed. This adds time for the initial set-up.


#Implementation

### 1. Install + load relevant modules:
*   langchain
*   torch
*   transformers
*   sentence-transformers
*   datasets
*   faiss-cpu  
*   pypdf
*  unstructure[pdf]
*  huggingface_hub (add hf_token)




In [None]:
!pip install langchain==0.1.5
!pip install --quiet langchain_experimental
!pip install torch
!pip install transformers
!pip install faiss-cpu
!pip install pypdf
!pip install sentence-transformers
!pip install unstructured==0.12.3
!pip install unstructured[pdf]==0.12.3
!pip install tiktoken
!pip install huggingface_hub



Collecting langchain==0.1.5
  Using cached langchain-0.1.5-py3-none-any.whl (806 kB)
Collecting langchain-community<0.1,>=0.0.17 (from langchain==0.1.5)
  Using cached langchain_community-0.0.38-py3-none-any.whl (2.0 MB)
Collecting langchain-core<0.2,>=0.1.16 (from langchain==0.1.5)
  Using cached langchain_core-0.1.52-py3-none-any.whl (302 kB)
Collecting langsmith<0.1,>=0.0.83 (from langchain==0.1.5)
  Using cached langsmith-0.0.92-py3-none-any.whl (56 kB)
INFO: pip is looking at multiple versions of langchain-community to determine which version is compatible with other requirements. This could take a while.
Collecting langchain-community<0.1,>=0.0.17 (from langchain==0.1.5)
  Using cached langchain_community-0.0.37-py3-none-any.whl (2.0 MB)
  Using cached langchain_community-0.0.36-py3-none-any.whl (2.0 MB)
  Using cached langchain_community-0.0.35-py3-none-any.whl (2.0 MB)
  Using cached langchain_community-0.0.34-py3-none-any.whl (1.9 MB)
  Using cached langchain_community-0.0.33-

In [None]:
# Download supporting data from llm-workshop + MIT Opencourseware

!git clone https://github.com/argonne-lcf/llm-workshop.git
!wget https://ocw.mit.edu/courses/6-00-introduction-to-computer-science-and-programming-fall-2008/e1c8c4fcfc48f347033239c8a023403d_6-00F08-L10.pdf

Cloning into 'llm-workshop'...
remote: Enumerating objects: 273, done.[K
remote: Counting objects: 100% (103/103), done.[K
remote: Compressing objects: 100% (47/47), done.[K
remote: Total 273 (delta 64), reused 88 (delta 55), pack-reused 170[K
Receiving objects: 100% (273/273), 44.89 MiB | 30.32 MiB/s, done.
Resolving deltas: 100% (144/144), done.
--2024-06-18 21:56:58--  https://ocw.mit.edu/courses/6-00-introduction-to-computer-science-and-programming-fall-2008/e1c8c4fcfc48f347033239c8a023403d_6-00F08-L10.pdf
Resolving ocw.mit.edu (ocw.mit.edu)... 151.101.194.133, 151.101.130.133, 151.101.66.133, ...
Connecting to ocw.mit.edu (ocw.mit.edu)|151.101.194.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 135352 (132K) [application/pdf]
Saving to: ‘e1c8c4fcfc48f347033239c8a023403d_6-00F08-L10.pdf’


2024-06-18 21:56:59 (6.64 MB/s) - ‘e1c8c4fcfc48f347033239c8a023403d_6-00F08-L10.pdf’ saved [135352/135352]



In [None]:
# Create a HF token key from https://huggingface.co/settings/tokens so that you
# can login to HF from inside this notebook
from huggingface_hub import login

import os
from getpass import getpass

hf_token = getpass('Enter huggingfacehub api token: ')
login(token=hf_token, add_to_git_credential=True)

Enter huggingfacehub api token: ··········
Token is valid (permission: fineGrained).
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


### 2. Choose a dataset to use and then load it into your code
Here we are using the pdfs loaded in pdfs/. We load this using langchain DirectoryLoader.

We can load multiple types of datasets into this example though the most commonly used are PDFs and websites.

To load websites, we could also use `langchain WebBaseLoader`

In this example, we will consider PDFs and load them in using `langchain DirectoryLoader`.

We host all PDFs at the PDFs directory `llm-workshop/tutorials/04-rag/PDFs`



In [None]:
from langchain.document_loaders import DirectoryLoader
loader = DirectoryLoader('llm-workshop/tutorials/04-rag/PDFs', glob="**/*.pdf", show_progress=True)
papers = loader.load()

100%|██████████| 7/7 [00:28<00:00,  4.03s/it]


### 3. Now, we need to split our documents into chunks.
We want the embedding to be greater than 1 word but much less than an entire page. This is essential for the similarity search between the query and the document. Essentially, the query will be searched for greatest similarity to embedded chunks in the dataset. Then those chunks with greatest similarity are augmented to the query.

It is essential to choose the chunking method according to your data type.
There are different ways to do this:

Fixed size
*   Token: Splits text on tokens. Can chunk tokens together
*   Character: Splits based on some user defined character.

Recursive
*  Recursively splits text. Useful for keeping related pieces of text next to each other.

Document based
*   HTML: Splits text based on HTML-specific characters.
*   Markdown: Splits on Markdown-specific characters
*   Code: Splits text based on characters specific to coding languages.

Semantic chunking
*   Extract semantic meaning from embeddings and then assess the semantic relationship between these chunks. Essentially splits into sentences, then groups into groups of 3 sentences, and then merges one that are similar in the embedding space.

Here we use recursive where the dataset is split using a set of characters. The default characters provided to it are ["\n\n", "\n", " ", ""].  A large text is split by the first character \n\n. If the first split by \n\n is still large then it moves to the next character which is \n and tries to split by it. This continues until the chunk size is reached.


In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
character_chunker = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0, separators=["\n\n"])
char_chunks = character_chunker.split_documents(papers)

In [None]:
for i in char_chunks[0:5]:
  print(i, "\n")

page_content='7 1 0 2\n\ny a M 7 1\n\n]\n\nG L . s c [\n\n2 v 6 7 0 7 0 . 3 0 7 1 : v i X r a\n\nSMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules\n\nEsben Jannik Bjerrum*\n\nWildcard Pharmaceutical Consulting, Frødings Allé 41, 2860 Søborg, Denmark *) esben@wildcardconsulting.dk\n\nAbstract' metadata={'source': 'llm-workshop/tutorials/04-rag/PDFs/1703.07076.pdf'} 

page_content='\n\nSimpliﬁed Molecular Input Line Entry System (SMILES) is a single line text representation of a unique molecule. One molecule can however have multiple SMILES strings, which is a reason that canonical SMILES have been deﬁned, which ensures a one to one correspondence between SMILES string and molecule. Here the fact that multiple SMILES represent the same molecule is explored as a technique for data augmentation of a molecular QSAR dataset modeled by a long short term memory (LSTM) cell based neural network. The augmented dataset was 130 times bigger than the original. The net

In [None]:
print(f"{len(papers)} papers have been split into {len(char_chunks)} chunks.")

7 papers have been split into 361 chunks.


#### Example: Comparing Naive Chunking with Semantic Chunking

Using a lecture transcript from MIT OpenCourseware on [Binary Trees: Fall 2008 Lecture 10](https://ocw.mit.edu/courses/6-00-introduction-to-computer-science-and-programming-fall-2008/resources/6-00f08-l10/) we can see the difference between naive chunking and semantic chunking.

In [None]:
from langchain.docstore.document import Document
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("e1c8c4fcfc48f347033239c8a023403d_6-00F08-L10.pdf")
pages = loader.load_and_split()

# First page of lecture is liscence, ignore and get text for all other pages
lecture_text = "".join(elem.page_content for elem in pages[1:])

lecture =  Document(page_content=lecture_text, metadata={"source": "local"})


In [None]:
#Initialize the encoder model.
from langchain.embeddings import HuggingFaceEmbeddings
model_name = "sentence-transformers/msmarco-distilbert-dot-v5"
model_kwargs = {'device':'cuda'}
encode_kwargs = {'normalize_embeddings':False}

encoder = HuggingFaceEmbeddings(
  model_name = model_name,
  model_kwargs = model_kwargs,
  encode_kwargs=encode_kwargs
)

#Perform semantic chunking.
from langchain_experimental.text_splitter import SemanticChunker

#initializing the splliter.
semantic_chunker = SemanticChunker(encoder, buffer_size=5)

#list of grouped_sentences (buffers)
buffers = semantic_chunker.split_documents([lecture])
buffers = [buffer.page_content for buffer in buffers]
semantic_chunks = semantic_chunker.create_documents(buffers)

  warn_deprecated(
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
from langchain.text_splitter import CharacterTextSplitter
character_chunker = CharacterTextSplitter(chunk_size=500, chunk_overlap=150, separator=" ")
char_chunks = character_chunker.split_documents([lecture])

In [None]:
import random
random.shuffle(semantic_chunks)
random.shuffle(char_chunks)

print("Semantic chunking")
for i in semantic_chunks[0:10]:
  print(i, "\n")

print("\n\n ----------- \n\n")

print("Fixed-size character chunking")
for i in char_chunks[0:10]:
  print(i, "\n")

Semantic chunking
page_content="We talked about things you could do to try make sure that happens. You could run \nthrough a little loop to say keep trying until you get one. But one of the ways I could \ndeal with it is what's shown here. And what's this little loop say to do? This little loop \nsays I'm going to write a function or procedures that takes in two messages. I'm going to run through a loop, and I'm going to request some input, which I'm \ngoing to read in with raw input. I'm goi ng to store that into val. And as you might \nexpect, I'm going to then try and see if I can convert that into a float. Oh wait a \nminute, that's a little different than what we did last time, right?" 

page_content="STUDENT: [UNINTELLIGIBLE] \nPROFESSOR: In both lists, right. So this is linear, order n and n is this sum of the \nelement, or sorry, the number of elements in each list. I said I was going to back my \nway into this. That gives me a way to merge things. So here's what merge sort \nw

In [None]:
print(f"Number of chunks produced by semantic chunking: {len(semantic_chunks)}")
print(f"Number of chunks produced by character chunking: {len(char_chunks)}")


Number of chunks produced by semantic chunking: 64
Number of chunks produced by character chunking: 119


#### ProTip: Semantic Chunking is not suitable to poorly-parsed PDF contents.

In [None]:
#Initialize the biomedical domain-specific encoder model.
from langchain.embeddings import HuggingFaceEmbeddings
model_name = "pritamdeka/S-PubMedBert-MS-MARCO"
model_kwargs = {'device':'cuda'}
encode_kwargs = {'normalize_embeddings':False}

biomedical_encoder = HuggingFaceEmbeddings(
  model_name = model_name,
  model_kwargs = model_kwargs,
  encode_kwargs=encode_kwargs
)


In [None]:
#Perform semantic chunking on scientific papers.
from langchain_experimental.text_splitter import SemanticChunker

#initializing the splliter.
semantic_chunker = SemanticChunker(biomedical_encoder, buffer_size=5)

#list of grouped_sentences (buffers)
buffers = semantic_chunker.split_documents(papers)
buffers = [buffer.page_content for buffer in buffers]
semantic_chunks = semantic_chunker.create_documents(buffers)

for i in semantic_chunks[0:10]:
  print(i, "\n")

page_content='7 1 0 2\n\ny a M 7 1\n\n]\n\nG L . s c [\n\n2 v 6 7 0 7 0 . 3 0 7 1 : v i X r a\n\nSMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules\n\nEsben Jannik Bjerrum*\n\nWildcard Pharmaceutical Consulting, Frødings Allé 41, 2860 Søborg, Denmark *) esben@wildcardconsulting.dk\n\nAbstract\n\nSimpliﬁed Molecular Input Line Entry System (SMILES) is a single line text representation of a unique molecule. One molecule can however have multiple SMILES strings, which is a reason that canonical SMILES have been deﬁned, which ensures a one to one correspondence between SMILES string and molecule. Here the fact that multiple SMILES represent the same molecule is explored as a technique for data augmentation of a molecular QSAR dataset modeled by a long short term memory (LSTM) cell based neural network. The augmented dataset was 130 times bigger than the original. The network trained with the augmented dataset shows better performance on a test set when compare

### 4. Then we embed the chunked texts using a Transformer and create a Faiss Vector Database
This allows us to encode the text into our search. Let's investigate the retrieved documents for a query.

Vector databases, also called vector storage, efficiently store and retrieve vector data, which are arrays of numerical values representing points in multi-dimensional space. They're useful for handling data like embeddings from deep learning models or numerical features. Unlike traditional relational databases, which aren't optimized for vectors, vector databases offer efficient storage, indexing, and querying for high-dimensional and variable-length vectors.

There are various types of vector databases:
1. Chroma
2. FAISS
3. Pinecone
4. Weaviate
5. Qdrant

Here, we build this using the FAISS utility.

In [None]:
from langchain.vectorstores import FAISS
faiss_vector_db = FAISS.from_documents(semantic_chunks, biomedical_encoder)
question = "Do you have any information on publications about RFDiffusion?"
searchDocs = faiss_vector_db.similarity_search(question)

#investigate top-3 nearest (most relevant) documents for the query.
print(searchDocs[0].page_content)
print(searchDocs[1].page_content)
print(searchDocs[2].page_content)

experimentally characterized designs. J.W., A.L. and W.S. contributed additional code. S.O. implemented RFdiffusion on Google Colab.
9h). Over the binder alone, the experimental structure deviates from the RFdiffusion design by only 0.6 Å (Fig. 6h). These results demonstrate the ability of RFdiffusion to generate new proteins with atomic level accuracy, and to precisely target functionally relevant sites on therapeutically important proteins. Discussion RFdiffusion is a comprehensive improvement over current protein design methods. RFdiffusion readily generates diverse uncondi- tional designs up to 600 residues in length that are accurately pre- dicted by AF2, far exceeding the complexity and accuracy achieved by most previous methods (a recent Hallucination-based approach also achieved high unconditional performance53). Half of our tested unconditional designs express in a soluble way, and have circular dichroism spectra consistent with the design models and high ther- mostability. De

![vector_database](https://github.com/argonne-lcf/llm-workshop/blob/main/tutorials/04-rag/rag_images/vector_database.png?raw=1)

Image credit: https://blog.gopenai.com/primer-on-vector-databases-and-retrieval-augmented-generation-rag-using-langchain-pinecone-37a27fb10546

### 5. Initialize the LLM that will be used for question answering

Here, we use a pretrained model flan-t5-large as part of a HuggingFacePipeline. This will later be chained with the vector database for RAG.

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM,pipeline
from langchain import HuggingFacePipeline

tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-large")
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-large")
pipe = pipeline("text2text-generation", model=model, tokenizer=tokenizer)
llm = HuggingFacePipeline(
   pipeline = pipe,
   model_kwargs={"temperature": 0, "max_length": 2048, "max_new_tokens": 1024, "device":"cuda"},
)


  warn_deprecated(


### 6. Retrieve data and use it to answer a question

![rag_workflow](https://github.com/argonne-lcf/llm-workshop/blob/main/tutorials/04-rag/rag_images/rag_workflow.png?raw=1)

Image credit: https://blog.gopenai.com/retrieval-augmented-generation-101-de05e5dc21ef

Let's ask questions it would only be able to know if the model actually read the texts!

In [None]:
from langchain.prompts import PromptTemplate

template = """You are an honest and helpful AI. You are alwasys truthful and concise in your answers. Please answer the question with the provided context.
If you don't know the answer, please say I don't know.
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

In [None]:
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
  llm=llm,
  chain_type="stuff",
  retriever=faiss_vector_db.as_retriever(),
  chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)
result = qa_chain ({ "query" : "What technique proposed in 2023 can be used to predict protein folding?" })
print(result["result"])

  warn_deprecated(
Token indices sequence length is longer than the specified maximum sequence length for this model (2041 > 512). Running this sequence through the model will result in indexing errors


RoseTTAFold diffusion


Now let's ask the chain where to find the article related to RFDiffusion

In [None]:
qa_chain ({ "query" : "Where was the RFdiffusion paper published?" })



{'query': 'Where was the RFdiffusion paper published?', 'result': 'Nature'}

In [None]:
qa_chain ({ "query" : "What can I use RFdiffusion model for?" })



{'query': 'What can I use RFdiffusion model for?', 'result': "I don't know"}