<a href="https://colab.research.google.com/github/DhyeyMavani2003/dhyey-AIT-deep-learning-assignments-repo/blob/main/Dhyey_Mavani's_Copy_of_AIT_RAG_Assessment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG + LLM Assessment

Your task is to create a Retrieval-Augmented Generation (RAG) system using a Large Language Model (LLM). The RAG system should be able to retrieve relevant information from a knowledge base and generate coherent and informative responses to user queries.

Steps:

1. Choose a domain and collect a suitable dataset of documents (at least 5 documents - PDFs or HTML pages) to serve as the knowledge base for your RAG system. Select one of the following topics:
   * latest scientific papers from arxiv.org,
   * fiction books released,
   * legal documentsor,
   * social media posts.

   Make sure that the documents are newer then the training dataset of the applied LLM. (20 points)

2. Create three relevant prompts to the dataset, and one irrelevant prompt. (20 points)

3. Load an LLM with at least 5B parameters. (10 points)

4. Test the LLM with your prompts. The goal should be that without the collected dataset your model is unable to answer the question. If it gives you a good answer, select another question to answer and maybe a different dataset. (10 points)

5. Create a LangChain-based RAG system by setting up a vector database from the documents. (20 points)

6. Provide your three relevant and one irrelevant prompts to your RAG system. For the relevant prompts, your RAG system should return relevant answers, and for the irrelevant prompt, an empty answer. (20 points)


# Solution by Dhyey Mavani:

- Reference used (as provided by Professor Bálint Gyres-Tóth): https://colab.research.google.com/drive/1LL_vKKqHOi-xGJz_TCF1yCdVMWbmGPT_?usp=sharing

I would like to choose 5 latest scientific papers from arxiv.org in computational finance from May 1 and 3 of 2024:
- https://arxiv.org/pdf/2405.00701
- https://arxiv.org/pdf/2405.00697
- https://arxiv.org/pdf/2405.01233
- https://arxiv.org/pdf/2405.00522
- https://arxiv.org/pdf/2404.19290

In [1]:
# Install relevant dependencies
!pip install transformers>=4.32.0 optimum>=1.12.0 > null
!pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ > null
!pip install langchain > null
!pip install chromadb > null
!pip install sentence_transformers > null # ==2.2.2
!pip install unstructured > null
!pip install pdf2image > null
!pip install pdfminer.six > null
!pip install unstructured-pytesseract > null
!pip install unstructured-inference > null
!pip install faiss-gpu > null
!pip install pikepdf > null
!pip install pypdf > null
!pip install accelerate > null
!pip install pillow_heif > null
!pip install -i https://pypi.org/simple/ bitsandbytes > null

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
spacy 3.7.4 requires typer<0.10.0,>=0.3.0, but you have typer 0.12.3 which is incompatible.
weasel 0.3.4 requires typer<0.10.0,>=0.3.0, but you have typer 0.12.3 which is incompatible.[0m[31m
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
imageio 2.31.6 requires pillow<10.1.0,>=8.3.2, but you have pillow 10.3.0 which is incompatible.[0m[31m
[0m

In [None]:
# Restart the kernel after installation
import os
os.kill(os.getpid(), 9)

In [1]:
# Import the necessary Python libraries
from huggingface_hub import login
from langchain.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline, BitsAndBytesConfig
from textwrap import fill
from langchain.prompts import PromptTemplate
import locale
from langchain.document_loaders import UnstructuredURLLoader
from langchain.vectorstores.utils import filter_complex_metadata # 'filter_complex_metadata' removes complex metadata that are not in str, int, float or bool format
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA

locale.getpreferredencoding = lambda: "UTF-8"

# you need to define your private User Access Token from Huggingface
# to be able to access models with accepted licence
HUGGINGFACE_UAT="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"
login(HUGGINGFACE_UAT)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [2]:
#model_name = "google/gemma-2b-it" # 2B language model from Google
model_name = "meta-llama/Meta-Llama-3-8B-Instruct" # 8B language model from Meta AI (NEED PERMISSION FOR THIS: I think I got it!!) FINAL MODEL

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto",
                                             quantization_config=quantization_config,
                                             trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

gen_cfg = GenerationConfig.from_pretrained(model_name)
gen_cfg.max_new_tokens=512
gen_cfg.temperature=0.0000001 # 0.0 # For RAG we would like to have determenistic answers
gen_cfg.return_full_text=True
gen_cfg.do_sample=True
gen_cfg.repetition_penalty=1.11

pipe=pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    generation_config=gen_cfg
)

llm = HuggingFacePipeline(pipeline=pipe)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [3]:
template_llama3 = """
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

{text}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

template=template_llama3

prompt = PromptTemplate(
    input_variables=["text"],
    template=template,
)

# Relevant Questions Before feeding papers into model:

text = "What are the names of the authors of the paper titled Learning tensor networks with parameter dependence for Fourier-based option pricing?"
result = llm(prompt.format(text=text))
print(fill(result.strip(), width=200))
print("##########################################################################")
text = "What are the names of the authors of the study named Pricing Catastrophe Bonds — A Probabilistic Machine Learning Approach?"
result = llm(prompt.format(text=text))
print(fill(result.strip(), width=200))
print("##########################################################################")
text = "What are the names of the authors of the study named DAM: A Universal Dual Attention Mechanism for Multimodal Timeseries Cryptocurrency Trend Forecasting?"
result = llm(prompt.format(text=text))
print(fill(result.strip(), width=200))
print("##########################################################################")

# Irrelevant Question Before feeding papers into model:

text = "What is the Mark Zuckerberg's age?"
result = llm(prompt.format(text=text))
print(fill(result.strip(), width=200))

  warn_deprecated(
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>user<|end_header_id|>  What are the names of the authors of the paper titled Learning tensor networks with parameter dependence for Fourier-based option
pricing?<|eot_id|><|start_header_id|>assistant<|end_header_id|> A very specific question!  According to Google Scholar, the paper "Learning tensor networks with parameter dependence for Fourier-based
option pricing" was written by:  1. Yuchen Yang 2. Jiahui Wang 3. Zhenyu Zhang 4. Xueying Guo 5. Jian Li  The paper was published in 2020 in the Journal of Computational Finance.  Please note that
authorship may be subject to change over time, and it's always a good idea to verify information through multiple sources.
##########################################################################


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>user<|end_header_id|>  What are the names of the authors of the study named Pricing Catastrophe Bonds — A Probabilistic Machine Learning
Approach?<|eot_id|><|start_header_id|>assistant<|end_header_id|> According to my research, the authors of the study "Pricing Catastrophe Bonds - A Probabilistic Machine Learning Approach" are:  1. Dr.
Yujie Liu 2. Dr. Jia Li 3. Dr. Zhenyu Wang 4. Dr. Xueping Shen  The study was published in 2020 and is available online through various academic databases.  Would you like me to provide more
information about the study or its findings?
##########################################################################


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>user<|end_header_id|>  What are the names of the authors of the study named DAM: A Universal Dual Attention Mechanism for Multimodal Timeseries Cryptocurrency Trend
Forecasting?<|eot_id|><|start_header_id|>assistant<|end_header_id|> A very specific question!  According to my search, the authors of the study "DAM: A Universal Dual Attention Mechanism for
Multimodal Timeseries Cryptocurrency Trend Forecasting" are:  1. Yuxuan Li 2. Jiahui Zhang 3. Xiangyu Wang 4. Zhiyong Feng 5. Yongfeng Huang  The study was published in 2020 and can be found on
various academic databases such as arXiv, ResearchGate, or IEEE Xplore.  Please note that authorship may vary depending on the source or version of the paper you're referring to. If you have any
further questions or need more information, feel free to ask!
##########################################################################
<|begin_of_text|><|start_header_id|>user<|end_header_id|>  What is the Mark Zucker

In [4]:
%%time

prompt_template_llama3 = """
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Use the following context to Answer the question at the end. Do not use any other information. If you can't find the relevant information in the context, just say you don't have enough information to answer the question. Don't try to make up an answer.

{context}<|eot_id|><|start_header_id|>user<|end_header_id|>

{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

prompt_template=prompt_template_llama3

prompt = PromptTemplate(
    input_variables=["text"],
    template=prompt_template,
)
prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])

CPU times: user 203 µs, sys: 0 ns, total: 203 µs
Wall time: 184 µs


In [5]:
!wget -O paper1.pdf --no-check-certificate "https://arxiv.org/pdf/2405.00701"
!wget -O paper2.pdf --no-check-certificate "https://arxiv.org/pdf/2405.00697"
!wget -O paper3.pdf --no-check-certificate "https://arxiv.org/pdf/2405.01233"
!wget -O paper4.pdf --no-check-certificate "https://arxiv.org/pdf/2405.00522"
!wget -O paper5.pdf --no-check-certificate "https://arxiv.org/pdf/2404.19290"

--2024-05-04 17:11:30--  https://arxiv.org/pdf/2405.00701
Resolving arxiv.org (arxiv.org)... 151.101.131.42, 151.101.195.42, 151.101.3.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.131.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 662747 (647K) [application/pdf]
Saving to: ‘paper1.pdf’


2024-05-04 17:11:31 (115 MB/s) - ‘paper1.pdf’ saved [662747/662747]

--2024-05-04 17:11:31--  https://arxiv.org/pdf/2405.00697
Resolving arxiv.org (arxiv.org)... 151.101.131.42, 151.101.195.42, 151.101.3.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.131.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115023 (1.1M) [application/pdf]
Saving to: ‘paper2.pdf’


2024-05-04 17:11:31 (112 MB/s) - ‘paper2.pdf’ saved [1115023/1115023]

--2024-05-04 17:11:31--  https://arxiv.org/pdf/2405.01233
Resolving arxiv.org (arxiv.org)... 151.101.131.42, 151.101.195.42, 151.101.3.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.131.42|:443... conne

In [6]:
from langchain.document_loaders import UnstructuredPDFLoader
import os
loaders = [UnstructuredPDFLoader(fn) for fn in ["/content/paper1.pdf",
                                                "/content/paper2.pdf",
                                                "/content/paper3.pdf",
                                                "/content/paper4.pdf",
                                                "/content/paper5.pdf"]]

In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

chunked_pdf_doc = []

for loader in loaders:
    print("Loading raw document..." + loader.file_path)
    pdf_doc = loader.load()
    updated_pdf_doc = filter_complex_metadata(pdf_doc)
    print("Splitting text...")
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=256)
    documents = text_splitter.split_documents(updated_pdf_doc)
    chunked_pdf_doc.extend(documents)

len(chunked_pdf_doc)

Loading raw document.../content/paper1.pdf


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Splitting text...
Loading raw document.../content/paper2.pdf
Splitting text...
Loading raw document.../content/paper3.pdf
Splitting text...
Loading raw document.../content/paper4.pdf
Splitting text...
Loading raw document.../content/paper5.pdf
Splitting text...


406

In [8]:
%%time
embeddings = HuggingFaceEmbeddings() # default model_name="sentence-transformers/all-mpnet-base-v2"
db_pdf = FAISS.from_documents(chunked_pdf_doc, embeddings)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

CPU times: user 9.23 s, sys: 1.2 s, total: 10.4 s
Wall time: 20.3 s


In [9]:
%%time

Chain_pdf = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db_pdf.as_retriever(search_type="similarity_score_threshold", search_kwargs={'k': 4, 'score_threshold': 0.2}),
    chain_type_kwargs={"prompt": prompt},
)
query = "What are the names of the authors of the paper titled Learning tensor networks with parameter dependence for Fourier-based option pricing?"
result = Chain_pdf.invoke(query)
print(fill(result['result'].strip(), width=200))
print("##########################################################################")
query = "What are the names of the authors of the study named Pricing Catastrophe Bonds — A Probabilistic Machine Learning Approach?"
result = Chain_pdf.invoke(query)
print(fill(result['result'].strip(), width=200))
print("##########################################################################")
query = "What are the names of the authors of the study named DAM: A Universal Dual Attention Mechanism for Multimodal Timeseries Cryptocurrency Trend Forecasting?"
result = Chain_pdf.invoke(query)
print(fill(result['result'].strip(), width=200))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to Answer the question at the end. Do not use any other information. If you can't find the relevant information
in the context, just say you don't have enough information to answer the question. Don't try to make up an answer.  4 2 0 2  r p A 7 1  ] P C . n i f - q [  1 v 1 0 7 0 0 . 5 0 4 2 : v i X r a
Learning tensor networks with parameter dependence for Fourier-based option pricing  Rihito Sakurai,1, 2, ∗ Haruto Takahashi,2 and Koichi Miyamoto3 1Department of Physics, The University of Tokyo,
Tokyo 113-0033. Japan 2Department of Physics, Saitama University, Saitama 338-8570, Japan 3Center for Quantum Information and Quantum Biology, Osaka University, Osaka 560-0043, Japan (Dated: May 3,
2024)  The rest of this paper is organized as follows. Sec. II is devoted to introducing the tensor train and tensor learning algorithm. In Sec. III, we review the FT-based option pricing and its
reformulation aid

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to Answer the question at the end. Do not use any other information. If you can't find the relevant information
in the context, just say you don't have enough information to answer the question. Don't try to make up an answer.  1  Email:  1 Introduction  Catastrophe (CAT) bonds have become increasingly vital in
managing and transferring catastrophic  risk. These bonds offer a source of capital to cover losses arising from natural disasters, allowing investors  to diversify their portfolios while helping
issuers mitigate potentially devastating financial consequences.  Understanding the pricing dynamics of CAT bonds is essential, both for investors seeking informed deci-  sions and for issuers
optimizing their risk management strategies. This paper introduces a probabilistic  machine-learning-based predictive framework for the pricing of CAT bonds, aiming to enhance empirical  pricing
accuracy and dis

In [10]:
%%time
# Irrelevant questions to the content provided and hence should respond with "I can't find relevant info"
query = "What is the NVIDIA CEO's dog name?"
result = Chain_pdf.invoke(query)
print(fill(result['result'].strip(), width=100))
print("##########################################################################")
query = "What is the Mark Zuckerberg's age?"
result = Chain_pdf.invoke(query)
print(fill(result['result'].strip(), width=100))
print("##########################################################################")
query = "When was Alphabet company born?"
result = Chain_pdf.invoke(query)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to Answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just say you don't have enough information to answer the question. Don't try to make up
an answer.  <|eot_id|><|start_header_id|>user<|end_header_id|>  What is the NVIDIA CEO's dog
name?<|eot_id|><|start_header_id|>assistant<|end_header_id|> I don't have enough information to
answer that question. The provided context does not mention the NVIDIA CEO or their dog's name.
##########################################################################


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to Answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just say you don't have enough information to answer the question. Don't try to make up
an answer.  <|eot_id|><|start_header_id|>user<|end_header_id|>  What is the Mark Zuckerberg's
age?<|eot_id|><|start_header_id|>assistant<|end_header_id|> I don't have enough information to
answer that question. The provided context does not mention Mark Zuckerberg's age.
##########################################################################
<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to Answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just say you don't have enough information to answer the question. Don't try to make up
an answer.  <|eot_id|><|start_header_id|>user<|