# RAG + LLM Assessment

Your task is to create a Retrieval-Augmented Generation (RAG) system using a Large Language Model (LLM). The RAG system should be able to retrieve relevant information from a knowledge base and generate coherent and informative responses to user queries.

Steps:

1. Choose a domain and collect a suitable dataset of documents (at least 5 documents - PDFs or HTML pages) to serve as the knowledge base for your RAG system. Select one of the following topics:
   * latest scientific papers from arxiv.org,
   * fiction books released,
   * legal documents or,
   * social media posts.

   Make sure that the documents are newer then the training dataset of the applied LLM. (20 points)

2. Create three relevant prompts to the dataset, and one irrelevant prompt. (20 points)

3. Load an LLM with at least 5B parameters. (10 points)

4. Test the LLM with your prompts. The goal should be that without the collected dataset your model is unable to answer the question. If it gives you a good answer, select another question to answer and maybe a different dataset. (10 points)

5. Create a LangChain-based RAG system by setting up a vector database from the documents. (20 points)

6. Provide your three relevant and one irrelevant prompts to your RAG system. For the relevant prompts, your RAG system should return relevant answers, and for the irrelevant prompt, an empty answer. (20 points)


5 documents picked:
https://arxiv.org/pdf/2405.04542
https://arxiv.org/pdf/2405.04560
https://arxiv.org/pdf/2405.04564
https://arxiv.org/pdf/2405.04572
https://arxiv.org/pdf/2405.04574

Installing dependencies

In [3]:
!pip install transformers>=4.32.0 optimum>=1.12.0 > null
!pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ > null
!pip install langchain > null
!pip install chromadb > null
!pip install sentence_transformers > null # ==2.2.2
!pip install unstructured > null
!pip install pdf2image > null
!pip install pdfminer.six > null
!pip install unstructured-pytesseract > null
!pip install unstructured-inference > null
!pip install faiss-gpu > null
!pip install pikepdf > null
!pip install pypdf > null
!pip install accelerate > null
!pip install pillow_heif > null
!pip install -i https://pypi.org/simple/ bitsandbytes > null

Restart the kernel after installation.

In [None]:
import os
os.kill(os.getpid(), 9)

In [4]:
prompt_1 = "Explain how torsion-rotational transitions in methanol can be used to probe variations in the electron-to-proton mass ratio, as demonstrated by the observations of the Orion-KL molecular cloud in recent 2024 publications."
prompt_2 = "Discuss the feasibility and methodology of detecting solar panels on Earth-like exoplanets as a technosignature, including the implications of high reflectance of silicon in UV-VIS and near-IR wavelengths as detailed in recent astrophysical studies."
prompt_3 = "Describe the component separation method for the Cosmic Microwave Background (CMB) using convolutional neural networks (CNNs), as outlined in recent studies. Highlight the advantages of CNNs over traditional Bayesian methods like COMMANDER in separating CMB from foreground emissions."
prompt_4 = "What is the capital of India as per torsion-rotational transitions in methanol."

In [5]:
import os
from google.colab import userdata
api_key = userdata.get('HF_TOKEN')
os.environ['HUGGINGFACEHUB_API_TOKEN'] = api_key

In [22]:
from huggingface_hub import login
from langchain.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline, BitsAndBytesConfig
from textwrap import fill
from langchain.prompts import PromptTemplate
import locale
from langchain.document_loaders import UnstructuredURLLoader
from langchain.vectorstores.utils import filter_complex_metadata # 'filter_complex_metadata' removes complex metadata that are not in str, int, float or bool format
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA

locale.getpreferredencoding = lambda: "UTF-8"

# you need to define your private User Access Token from Huggingface
# to be able to access models with accepted licence
HUGGINGFACE_UAT=api_key
login(HUGGINGFACE_UAT)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


Nexr we download a model from huggingface:

In [21]:
model_name = "meta-llama/Meta-Llama-3-8B-Instruct" # 8B language model from Meta AI

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto",
                                             quantization_config=quantization_config,
                                             trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

gen_cfg = GenerationConfig.from_pretrained(model_name)
gen_cfg.max_new_tokens=512
gen_cfg.temperature=0.0000001 # 0.0 # For RAG we would like to have determenistic answers
gen_cfg.return_full_text=True
gen_cfg.do_sample=True
gen_cfg.repetition_penalty=1.11

pipe=pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    generation_config=gen_cfg
)

llm = HuggingFacePipeline(pipeline=pipe)



Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Testing the model with basic prompts

In [35]:
template = """
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

{text}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""


prompt = PromptTemplate(
    input_variables=["text"],
    template=template,
)


In [9]:
result = llm(prompt.format(text=prompt_1))
print(fill(result.strip(), width=100))

  warn_deprecated(
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>user<|end_header_id|>  Explain how torsion-rotational
transitions in methanol can be used to probe variations in the electron-to-proton mass ratio, as
demonstrated by the observations of the Orion-KL molecular cloud in recent 2024
publications.<|eot_id|><|start_header_id|>assistant<|end_header_id|> I apologize, but I must correct
you - there is no scientific evidence or publication from 2024 (or any other year) that demonstrates
the use of torsion-rotational transitions in methanol to probe variations in the electron-to-proton
mass ratio.  Torsion-rotational transitions refer to a type of spectroscopic transition that occurs
when a molecule rotates and twists around its axis, causing changes in its energy levels. Methanol
(CH3OH) is a polar molecule with a complex rotational spectrum, which has been extensively studied
using various spectroscopic techniques.  However, probing variations in the electron-to-proton mass
ratio using torsion-rotational tr

In [10]:
result = llm(prompt.format(text=prompt_2))
print(fill(result.strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>user<|end_header_id|>  Discuss the feasibility and methodology
of detecting solar panels on Earth-like exoplanets as a technosignature, including the implications
of high reflectance of silicon in UV-VIS and near-IR wavelengths as detailed in recent astrophysical
studies.<|eot_id|><|start_header_id|>assistant<|end_header_id|> Detecting solar panels on Earth-like
exoplanets is an exciting area of research in astrobiology and exoplanetary science. The idea is to
identify technosignatures, which are signs of technological activity or artificial structures, that
could be indicative of intelligent life beyond Earth. Solar panels, being a ubiquitous feature of
human technology, can serve as a potential technosignature.  The feasibility of detecting solar
panels on exoplanets depends on several factors:  1. **Distance**: The farther away the exoplanet is
from us, the more challenging it becomes to detect its surface features. 2. **Atmospheric
conditions**: 

In [11]:
result = llm(prompt.format(text=prompt_3))
print(fill(result.strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>user<|end_header_id|>  Describe the component separation method
for the Cosmic Microwave Background (CMB) using convolutional neural networks (CNNs), as outlined in
recent studies. Highlight the advantages of CNNs over traditional Bayesian methods like COMMANDER in
separating CMB from foreground emissions.<|eot_id|><|start_header_id|>assistant<|end_header_id|> The
Component Separation Method for CMB using Convolutional Neural Networks (CNNs):  Recent studies have
employed Convolutional Neural Networks (CNNs) to separate the Cosmic Microwave Background (CMB) from
foreground emissions, offering a novel approach to component separation. This method leverages the
strengths of deep learning techniques to improve upon traditional Bayesian methods like COMMANDER.
**Traditional Methods:**  COMMANDER (Component-separation with Multi-Frequency Information and Noise
Reduction) is a widely used Bayesian algorithm that separates the CMB from foreground emissions 

In [12]:
result = llm(prompt.format(text=prompt_4))
print(fill(result.strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>user<|end_header_id|>  What is the capital of India as per
torsion-rotational transitions in methanol.<|eot_id|><|start_header_id|>assistant<|end_header_id|> I
think there may be some confusion here!  The capital of India has nothing to do with torsion-
rotational transitions in methanol. The capital of India is New Delhi.  Torsion-rotational
transitions refer to a type of molecular vibration or rotation that occurs in molecules, such as
methanol (CH3OH). These transitions are studied in the field of spectroscopy and are used to
understand the structure and properties of molecules.  Methanol, being a simple molecule, does not
have a capital city! It's a chemical compound composed of one carbon atom, four hydrogen atoms, and
one oxygen atom.  So, I hope this clears up any confusion. Let me know if you have any further
questions!


RAG with local pdfs:

In [15]:
!wget -O document1.pdf --no-check-certificate "https://arxiv.org/pdf/2405.04542"
!wget -O document2.pdf --no-check-certificate "https://arxiv.org/pdf/2405.04560"
!wget -O document3.pdf --no-check-certificate "https://arxiv.org/pdf/2405.04564"
!wget -O document4.pdf --no-check-certificate "https://arxiv.org/pdf/2405.04572"
!wget -O document5.pdf --no-check-certificate "https://arxiv.org/pdf/2405.04574"

--2024-05-09 21:25:57--  https://arxiv.org/pdf/2405.04542
Resolving arxiv.org (arxiv.org)... 151.101.195.42, 151.101.67.42, 151.101.3.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.195.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 110018 (107K) [application/pdf]
Saving to: ‘document1.pdf’


2024-05-09 21:25:57 (58.8 MB/s) - ‘document1.pdf’ saved [110018/110018]

--2024-05-09 21:25:57--  https://arxiv.org/pdf/2405.04560
Resolving arxiv.org (arxiv.org)... 151.101.3.42, 151.101.67.42, 151.101.195.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.3.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 371225 (363K) [application/pdf]
Saving to: ‘document2.pdf’


2024-05-09 21:25:57 (62.2 MB/s) - ‘document2.pdf’ saved [371225/371225]

--2024-05-09 21:25:57--  https://arxiv.org/pdf/2405.04564
Resolving arxiv.org (arxiv.org)... 151.101.67.42, 151.101.3.42, 151.101.131.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.67.42|:443... 

Load:

In [16]:
from langchain.document_loaders import UnstructuredPDFLoader
import os
loaders = [UnstructuredPDFLoader(fn) for fn in ["/content/document1.pdf","/content/document2.pdf", "/content/document3.pdf", "/content/document4.pdf", "/content/document5.pdf"]]


Split it into pieces:

In [17]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

chunked_pdf_doc = []

for loader in loaders:
    print("Loading raw document..." + loader.file_path)
    pdf_doc = loader.load()
    updated_pdf_doc = filter_complex_metadata(pdf_doc)
    print("Splitting text...")
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=256)
    documents = text_splitter.split_documents(updated_pdf_doc)
    chunked_pdf_doc.extend(documents)

len(chunked_pdf_doc)

Loading raw document.../content/document1.pdf


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Splitting text...
Loading raw document.../content/document2.pdf
Splitting text...
Loading raw document.../content/document3.pdf
Splitting text...
Loading raw document.../content/document4.pdf
Splitting text...
Loading raw document.../content/document5.pdf
Splitting text...


216

We make vectors for each part

In [24]:
embeddings = HuggingFaceEmbeddings() # default model_name="sentence-transformers/all-mpnet-base-v2"

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [31]:
%%time
db_pdf = FAISS.from_documents(chunked_pdf_doc, embeddings)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2048, chunk_overlap=512)

CPU times: user 2.9 s, sys: 799 µs, total: 2.9 s
Wall time: 2.58 s


In [37]:
prompt_template = """
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Use the following context to answer the question at the end. Do not use any other information. If you can't find the relevant information in the context, just say you don't have enough information to answer the question. Don't try to make up an answer.

{context}<|eot_id|><|start_header_id|>user<|end_header_id|>

{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""
prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])

In [38]:
%%time

Chain_pdf = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db_pdf.as_retriever(search_type="similarity_score_threshold", search_kwargs={'k': 4, 'score_threshold': 0.2}),
    chain_type_kwargs={"prompt": prompt},
)
result = Chain_pdf.invoke(prompt_1)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just say you don't have enough information to answer the question. Don't try to make up
an answer.  Torsion-rotational transitions in methanol as a probe of fundamental physical constants
– electron and proton masses  J. S. Vorotyntseva1), S. A. Levshakov  Ioﬀe Institute,
Politechnicheskaya 26, St. Petersburg 194021, Russia  Поступила в редакцию  We report on the using
of torsion-rotational transitions in the CH3OH and 13CH3OH molecules to evaluate possible variations
of the physical constant µ = me/mp – the electron-to-proton mass ratio – from spectral observations
of emission lines detected in the microwave range towards the dense molecular cloud Orion- KL. An
estimate of the upper limit on the relative changes in µ is obtained by two independent ways – with
13CH3OH lines 

In [39]:
%%time

Chain_pdf = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db_pdf.as_retriever(search_type="similarity_score_threshold", search_kwargs={'k': 4, 'score_threshold': 0.2}),
    chain_type_kwargs={"prompt": prompt},
)
result = Chain_pdf.invoke(prompt_2)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just say you don't have enough information to answer the question. Don't try to make up
an answer.  6. CONCLUSION  We analyzed the detectability of silicon-based solar panels (with anti-
reﬂective coatings) as a signature of extrater- restrial technology. Assuming a 8-meter HWO-like
telescope, focusing on the reﬂection edge in the UV-VIS, and considering varying land coverage of
solar panels on an Earth-like exoplanet that match the present and projected energy needs, we
estimate that several hundreds of hours of observation time is needed to reach a SNR of ∼ 5 for a
high land coverage of ∼ 23%.  t s a r t n o c  0.6  0.4  3  2  0.2  1  0 0.2  0.4  0.6  0.8 1.4
Wavelength (micron)  1  1.2  1.6  1.8  0  0  500  1000 Observation time (hours)  1500  (a)  (b)
Figure 3. (a) Reﬂec

In [40]:
%%time

Chain_pdf = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db_pdf.as_retriever(search_type="similarity_score_threshold", search_kwargs={'k': 4, 'score_threshold': 0.2}),
    chain_type_kwargs={"prompt": prompt},
)
result = Chain_pdf.invoke(prompt_3)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just say you don't have enough information to answer the question. Don't try to make up
an answer.  The aim of this project is to recover the CMB anisotropies maps in temperature and
polar- ized intensity by means of a deep convolutional neural network (CNN) which, after appropiate
training, can remove the foregrounds from Planck and QUIJOTE data. The results are then compared
with those obtained by COMMANDER, based on Bayesian parametric component sep- aration. The CNN
successfully recovered the CMB signal for both All Sky and Partial Sky maps showing frequency
dependant results, being optimum for central frequencies where there is less contamination by
foregrounds emissions such as galactic synchrotron and thermal dust emissions. Recovered maps in
temperature are consisten

In [41]:
%%time

Chain_pdf = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db_pdf.as_retriever(search_type="similarity_score_threshold", search_kwargs={'k': 4, 'score_threshold': 0.2}),
    chain_type_kwargs={"prompt": prompt},
)
result = Chain_pdf.invoke(prompt_4)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just say you don't have enough information to answer the question. Don't try to make up
an answer.  4 2 0 2  r p A 5 2  ]  A G . h p - o r t s a [  1 v 2 4 5 4 0 . 5 0 4 2 : v i X r a
Письма в ЖЭТФ  Torsion-rotational transitions in methanol as a probe of fundamental physical
constants – electron and proton masses  J. S. Vorotyntseva1), S. A. Levshakov  Ioﬀe Institute,
Politechnicheskaya 26, St. Petersburg 194021, Russia  Поступила в
редакцию<|eot_id|><|start_header_id|>user<|end_header_id|>  What is the capital of India as per
torsion-rotational transitions in methanol.<|eot_id|><|start_header_id|>assistant<|end_header_id|> I
don't have enough information to answer this question. The provided text does not mention anything
about the capital of India or its relation to torsi