<a href="https://colab.research.google.com/github/176567/deeplearning/blob/main/_ait_rag_assessment_ipynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG + LLM Assessment

Your task is to create a Retrieval-Augmented Generation (RAG) system using a Large Language Model (LLM). The RAG system should be able to retrieve relevant information from a knowledge base and generate coherent and informative responses to user queries.

Steps:

1. Choose a domain and collect a suitable dataset of documents (at least 5 documents - PDFs or HTML pages) to serve as the knowledge base for your RAG system. Select one of the following topics:
   * latest scientific papers from arxiv.org,
   * fiction books released,
   * legal documents or,
   * social media posts.

   Make sure that the documents are newer then the training dataset of the applied LLM. (20 points)

2. Create three relevant prompts to the dataset, and one irrelevant prompt. (20 points)

3. Load an LLM with at least 5B parameters. (10 points)

4. Test the LLM with your prompts. The goal should be that without the collected dataset your model is unable to answer the question. If it gives you a good answer, select another question to answer and maybe a different dataset. (10 points)

5. Create a LangChain-based RAG system by setting up a vector database from the documents. (20 points)

6. Provide your three relevant and one irrelevant prompts to your RAG system. For the relevant prompts, your RAG system should return relevant answers, and for the irrelevant prompt, an empty answer. (20 points)


In [1]:
!pip install transformers>=4.32.0 optimum>=1.12.0
!pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
!pip install langchain
!pip install chromadb
!pip install sentence_transformers==2.2.2
!pip install unstructured
!pip install pdf2image
!pip install pdfminer.six
!pip install unstructured-pytesseract
!pip install unstructured-inference
!pip install faiss-gpu
!pip install pikepdf
!pip install pypdf
!pip install accelerate
!pip install pillow_heif
!pip install -i https://pypi.org/simple/ bitsandbytes
!pip install bitsandbytes==0.43.0

Looking in indexes: https://pypi.org/simple, https://huggingface.github.io/autogptq-index/whl/cu118/
Collecting faiss-gpu
  Using cached faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB)
Installing collected packages: faiss-gpu
Successfully installed faiss-gpu-1.7.2
Collecting pikepdf
  Using cached pikepdf-8.15.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB)
Collecting Pillow>=10.0.1 (from pikepdf)
  Downloading pillow-10.3.0-cp310-cp310-manylinux_2_28_x86_64.whl (4.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.5/4.5 MB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: Pillow, pikepdf
  Attempting uninstall: Pillow
    Found existing installation: Pillow 9.4.0
    Uninstalling Pillow-9.4.0:
      Successfully uninstalled Pillow-9.4.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the

Collecting pillow_heif
  Downloading pillow_heif-0.16.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pillow_heif
Successfully installed pillow_heif-0.16.0
Looking in indexes: https://pypi.org/simple/
Collecting bitsandbytes
  Downloading bitsandbytes-0.43.1-py3-none-manylinux_2_24_x86_64.whl (119.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: bitsandbytes
Successfully installed bitsandbytes-0.43.1
Collecting bitsandbytes==0.43.0
  Downloading bitsandbytes-0.43.0-py3-none-manylinux_2_24_x86_64.whl (102.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.2/102.2 MB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: bitsandbytes
  Attempting uninstall: bitsandbyte

In [None]:
import os
os.kill(os.getpid(), 9)

In [1]:
from huggingface_hub import login
from langchain.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline, BitsAndBytesConfig
from textwrap import fill
from langchain.prompts import PromptTemplate
import locale
from langchain.document_loaders import UnstructuredURLLoader
from langchain.vectorstores.utils import filter_complex_metadata # 'filter_complex_metadata' removes complex metadata that are not in str, int, float or bool format
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA

locale.getpreferredencoding = lambda: "UTF-8"

# you need to define your private User Access Token from Huggingface
# to be able to access models with accepted licence
HUGGINGFACE_UAT="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"
login(HUGGINGFACE_UAT)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [2]:
model_name = "google/gemma-2b-it" # 2B language model from Google
model_name = "meta-llama/Meta-Llama-3-8B-Instruct" # 8B language model from Meta AI

#quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto",
                                             #quantization_config=quantization_config,
                                             trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

gen_cfg = GenerationConfig.from_pretrained(model_name)
gen_cfg.max_new_tokens=512
gen_cfg.temperature=0.0000001 # 0.0 # For RAG we would like to have determenistic answers
gen_cfg.return_full_text=True
gen_cfg.do_sample=True
gen_cfg.repetition_penalty=1.11

pipe=pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    generation_config=gen_cfg
)

llm = HuggingFacePipeline(pipeline=pipe)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [3]:
!wget -O document1.pdf --no-check-certificate "https://arxiv.org/pdf/2405.02339"
!wget -O document2.pdf --no-check-certificate "https://arxiv.org/pdf/2405.03745"
!wget -O document3.pdf --no-check-certificate "https://arxiv.org/pdf/2405.02366"
!wget -O document4.pdf --no-check-certificate "https://arxiv.org/pdf/2405.02368"
!wget -O document5.pdf --no-check-certificate "https://arxiv.org/pdf/2405.02382"

--2024-05-08 22:57:09--  https://arxiv.org/pdf/2405.02339
Resolving arxiv.org (arxiv.org)... 151.101.3.42, 151.101.131.42, 151.101.67.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.3.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1214140 (1.2M) [application/pdf]
Saving to: ‘document1.pdf’


2024-05-08 22:57:09 (16.9 MB/s) - ‘document1.pdf’ saved [1214140/1214140]

--2024-05-08 22:57:09--  https://arxiv.org/pdf/2405.03745
Resolving arxiv.org (arxiv.org)... 151.101.3.42, 151.101.131.42, 151.101.67.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.3.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 583511 (570K) [application/pdf]
Saving to: ‘document2.pdf’


2024-05-08 22:57:09 (11.3 MB/s) - ‘document2.pdf’ saved [583511/583511]

--2024-05-08 22:57:09--  https://arxiv.org/pdf/2405.02366
Resolving arxiv.org (arxiv.org)... 151.101.3.42, 151.101.131.42, 151.101.67.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.3.42|:443... 

In [4]:
prompt_template_llama3 = """
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Use the following context to answer the question at the end. Do not use any other information. If you can't find the relevant information in the context, just say you don't have enough information to answer the question. Don't try to make up an answer.

{context}<|eot_id|><|start_header_id|>user<|end_header_id|>

{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""
prompt_template=prompt_template_llama3


In [5]:
from langchain.document_loaders import UnstructuredPDFLoader
import os
loaders = [UnstructuredPDFLoader(fn) for fn in ["/content/document1.pdf", "/content/document3.pdf", "/content/document4.pdf","/content/document5.pdf","/content/document2.pdf"]]



In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

chunked_pdf_doc = []

for loader in loaders:
    print("Loading raw document..." + loader.file_path)
    pdf_doc = loader.load()
    updated_pdf_doc = filter_complex_metadata(pdf_doc)
    print("Splitting text...")
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=256)
    documents = text_splitter.split_documents(updated_pdf_doc)
    chunked_pdf_doc.extend(documents)

len(chunked_pdf_doc)

Loading raw document.../content/document1.pdf


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Splitting text...
Loading raw document.../content/document3.pdf
Splitting text...
Loading raw document.../content/document4.pdf
Splitting text...
Loading raw document.../content/document5.pdf
Splitting text...
Loading raw document.../content/document2.pdf
Splitting text...


264

In [7]:
embeddings = HuggingFaceEmbeddings()

.gitattributes:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [8]:
%%time

db_pdf = FAISS.from_documents(chunked_pdf_doc, embeddings)

CPU times: user 5.19 s, sys: 88.1 ms, total: 5.28 s
Wall time: 7.92 s


In [9]:
%%time
prompt= PromptTemplate(template = prompt_template_llama3, input_variables=["context", "question"])
Chain_pdf = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db_pdf.as_retriever(search_type="similarity_score_threshold", search_kwargs={'k': 4, 'score_threshold': 0.2}),
    chain_type_kwargs={"prompt": prompt},
)


CPU times: user 118 ms, sys: 22.1 ms, total: 140 ms
Wall time: 261 ms


In [10]:
query = "What is The Juno mission (Bolton et al., 2017), in orbit around Jupiter?"
result = Chain_pdf.invoke(query)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


OutOfMemoryError: CUDA out of memory. Tried to allocate 1.96 GiB. GPU 0 has a total capacity of 14.75 GiB of which 1.66 GiB is free. Process 37781 has 13.09 GiB memory in use. Of the allocated memory 12.81 GiB is allocated by PyTorch, and 157.49 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [11]:
%%time

query = "What do Rotation-powered Pulsars represent"
result = Chain_pdf.invoke(query)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


OutOfMemoryError: CUDA out of memory. Tried to allocate 1.96 GiB. GPU 0 has a total capacity of 14.75 GiB of which 413.06 MiB is free. Process 37781 has 14.34 GiB memory in use. Of the allocated memory 14.05 GiB is allocated by PyTorch, and 168.42 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [12]:
%%time

query = "What do Central Compact Objects form?"
result = Chain_pdf.invoke(query)
print(fill(result['result'].strip(), width=100))


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 173.06 MiB is free. Process 37781 has 14.58 GiB memory in use. Of the allocated memory 14.19 GiB is allocated by PyTorch, and 270.96 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [13]:
%%time

query = "What are the possible job profiles at NVIDIA?"
result = Chain_pdf.invoke(query)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


OutOfMemoryError: CUDA out of memory. Tried to allocate 1.96 GiB. GPU 0 has a total capacity of 14.75 GiB of which 1.69 GiB is free. Process 37781 has 13.05 GiB memory in use. Of the allocated memory 12.69 GiB is allocated by PyTorch, and 244.51 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)