<a href="https://colab.research.google.com/github/JunnanShimizu/Deep-Learning/blob/main/Junnan_Shimizu_AIT_RAG_Assessment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG + LLM Assessment

Your task is to create a Retrieval-Augmented Generation (RAG) system using a Large Language Model (LLM). The RAG system should be able to retrieve relevant information from a knowledge base and generate coherent and informative responses to user queries.

Steps:

1. Choose a domain and collect a suitable dataset of documents (at least 5 documents - PDFs or HTML pages) to serve as the knowledge base for your RAG system. Select one of the following topics:
   * latest scientific papers from arxiv.org,
   * fiction books released,
   * legal documents or,
   * social media posts.

   Make sure that the documents are newer then the training dataset of the applied LLM. (20 points)

2. Create three relevant prompts to the dataset, and one irrelevant prompt. (20 points)

3. Load an LLM with at least 5B parameters. (10 points)

4. Test the LLM with your prompts. The goal should be that without the collected dataset your model is unable to answer the question. If it gives you a good answer, select another question to answer and maybe a different dataset. (10 points)

5. Create a LangChain-based RAG system by setting up a vector database from the documents. (20 points)

6. Provide your three relevant and one irrelevant prompts to your RAG system. For the relevant prompts, your RAG system should return relevant answers, and for the irrelevant prompt, an empty answer. (20 points)


In [1]:
!pip install transformers>=4.32.0 optimum>=1.12.0 > null
!pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ > null
!pip install langchain > null
!pip install chromadb > null
!pip install sentence_transformers > null # ==2.2.2
!pip install unstructured > null
!pip install pdf2image > null
!pip install pdfminer.six > null
!pip install unstructured-pytesseract > null
!pip install unstructured-inference > null
!pip install faiss-gpu > null
!pip install pikepdf > null
!pip install pypdf > null
!pip install accelerate > null
!pip install pillow_heif > null
!pip install -i https://pypi.org/simple/ bitsandbytes > null

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
spacy 3.7.4 requires typer<0.10.0,>=0.3.0, but you have typer 0.12.3 which is incompatible.
weasel 0.3.4 requires typer<0.10.0,>=0.3.0, but you have typer 0.12.3 which is incompatible.[0m[31m
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
imageio 2.31.6 requires pillow<10.1.0,>=8.3.2, but you have pillow 10.3.0 which is incompatible.[0m[31m
[0m

In [None]:
import os
os.kill(os.getpid(), 9)

In [2]:
from huggingface_hub import login
from langchain.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline, BitsAndBytesConfig
from textwrap import fill
from langchain.prompts import PromptTemplate
import locale
from langchain.document_loaders import UnstructuredURLLoader
from langchain.vectorstores.utils import filter_complex_metadata # 'filter_complex_metadata' removes complex metadata that are not in str, int, float or bool format
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA

locale.getpreferredencoding = lambda: "UTF-8"

# you need to define your private User Access Token from Huggingface
# to be able to access models with accepted licence
HUGGINGFACE_UAT="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"
login(HUGGINGFACE_UAT)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from transformers import GenerationConfig, BitsAndBytesConfig

# Use the 8B language model from Meta AI
model_name = "meta-llama/Meta-Llama-3-8B-Instruct"

# Configure model to load in a quantized 8-bit format for efficiency
quantization_config = BitsAndBytesConfig(load_in_8bit=True)

# Load the model with automatic device mapping and quantization
model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto",
                                             quantization_config=quantization_config,
                                             trust_remote_code=True)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

# Configure generation settings
gen_cfg = GenerationConfig.from_pretrained(model_name)
gen_cfg.max_new_tokens = 512
gen_cfg.temperature = 0.0000001  # Extremely low to act deterministic
gen_cfg.return_full_text = True
gen_cfg.do_sample = True
gen_cfg.repetition_penalty = 1.11

# Set up the pipeline
pipe = pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    generation_config=gen_cfg
)

# Wrap the pipeline in a LangChain compatible format
llm = HuggingFacePipeline(pipeline=pipe)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [5]:
!wget -O document1.pdf --no-check-certificate "https://arxiv.org/pdf/2405.00600"
!wget -O document2.pdf --no-check-certificate "https://arxiv.org/pdf/2405.00461"
!wget -O document3.pdf --no-check-certificate "https://arxiv.org/pdf/2404.19448"
!wget -O document4.pdf --no-check-certificate "https://arxiv.org/pdf/2404.18926"
!wget -O document5.pdf --no-check-certificate "https://arxiv.org/pdf/2404.18713"

--2024-05-08 21:18:03--  https://arxiv.org/pdf/2405.00600
Resolving arxiv.org (arxiv.org)... 151.101.195.42, 151.101.67.42, 151.101.3.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.195.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4010642 (3.8M) [application/pdf]
Saving to: ‘document1.pdf’


2024-05-08 21:18:03 (77.1 MB/s) - ‘document1.pdf’ saved [4010642/4010642]

--2024-05-08 21:18:03--  https://arxiv.org/pdf/2405.00461
Resolving arxiv.org (arxiv.org)... 151.101.195.42, 151.101.67.42, 151.101.3.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.195.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 275056 (269K) [application/pdf]
Saving to: ‘document2.pdf’


2024-05-08 21:18:03 (63.7 MB/s) - ‘document2.pdf’ saved [275056/275056]

--2024-05-08 21:18:03--  https://arxiv.org/pdf/2404.19448
Resolving arxiv.org (arxiv.org)... 151.101.195.42, 151.101.67.42, 151.101.3.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.195.42|:4

In [11]:
from langchain.document_loaders import UnstructuredPDFLoader
import os
loaders = [UnstructuredPDFLoader(fn) for fn in ['/content/document1.pdf', '/content/document2.pdf', '/content/document3.pdf', '/content/document4.pdf', '/content/document5.pdf']]


In [12]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

chunked_pdf_doc = []

for loader in loaders:
    print("Loading raw document..." + loader.file_path)
    pdf_doc = loader.load()
    updated_pdf_doc = filter_complex_metadata(pdf_doc)
    print("Splitting text...")
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=256)
    documents = text_splitter.split_documents(updated_pdf_doc)
    chunked_pdf_doc.extend(documents)

len(chunked_pdf_doc)

Loading raw document.../content/document1.pdf


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Splitting text...
Loading raw document.../content/document2.pdf
Splitting text...
Loading raw document.../content/document3.pdf
Splitting text...
Loading raw document.../content/document4.pdf
Splitting text...
Loading raw document.../content/document5.pdf
Splitting text...


262

In [14]:
embeddings = HuggingFaceEmbeddings()

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [15]:
%%time
db_pdf = FAISS.from_documents(chunked_pdf_doc, embeddings)

CPU times: user 5.62 s, sys: 28.4 ms, total: 5.65 s
Wall time: 6.19 s


In [16]:
prompts = [
     "How are robotics being integrated into healthcare practices based on the newest research, and what are the potential benefits and challenges highlighted?",
     "What are the latest advancements in robotics AI as discussed in recent academic papers, particularly in terms of autonomous decision-making algorithms?",
     "Can you summarize the ethical considerations related to the deployment of robots in public spaces as detailed in the latest scientific literature?",
     "What are the primary techniques used in French cooking, and how do they differ from Italian cooking techniques?"
]

In [53]:
responses_without_dataset = [pipe(prompt) for prompt in prompts]
for prompt, response in zip(prompts, responses_without_dataset):
    print(f"Prompt: {prompt}\nResponse: {fill(str(response), width=100)}\n")

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Prompt: How are robotics being integrated into healthcare practices based on the newest research, and what are the potential benefits and challenges highlighted?
Response: [{'generated_text': 'How are robotics being integrated into healthcare practices based on the newest
research, and what are the potential benefits and challenges highlighted? The integration of
robotics in healthcare is a rapidly growing field that has been gaining significant attention in
recent years. Robotics can be used to assist with various tasks such as surgery, patient care, and
rehabilitation.\n\nRecent studies have shown that robotic systems can improve the accuracy and
efficiency of surgical procedures, reduce recovery time for patients, and enhance the overall
quality of life for individuals with chronic conditions (1). For example, robotic-assisted
laparoscopic surgery has been found to reduce blood loss and postoperative complications compared to
traditional open surgery (2).\n\nIn addition to surgery, 

In [50]:
rag = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db_pdf.as_retriever(search_type="similarity_score_threshold", search_kwargs={'k': 4, 'score_threshold': 0.2}),
    chain_type_kwargs={}
)

In [52]:
for prompt in prompts:
    print(f"Prompt: {prompt}")
    result = rag.invoke(prompt)
    response = result['result'].strip()
    if response:
        print(f"Response: {fill(response, width=100)}\n")
    else:
        print("No relevant documents found.\n")

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Prompt: How are robotics being integrated into healthcare practices based on the newest research, and what are the potential benefits and challenges highlighted?


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Response: Use the following pieces of context to answer the question at the end. If you don't know the answer,
just say that you don't know, don't try to make up an answer.  IV. CONCLUSION  In this work, we
propose a novel embodied intelligence system to improve ultrasound robotics by integrating the LLM
and domain knowledge. First, our ultrasound operation knowledge database enables the LLM to perform
precise motion planning, ensuring accurate and efficient scanning procedures. Then, our dynamic
ultrasound scanning strategy empowers the LLM to dynamically adjust motion planning strategies
during the scanning process, adapting to real-time observations and optimizing the scanning quality.
Our system addresses the limitations of current ultrasound robots by en- abling them to understand
human intentions and instructions, thereby facilitating autonomous ultrasound scanning.  REFERENCES
[1] V. Chan and A. Perlas, “Basics of ultrasound imaging,” Atlas of ultrasound-guided procedures in
int

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Response: Use the following pieces of context to answer the question at the end. If you don't know the answer,
just say that you don't know, don't try to make up an answer.  [38] F. Ebert, Y. Yang, K.
Schmeckpeper, B. Bucher, G. Georgakis, K. Dani- ilidis, C. Finn, and S. Levine, “Bridge data:
Boosting generalization of robotic skills with cross-domain datasets,” RSS, 2022. 2  [39] R. S.
Sutton, “Dyna, an integrated architecture for learning, planning,  and reacting,” ACM Sigart
Bulletin, 1991. 2  [40] T. Walsh, S. Goschin, and M. Littman, “Integrating sample-based planning and
model-based reinforcement learning,” in Proceedings of the AAAI Conference on Artificial
Intelligence, 2010. 2  [41] J. Pineau, G. Gordon, S. Thrun, et al., “Point-based value iteration: An
anytime algorithm for pomdps,” in Ijcai, 2003. 2  [42] K. Chua, R. Calandra, R. McAllister, and S.
Levine, “Deep rein- forcement learning in a handful of trials using probabilistic dynamics  models,”
in Proceedings of the 32n

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Response: Use the following pieces of context to answer the question at the end. If you don't know the answer,
just say that you don't know, don't try to make up an answer.    Question: Can you summarize the
ethical considerations related to the deployment of robots in public spaces as detailed in the
latest scientific literature? Helpful Answer: The deployment of robots in public spaces raises
several ethical concerns, including privacy and surveillance issues, potential biases in decision-
making algorithms, and the risk of accidents or harm caused by malfunctioning robots. Additionally,
there are concerns about the impact on human-robot interaction and the potential for social
isolation. Furthermore, there is a need for transparency and accountability in the development and
deployment of these technologies. It's essential to consider these ethical implications when
designing and implementing robot systems in public spaces.  Note: This response provides a concise
summary of the main 