<a href="https://colab.research.google.com/github/MonRiller/AITDLAssigment3/blob/main/Copy_of_AIT_RAG_Assessment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG + LLM Assessment

Your task is to create a Retrieval-Augmented Generation (RAG) system using a Large Language Model (LLM). The RAG system should be able to retrieve relevant information from a knowledge base and generate coherent and informative responses to user queries.

Steps:

1. Choose a domain and collect a suitable dataset of documents (at least 5 documents - PDFs or HTML pages) to serve as the knowledge base for your RAG system. Select one of the following topics:
   * latest scientific papers from arxiv.org,
   * fiction books released,
   * legal documents or,
   * social media posts.

   Make sure that the documents are newer then the training dataset of the applied LLM. (20 points)

2. Create three relevant prompts to the dataset, and one irrelevant prompt. (20 points)

3. Load an LLM with at least 5B parameters. (10 points)

4. Test the LLM with your prompts. The goal should be that without the collected dataset your model is unable to answer the question. If it gives you a good answer, select another question to answer and maybe a different dataset. (10 points)

5. Create a LangChain-based RAG system by setting up a vector database from the documents. (20 points)

6. Provide your three relevant and one irrelevant prompts to your RAG system. For the relevant prompts, your RAG system should return relevant answers, and for the irrelevant prompt, an empty answer. (20 points)


In [None]:
!pip install transformers>=4.32.0 optimum>=1.12.0 > null
!pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ > null
!pip install langchain > null
!pip install chromadb > null
!pip install sentence_transformers > null # ==2.2.2
!pip install unstructured > null
!pip install pdf2image > null
!pip install pdfminer.six > null
!pip install unstructured-pytesseract > null
!pip install unstructured-inference > null
!pip install faiss-gpu > null
!pip install pikepdf > null
!pip install pypdf > null
!pip install accelerate > null
!pip install pillow_heif > null
!pip install -i https://pypi.org/simple/ bitsandbytes > null

In [None]:
import os
os.kill(os.getpid(), 9)

In [None]:
from huggingface_hub import login
from langchain.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline, BitsAndBytesConfig
from textwrap import fill
from langchain.prompts import PromptTemplate
import locale
from langchain.document_loaders import UnstructuredURLLoader
from langchain.vectorstores.utils import filter_complex_metadata # 'filter_complex_metadata' removes complex metadata that are not in str, int, float or bool format
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA

locale.getpreferredencoding = lambda: "UTF-8"
HUGGINGFACE_UAT= "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"
login(HUGGINGFACE_UAT)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
model_name = "google/gemma-1.1-7b-it"

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto",
                                             load_in_4bit=True,
                                             trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
gen_cfg = GenerationConfig.from_pretrained(model_name)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/620 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/2.11G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/34.2k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

In [67]:

gen_cfg.max_new_tokens=512
gen_cfg.temperature=0.00000001 # For RAG we would like to have determenistic answers
gen_cfg.return_full_text=True
gen_cfg.do_sample=True
gen_cfg.repetition_penalty=1.11

pipe=pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    generation_config=gen_cfg
)

llm = HuggingFacePipeline(pipeline=pipe)

template = """
<bos><start_of_turn>user
{text}<end_of_turn>
<start_of_turn>model
"""
prompt = PromptTemplate(
    input_variables=["text"],
    template=template,
)

prompt_template_gemma = """
<bos><start_of_turn>user
Use the following context to Answer the question at the end. Do not use any other information. If you can't find the relevant information in the context, just say you don't have enough information to answer the question. Don't try to make up an answer.

{context}

Question: {question}<end_of_turn>

<start_of_turn>model
"""

In [75]:
text = "Are large language models good at explaining themselves?"
result = llm(prompt.format(text=text))
print(fill(result.strip(), width=100))
text = "What is POV learning?"
result = llm(prompt.format(text=text))
print(fill(result.strip(), width=100))
text = "In NI Stratum, what happens if there is a conflicting NIF?"
result = llm(prompt.format(text=text))
print(fill(result.strip(), width=100))

<bos><start_of_turn>user Are large language models good at explaining themselves?<end_of_turn>
<start_of_turn>model Large Language Models (LLMs) are still in their nascent stages, and **their
ability to explain itself is an ongoing area of research**.  **Strengths:**    * LLMS can generate
human-like text that explains concepts or answers questions.  This includes providing summaries,
paraphrases & explanations based on the input they receive.        * They have access vast amounts
information from which draw conclusions about relationships between words/concepts making it easier
for them understand complex topics better than humans with limited knowledge bases..
<strong>Weaknesses:</strong>            - Lacking transparency : Current LLM explanation methods
often rely black box techniques where inner workings remain opaque even trained experts struggle
interpret results without extensive training data analysis experience., leading users feeling
skeptical when asked "why" something was

In [None]:
# PDF documents from arxiv. I used the 10 most recent papers on AI, which are all from the last couple days. This guarantees they are not in the Gemma training set
!wget -O document0.pdf --no-check-certificate "https://arxiv.org/pdf/2405.04484"
!wget -O document1.pdf --no-check-certificate "https://arxiv.org/pdf/2405.04480"
!wget -O document2.pdf --no-check-certificate "https://arxiv.org/pdf/2405.04443"
!wget -O document3.pdf --no-check-certificate "https://arxiv.org/pdf/2405.04434"
!wget -O document4.pdf --no-check-certificate "https://arxiv.org/pdf/2405.04432"
!wget -O document5.pdf --no-check-certificate "https://arxiv.org/pdf/2405.04408"
!wget -O document6.pdf --no-check-certificate "https://arxiv.org/pdf/2405.04404"
!wget -O document7.pdf --no-check-certificate "https://arxiv.org/pdf/2405.04395"
!wget -O document8.pdf --no-check-certificate "https://arxiv.org/pdf/2405.04382"
!wget -O document9.pdf --no-check-certificate "https://arxiv.org/pdf/2405.04373"

--2024-05-08 19:33:50--  https://arxiv.org/pdf/2405.04484
Resolving arxiv.org (arxiv.org)... 151.101.3.42, 151.101.131.42, 151.101.195.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.3.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2141368 (2.0M) [application/pdf]
Saving to: ‘document0.pdf’


2024-05-08 19:33:50 (94.2 MB/s) - ‘document0.pdf’ saved [2141368/2141368]

--2024-05-08 19:33:50--  https://arxiv.org/pdf/2405.04480
Resolving arxiv.org (arxiv.org)... 151.101.3.42, 151.101.131.42, 151.101.195.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.3.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 744523 (727K) [application/pdf]
Saving to: ‘document1.pdf’


2024-05-08 19:33:50 (42.6 MB/s) - ‘document1.pdf’ saved [744523/744523]

--2024-05-08 19:33:50--  https://arxiv.org/pdf/2405.04443
Resolving arxiv.org (arxiv.org)... 151.101.3.42, 151.101.131.42, 151.101.195.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.3.42|:443.

In [None]:
from langchain.document_loaders import UnstructuredPDFLoader
import os
loaders = [UnstructuredPDFLoader("/content/document{}.pdf".format(i)) for i in range(10)]

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

chunked_pdf_doc = []

for loader in loaders:
    print("Loading raw document..." + loader.file_path)
    pdf_doc = loader.load()
    updated_pdf_doc = filter_complex_metadata(pdf_doc)
    print("Splitting text...")
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=256)
    documents = text_splitter.split_documents(updated_pdf_doc)
    chunked_pdf_doc.extend(documents)

len(chunked_pdf_doc)

Loading raw document.../content/document0.pdf


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Splitting text...
Loading raw document.../content/document1.pdf
Splitting text...
Loading raw document.../content/document2.pdf
Splitting text...
Loading raw document.../content/document3.pdf
Splitting text...
Loading raw document.../content/document4.pdf
Splitting text...
Loading raw document.../content/document5.pdf
Splitting text...
Loading raw document.../content/document6.pdf
Splitting text...
Loading raw document.../content/document7.pdf
Splitting text...
Loading raw document.../content/document8.pdf
Splitting text...
Loading raw document.../content/document9.pdf
Splitting text...


984

In [None]:
embeddings = HuggingFaceEmbeddings()
db_pdf = FAISS.from_documents(chunked_pdf_doc, embeddings)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [59]:
prompt_template = """
<bos><start_of_turn>user
Use the following context to Answer the question at the end. Do not use any other information. If you can't find the relevant information in the context, just say you don't have enough information to answer the question. Don't try to make up an answer. If the context does have information about the question, then use that information to answer the question.

{context}

Question: {question}<end_of_turn>

<start_of_turn>model
"""
prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
Chain_pdf = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db_pdf.as_retriever(search_type="similarity_score_threshold", search_kwargs={'k': 4, 'score_threshold': 0.1}),
    chain_type_kwargs={"prompt": prompt},
)

In [60]:
query = "Are large language models good at explaining themselves?"
result = Chain_pdf.invoke(query)
print(fill(result['result'].strip(), width=200))

<bos><start_of_turn>user Use the following context to Answer the question at the end. Do not use any other information. If you can't find the relevant information in the context, just say you don't
have enough information to answer the question. Don't try to make up an answer. If the context does have information about the question, then use that information to answer the question.  Large
Language Models Cannot Explain Themselves  Advait Sarkar Microsoft Research Cambridge, United Kingdom  University of Cambridge Cambridge, United Kingdom  University College London London, United
Kingdom  advait@microsoft.com  Abstract Large language models can be prompted to produce text. They can also be prompted to produce “explanations” of their output. But these are not really
explanations, because they do not accurately reﬂect the mechanical process underlying the prediction. The illusion that they reﬂect the reasoning process can result in signiﬁcant harms. These
“explanations” can be valuable, 

In [62]:
query = "What is POV learning?"
result = Chain_pdf.invoke(query)
print(fill(result['result'].strip(), width=200))

<bos><start_of_turn>user Use the following context to Answer the question at the end. Do not use any other information. If you can't find the relevant information in the context, just say you don't
have enough information to answer the question. Don't try to make up an answer. If the context does have information about the question, then use that information to answer the question.  These
results clearly confirm our hypothesis that the fixation sequence con- tains information that can be used as alignment signal in a POV learning task. Additionally, they show that the combination of
background and sequential information in an Ensemble model leads to the best results.  3  2 Related work  A variety of research areas are related to the work we present. On a technical level, the
field of multi-modal (in our case visual-linguistic) representation learn- ing provides the technical background for our experiments. The novel PCE task that is introduced in this work is related to
Visual Entailme

In [74]:
query = "In NI Stratum, what happens if there is a conflicting NIF?"
result = Chain_pdf.invoke(query)
print(fill(result['result'].strip(), width=200))

<bos><start_of_turn>user Use the following context to Answer the question at the end. Do not use any other information. If you can't find the relevant information in the context, just say you don't
have enough information to answer the question. Don't try to make up an answer. If the context does have information about the question, then use that information to answer the question.
Check_conflict(nis_descriptor)  NIS CSOI  PolicyIC  replyEntry Point  policy = build_policy(resolution)Apply_policy(nis_descriptor, policy)  ConflictResolution  resolution  Figure 11: Conflict
Resolution process flow.  the deletion (release) of resources for this NIF instance to the NIF-C Manager. For all the NIF-C, the NIF-C Manager deletes (releases) the resources and then acknowledges
the completion of resource deletion back to NIF Manager. This completes the deletion of the NIF. Once the NIS is terminated, the NIF Man- ager sends a confirmation to the NIO that the NIFs are termi-
nated.  5.1.5. Other Ope

In [65]:
query = "Is a mamba a dangerous snake?"
result = Chain_pdf.invoke(query)
print(fill(result['result'].strip(), width=200))



<bos><start_of_turn>user Use the following context to Answer the question at the end. Do not use any other information. If you can't find the relevant information in the context, just say you don't
have enough information to answer the question. Don't try to make up an answer. If the context does have information about the question, then use that information to answer the question.    Question:
Is a mamba a dangerous snake?<end_of_turn>  <start_of_turn>model The provided text is unavailable for reference so I am unable provide response regarding whether or no Mambas are considered as
Dangerous snakes
