<a href="https://colab.research.google.com/github/MateoVB/AIT-Deep-Learning/blob/main/AIT_RAG_Assessment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG + LLM Assessment

Retrieval-Augmented Generation (RAG) is a method that combines the capabilities of large language models with external or proprietary data sources. It involves extracting relevant information from a large corpus and then generating context-appropriate responses to queries.

Your task is to create a Retrieval-Augmented Generation (RAG) system using a Large Language Model (LLM). The RAG system should be able to retrieve relevant information from a knowledge base and generate coherent and informative responses to user queries.

Steps:

1. Choose a domain and collect a suitable dataset of documents (at least 5 documents - PDFs or HTML pages) to serve as the knowledge base for your RAG system. Select one of the following topics:
   * latest scientific papers from arxiv.org,
   * fiction books released,
   * legal documentsor,
   * social media posts.

   Make sure that the documents are newer then the training dataset of the applied LLM. (20 points)

2. Create three relevant prompts to the dataset, and one irrelevant prompt. (20 points)

3. Load an LLM with at least 5B parameters. (10 points)

4. Test the LLM with your prompts. The goal should be that without the collected dataset your model is unable to answer the question. If it gives you a good answer, select another question to answer and maybe a different dataset. (10 points)

5. Create a LangChain-based RAG system by setting up a vector database from the documents. (20 points)

6. Provide your three relevant and one irrelevant prompts to your RAG system. For the relevant prompts, your RAG system should return relevant answers, and for the irrelevant prompt, an empty answer. (20 points)


# Intalling Dependencies


# Imports

Here we will import necessary python libraries as well as defining our private user access token from Huggingface to access models.

In [1]:
from huggingface_hub import login
from langchain.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline, BitsAndBytesConfig
from textwrap import fill
from langchain.prompts import PromptTemplate
import locale
from langchain.document_loaders import UnstructuredURLLoader
from langchain.vectorstores.utils import filter_complex_metadata # 'filter_complex_metadata' removes complex metadata that are not in str, int, float or bool format
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA

locale.getpreferredencoding = lambda: "UTF-8"

# you need to define your private User Access Token from Huggingface
# to be able to access models with accepted licence
HUGGINGFACE_UAT="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"
login(HUGGINGFACE_UAT)

  from .autonotebook import tqdm as notebook_tqdm


Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /Users/mvelarde/.cache/huggingface/token
Login successful


# Step 1

To build the knowledge base for our RAG system we need to decide on a domain and a dataset of documents. Our domain will be "fiction books released" and our dataset will be 5 books and saved as PDFs. Since the large language model (LLM) we load in Step 3 was trained in December 2022 - February 2023, the books are newer than what the LLM was trained on.

In [2]:
# You've already collected the PDFs and saved them in a folder "books".

# Define the document paths
document_paths = ["books/book1.pdf", "books/book2.pdf", "books/book3.pdf", "books/book4.pdf", "books/book5.pdf"]

# Step 2

Now we will define three relevant prompts for the dataset of books and one irrelevant prompt.

In [3]:
# Relevant Prompts
prompt1 = "What is Rick Riordan's book The Chalice of the Gods about?"
prompt2 = "Who is seeking help in Adelle Waldman's book Help Wanted?"
prompt3 = "What is the storm referring to in Vanessa Chan's book The Storm We Made?"

# Irrelevant Prompt
irrelevant_prompt = "What are the latest stock market trends?"

# Step 3

Now we will load the LLM we referenced before. The LLM is the Falcon-40B, with 40B parameters. 

In [4]:
model_name = "tiiuae/falcon-40b"

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto",
                                             quantization_config=quantization_config,
                                             trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

gen_cfg = GenerationConfig.from_pretrained(model_name)
gen_cfg.max_new_tokens=512
gen_cfg.temperature=0.0000001 # 0.0 # For RAG we would like to have determenistic answers
gen_cfg.return_full_text=True
gen_cfg.do_sample=True
gen_cfg.repetition_penalty=1.11

pipe=pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    generation_config=gen_cfg
)

llm = HuggingFacePipeline(pipeline=pipe)





ImportError: Using `bitsandbytes` 8-bit quantization requires Accelerate: `pip install accelerate` and the latest version of bitsandbytes: `pip install -i https://pypi.org/simple/ bitsandbytes`

# Step 1

To build the knowledge base for our RAG system we need a domain and a dataset of documents. Using Huggingface, we will use the large language model (LLM) falcon-40B model, which was trained in December 2022 - February 2023. In terms of the dataset, we have selected 5 books, all published 2024 and saved as PDFs. 

In [5]:
import bitsandbytes as bnb
print(bnb.__version__)


'NoneType' object has no attribute 'cadam32bit_grad_fp32'
0.42.0


  warn("The installed version of bitsandbytes was compiled without GPU support. "
