# Introduction:

For this Quantative Risk Assesment (QRM) assignement, we will be comparing and quantativly assesing the usability of a Large Language Model (LLM) - in this case Llama3 by Meta - when implemented in a company use case.

### The Case
The project case focuses on evaluating the risks and benefits of using LLMs to enhance web-scrapers/AI-powered asistants, specifically by comparing their performance with and without Retrieval-Augmented Generation (RAG). The goal is to quantify how RAG reduces risks such as hallucination and data inaccuracies, using metrics like Error Rate (ER), and to assess the trade-offs in implementing this approach in a quantitative risk management framework.

### The Model
Llama 3 by Meta is a transformer-based Large Language Model designed for natural language understanding and generation. It has a data cut-off point from mid-2023, meaning it is trained on datasets available up until that period. Llama 3 is optimized for high efficiency, offering a balance between computational demands and performance, and it supports integration with external tools such as Retrieval-Augmented Generation (RAG) to enhance its ability to work with up-to-date or domain-specific information beyond its training data.

### The Data
For our data, we will use the AI-act from the EU, published in 2024. The EU AI Act dataset is particularly suitable for this project because it was published after Llama 3’s data cut-off point (mid-2023). This ensures that the model has not been trained on this document, providing an ideal scenario to evaluate the model's ability to interpret and summarize new, unseen data. 

Furthermore, the EU AI Act contains complex and structured information, making it a robust test for assessing the model’s performance in understanding and accurately representing legal and technical content. This setup allows for testing how Retrieval-Augmented Generation (RAG) can improve accuracy and reduce hallucinations when integrated. 

For the LLM to understand the information, the data will be indexed into a vector database to make it searchable by the RAG system.

## Installation and Imports

In [None]:
!pip install -r requirements.txt

In [None]:
import sys
from torch import cuda, bfloat16
import torch
import transformers
from transformers import AutoTokenizer
from time import time
#import chromadb
#from chromadb.config import Settings
from langchain.llms import HuggingFacePipeline
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma

## Define model, tokenizer, query pipeline

Define the model, the device, and the `bitsandbytes` configuration.

In [None]:
model_id = '/kaggle/input/llama-3/transformers/8b-chat-hf/1'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# using the `bitsandbytes` library
# we set quantization configuration to load large model with less GPU memory
# possibly not needed with larger setups
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

print(device)

Now we prepare the tokenizer and the model. To gain a sense of how long these steps will take, we wrap the pipeline in a timer. 

Be aware that running this model without a dedicated GPU won't be possible. To run this project I've used UCloud computers with  that were allocated to me for this project.

In [None]:
time_start = time()

model_config = transformers.AutoConfig.from_pretrained(
   model_id,
    trust_remote_code=True,
    max_new_tokens=1024
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

time_end = time()
print(f"Prepare model, tokenizer: {round(time_end-time_start, 3)} sec.")


We next define the query pipeline.

To ensure proper functionality when setting up the HuggingFace pipeline, we especially need to specify the `max_length` here. This prevents the pipeline from defaulting to the very short length of 20 which we can't use for our case.

In [None]:
time_start = time()

query_pipeline = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.float16,
        max_length=1000,
        device_map="auto",)

time_end = time()
print(f"Prepare pipeline: {round(time_end-time_start, 3)} sec.")

We should now test our pipeline by defining a function

In [None]:
def test_model(tokenizer, pipeline, message):
    """
    Perform a query
    print the result
    Args:
        tokenizer: the tokenizer
        pipeline: the pipeline
        message: the prompt
    Returns
        None
    """    
    time_start = time()
    
    sequences = pipeline(
        message,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        max_length=200,)
    
    time_end = time()
    total_time = f"{round(time_end-time_start, 3)} sec."
    
    question = sequences[0]['generated_text'][:len(message)]
    answer = sequences[0]['generated_text'][len(message):]
    
    return f"Question: {question}\nAnswer: {answer}\nTotal time: {total_time}"

## Testing our pipeline and model

We test the pipeline using a few queries related to the European Union Artificial Intelligence Act (EU AI Act).

Additionally, we create a UI-utility function to neatly display the LLM's output. This function includes the calculation time, the input question, and the generated answer, all formatted for clear and easy readability. This will help us later when we aim to compare the output between the model with and without RAG.

In [None]:
from IPython.display import display, Markdown

def colorize_text(text):
    for word, color in zip(["Reasoning", "Question", "Answer", "Total time"], ["blue", "red", "green", "magenta"]):
        text = text.replace(f"{word}:", f"\n\n**<font color='{color}'>{word}:</font>**")
    return text

### Testing

Let's test the model with a few questions regarding the AI-act. 

Later we will build a repo of all questions and save the responses in seperate columns for RAG and non-RAG responses.

As an example, we will ask `What is the purpose of the EU AI regulation act?`. Here we are looking for the following from page 1 of the act: 

<I>"The purpose of this Regulation is to improve the functioning of the internal market by laying down a uniform legal 
framework in particular for the development, the placing on the market, the putting into service and the use of 
artificial intelligence systems (AI systems) in the Union, in accordance with Union values, to promote the uptake of 
human centric and trustworthy artificial intelligence (AI) while ensuring a high level of protection of health, safety, 
fundamental  rights  as  enshrined  in  the  Charter  of  Fundamental  Rights  of  the  European  Union  (the  ‘Charter’), 
including democracy, the rule of  law and environmental protection, to protect against the harmful effects of AI 
systems  in  the  Union,  and  to  support  innovation.  This  Regulation  ensures  the  free  movement,  cross-border,  of 
AI-based  goods  and  services,  thus  preventing  Member  States  from  imposing  restrictions  on  the  development, 
marketing and use of AI systems, unless explicitly authorised by this Regulation."</I>

In [None]:
response = test_model(tokenizer,
                    query_pipeline,
                   "What is the purpose of the EU AI regulation act?")
display(Markdown(colorize_text(response)))

Next we will ask `How many recitals are included in the first section of the EU AI Act document? Please provide the last numbered recital for an exact count`. 

This anwser is quite straightforward, since there are 180 of these recitals. 

In [None]:
response = test_model(tokenizer,
                    query_pipeline,
                   "How many recitals are included in the first section of the EU AI Act document? Please provide the last numbered recital for an exact count")
display(Markdown(colorize_text(response)))

As can be seen, the answers are not really helpfull or accurate. The model is obviously halucinating since we are asking for information that it has not been trained on or "seen" before.

## Retrival Augmented Generation (RAG)

To improve the accuracy of our model, we should build a RAG system that will allow the model to "read" our document before responding.
To create this RAG system we will be using a HuggingFacePipeline. The overall stepes are as follows:
- Using a `HuggingFacePipline`, we will test the model
- We will then ingest the EU AI act using `PyPdfLoader`
- Sperate or Chunk the AI act into chuncks of 1000 characters, making sure that we have a partial overlap between the chunks of 100 characters.
- Generate embeddings and store the transformed text (processed from the PDF, chunked with overlap, embedded, and indexed) into the vector database.
- Build a `QA_Request` pipeline, incorporating both the retrieval and generation steps.

### Test the model with HuggingFace Pipeline

We will test the model using a HuggingFace pipeline by querying about the meaning of the EU AI Act. By utilizing the HuggingFacePipeline, we will have a more seamless integration with LangChain tasks.

In [None]:
llm = HuggingFacePipeline(pipeline=query_pipeline)

# checking again that everything is working fine
time_start = time()
question = "Please explain what EU AI Act is."
response = llm(prompt=question)
time_end = time()
total_time = f"{round(time_end-time_start, 3)} sec."
full_response =  f"Question: {question}\nAnswer: {response}\nTotal time: {total_time}"
display(Markdown(colorize_text(full_response)))

We should now ingest our data (EU AI Act). We will be using the `PyPDFLoader` from Langchain. 

In [None]:
loader = PyPDFLoader("./EU_AI_Act.pdf")
documents = loader.load()

### Splitting the data

Using a recursive character text splitter, we will split our data into chunks.
- chunk_size: 1000 (the size of a chunk in characters)
- chunk_overlap: 100 (the size of characters that two chunks overlap)

Chunk overlapping balances chunk size constraints and context preservation, enhancing the effectiveness and accuracy of LLMs in processing text.

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
all_splits = text_splitter.split_documents(documents)

### Creating Embeddings with Sentence Transformers and HuggingFace

We generate embeddings using **Sentence Transformer** models and **HuggingFace embeddings**. 

#### Handling Availability Issues
Occasionally, HuggingFace's sentence-transformer models may not be accessible online. To address this, we implement a mechanism that enables the use of locally stored Sentence Transformer models, ensuring uninterrupted functionality even in offline or restricted environments.


In [None]:
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}

# try to access the sentence transformers from HuggingFace: https://huggingface.co/api/models/sentence-transformers/all-mpnet-base-v2
try:
    embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)
except Exception as ex:
    print("Exception: ", ex)
    # alternatively, we will access the embeddings models locally
    local_model_path = "/kaggle/input/sentence-transformers/minilm-l6-v2/all-MiniLM-L6-v2"
    print(f"Use alternative (local) model: {local_model_path}\n")
    embeddings = HuggingFaceEmbeddings(model_name=local_model_path, model_kwargs=model_kwargs)

### Initializing ChromaDB with Document Splits and Embeddings

We initialize **ChromaDB** using the following:

- **Document Splits:** Pre-processed chunks of the original text.
- **Embeddings:** Generated embeddings from the previously defined models.

#### Enabling Persistence
To ensure data is stored and accessible for future use, we enable the **persistence** option for the vector database, allowing ChromaDB to save data locally.


In [None]:
vectordb = Chroma.from_documents(documents=all_splits, embedding=embeddings, persist_directory="chroma_db")

### Initializing the Chain

We initialize a **QA_Retrival task chain** using LangChain utilities.

#### How It Works
1. **Querying the Vector Database:**  
   The chain begins by querying the vector database using **similarity search** with the provided prompt.

2. **Retrieving Context:**  
   The vector database retrieves relevant documents that match the query.

3. **Composing the Prompt:**  
   The query and the retrieved context are combined to create a prompt. This prompt instructs the LLM to answer the query.

4. **Generating the Response:**  
   The LLM uses the retrieved context to generate an accurate and context-aware response.

This process is known as **Retrieval-Augmented Generation (RAG)** because it combines **retrieval** of relevant data with **generation** of responses based on that data.


In [None]:
retriever = vectordb.as_retriever()

qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=retriever, 
    verbose=True
)

## Test the RAG

We now define a new test function that will run our query and provide us with a RAG-based response.

In [None]:
def test_rag(qa, query):
    """
    Test the Retrieval Augmented Generation (RAG) system.
    
    Args:
        qa (RetrievalQA.from_chain_type): Langchain function to perform RAG
        query (str): query for the RAG system
    Returns:
        None
    """

    time_start = time()
    response = qa.run(query)
    time_end = time()
    total_time = f"{round(time_end-time_start, 3)} sec."

    full_response =  f"Question: {query}\nAnswer: {response}\nTotal time: {total_time}"
    display(Markdown(colorize_text(full_response)))

Now for the fun part - checking our queries:

In [None]:
query = "How is performed the testing of high-risk AI systems in real world conditions?"
test_rag(qa, query)

In [None]:
query = "What are the operational obligations of notified bodies?"
test_rag(qa, query)

### Document Sources

To check the document sources for the last query, we follow these steps:

1. **Run a Similarity Search:**  
   Perform a similarity search in the vector database based on the query.

2. **Iterate Through Retrieved Documents:**  
   Loop through the documents returned by the similarity search.

3. **Extract and Display Metadata:**  
   For each document:
   - Print the document source from its metadata.
   - Display the page content of the document.

This process helps identify the origin and content of the documents retrieved during the query.


In [None]:
docs = vectordb.similarity_search(query)
print(f"Query: {query}")
print(f"Retrieved documents: {len(docs)}")
for doc in docs:
    doc_details = doc.to_json()['kwargs']
    print("Source: ", doc_details['metadata']['source'])
    print("Text: ", doc_details['page_content'], "\n")

### Conclusions

We utilized **LangChain**, **ChromaDB**, and **Llama3** as the LLM to build a **Retrieval-Augmented Generation (RAG)** solution. For testing purposes, we worked with the **EU AI Act (2023)**.

#### Key Findings
- The RAG model successfully provided accurate answers to questions about the EU AI Act.

#### Areas for Improvement
To enhance the solution, we plan to:
1. **Optimize the Embeddings:** Improve the quality and relevance of the embeddings used in the retrieval process.
2. **Implement Advanced RAG Schemes:** Explore and apply more complex RAG architectures for better performance and flexibility.
