<a href="https://colab.research.google.com/github/Maziger/master-generative-ai-with-llm/blob/main/Notebooks/13_rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Appendix C. End-to-End Retrieval-Augmented Generation


This notebook is a supplementary material for the Appendix C of the [Hands-On Generative AI with Transformers and Diffusion Models book](https://learning.oreilly.com/library/view/hands-on-generative-ai/9781098149239/).

## Processing the Data


In [None]:
import urllib.request

# Define the file name and URL
file_name = "The-AI-Act.pdf"
url = "https://artificialintelligenceact.eu/wp-content/uploads/2021/08/The-AI-Act.pdf"

# Download the file
urllib.request.urlretrieve(url, file_name)
print(f"{file_name} downloaded successfully.")

The-AI-Act.pdf downloaded successfully.


In [None]:
pip install langchain_community pypdf langchain-text-splitters

In [None]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader(file_name)
docs = loader.load()
print(len(docs))

108


In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500, chunk_overlap=100
)
chunks = text_splitter.split_documents(docs)
print(len(chunks))

854


In [None]:
chunked_text = [chunk.page_content for chunk in chunks]
chunked_text[404]

'user or for own use on the Union market for its intended purpose;  \n(12) ‘intended purpose’ means the use for which an AI system is intended by the provider, \nincluding the specific context and conditions of use,  as specified in the information \nsupplied by the provider in the instructions for use, promotional or sales materials \nand statements, as well as in the technical documentation;  \n(13) ‘reasonably foreseeable misuse’ means the use of an AI system in a way tha t is not in'

### Embedding the Documents


In [None]:
from sentence_transformers import SentenceTransformer, util

sentences = ["I'm happy", "I'm full of happiness"]
model = SentenceTransformer("BAAI/bge-small-en-v1.5")

# Compute embedding for both sentences
embedding_1 = model.encode(sentences[0], convert_to_tensor=True)
embedding_2 = model.encode(sentences[1], convert_to_tensor=True)

In [None]:
embedding_1.shape

torch.Size([384])

In [None]:
util.pytorch_cos_sim(embedding_1, embedding_2)

tensor([[0.8367]], device='cuda:0')

In [None]:
embedding_1 @ embedding_2

tensor(0.8367, device='cuda:0')

In [None]:
import torch

torch.dot(embedding_1, embedding_2)

tensor(0.8367, device='cuda:0')

In [None]:
chunk_embeddings = model.encode(chunked_text, convert_to_tensor=True)

In [None]:
chunk_embeddings.shape

torch.Size([854, 384])

## Retrieval

In [None]:
def search_documents(query, top_k=5):
    # Encode the query into a vector
    query_embedding = model.encode(query, convert_to_tensor=True)

    # Calculate cosine similarity between the query and all document chunks
    similarities = util.pytorch_cos_sim(query_embedding, chunk_embeddings)

    # Get the top k most similar chunks
    top_k_indices = similarities[0].topk(top_k).indices

    # Retrieve the corresponding document chunks
    results = [chunked_text[i] for i in top_k_indices]

    return results

In [None]:
search_documents("What are prohibited ai practices?", top_k=2)

['TITLE  II \nPROHIBITED  ARTIFICIAL  INTELLIGENCE  PRACTICES  \nArticle 5  \n1. The following artificial intelligence practices shall be prohibited:  \n(a) the placing on the market, putting into service or use of an A I system that \ndeploys subliminal techniques beyond a person’s consciousness in order to \nmaterially distort a person’s behaviour in a manner that causes or is likely to \ncause that person or another person physical or psychological harm;',
 'low or minimal risk. The list of prohibited practices in Title II comprises all those AI systems \nwhose use is considered unacceptable as contravening Unio n values, for instance by violating \nfundamental rights. The prohibitions covers practices that have a significant potential to \nmanipulate persons  through subliminal techniques beyond their consciousness or exploit']

* The model retrieves relevant information from the input question

## Generation

In [None]:
from transformers import pipeline

from genaibook.core import get_device

device = get_device()
generator = pipeline(
    "text-generation", model="HuggingFaceTB/SmolLM-135M-Instruct", device=device
)

In [None]:
def generate_answer(query):
    # Retrieve relevant chunks
    context_chunks = search_documents(query, top_k=2)

    # Combine the chunks into a single context string
    context = "\n".join(context_chunks)

    # Generate a response using the context
    prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"

    # Define the context to be passed to the model
    system_prompt = (
        "You are a friendly assistant that answers questions about the AI Act. "
        "If the user is not making a question, you can ask for clarification"
    )
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": prompt},
    ]

    response = generator(messages, max_new_tokens=300)
    return response[0]["generated_text"][2]["content"]

In [None]:
answer = generate_answer("What are prohibited ai practices in the EU act?")
print(answer)

The EU Act prohibits the use of artificial intelligence practices that are harmful to individuals, such as:

* The placing on the market, putting into service or use of an A I system that is subliminal, that is, it is not intended to be used for any purpose other than to deceive or manipulate individuals.
* The use of A I systems that are designed to deceive or manipulate individuals, such as those used in advertising, marketing, or customer service.
* The use of A I systems that are designed to manipulate individuals, such as those used in surveillance or monitoring.

The EU Act prohibits the use of A I systems that are designed to deceive or manipulate individuals, such as those used in advertising, marketing, or customer service.

The EU Act prohibits the use of A I systems that are designed to deceive or manipulate individuals, such as those used in advertising, marketing, or customer service.

The EU Act prohibits the use of A I systems that are designed to deceive or manipulate i