# Lesson 1: Advanced RAG Pipeline

In [1]:
import utils

import os
import openai
openai.api_key = utils.get_openai_api_key()

✅ In Answer Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Answer Relevance, input response will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Context Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Context Relevance, input response will be set to __record__.app.query.rets.source_nodes[:].node.text .
✅ In Groundedness, input source will be set to __record__.app.query.rets.source_nodes[:].node.text .
✅ In Groundedness, input statement will be set to __record__.main_output or `Select.RecordOutput` .


In [8]:
from llama_index import SimpleDirectoryReader

documents = SimpleDirectoryReader(
    input_files=["./CELEX_32016R0679_EN_TXT.pdf"]
).load_data()

In [9]:
print(type(documents), "\n")
print(len(documents), "\n")
print(type(documents[0]))
print(documents[0])

<class 'list'> 

88 

<class 'llama_index.schema.Document'>
Doc ID: 5438ec01-0a2b-4b2d-9265-96e9a5162cbc
Text: I  (Legislativ e acts)  REGUL ATIONS  REGUL ATION (EU) 2016/679
OF THE EUR OPEAN PARLIAMENT AND OF THE COUNCIL  of 27 Apr il 2016  on
the protection of natural persons with regard to the processing of
personal data and on the free  movement of such data, and repealing
Directiv e 95/46/EC (General Data Protection Regulation)  (Text with
EEA relev...


## Basic RAG pipeline

In [10]:
from llama_index import Document

document = Document(text="\n\n".join([doc.text for doc in documents]))

In [11]:
from llama_index import VectorStoreIndex
from llama_index import ServiceContext
from llama_index.llms import OpenAI

llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
service_context = ServiceContext.from_defaults(
    llm=llm, embed_model="local:BAAI/bge-small-en-v1.5"
)
index = VectorStoreIndex.from_documents([document],
                                        service_context=service_context)

In [12]:
query_engine = index.as_query_engine()

In [13]:
response = query_engine.query(
    "How should an organization manage user authorizations to ensure data security under GDPR?"
)
print(str(response))

An organization should manage user authorizations by implementing measures to mitigate risks inherent in the processing, such as encryption. These measures should ensure an appropriate level of security, including confidentiality, taking into account the state of the art and the costs of implementation in relation to the risks and the nature of the personal data being protected.


## Evaluation setup using TruLens

In [14]:
eval_questions = []
with open('eval_questions.txt', 'r') as file:
    for line in file:
        # Remove newline character and convert to integer
        item = line.strip()
        print(item)
        eval_questions.append(item)

What are the essential security measures for effective teleworking under GDPR?
What organizational measures should be taken to ensure compliance with the GDPR in terms of data security?
What technical measures are essential for ensuring the security of personal data under GDPR?
What are the essential technical measures to secure equipment and workstations for protecting personal data?
How should an organization protect its premises to ensure the security of personal data?
How should an organization implement user authentication to protect personal data under GDPR requirements?
How should an organization manage user authorizations to ensure data security under GDPR?
What is pseudonymisation and how should it be implemented under GDPR?
How does encryption and hash functions contribute to GDPR compliance?
What does data anonymisation involve under GDPR, and how is it distinguished from pseudonymisation?


In [15]:
# You can try your own question:
new_question = "What are the essential security measures for effective teleworking under GDPR?"
eval_questions.append(new_question)

In [16]:
print(eval_questions)

['What are the essential security measures for effective teleworking under GDPR?', 'What organizational measures should be taken to ensure compliance with the GDPR in terms of data security?', 'What technical measures are essential for ensuring the security of personal data under GDPR?', 'What are the essential technical measures to secure equipment and workstations for protecting personal data?', 'How should an organization protect its premises to ensure the security of personal data?', 'How should an organization implement user authentication to protect personal data under GDPR requirements?', 'How should an organization manage user authorizations to ensure data security under GDPR?', 'What is pseudonymisation and how should it be implemented under GDPR?', 'How does encryption and hash functions contribute to GDPR compliance?', 'What does data anonymisation involve under GDPR, and how is it distinguished from pseudonymisation?', 'What are the essential security measures for effective

In [17]:
from trulens_eval import Tru
tru = Tru()

tru.reset_database()

🦑 Tru initialized with db url sqlite:///default.sqlite .
🛑 Secret keys may be written to the database. See the `database_redact_keys` option of `Tru` to prevent this.


For the classroom, we've written some of the code in helper functions inside a utils.py file.  
- You can view the utils.py file in the file directory by clicking on the "Jupyter" logo at the top of the notebook.
- In later lessons, you'll get to work directly with the code that's currently wrapped inside these helper functions, to give you more options to customize your RAG pipeline.

In [18]:
from utils import get_prebuilt_trulens_recorder

tru_recorder = get_prebuilt_trulens_recorder(query_engine,
                                             app_id="Direct Query Engine")

In [19]:
with tru_recorder as recording:
    for question in eval_questions:
        response = query_engine.query(question)

In [20]:
records, feedback = tru.get_records_and_feedback(app_ids=[])

In [21]:
records.head()

Unnamed: 0,app_id,app_json,type,record_id,input,output,tags,record_json,cost_json,perf_json,ts,Answer Relevance,Context Relevance,Groundedness,Answer Relevance_calls,Context Relevance_calls,Groundedness_calls,latency,total_tokens,total_cost
0,Direct Query Engine,"{""app_id"": ""Direct Query Engine"", ""tags"": ""-"",...",RetrieverQueryEngine(llama_index.query_engine....,record_hash_afc5ad06d94e228038a6d5d9520f539b,"""What are the essential security measures for ...","""The essential security measures for effective...",-,"{""record_id"": ""record_hash_afc5ad06d94e228038a...","{""n_requests"": 1, ""n_successful_requests"": 1, ...","{""start_time"": ""2024-06-03T05:49:33.851506"", ""...",2024-06-03T05:49:35.809208,1.0,0.0,0.3,[{'args': {'prompt': 'What are the essential s...,[{'args': {'prompt': 'What are the essential s...,[{'args': {'source': 'Where such notification...,1,2097,0.003175
1,Direct Query Engine,"{""app_id"": ""Direct Query Engine"", ""tags"": ""-"",...",RetrieverQueryEngine(llama_index.query_engine....,record_hash_1ac5b4e7f410235350056322a040516a,"""What organizational measures should be taken ...","""Organizational measures such as evaluating ri...",-,"{""record_id"": ""record_hash_1ac5b4e7f4102353500...","{""n_requests"": 1, ""n_successful_requests"": 1, ...","{""start_time"": ""2024-06-03T05:49:35.973479"", ""...",2024-06-03T05:49:37.672573,1.0,1.0,0.35,[{'args': {'prompt': 'What organizational meas...,[{'args': {'prompt': 'What organizational meas...,[{'args': {'source': 'Those measures shall be ...,1,2182,0.003299
2,Direct Query Engine,"{""app_id"": ""Direct Query Engine"", ""tags"": ""-"",...",RetrieverQueryEngine(llama_index.query_engine....,record_hash_626f8c76d424f72ab2b046d647ed1d59,"""What technical measures are essential for ens...","""Implementing appropriate technical and organi...",-,"{""record_id"": ""record_hash_626f8c76d424f72ab2b...","{""n_requests"": 1, ""n_successful_requests"": 1, ...","{""start_time"": ""2024-06-03T05:49:37.827962"", ""...",2024-06-03T05:49:40.235077,1.0,0.5,1.0,[{'args': {'prompt': 'What technical measures ...,[{'args': {'prompt': 'What technical measures ...,[{'args': {'source': 'Those restr ictions shou...,2,2213,0.00336
3,Direct Query Engine,"{""app_id"": ""Direct Query Engine"", ""tags"": ""-"",...",RetrieverQueryEngine(llama_index.query_engine....,record_hash_3da2d84da259874b0af4e134761ca82a,"""What are the essential technical measures to ...","""The essential technical measures to secure eq...",-,"{""record_id"": ""record_hash_3da2d84da259874b0af...","{""n_requests"": 1, ""n_successful_requests"": 1, ...","{""start_time"": ""2024-06-03T05:49:40.393497"", ""...",2024-06-03T05:49:42.211371,1.0,0.6,1.0,[{'args': {'prompt': 'What are the essential t...,[{'args': {'prompt': 'What are the essential t...,"[{'args': {'source': 'When developing, designi...",1,2198,0.003326
4,Direct Query Engine,"{""app_id"": ""Direct Query Engine"", ""tags"": ""-"",...",RetrieverQueryEngine(llama_index.query_engine....,record_hash_9eb80507a3d7eb5e0054d8718a81cd05,"""How should an organization protect its premis...","""An organization should evaluate the risks inh...",-,"{""record_id"": ""record_hash_9eb80507a3d7eb5e005...","{""n_requests"": 1, ""n_successful_requests"": 1, ...","{""start_time"": ""2024-06-03T05:49:42.367805"", ""...",2024-06-03T05:49:44.618402,1.0,0.6,0.6,[{'args': {'prompt': 'How should an organizati...,[{'args': {'prompt': 'How should an organizati...,[{'args': {'source': 'The controller and proc...,2,2204,0.00334


In [22]:
# launches on http://localhost:8501/
tru.run_dashboard()

Starting dashboard ...
Config file already exists. Skipping writing process.
Credentials file already exists. Skipping writing process.


Accordion(children=(VBox(children=(VBox(children=(Label(value='STDOUT'), Output())), VBox(children=(Label(valu…

Dashboard started at https://s172-31-15-23p44922.lab-aws-production.deeplearning.ai/ .


<Popen: returncode: None args: ['streamlit', 'run', '--server.headless=True'...>

## Advanced RAG pipeline

### 1. Sentence Window retrieval

In [23]:
from llama_index.llms import OpenAI

llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)

In [27]:
from utils import build_sentence_window_index

sentence_index = build_sentence_window_index(
    document,
    llm,
    embed_model="local:BAAI/bge-small-en-v1.5",
    save_dir="sentence_index"
)

In [25]:
from utils import get_sentence_window_query_engine

sentence_window_engine = get_sentence_window_query_engine(sentence_index)

config.json:   0%|          | 0.00/799 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

In [26]:
window_response = sentence_window_engine.query(
    "How should an organization manage user authorizations to ensure data security under GDPR?"
)
print(str(window_response))

An organization should adopt internal policies and implement measures that meet the principles of data protection by design and data protection by default. This includes minimizing the processing of personal data, pseudonymizing personal data as soon as possible, ensuring transparency in data processing functions, enabling data subjects to monitor data processing, and creating and improving security features. Additionally, when developing applications, services, and products that involve processing personal data, organizations should consider data protection rights and ensure that controllers and processors can fulfill their data protection obligations. The principles of data protection by design and by default should also be considered in the context of public tenders.


In [28]:
tru.reset_database()

tru_recorder_sentence_window = get_prebuilt_trulens_recorder(
    sentence_window_engine,
    app_id = "Sentence Window Query Engine"
)

In [29]:
for question in eval_questions:
    with tru_recorder_sentence_window as recording:
        response = sentence_window_engine.query(question)
        print(question)
        print(str(response))

What are the essential security measures for effective teleworking under GDPR?
The essential security measures for effective teleworking under GDPR include implementing appropriate technological protection and organizational measures to promptly detect any personal data breaches, informing the supervisory authority and the data subject without undue delay in case of a breach likely to result in a high risk to the rights and freedoms of the individual, and communicating the nature of the breach along with recommendations to mitigate potential adverse effects to the data subject as soon as reasonably feasible.
What organizational measures should be taken to ensure compliance with the GDPR in terms of data security?
Organizational measures that should be taken to ensure compliance with the GDPR in terms of data security include adopting internal policies, implementing measures that align with the principles of data protection by design and data protection by default, minimizing the proces

In [30]:
tru.get_leaderboard(app_ids=[])

Unnamed: 0_level_0,Context Relevance,Answer Relevance,Groundedness,latency,total_cost
app_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Sentence Window Query Engine,0.422727,1.0,0.834217,7.181818,0.001898


In [31]:
# launches on http://localhost:8501/
tru.run_dashboard()

Starting dashboard ...
Config file already exists. Skipping writing process.
Credentials file already exists. Skipping writing process.
Dashboard already running at path: https://s172-31-15-23p44922.lab-aws-production.deeplearning.ai/


<Popen: returncode: None args: ['streamlit', 'run', '--server.headless=True'...>

### 2. Auto-merging retrieval

In [32]:
from utils import build_automerging_index

automerging_index = build_automerging_index(
    documents,
    llm,
    embed_model="local:BAAI/bge-small-en-v1.5",
    save_dir="merging_index"
)

In [33]:
from utils import get_automerging_query_engine

automerging_query_engine = get_automerging_query_engine(
    automerging_index,
)

In [34]:
auto_merging_response = automerging_query_engine.query(
    "How should an organization manage user authorizations to ensure data security under GDPR?"
)
print(str(auto_merging_response))

An organization should adopt internal policies and implement measures that meet the principles of data protection by design and data protection by default in order to manage user authorizations and ensure data security under GDPR.


In [35]:
tru.reset_database()

tru_recorder_automerging = get_prebuilt_trulens_recorder(automerging_query_engine,
                                                         app_id="Automerging Query Engine")

In [36]:
for question in eval_questions:
    with tru_recorder_automerging as recording:
        response = automerging_query_engine.query(question)
        print(question)
        print(response)

What are the essential security measures for effective teleworking under GDPR?
The essential security measures for effective teleworking under GDPR include implementing technical and organizational measures to correct inaccuracies in personal data, minimize the risk of errors, secure personal data considering potential risks for the interests and rights of the data subject, and prevent discriminatory effects on individuals based on racial or ethnic origin, political opinion, religion, or beliefs.
What organizational measures should be taken to ensure compliance with the GDPR in terms of data security?
Organizational measures such as adopting internal policies and implementing measures that meet the principles of data protection by design and data protection by default should be taken to ensure compliance with the GDPR in terms of data security. Additionally, safeguards should be in place to ensure technical and organizational measures that support the principle of data minimization.
Wh

In [37]:
tru.get_leaderboard(app_ids=[])

Unnamed: 0_level_0,Context Relevance,Answer Relevance,Groundedness,latency,total_cost
app_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Automerging Query Engine,0.413636,0.981818,0.732424,6.181818,0.000626


In [38]:
# launches on http://localhost:8501/
tru.run_dashboard()

Starting dashboard ...
Config file already exists. Skipping writing process.
Credentials file already exists. Skipping writing process.
Dashboard already running at path: https://s172-31-15-23p44922.lab-aws-production.deeplearning.ai/


<Popen: returncode: None args: ['streamlit', 'run', '--server.headless=True'...>