# Data Indexing, Data Retrieval and Generation

There are two central steps involved:


**Data Indexing:**
1. Documents are loaded and split into smaller text chunks.
2. Text chunks are converted into vector embeddings and stored in a vector database (Vector DB) next to their respective text chunks.

**Data Retrieval and Generation:**
1. A user query is embedded and used to retrieve relevant text chunks from the Vector DB.
2. Retrieved chunks are processed by a large language model (LLM) to generate a contextually relevant response.



***
**Coding sources**

I extend the code provided and explained in the following YouTube Video: 

- RAG Langchain Python Project: Easy AI/Chat For Your Docs: https://www.youtube.com/watch?v=tcqEUSNCn8I
    + GitHub: https://github.com/pixegami/langchain-rag-tutorial


## problem: incompatibility between packages:

In [None]:
## could be necessary to downgrade package:
# pip uninstall langchain-core
# pip install langchain-core==0.3.10
## or install older version of chromadb:
# pip install --upgrade chromadb==0.5.0

## Get API, local supabase server key(s)

In [1]:
import os
import sys

# Assuming 'src' is one level down (in the current directory or a subdirectory)
path_to_src = os.path.join('../..','src')  # Moves one level down to 'src' folder

# Add the path to sys.path
sys.path.append(path_to_src)

# Now you can import your API_key module
import API_key as key

## include self-written functions

In [2]:
import src.forChromaApproach as di_drg

In [3]:
# Print the current working directory
print("Current working directory:", os.getcwd())

Current working directory: c:\DATEN\PHD\WORKSHOPS\introductory workshop in LLMs\4_summarizingLiterature\RAG


# Data Indexing
## Data Preperation: Documents are loaded and split into smaller text chunks

**load_pdfs_by_filename**: Loads and stores PDF pages by filename:

In [4]:
path_to_PDFs = os.path.join('PDFs/AIregulation')  # Moves one level up to 'PDFs' folder


pdf_pages = di_drg.load_pdfs_by_filename(path_to_PDFs, verbose=False)

# Optional: Print the loaded pages by filename
for filename, pages in pdf_pages.items():
    print(f"\nPDF: {filename}")
    print(f"Total Pages: {len(pages)}")
    # print(pages[0])

Ignoring wrong pointing object 110 0 (offset 0)
Ignoring wrong pointing object 244 0 (offset 0)
Ignoring wrong pointing object 12 0 (offset 0)
Ignoring wrong pointing object 1319 0 (offset 0)
Ignoring wrong pointing object 9 0 (offset 0)
Ignoring wrong pointing object 13 0 (offset 0)
Ignoring wrong pointing object 44 0 (offset 0)
Ignoring wrong pointing object 134 0 (offset 0)
Ignoring wrong pointing object 118 0 (offset 0)
Ignoring wrong pointing object 438 0 (offset 0)
Ignoring wrong pointing object 41 0 (offset 0)
Ignoring wrong pointing object 52 0 (offset 0)
Ignoring wrong pointing object 9 0 (offset 0)
Ignoring wrong pointing object 15 0 (offset 0)
Ignoring wrong pointing object 199 0 (offset 0)
Ignoring wrong pointing object 331 0 (offset 0)
Ignoring wrong pointing object 9 0 (offset 0)
Ignoring wrong pointing object 16 0 (offset 0)
Ignoring wrong pointing object 150 0 (offset 0)
Ignoring wrong pointing object 232 0 (offset 0)
Ignoring wrong pointing object 9 0 (offset 0)
Ignori


PDF: 10.1002_sd.2048.pdf
Total Pages: 14

PDF: 10.1007_s00146-023-01650-z.pdf
Total Pages: 8

PDF: 10.1007_s10506-017-9206-9.pdf
Total Pages: 15

PDF: 10.1007_s11077-022-09452-8.pdf
Total Pages: 23

PDF: 10.1007_s11569-024-00454-9.pdf
Total Pages: 29

PDF: 10.1007_s40804-020-00200-0.pdf
Total Pages: 27

PDF: 10.1017_err.2019.8.pdf
Total Pages: 19

PDF: 10.1017_err.2021.52.pdf
Total Pages: 25

PDF: 10.1017_err.2022.14.pdf
Total Pages: 16

PDF: 10.1017_err.2023.1.pdf
Total Pages: 19

PDF: 10.1080_13600834.2018.1488659.pdf
Total Pages: 19

PDF: 10.1080_13669877.2021.1957985.pdf
Total Pages: 14

PDF: 10.1111_bioe.13124.pdf
Total Pages: 9

PDF: 10.1111_rego.12512.pdf
Total Pages: 30

PDF: 10.1111_rego.12563.pdf
Total Pages: 18

PDF: 10.1111_rego.12568.pdf
Total Pages: 22

PDF: 10.1177_0266382120923962.pdf
Total Pages: 9

PDF: 10.1177_2053951719860542.pdf
Total Pages: 14

PDF: 10.1177_20539517211039493.pdf
Total Pages: 5

PDF: 10.14658_pupj-jelt-2021-2-2.pdf
Total Pages: 20

PDF: 10.2139_ss

In [5]:
# Assuming pdf_chunks is the dictionary containing chunks for each PDF
first_key = list(pdf_pages.keys())[0]  # Get the first PDF filename
print("first PDF of folder:", first_key)
first_pdf_pages = pdf_pages[first_key]  # Get the chunks for the first PDF


# Print the first page
print("First Page:", first_pdf_pages[0], "\n\n")

first PDF of folder: 10.1002_sd.2048.pdf
First Page: page_content='RESEARCH ARTICLE
Governing Artificial Intelligence to benefit the UN Sustainable
Development Goals
Jon Truby
Law & Development, College of Law, Qatar
University, Doha, Qatar
Correspondence
Jon Truby, Centre for Law & Development,
College of Law, Qatar University, PO BOX
2713 Doha, Qatar.
Email: jon.truby@qu.edu.qa
Funding information
Qatar National Research Fund, Grant/Award
Number: NPRP 11S-1119-170016Abstract
Big Tech's unregulated roll-out out of experimental AI poses risks to the achievement of
the UN Sustainable Development Goals (SDGs), w ith particular vulnerability for develop-
ing countries. The goal of financial inclusion is threatened by the imperfect and
ungoverned design and implementation of AI decision-making software making important
financial decisions affecting customers. Aut omated decision-makin ga l g o r i t h m sh a v ed i s -
played evidence of bias, lack ethical gover nance, and limit transparen

**split_pdf_pages_into_chunks**: Splits and stores PDF pages into chunks by filename:

In [6]:
pdf_chunks = di_drg.split_pdf_pages_into_chunks(pdf_pages, chunk_size=500, chunk_overlap=150, verbose=False)

# Optional: Print a summary of chunks created per PDF
for filename, chunks in pdf_chunks.items():
    print(f"\nPDF: {filename}")
    print(f"Total Chunks: {len(chunks)}")


PDF: 10.1002_sd.2048.pdf
Total Chunks: 254

PDF: 10.1007_s00146-023-01650-z.pdf
Total Chunks: 136

PDF: 10.1007_s10506-017-9206-9.pdf
Total Chunks: 120

PDF: 10.1007_s11077-022-09452-8.pdf
Total Chunks: 256

PDF: 10.1007_s11569-024-00454-9.pdf
Total Chunks: 405

PDF: 10.1007_s40804-020-00200-0.pdf
Total Chunks: 248

PDF: 10.1017_err.2019.8.pdf
Total Chunks: 181

PDF: 10.1017_err.2021.52.pdf
Total Chunks: 285

PDF: 10.1017_err.2022.14.pdf
Total Chunks: 182

PDF: 10.1017_err.2023.1.pdf
Total Chunks: 241

PDF: 10.1080_13600834.2018.1488659.pdf
Total Chunks: 184

PDF: 10.1080_13669877.2021.1957985.pdf
Total Chunks: 161

PDF: 10.1111_bioe.13124.pdf
Total Chunks: 162

PDF: 10.1111_rego.12512.pdf
Total Chunks: 416

PDF: 10.1111_rego.12563.pdf
Total Chunks: 290

PDF: 10.1111_rego.12568.pdf
Total Chunks: 337

PDF: 10.1177_0266382120923962.pdf
Total Chunks: 116

PDF: 10.1177_2053951719860542.pdf
Total Chunks: 195

PDF: 10.1177_20539517211039493.pdf
Total Chunks: 66

PDF: 10.14658_pupj-jelt-2021

In [7]:
# Assuming pdf_chunks is the dictionary containing chunks for each PDF
first_key = list(pdf_chunks.keys())[0]  # Get the first PDF filename
print("first PDF of folder:", first_key)
first_pdf_chunks = pdf_chunks[first_key]  # Get the chunks for the first PDF

# Access the first and second chunks
first_chunk = first_pdf_chunks[0]
second_chunk = first_pdf_chunks[1]

# Print the first two chunks
print("\nFirst Chunk:", first_chunk, "\n\n")
print("Second Chunk:", second_chunk)

first PDF of folder: 10.1002_sd.2048.pdf

First Chunk: page_content='RESEARCH ARTICLE
Governing Artificial Intelligence to benefit the UN Sustainable
Development Goals
Jon Truby
Law & Development, College of Law, Qatar
University, Doha, Qatar
Correspondence
Jon Truby, Centre for Law & Development,
College of Law, Qatar University, PO BOX
2713 Doha, Qatar.
Email: jon.truby@qu.edu.qa
Funding information
Qatar National Research Fund, Grant/Award
Number: NPRP 11S-1119-170016Abstract' metadata={'source': 'PDFs/AIregulation\\10.1002_sd.2048.pdf', 'page': 0} 


Second Chunk: page_content='2713 Doha, Qatar.
Email: jon.truby@qu.edu.qa
Funding information
Qatar National Research Fund, Grant/Award
Number: NPRP 11S-1119-170016Abstract
Big Tech's unregulated roll-out out of experimental AI poses risks to the achievement of
the UN Sustainable Development Goals (SDGs), w ith particular vulnerability for develop-
ing countries. The goal of financial inclusion is threatened by the imperfect and
ungover

In [8]:
print("page content:", first_chunk.page_content, "\n\n")
print("metadata:", first_chunk.metadata)

page content: RESEARCH ARTICLE
Governing Artificial Intelligence to benefit the UN Sustainable
Development Goals
Jon Truby
Law & Development, College of Law, Qatar
University, Doha, Qatar
Correspondence
Jon Truby, Centre for Law & Development,
College of Law, Qatar University, PO BOX
2713 Doha, Qatar.
Email: jon.truby@qu.edu.qa
Funding information
Qatar National Research Fund, Grant/Award
Number: NPRP 11S-1119-170016Abstract 


metadata: {'source': 'PDFs/AIregulation\\10.1002_sd.2048.pdf', 'page': 0}


## Data Storage: Text chunks are converted into vector embeddings and stored in a vector database (Vector DB) next to their respective text chunks.

In [9]:
path_to_Chroma = os.path.join('DB_Chroma')  # Moves one level up to 'PDFs' folder

sources_DB = di_drg.inspect_chrom(CHROMA_PATH=path_to_Chroma, openAI_key=key.openAI_key)
print("Number of sources in DB:", len(sources_DB))
print("\nSources:\n", sources_DB)

# Remove the "PDFs\\" prefix from all entries
cleaned_sources_DB = [pdf.replace('PDFs\\', '').replace('PDFs/AIregulation\\', '') for pdf in sources_DB]

# Print the result
print("\nCleaned sources:\n", cleaned_sources_DB)

  db = Chroma(persist_directory=CHROMA_PATH, embedding_function=OpenAIEmbeddings(api_key=openAI_key))


Number of sources in DB: 3

Sources:
 ['PDFs/AIregulation\\10.1007_s10506-017-9206-9.pdf', 'PDFs/AIregulation\\10.1002_sd.2048.pdf', 'PDFs/AIregulation\\10.1007_s00146-023-01650-z.pdf']

Cleaned sources:
 ['10.1007_s10506-017-9206-9.pdf', '10.1002_sd.2048.pdf', '10.1007_s00146-023-01650-z.pdf']


In [10]:
# if you want to remove your DB:
## di_drg.remove_chrom(CHROMA_PATH=path_to_Chroma)

# pdf_chunks is a dictionary as such we can run over the keys:
for pdf in pdf_chunks.keys():
    if pdf not in cleaned_sources_DB:
        print(f"The PDF '{pdf}' is not included in the DB, as such:")
        print("create DB for", pdf)
        di_drg.save_to_chrom(chunks=pdf_chunks[pdf], CHROMA_PATH=path_to_Chroma, openAI_key=key.openAI_key)

The PDF '10.1007_s11077-022-09452-8.pdf' is not included in the DB, as such:
create DB for 10.1007_s11077-022-09452-8.pdf


  db.persist()


Saved 256 chunks to DB_Chroma.
The PDF '10.1007_s11569-024-00454-9.pdf' is not included in the DB, as such:
create DB for 10.1007_s11569-024-00454-9.pdf
Saved 405 chunks to DB_Chroma.
The PDF '10.1007_s40804-020-00200-0.pdf' is not included in the DB, as such:
create DB for 10.1007_s40804-020-00200-0.pdf
Saved 248 chunks to DB_Chroma.
The PDF '10.1017_err.2019.8.pdf' is not included in the DB, as such:
create DB for 10.1017_err.2019.8.pdf
Saved 181 chunks to DB_Chroma.
The PDF '10.1017_err.2021.52.pdf' is not included in the DB, as such:
create DB for 10.1017_err.2021.52.pdf
Saved 285 chunks to DB_Chroma.
The PDF '10.1017_err.2022.14.pdf' is not included in the DB, as such:
create DB for 10.1017_err.2022.14.pdf
Saved 182 chunks to DB_Chroma.
The PDF '10.1017_err.2023.1.pdf' is not included in the DB, as such:
create DB for 10.1017_err.2023.1.pdf
Saved 241 chunks to DB_Chroma.
The PDF '10.1080_13600834.2018.1488659.pdf' is not included in the DB, as such:
create DB for 10.1080_13600834.

In [11]:
sources_DB = di_drg.inspect_chrom(CHROMA_PATH=path_to_Chroma, openAI_key=key.openAI_key)
print("Number of sources in DB:", len(sources_DB))
print("\nSources:\n", sources_DB)

Number of sources in DB: 29

Sources:
 ['PDFs/AIregulation\\10.1017_err.2021.52.pdf', 'PDFs/AIregulation\\10.2979_gls.2023.a886162.pdf', 'PDFs/AIregulation\\10.1177_20539517211039493.pdf', 'PDFs/AIregulation\\10.1111_rego.12563.pdf', 'PDFs/AIregulation\\10.1017_err.2019.8.pdf', 'PDFs/AIregulation\\10.2139_ssrn.2609777.pdf', 'PDFs/AIregulation\\10.24251_HICSS.2020.647.pdf', 'PDFs/AIregulation\\10.1007_s11077-022-09452-8.pdf', 'PDFs/AIregulation\\10.1007_s11569-024-00454-9.pdf', 'PDFs/AIregulation\\doi-10.1017_err.2022.38.pdf', 'PDFs/AIregulation\\10.1111_rego.12512.pdf', 'PDFs/AIregulation\\10.4324_9780429262081-19.pdf', 'PDFs/AIregulation\\10.2139_ssrn.3501410.pdf', 'PDFs/AIregulation\\10.1007_s00146-023-01650-z.pdf', 'PDFs/AIregulation\\10.1111_rego.12568.pdf', 'PDFs/AIregulation\\10.1080_13600834.2018.1488659.pdf', 'PDFs/AIregulation\\10.48550_arXiv.2305.02231.pdf', 'PDFs/AIregulation\\10.1017_err.2023.1.pdf', 'PDFs/AIregulation\\10.1111_bioe.13124.pdf', 'PDFs/AIregulation\\10.14658_

# Data Retrieval and Generation

your prompt template (system message):

In [12]:
PROMPT_TEMPLATE = """
Answer the question based only on the following context:

{context}

---

Answer the question based on the above context: {question}
"""

your question (user message):

In [13]:
# Question you ask to your DB:
question = """
Why is it important to study laypersons' perceptions of AI regulation, especially focusing on their trust levels, perceived benefits and risks, and how these perceptions impact the successful integration of AI into society? Please discuss how understanding public perceptions can inform regulatory approaches, enhance public trust, and address potential risks and benefits for society.
"""

retrieve data (see "source_page_pairs" and "X_hits") and generate response (see "response"):

In [14]:
response, source_page_pairs, filtered_hits, all_hits = di_drg.retrieveGenerate(query_text=question, prompt_template=PROMPT_TEMPLATE, openAI_key=key.openAI_key, chroma_path=path_to_Chroma, 
                                                                            docsReturn=10, thresholdSimilarity=0.8)

Number of possible relevant text chunks found with a threshold similarity of 0.8: 88
Query: Human: 
Answer the question based only on the following context:

impact the trust relationship between a human and an arti ﬁcial agent and are aspects that regulation can shape
in a direct manner. In democracies, the legal framework can be amended to address novel issues (e.g., proposed
AI Act, see Section 4) and the institutional setting changed according to the needs of society. Regulation thus lays
down the way in which interactions occur and determines the underlying characteristics of the environment in
which trust relationships emerge.

---

nology to the public sector (Aoki, 2021 , p. 1).
The ﬁrst reviewed study by Aoki concerns the introduction of and citizens ’initial public trust in AI chatbots
(Aoki, 2021 ). The paper hypothesizes that initial public trust in AI chatbots depends on the area of enquiry and
on the purposes communicated to the public for introducing the technology (Aoki

  response_text = model.predict(prompt)


In [15]:
print(response)
print(source_page_pairs)

Studying laypersons' perceptions of AI regulation is important because it provides insights into how the general public views AI technology, its potential benefits, and risks. Understanding these perceptions is crucial for the successful integration of AI into society because public trust plays a significant role in the adoption and acceptance of new technologies. 

By studying public perceptions, regulators can tailor their approaches to address concerns and build trust among citizens. This can help in designing regulations that are more effective, transparent, and responsive to the needs and expectations of the public. Additionally, understanding public perceptions can help in identifying potential risks and benefits associated with AI technology, allowing regulators to mitigate risks and maximize the benefits for society.

Overall, studying laypersons' perceptions of AI regulation can inform regulatory approaches, enhance public trust, and ensure that the integration of AI technolog

In [16]:
print(len(all_hits))
print(all_hits[0])

88
(Document(metadata={'page': 2, 'source': 'PDFs/AIregulation\\white house_AI.pdf'}, page_content="significantly on public trust and validation, the government's regulatory and non-regulatory \napproaches to AI should contribute to public trust in AI by promoting reliable, robust, and \ntrustworthy AI applications. For example, an appropriate regulatory approach that reduces \naccidents can increase public trust and thereby support the development of industries powered by \nAI. Regulatory approaches may also be needed to protect reasonable expectations of privacy on"), 0.8372320426012777)


## it is possible to further investiagte the results of a single run:

In [17]:
# Initialize a list to store all sources
sources = []

# Loop through each tuple and extract the source
for doc, score in all_hits:
    source = doc.metadata.get('source')  # Access the 'source' key from the metadata
    if source is not None:
        sources.append(source)

# Print the list of sources
sources = [source.replace('PDFs\\', '').replace('PDFs/AIregulation\\', '') for source in sources]

print(sources)

['white house_AI.pdf', '10.1111_rego.12512.pdf', '10.1111_rego.12512.pdf', '10.1111_rego.12568.pdf', '10.1111_rego.12512.pdf', '10.1111_rego.12512.pdf', '10.1111_rego.12568.pdf', '10.1111_rego.12568.pdf', 'white house_AI.pdf', '10.1111_rego.12512.pdf', 'white house_AI.pdf', '10.1111_rego.12512.pdf', '10.1111_rego.12563.pdf', '10.1177_0266382120923962.pdf', '10.1111_rego.12568.pdf', '10.1111_rego.12512.pdf', '10.1111_rego.12512.pdf', '10.1111_rego.12512.pdf', '10.1111_rego.12512.pdf', '10.1111_rego.12512.pdf', '10.1080_13669877.2021.1957985.pdf', '10.1080_13669877.2021.1957985.pdf', 'doi-10.1017_err.2022.38.pdf', '10.1111_rego.12512.pdf', '10.24251_HICSS.2021.664.pdf', '10.2139_ssrn.3501410.pdf', '10.1080_13669877.2021.1957985.pdf', '10.1111_rego.12512.pdf', '10.2979_gls.2023.a886162.pdf', '10.24251_HICSS.2021.664.pdf', '10.1111_rego.12512.pdf', '10.1111_rego.12512.pdf', '10.1111_rego.12512.pdf', '10.24251_HICSS.2021.664.pdf', '10.1111_rego.12568.pdf', '10.2139_ssrn.3501410.pdf', '10.11

In [18]:
import pandas as pd
from collections import Counter

# Counting the frequency of each file
file_count = Counter(sources)

# Creating a DataFrame from the frequency count
frequency_df = pd.DataFrame(file_count.items(), columns=['File Name', 'Frequency'])

# Adding a new column for chunk lengths
frequency_df["Chunk Length PDF"] = 0

# Loop through pdf_chunks and update the DataFrame
for filename, chunks in pdf_chunks.items():
    frequency_df.loc[frequency_df["File Name"] == filename, "Chunk Length PDF"] = len(chunks)

frequency_df["Chunk Percentage"] = ((frequency_df["Frequency"] / frequency_df["Chunk Length PDF"]) * 100).round(2)

frequency_df = frequency_df.sort_values(by="Chunk Percentage", ascending=False)
print(frequency_df)

# Saving the DataFrame to an Excel file
frequency_df.to_excel("outputs_ChromaApproach/file_frequency_table.xlsx", index=False)

                            File Name  Frequency  Chunk Length PDF  \
1              10.1111_rego.12512.pdf         37               416   
0                  white house_AI.pdf          6               135   
7         10.24251_HICSS.2021.664.pdf          6               147   
5   10.1080_13669877.2021.1957985.pdf          5               161   
8            10.2139_ssrn.3501410.pdf          8               282   
2              10.1111_rego.12568.pdf          9               337   
3              10.1111_rego.12563.pdf          4               290   
9        10.2979_gls.2023.a886162.pdf          2               195   
12      10.48550_arXiv.2305.02231.pdf          5               496   
4        10.1177_0266382120923962.pdf          1               116   
14        10.24251_HICSS.2020.647.pdf          1               135   
6         doi-10.1017_err.2022.38.pdf          1               154   
13       10.4324_9780429262081-19.pdf          1               163   
11            10.101

## loop through

In [19]:
# Initialize an empty list to store the results
results = []

# Run the function multiple times and store the outputs
for _ in range(10):  # Replace 5 with the number of iterations you want to run
    response, source_page_pairs, filtered_hits, all_hits = di_drg.retrieveGenerate(query_text=question, prompt_template=PROMPT_TEMPLATE, openAI_key=key.openAI_key, chroma_path=path_to_Chroma, 
                                                                            docsReturn=10, thresholdSimilarity=0.8)
    results.append({"Response LLM": response, "Sources, Pages": source_page_pairs})

# Convert the results list into a DataFrame
df = pd.DataFrame(results)

# Display the DataFrame
print(df)

# Saving the DataFrame to an Excel file
df.to_excel("outputs_ChromaApproach/file_LLMs_calls2.xlsx", index=False)

Number of possible relevant text chunks found with a threshold similarity of 0.8: 88
Query: Human: 
Answer the question based only on the following context:

targeted approach to policymakers to debate how to streamline regulatory efforts for future AI governance.
Keywords: artiﬁcial intelligence, automation, human-automation trust, regulation of technology, trust, trustworthy AI.
1. Introduction
In May 2021, the EU Executive Vice-President, Margrethe Vestager, held a speech in which she asked: “Do
Europeans trust technology? ”After noting several large discrepancies within member states but an overall low

---

research by Kennedy et al., the primary object of trust is thus the AI technology. However, the study also tests
the variation in the degree of initial public trust in AI chatbots relative to the public trust in human administra-
tors (Aoki, 2021 , p. 4), thus providing insights into the changes in perceived trust once the technology is being
introduced to an institution. Aoki 