# Data Indexing, Data Retrieval and Generation

There are two central steps involved:


**Data Indexing:**
1. Documents are loaded and split into smaller text chunks.
2. Text chunks are converted into vector embeddings and stored in a vector database (Vector DB) next to their respective text chunks.

**Data Retrieval and Generation:**
1. A user query is embedded and used to retrieve relevant text chunks from the Vector DB.
2. Retrieved chunks are processed by a large language model (LLM) to generate a contextually relevant response.



***
**Coding sources**

I extend the code provided and explained in the following YouTube Video: 

- RAG Langchain Python Project: Easy AI/Chat For Your Docs: https://www.youtube.com/watch?v=tcqEUSNCn8I
    + GitHub: https://github.com/pixegami/langchain-rag-tutorial


## If you facing issues running your Code:

It could be the case that chroma and langchain-core are not compatible.

In [1]:
## run in your terminal; could be necessary to downgrade package:
# pip uninstall langchain-core
# pip install langchain-core==0.3.10

## or install older version of chromadb:
# pip install --upgrade chromadb==0.5.0

## Get API, local supabase server key(s)

In [2]:
import os
import sys

# Assuming 'src' is one level down (in the current directory or a subdirectory)
path_to_src = os.path.join('../..','src')  # Moves one level down to 'src' folder

# Add the path to sys.path
sys.path.append(path_to_src)

# Now you can import your API_key module
import API_key as key

## include self-written functions

In [3]:
import src.forChromaApproach as di_drg

In [4]:
# Print the current working directory
print("Current working directory:", os.getcwd())

Current working directory: c:\DATEN\PHD\WORKSHOPS\introductory workshop in LLMs\4_summarizingLiterature\RAG


# Data Indexing
## Data Preperation: Documents are loaded and split into smaller text chunks

**load_pdfs_by_filename**: Loads and stores PDF pages by filename:

In [5]:
path_to_PDFs = os.path.join('PDFs/AIregulation_query2')  # Moves one level up to 'PDFs' folder


pdf_pages = di_drg.load_pdfs_by_filename(path_to_PDFs, verbose=False)

# Optional: Print the loaded pages by filename
for filename, pages in pdf_pages.items():
    print(f"\nPDF: {filename}")
    print(f"Total Pages: {len(pages)}")
    # print(pages[0])

Ignoring wrong pointing object 17 0 (offset 0)
Ignoring wrong pointing object 17 0 (offset 0)
Ignoring wrong pointing object 17 0 (offset 0)



PDF: 10.1007s00146-023-01777-z_ 2023_Institutionalised di.pdf
Total Pages: 14

PDF: 10.1007s10506-022-09323-w_ 2023_Mapping the Issues o.pdf
Total Pages: 29

PDF: 10.1007s10551-022-05053-w_ 2022_Moral Judgments in t.pdf
Total Pages: 27

PDF: 10.1007s11948-020-00276-4_ 2020_Towards Transparency.pdf
Total Pages: 29

PDF: 10.1007s43681-023-00387-1_ 2023_Publics’ views on et.pdf
Total Pages: 29

PDF: 10.1016j.clsr.2022.105657_ 2022_Regulating AI. A lab.pdf
Total Pages: 16

PDF: 10.1016j.inffus.2023.101896_ 2023_Connecting the dots.pdf
Total Pages: 24

PDF: 10.1016j.patter.2021.100362_ 2021_Creating ethics guid.pdf
Total Pages: 15

PDF: 10.1016j.techsoc.2024.102471_ 2024_Trustworthy AI in th.pdf
Total Pages: 15

PDF: 10.108010510974.2020.1807380_ 2021_Impacts of Attitudes.pdf
Total Pages: 18

PDF: 10.108013600869.2022.2060471_ 2022_On the path to the f.pdf
Total Pages: 24

PDF: 10.1093scipolscac062_ 2023_Can innovation vouch.pdf
Total Pages: 16

PDF: 10.1109ACCESS.2024.3458893_ 2024_Ethica

In [6]:
# Assuming pdf_chunks is the dictionary containing chunks for each PDF
first_key = list(pdf_pages.keys())[0]  # Get the first PDF filename
print("first PDF of folder:", first_key)
first_pdf_pages = pdf_pages[first_key]  # Get the chunks for the first PDF


# Print the first page
print("First Page:", first_pdf_pages[0], "\n\n")

first PDF of folder: 10.1007s00146-023-01777-z_ 2023_Institutionalised di.pdf
First Page: page_content='Vol.:(0123456789)1 3AI & SOCIETY 
https://doi.org/10.1007/s00146-023-01777-z
OPEN FORUM
Institutionalised distrust and human oversight of artificial intelligence: 
towards a democratic design of AI governance under the European 
Union AI Act
Johann Laux1 
Received: 12 April 2023 / Accepted: 5 September 2023 
© The Author(s) 2023
Abstr Act
Human oversight has become a key mechanism for the governance of artificial intelligence (“AI”). Human overseers are sup-
posed to increase the accuracy and safety of AI systems, uphold human values, and build trust in the technology. Empirical 
research suggests, however, that humans are not reliable in fulfilling their oversight tasks. They may be lacking in competence 
or be harmfully incentivised. This creates a challenge for human oversight to be effective. In addressing this challenge, this 
article aims to make three contributions. First, it 

**split_pdf_pages_into_chunks**: Splits and stores PDF pages into chunks by filename:

In [7]:
pdf_chunks = di_drg.split_pdf_pages_into_chunks(pdf_pages, chunk_size=500, chunk_overlap=150, verbose=False)

# Optional: Print a summary of chunks created per PDF
for filename, chunks in pdf_chunks.items():
    print(f"\nPDF: {filename}")
    print(f"Total Chunks: {len(chunks)}")


PDF: 10.1007s00146-023-01777-z_ 2023_Institutionalised di.pdf
Total Chunks: 225

PDF: 10.1007s10506-022-09323-w_ 2023_Mapping the Issues o.pdf
Total Chunks: 266

PDF: 10.1007s10551-022-05053-w_ 2022_Moral Judgments in t.pdf
Total Chunks: 345

PDF: 10.1007s11948-020-00276-4_ 2020_Towards Transparency.pdf
Total Chunks: 286

PDF: 10.1007s43681-023-00387-1_ 2023_Publics’ views on et.pdf
Total Chunks: 387

PDF: 10.1016j.clsr.2022.105657_ 2022_Regulating AI. A lab.pdf
Total Chunks: 296

PDF: 10.1016j.inffus.2023.101896_ 2023_Connecting the dots.pdf
Total Chunks: 504

PDF: 10.1016j.patter.2021.100362_ 2021_Creating ethics guid.pdf
Total Chunks: 231

PDF: 10.1016j.techsoc.2024.102471_ 2024_Trustworthy AI in th.pdf
Total Chunks: 364

PDF: 10.108010510974.2020.1807380_ 2021_Impacts of Attitudes.pdf
Total Chunks: 162

PDF: 10.108013600869.2022.2060471_ 2022_On the path to the f.pdf
Total Chunks: 241

PDF: 10.1093scipolscac062_ 2023_Can innovation vouch.pdf
Total Chunks: 295

PDF: 10.1109ACCESS.2

In [8]:
# Assuming pdf_chunks is the dictionary containing chunks for each PDF
first_key = list(pdf_chunks.keys())[0]  # Get the first PDF filename
print("first PDF of folder:", first_key)
first_pdf_chunks = pdf_chunks[first_key]  # Get the chunks for the first PDF

# Access the first and second chunks
first_chunk = first_pdf_chunks[0]
second_chunk = first_pdf_chunks[1]

# Print the first two chunks
print("\nFirst Chunk:", first_chunk, "\n\n")
print("Second Chunk:", second_chunk)

first PDF of folder: 10.1007s00146-023-01777-z_ 2023_Institutionalised di.pdf

First Chunk: page_content='Vol.:(0123456789)1 3AI & SOCIETY 
https://doi.org/10.1007/s00146-023-01777-z
OPEN FORUM
Institutionalised distrust and human oversight of artificial intelligence: 
towards a democratic design of AI governance under the European 
Union AI Act
Johann Laux1 
Received: 12 April 2023 / Accepted: 5 September 2023 
© The Author(s) 2023
Abstr Act
Human oversight has become a key mechanism for the governance of artificial intelligence (“AI”). Human overseers are sup-' metadata={'source': 'PDFs/AIregulation_query2\\10.1007s00146-023-01777-z_ 2023_Institutionalised di.pdf', 'page': 0} 


Second Chunk: page_content='Abstr Act
Human oversight has become a key mechanism for the governance of artificial intelligence (“AI”). Human overseers are sup-
posed to increase the accuracy and safety of AI systems, uphold human values, and build trust in the technology. Empirical 
research suggests, however

In [9]:
print("page content:", first_chunk.page_content, "\n\n")
print("metadata:", first_chunk.metadata)

page content: Vol.:(0123456789)1 3AI & SOCIETY 
https://doi.org/10.1007/s00146-023-01777-z
OPEN FORUM
Institutionalised distrust and human oversight of artificial intelligence: 
towards a democratic design of AI governance under the European 
Union AI Act
Johann Laux1 
Received: 12 April 2023 / Accepted: 5 September 2023 
© The Author(s) 2023
Abstr Act
Human oversight has become a key mechanism for the governance of artificial intelligence (“AI”). Human overseers are sup- 


metadata: {'source': 'PDFs/AIregulation_query2\\10.1007s00146-023-01777-z_ 2023_Institutionalised di.pdf', 'page': 0}


## Data Storage: Text chunks are converted into vector embeddings and stored in a vector database (Vector DB) next to their respective text chunks.

In [10]:
path_to_Chroma = os.path.join('DB_Chroma')  # Moves one level up to 'PDFs' folder

sources_DB = di_drg.inspect_chrom(CHROMA_PATH=path_to_Chroma, openAI_key=key.openAI_key)
print("Number of sources in DB:", len(sources_DB))
print("\nSources:\n", sources_DB)

# Remove the "PDFs\\" prefix from all entries
cleaned_sources_DB = [pdf.replace('PDFs\\', '').replace('PDFs/AIregulation_query2\\', '') for pdf in sources_DB] # !!! add AIregulation

# Print the result
print("\nCleaned sources:\n", cleaned_sources_DB)

  db = Chroma(persist_directory=CHROMA_PATH, embedding_function=OpenAIEmbeddings(api_key=openAI_key))


Number of sources in DB: 37

Sources:
 ['PDFs/AIregulation_query2\\10.1007s43681-023-00387-1_ 2023_Publics’ views on et.pdf', 'PDFs/AIregulation_query2\\10.11453551624.3555294_ 2022_AI-Competent Individ.pdf', 'PDFs/AIregulation_query2\\10.2139ssrn.3021135_ 2017_Artificial Intellige.pdf', 'PDFs/AIregulation_query2\\10.219626552_ 2021_Investigating the Et.pdf', 'PDFs/AIregulation_query2\\2020_ADMINISTERING ARTIFI.pdf', 'PDFs/AIregulation_query2\\2023_THE ROLE OF ARTIFICI.pdf', 'PDFs/AIregulation_query2\\10.2139ssrn.4950727_ 2024_A Comprehensive Revi.pdf', 'PDFs/AIregulation_query2\\10.3389fcomp.2023.1113903_ 2023_What does the public.pdf', 'PDFs/AIregulation_query2\\10.108010510974.2020.1807380_ 2021_Impacts of Attitudes.pdf', 'PDFs/AIregulation_query2\\10.117700076503221080959_ 2022_An Eye for Artificia.pdf', 'PDFs/AIregulation_query2\\10.1093scipolscac062_ 2023_Can innovation vouch.pdf', 'PDFs/AIregulation_query2\\10.1016j.patter.2021.100362_ 2021_Creating ethics guid.pdf', 'PDFs/AIreg

In [11]:
# if you want to remove your DB:
## di_drg.remove_chrom(CHROMA_PATH=path_to_Chroma)

# pdf_chunks is a dictionary as such we can run over the keys:
for pdf in pdf_chunks.keys():
    if pdf not in cleaned_sources_DB:
        print(f"The PDF '{pdf}' is not included in the DB, as such:")
        print("create DB for", pdf)
        di_drg.save_to_chrom(chunks=pdf_chunks[pdf], CHROMA_PATH=path_to_Chroma, openAI_key=key.openAI_key)

In [12]:
sources_DB = di_drg.inspect_chrom(CHROMA_PATH=path_to_Chroma, openAI_key=key.openAI_key)
print("Number of sources in DB:", len(sources_DB))
print("\nSources:\n", sources_DB)

Number of sources in DB: 37

Sources:
 ['PDFs/AIregulation_query2\\10.1007s43681-023-00387-1_ 2023_Publics’ views on et.pdf', 'PDFs/AIregulation_query2\\10.11453551624.3555294_ 2022_AI-Competent Individ.pdf', 'PDFs/AIregulation_query2\\10.2139ssrn.3021135_ 2017_Artificial Intellige.pdf', 'PDFs/AIregulation_query2\\10.219626552_ 2021_Investigating the Et.pdf', 'PDFs/AIregulation_query2\\2020_ADMINISTERING ARTIFI.pdf', 'PDFs/AIregulation_query2\\2023_THE ROLE OF ARTIFICI.pdf', 'PDFs/AIregulation_query2\\10.2139ssrn.4950727_ 2024_A Comprehensive Revi.pdf', 'PDFs/AIregulation_query2\\10.3389fcomp.2023.1113903_ 2023_What does the public.pdf', 'PDFs/AIregulation_query2\\10.108010510974.2020.1807380_ 2021_Impacts of Attitudes.pdf', 'PDFs/AIregulation_query2\\10.117700076503221080959_ 2022_An Eye for Artificia.pdf', 'PDFs/AIregulation_query2\\10.1093scipolscac062_ 2023_Can innovation vouch.pdf', 'PDFs/AIregulation_query2\\10.1016j.patter.2021.100362_ 2021_Creating ethics guid.pdf', 'PDFs/AIreg

# Data Retrieval and Generation

your prompt template (system message):

In [13]:
PROMPT_TEMPLATE = """
Answer the question based only on the following context:

{context}

---

Answer the question based on the above context: {question}
"""

your question (user message):

In [14]:
# Question query 1
question = """
Why is it important to study laypersons' perceptions of AI regulation, especially focusing on their trust levels, perceived benefits and risks, and how these perceptions impact the successful integration of AI into society? Please discuss how understanding public perceptions can inform regulatory approaches, enhance public trust, and address potential risks and benefits for society.
"""

# Question query 2
question = """
Why is it important to study laypersons' perceptions of AI regulation and how these perceptions impact the successful integration of AI into society? Please discuss how understanding public perceptions can inform regulatory approaches.
"""

retrieve data (see "source_page_pairs" and "X_hits") and generate response (see "response"):

In [15]:
response, source_page_pairs, filtered_hits, all_hits = di_drg.retrieveGenerate(query_text=question, prompt_template=PROMPT_TEMPLATE, openAI_key=key.openAI_key, chroma_path=path_to_Chroma, 
                                                                            docsReturn=10, thresholdSimilarity=0.8)

Number of possible relevant text chunks found with a threshold similarity of 0.8: 108
Query: Human: 
Answer the question based only on the following context:

noted, there exists a noticeable void in the literature in 
regard to understanding how concrete research practices 
incorporate public perspectives and embrace multistake-
holder approaches, inclusion, and dialogue.
While several studies have delved into the framing of the 
publics’ role within AI governance in several instances (from 
Big Tech initiatives to hiring ethics teams and guidelines 
issued from multiple institutions to governments’ national

---

in impacting AI policy. It is thus vital to have a better understand-
ing of how the public thinks about AI and the governance of AI.
Such understanding is essential to crafting informed policy and
identifying opportunities to educate the public about AI’s character,
bene￿ts, and risks.
Using an original, large-scale survey ( N=2000), we studied how
the American public perce

  response_text = model.predict(prompt)


In [16]:
print(response)
print(source_page_pairs)

Studying laypersons' perceptions of AI regulation is important because these perceptions can greatly impact the successful integration of AI into society. Public perceptions can influence the acceptance and adoption of AI systems, as well as shape policy outcomes related to AI governance. Understanding how the public thinks about AI and its governance is essential for crafting informed policy and identifying opportunities to educate the public about AI's benefits and risks.

By studying public perceptions, regulators can gain insights into the concerns and preferences of the general population regarding AI technology. This information can help regulators develop regulations that are more aligned with public values and expectations, ultimately increasing the likelihood of successful integration of AI into society. Additionally, understanding public perceptions can inform regulatory approaches by guiding regulators on how to incorporate public input into the regulatory process, ensuring 

In [17]:
print(len(all_hits))
print(all_hits[0])

108
(Document(metadata={'page': 0, 'source': 'PDFs/AIregulation_query2\\10.11453375627.3375827_ 2020_U.S. Public Opinion.pdf'}, page_content='in impacting AI policy. It is thus vital to have a better understand-\ning of how the public thinks about AI and the governance of AI.\nSuch understanding is essential to crafting informed policy and\nidentifying opportunities to educate the public about AI’s character,\nbene\uffffts, and risks.\nUsing an original, large-scale survey ( N=2000), we studied how\nthe American public perceives AI governance. The overwhelming\nmajority of Americans (82%) believe that AI and/or robots should\nPaper Presentation'), 0.8455482269417804)


## it is possible to further investiagte the results of a single run:

In [18]:
# Initialize a list to store all sources
sources = []

# Loop through each tuple and extract the source
for doc, score in all_hits:
    source = doc.metadata.get('source')  # Access the 'source' key from the metadata
    if source is not None:
        sources.append(source)

# Print the list of sources
sources = [source.replace('PDFs\\', '').replace('PDFs/AIregulation_query2\\', '') for source in sources]  # !!! add AIregulation

print(sources)

['10.11453375627.3375827_ 2020_U.S. Public Opinion.pdf', '10.1093scipolscac062_ 2023_Can innovation vouch.pdf', '10.1109ACCESS.2024.3458893_ 2024_Ethical Concerns Wit.pdf', '10.2139ssrn.4950727_ 2024_A Comprehensive Revi.pdf', '10.11453375627.3375827_ 2020_U.S. Public Opinion.pdf', '10.21203rs.3.rs-3765278v1_ 2024_Ethical concerns abo.pdf', '10.21203rs.3.rs-3765278v1_ 2024_Ethical concerns abo.pdf', '10.1093scipolscac062_ 2023_Can innovation vouch.pdf', '10.21203rs.3.rs-3765278v1_ 2024_Ethical concerns abo.pdf', '10.1007s43681-023-00387-1_ 2023_Publics’ views on et.pdf', '10.2139ssrn.4950727_ 2024_A Comprehensive Revi.pdf', '10.1109ACCESS.2024.3458893_ 2024_Ethical Concerns Wit.pdf', '10.1093scipolscac062_ 2023_Can innovation vouch.pdf', '10.11453551624.3555294_ 2022_AI-Competent Individ.pdf', '10.1109ACCESS.2024.3458893_ 2024_Ethical Concerns Wit.pdf', '10.21203rs.3.rs-3765278v1_ 2024_Ethical concerns abo.pdf', '10.21203rs.3.rs-3765278v1_ 2024_Ethical concerns abo.pdf', '10.1177205395

In [19]:
import pandas as pd
from collections import Counter

# Counting the frequency of each file
file_count = Counter(sources)

# Creating a DataFrame from the frequency count
frequency_df = pd.DataFrame(file_count.items(), columns=['File Name', 'Frequency'])

# Adding a new column for chunk lengths
frequency_df["Chunk Length PDF"] = 0

# Loop through pdf_chunks and update the DataFrame
for filename, chunks in pdf_chunks.items():
    frequency_df.loc[frequency_df["File Name"] == filename, "Chunk Length PDF"] = len(chunks)

frequency_df["Chunk Percentage"] = ((frequency_df["Frequency"] / frequency_df["Chunk Length PDF"]) * 100).round(2)

frequency_df = frequency_df.sort_values(by="Chunk Percentage", ascending=False)
print(frequency_df)

# Saving the DataFrame to an Excel file
frequency_df.to_excel("outputs_ChromaApproach/file_frequency_table_query2.xlsx", index=False)

                                            File Name  Frequency  \
2   10.1109ACCESS.2024.3458893_ 2024_Ethical Conce...         17   
4   10.21203rs.3.rs-3765278v1_ 2024_Ethical concer...         15   
0   10.11453375627.3375827_ 2020_U.S. Public Opini...          9   
1   10.1093scipolscac062_ 2023_Can innovation vouc...         16   
9   10.3389fcomp.2023.1113903_ 2023_What does the ...          6   
8   10.108010510974.2020.1807380_ 2021_Impacts of ...          4   
5   10.1007s43681-023-00387-1_ 2023_Publics’ views...          9   
13  10.1186s12911-021-01586-8_ 2021_Exploring perc...          4   
6   10.11453551624.3555294_ 2022_AI-Competent Indi...          4   
3   10.2139ssrn.4950727_ 2024_A Comprehensive Revi...          9   
11  10.1371journal.pone.0288109_ 2023_Exploring th...          2   
7   10.117720539517221092956_ 2022_Artificial inte...          2   
15  10.1162daed_a_01920_ 2022_Artificially Intelli...          1   
14   10.1111rego.12512_ 2024_Trustworthy artific

## loop through

In [20]:
# Initialize an empty list to store the results
results = []

# Run the function multiple times and store the outputs
for _ in range(10):  # Replace 5 with the number of iterations you want to run
    response, source_page_pairs, filtered_hits, all_hits = di_drg.retrieveGenerate(query_text=question, prompt_template=PROMPT_TEMPLATE, openAI_key=key.openAI_key, chroma_path=path_to_Chroma, 
                                                                            docsReturn=10, thresholdSimilarity=0.8)
    results.append({"Response LLM": response, "Sources, Pages": source_page_pairs})

# Convert the results list into a DataFrame
df = pd.DataFrame(results)

# Display the DataFrame
print(df)

# Saving the DataFrame to an Excel file
df.to_excel("outputs_ChromaApproach/file_LLMs_calls_query2.xlsx", index=False)

Number of possible relevant text chunks found with a threshold similarity of 0.8: 108
Query: Human: 
Answer the question based only on the following context:

The complex and subtle sociotechnical concepts inherent to AI 
make it challenging to design effective governance and science 
communication strategies that are informed by and respectful 
of diverse public views and values. In light of these challenges, 
this work evaluated underlying factors, values, and mech-
anisms that influence attitudes toward AI. We explored the role of sociodemographic variables; the impact of the cultural 
values of egalitarianism, individualism, techno-skepticism,

---

influence perceptions.
In addition to perceptions potentially depending, at 
least in part, on task context and characteristics, indi -
viduals may perceive certain risks or benefits associated 
with AI-enabled technologies in healthcare. For example, 
the potential for AI to improve the efficiency and accu -
racy of decisions may be ap