# Data Indexing

There are two central steps involved:

1. Documents are stored, prepared and split into smaller text chunks.
2. Text chunks are converted into vector embeddings and stored in a vector database (Vector DB) next to their respective text chunks.


*** 
**Background information**

* All files, chunks and embeddings are stored on a local **Supabase** server (open source Firebase alternative; based on Postgres, which is a relational database management system), see: https://supabase.com/


***
**Coding sources**

I extend the code provided and explained in the following YouTube Video: 

- The missing pieces to your AI app (pgvector + RAG in prod): https://www.youtube.com/watch?v=ibzlEQmgPPY
    + GitHub: https://github.com/supabase-community/chatgpt-your-files


## How to use this code:


1. Enter your search / how you have found the PDF files
```
entry_search = {'topic': "AI regulation", 'subtopic': "trust, risk, benefit", 
                'search_query':'"artificial intelligence" AND (trust OR risk* OR benef*) AND "regulation"', 'search_plattform': "Google Scholar", 
                'comment':"only retrieved the first 30 entries (2 excluded, because 1x not downloadable); N=29"}
```

2. define folder path to your PDFs:
```
path_to_PDF = os.path.join('PDFs/AIregulation/')  # Moves one level up to 'PDFs' folder
```

3. provide the additional arguments to the PDF entries in your DB:

along to the entry in the DB the information written in `args_Search` is stored, to later filter your DB
```
args_Search = {'topic': "AI regulation", 'subtopic': "trust, risk, benefit"}
```

4. define using GROBID (GeneRation Of BIbliographic Data) or simple  recursive splling


Remark: GROBID is an open-source software tool designed for extracting, parsing, and structuring bibliographic data and scientific documents.
```
...
```


## If you facing issues running your Code:

It could be the case that chroma and langchain cause import issues, see: https://github.com/langchain-ai/langchain/issues/7509

In [1]:
## run in your terminal:
# pip install pydantic==1.10.8
# pip install chromadb==0.3.26

## Get API, local supabase server key(s)

In [2]:
import os
import sys

# Assuming 'src' is one level down (in the current directory or a subdirectory)
path_to_src = os.path.join('../..','src')  # Moves one level down to 'src' folder

# Add the path to sys.path
sys.path.append(path_to_src)

# Now you can import your API_key module
import API_key as key

## include self-written functions

In [3]:
import src.forDataIndexing as di

  from tqdm.autonotebook import tqdm, trange


In [4]:
# Print the current working directory
print("Current working directory:", os.getcwd())

Current working directory: c:\DATEN\PHD\WORKSHOPS\introductory workshop in LLMs\4_summarizingLiterature\RAG


# Connect to our local supabase server

In [5]:
from supabase import create_client

supabase = create_client(key.SUPABASE_URL, key.SUPABASE_KEY)

# Data Preperation: Documents are stored, prepared and split into smaller text chunks

## Enter your search / how you have found the PDF files

In [6]:
res = supabase.table("searches").select("topic").execute()
# Convert the list of tuples to a dictionary
data_dict = dict(res)
# Access the 'data' key directly
data_items = data_dict.get('data', [])
# Extract topics from the list of dictionaries within 'data'
topics = [entry['topic'] for entry in data_items]
print("search topics in your DB:\n", topics)

search topics in your DB:
 []


In [7]:
entry_search = {'topic': "AI regulation", 'subtopic': "trust, risk, benefit", 
                'search_query':'"artificial intelligence" AND (trust OR risk* OR benef*) AND "regulation"', 'search_plattform': "Google Scholar", 
                'comment':"only retrieved the first 30 entries (1 excluded, because not downloadable); N=29"}

# Check if the 'topic' in entry_search is in topics
if entry_search['topic'] in topics:
    # Update the existing entry where the topic matches
    data = supabase.table('searches').update(entry_search).eq('topic', entry_search['topic']).execute()
    print("search entry updated")
else:
    # Insert the new entry if the topic is not found in topics
    data = supabase.table('searches').insert(entry_search).execute()
    print("new search entry added")

new search entry added


## Upload your PDFs in the local DB

define folder path to your PDFs:

In [8]:
path_to_PDF = os.path.join('PDFs/AIregulation/')  # Moves one level up to 'PDFs' folder

upload PDFs to storage **AND** creates an entry in the DB:

along to the entry in the DB the information written in `args_Search` is stored, to later filter your DB

Remark: this function takes a bit, because PDFs are temporarly loaded to get their number of pages

In [9]:
args_Search = {'topic': "AI regulation", 'subtopic': "trust, risk, benefit"}

di.upload_PDFs(folder_path=path_to_PDF, supabase_DB=supabase, args_Search=args_Search, verbose=False)

Ignoring wrong pointing object 110 0 (offset 0)
Ignoring wrong pointing object 244 0 (offset 0)
Ignoring wrong pointing object 12 0 (offset 0)
Ignoring wrong pointing object 1319 0 (offset 0)
Ignoring wrong pointing object 9 0 (offset 0)
Ignoring wrong pointing object 13 0 (offset 0)
Ignoring wrong pointing object 44 0 (offset 0)
Ignoring wrong pointing object 134 0 (offset 0)
Ignoring wrong pointing object 118 0 (offset 0)
Ignoring wrong pointing object 438 0 (offset 0)
Ignoring wrong pointing object 41 0 (offset 0)
Ignoring wrong pointing object 52 0 (offset 0)
Ignoring wrong pointing object 9 0 (offset 0)
Ignoring wrong pointing object 15 0 (offset 0)
Ignoring wrong pointing object 199 0 (offset 0)
Ignoring wrong pointing object 331 0 (offset 0)
Ignoring wrong pointing object 9 0 (offset 0)
Ignoring wrong pointing object 16 0 (offset 0)
Ignoring wrong pointing object 150 0 (offset 0)
Ignoring wrong pointing object 232 0 (offset 0)
Ignoring wrong pointing object 9 0 (offset 0)
Ignori

The following file: "most relevant" is a) not a PDF or b) was already uploaded in the DB.
The following file: "unicode error" is a) not a PDF or b) was already uploaded in the DB.
In total 28 PDFs were sucessfully uploaded to your DB.


PDFs in your DB:

In [10]:
res = supabase.storage.from_('files').list()
# Extracting 'name' from each dictionary
file_names = [file['name'] for file in res]

# Output the list of file names
print(file_names)

# Number of PDFs in DB
print(len(file_names))

['10.1002_sd.2048.pdf', '10.1007_s00146-023-01650-z.pdf', '10.1007_s10506-017-9206-9.pdf', '10.1007_s11077-022-09452-8.pdf', '10.1007_s11569-024-00454-9.pdf', '10.1007_s40804-020-00200-0.pdf', '10.1017_err.2019.8.pdf', '10.1017_err.2021.52.pdf', '10.1017_err.2022.14.pdf', '10.1017_err.2023.1.pdf', '10.1080_13600834.2018.1488659.pdf', '10.1080_13669877.2021.1957985.pdf', '10.1111_bioe.13124.pdf', '10.1111_rego.12512.pdf', '10.1111_rego.12563.pdf', '10.1111_rego.12568.pdf', '10.1177_0266382120923962.pdf', '10.1177_2053951719860542.pdf', '10.1177_20539517211039493.pdf', '10.14658_pupj-jelt-2021-2-2.pdf', '10.2139_ssrn.3501410.pdf', '10.24251_HICSS.2020.647.pdf', '10.24251_HICSS.2021.664.pdf', '10.2979_gls.2023.a886162.pdf', '10.4324_9780429262081-19.pdf', '10.48550_arXiv.2305.02231.pdf', 'doi-10.1017_err.2022.38.pdf', 'white house_AI.pdf']
28


## PDFs in the local DB are prepared

set a folder path for the temporary download of your PDFs:

In [11]:
path_to_PDFs = os.path.join('tmp_downloads')  # Moves one level up to 'tmp_downloads' folder

aaaaaaaaaaaaaaaaaa GROBID

https://grobid.readthedocs.io/en/latest/Grobid-docker/#grobid-and-docker-containers


docker run --gpus all --init --ulimit core=0 -p 8070:8070 grobid/grobid:0.8.1


docker run --rm --gpus all --init --ulimit core=0 -p 8070:8070 grobid/grobid:0.8.1

In [12]:
ERROR

NameError: name 'ERROR' is not defined

see: https://grobid.readthedocs.io/en/latest/Grobid-service/

In [None]:
import requests
from pathlib import Path

pdf_file = Path(path_to_PDFs + "/10.1007_s00146-023-01650-z.pdf")
files = {
    'input': open(pdf_file, 'rb'),
}

response = requests.post('http://localhost:8070/api/processFulltextDocument', files=files)

In [None]:
vars(response)

In [None]:
response.content

In [44]:
from grobid_client import Client

client = Client(base_url="https://cloud.science-miner.com/grobid/api")

In [None]:
client

In [None]:
from grobid_client.grobid_client import GrobidClient

In [33]:
from grobid_client import Client

client = Client(base_url="http://localhost:8070/")

In [46]:
from pathlib import Path
from grobid_client.api.pdf import process_fulltext_document
from grobid_client.models import Article, ProcessForm
from grobid_client.types import TEI, File
pdf_file = "MyPDFFile.pdf"
pdf_file = Path(path_to_PDFs + "/10.1007_s00146-023-01650-z.pdf")

with pdf_file.open("rb") as fin:
    form = ProcessForm(
        segment_sentences="1",
        input_=File(file_name=pdf_file.name, payload=fin, mime_type="application/pdf"),
    )
    r = process_fulltext_document.sync_detailed(client=client, multipart_data=form)
    if r.is_success:
        article: Article = TEI.parse(r.content, figures=False)
        assert article.title

In [None]:
r

aaaaaaaaaaaaaaaaaa GROBID

get the names of all PDFs, which have not been processed (chunks + embeddings):

In [13]:
non_processed_PDFs = di.non_processed_PDFs(supabase_DB=supabase, verbose=False)
print("non_processed_PDFs:\n", non_processed_PDFs)

ids in your table documents_chunks:
 set()
non_processed_PDFs:
 ['10.1002_sd.2048.pdf', '10.1007_s00146-023-01650-z.pdf', '10.1007_s10506-017-9206-9.pdf', '10.1007_s11077-022-09452-8.pdf', '10.1007_s11569-024-00454-9.pdf', '10.1007_s40804-020-00200-0.pdf', '10.1017_err.2019.8.pdf', '10.1017_err.2021.52.pdf', '10.1017_err.2022.14.pdf', '10.1017_err.2023.1.pdf', '10.1080_13600834.2018.1488659.pdf', '10.1080_13669877.2021.1957985.pdf', '10.1111_bioe.13124.pdf', '10.1111_rego.12512.pdf', '10.1111_rego.12563.pdf', '10.1111_rego.12568.pdf', '10.1177_0266382120923962.pdf', '10.1177_2053951719860542.pdf', '10.1177_20539517211039493.pdf', '10.14658_pupj-jelt-2021-2-2.pdf', '10.2139_ssrn.3501410.pdf', '10.24251_HICSS.2020.647.pdf', '10.24251_HICSS.2021.664.pdf', '10.2979_gls.2023.a886162.pdf', '10.4324_9780429262081-19.pdf', '10.48550_arXiv.2305.02231.pdf', 'doi-10.1017_err.2022.38.pdf', 'white house_AI.pdf']


In [14]:
args_Split = {'chunk_size': 800, 'chunk_overlap': 150}

di.load_split_embed(supabase_DB=supabase, path_to_PDFs=path_to_PDFs, args_Split=args_Split, LMM='all-MiniLM-L6-v2')
# delete all downloaded files

ids in your table documents_chunks:
 set()


Ignoring wrong pointing object 110 0 (offset 0)
Ignoring wrong pointing object 244 0 (offset 0)
Ignoring wrong pointing object 12 0 (offset 0)
Ignoring wrong pointing object 1319 0 (offset 0)
Ignoring wrong pointing object 9 0 (offset 0)
Ignoring wrong pointing object 13 0 (offset 0)
Ignoring wrong pointing object 44 0 (offset 0)
Ignoring wrong pointing object 134 0 (offset 0)
Ignoring wrong pointing object 118 0 (offset 0)
Ignoring wrong pointing object 438 0 (offset 0)


10.1002_sd.2048.pdf




10.1007_s00146-023-01650-z.pdf
10.1007_s10506-017-9206-9.pdf
10.1007_s11077-022-09452-8.pdf
10.1007_s11569-024-00454-9.pdf
10.1007_s40804-020-00200-0.pdf
10.1017_err.2019.8.pdf
10.1017_err.2021.52.pdf
10.1017_err.2022.14.pdf
10.1017_err.2023.1.pdf


In [16]:
response = (
    supabase.table("documents_chunks")
    .select("*")
    .eq("order_chunks", 0)
    .execute()
)

print(len(response.data))
print(response.data)

10
[{'document_id': 'cb4d0c47662351ca09bade84b9554256', 'id': '10.1002_sd.2048.pdf', 'order_chunks': 0, 'section': None, 'content': "RESEARCH ARTICLE\nGoverning Artificial Intelligence to benefit the UN Sustainable\nDevelopment Goals\nJon Truby\nLaw & Development, College of Law, Qatar\nUniversity, Doha, Qatar\nCorrespondence\nJon Truby, Centre for Law & Development,\nCollege of Law, Qatar University, PO BOX\n2713 Doha, Qatar.\nEmail: jon.truby@qu.edu.qa\nFunding information\nQatar National Research Fund, Grant/Award\nNumber: NPRP 11S-1119-170016Abstract\nBig Tech's unregulated roll-out out of experimental AI poses risks to the achievement of\nthe UN Sustainable Development Goals (SDGs), w ith particular vulnerability for develop-\ning countries. The goal of financial inclusion is threatened by the imperfect and\nungoverned design and implementation of AI decision-making software making important", 'embedding': '[-0.01656288,0.051273286,-0.012496664,0.058511723,0.027607173,-0.004153954

In [17]:
response = (
    supabase.table("documents_chunks")
    .select("*")
    .eq("id", "10.1017_err.2019.8.pdf")
    .execute()
)

print(len(response.data))
#print(response.data)

102


# Outcomes

In [None]:
data = supabase.rpc('hello_world3').execute()
print("Hello World:", data)

In [24]:
from collections import namedtuple

# Define the namedtuple
Document = namedtuple('Document', ['page_content'])

# Create an instance of Document
#doc = Document("How can AI Regulation be defined?")
doc = Document("What are central risks of AI Regulation?")

# Pass the object inside a list
embedding = di.create_embeddings([doc], LMM='all-MiniLM-L6-v2', verbose=False)
print("embedding:\n", embedding[0].tolist())
print("length embedding:\n", len(embedding[0].tolist()))

embedding:
 [0.015590731054544449, -0.0014040754176676273, -0.0017040095990523696, -0.03447266295552254, 0.09766996651887894, 0.10837697237730026, 0.01510717160999775, 0.054916199296712875, 0.004265499766916037, 0.06800765544176102, 0.01796419359743595, 0.025512173771858215, 0.013039699755609035, -0.005051808897405863, -0.03730042278766632, 0.022326044738292694, -0.013636360876262188, -0.02060031332075596, -0.12669864296913147, -0.037537068128585815, 0.049131546169519424, -0.04182096943259239, -0.04689469933509827, 0.010040856897830963, -0.06738842278718948, -0.001647068769671023, 0.016694990918040276, -0.08428200334310532, -0.010556918568909168, -0.04807668924331665, -0.051953498274087906, -0.049975279718637466, 0.08247032016515732, -0.04592299833893776, -0.007842435501515865, -0.022438373416662216, -0.007997633889317513, -0.028029626235365868, 0.06806881725788116, -0.03452880308032036, 0.013360774144530296, -0.07295720279216766, -0.03219093009829521, -0.040609512478113174, 0.06035649

In [25]:
match_threshold = 0.7  # Replace with your desired threshold

# Call the RPC function
data = supabase.rpc('match_documents_chunks', {
    'embedding': embedding[0].tolist(),
    'match_threshold': match_threshold,
    'match_count': 5
}).execute()

In [26]:
len(data.data)

5

In [None]:
data.data

In [43]:
print(data.data[0]["content"])
print(data.data[0]["embedding"])

ence, philosophy, law, risk research and technology 
assessment.
After some light is shed on the characteristics of 
AI applications as objects of risk regulation (Sec -
tion “ Applications of artiﬁcial intelligence as objects 
of risk regulation ”), the European approach to AI 
regulation (the AI Act) is outlined in Section “ Risk 
regulation in the European Artiﬁcial Intelligence 
Act”. Section “ Operationalising risks: options and 
challenges ” discusses the need for the risks to fun -
damental rights and societal values to be interpreted 
and operationalised given that the AI Act is seen as 
a regulatory framework that needs to be ﬁlled with 
normative choices. The normative ambiguities in
[-0.004556655,0.0026064916,-0.06992638,-0.05402437,0.04051935,0.096363835,0.08717033,0.03598786,-0.027202839,0.037706353,0.036698196,-0.020172024,-0.0071080923,0.039583758,-0.023244157,0.02948049,-0.02397024,-0.0676223,-0.1172657,0.06490634,0.07891032,-0.023304693,-0.033173647,-0.027152205,-0.099

In [44]:
import numpy as np
numeric_vector = np.fromstring(data.data[0]["embedding"].strip('[]'), sep=',')
len(numeric_vector.tolist())

384

In [45]:
from sentence_transformers import SentenceTransformer

# Load the pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

similarities = model.similarity(embedding[0].tolist(), numeric_vector.tolist())
print(similarities)



tensor([[0.7318]])


In [None]:
ERROR

**load_pdfs_by_filename**: Loads and stores PDF pages by filename:

In [None]:



pdf_pages = di.load_pdfs_by_filename(path_to_PDFs, verbose=False)

# Optional: Print the loaded pages by filename
for filename, pages in pdf_pages.items():
    print(f"\nPDF: {filename}")
    print(f"Total Pages: {len(pages)}")
    # print(pages[0])

In [None]:
# Assuming pdf_chunks is the dictionary containing chunks for each PDF
first_key = list(pdf_pages.keys())[0]  # Get the first PDF filename
print("first PDF of folder:", first_key)
first_pdf_pages = pdf_pages[first_key]  # Get the chunks for the first PDF


# Print the first page
print("First Page:", first_pdf_pages[0], "\n\n")

**split_pdf_pages_into_chunks**: Splits and stores PDF pages into chunks by filename:

On average, one token is roughly 4 characters in English text. So, each chunk of 800 characters roughly corresponds to 200 tokens.


**The maximal number of tokens of `all-MiniLM-L6-v2` is 512.**

In [None]:
pdf_chunks = di.split_pdf_pages_into_chunks(pdf_pages, chunk_size=800, chunk_overlap=150, verbose=False)

# Optional: Print a summary of chunks created per PDF
for filename, chunks in pdf_chunks.items():
    print(f"\nPDF: {filename}")
    print(f"Total Chunks: {len(chunks)}")

In [None]:
# Assuming pdf_chunks is the dictionary containing chunks for each PDF
first_key = list(pdf_chunks.keys())[0]  # Get the first PDF filename
print("first PDF of folder:", first_key)
first_pdf_chunks = pdf_chunks[first_key]  # Get the chunks for the first PDF

# Access the first and second chunks
first_chunk = first_pdf_chunks[0]
second_chunk = first_pdf_chunks[1]

# Print the first two chunks
print("\nFirst Chunk:", first_chunk, "\n\n")
print("Second Chunk:", second_chunk)

In [None]:
print("page content:", first_chunk.page_content, "\n\n")
print("metadata:", first_chunk.metadata)

In [None]:
print(second_chunk.page_content)

In [None]:
len(first_pdf_chunks)
first_pdf_chunks[0].page_content

In [None]:
pdf_chunks.keys()

In [None]:
list(pdf_chunks.keys())

In [50]:



# Assuming first_pdf_chunks is an array of PDF page objects or text chunks
tmp_chunks = []  # Initialize an empty array to store the extracted content

# Iterate over each page or chunk in first_pdf_chunks
for chunk in first_pdf_chunks:
    # Extract the page content (assuming 'chunk' has a method or property called 'extract_content')
    content = chunk.page_content  # Modify this line based on how you extract content from your PDF chunks
    # Append the extracted content to tmp_chunks
    tmp_chunks.append(content)

# tmp_chunks now contains the extracted content from each page


In [None]:
len(tmp_chunks)

# Data Storage: Text chunks are converted into vector embeddings and stored in a vector database (Vector DB) next to their respective text chunks.

In [None]:
from sentence_transformers import SentenceTransformer
import pandas as pd


# Define sentences
sentences = tmp_chunks

# Load the pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Extract features
features = model.encode(sentences)

# Print the features as a pandas dataframe
pd.DataFrame(features)

In [None]:
len(features)
features[0]

In [None]:
similarities = model.similarity(features, features)
print(similarities)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics.pairwise import cosine_similarity
from scipy.cluster.hierarchy import dendrogram, linkage
from sentence_transformers import SentenceTransformer

# Initialize the model and encode the corpus
embedder = SentenceTransformer("all-MiniLM-L6-v2")
corpus = sentences
corpus_embeddings = embedder.encode(corpus)

# Calculate similarity matrix using cosine similarity
similarity_matrix = cosine_similarity(corpus_embeddings)

# Convert similarity matrix to a distance matrix
distance_matrix = 1 - similarity_matrix

# Perform hierarchical clustering
linked = linkage(distance_matrix, 'ward')

# Plot the dendrogram
plt.figure(figsize=(10, 8))
dendrogram(linked, labels=corpus, orientation='top', distance_sort='descending', show_leaf_counts=True)
plt.title("Dendrogram of Sentence Clustering")
plt.xlabel("Sentences")
plt.ylabel("Distance")
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()


In [None]:
aaaaaaaaaaaa

In [24]:
import os
import json
#from dotenv import load_dotenv
from supabase import create_client, Client
from faker import Faker
import faker_commerce


def add_entries_to_vendor_table(supabase, vendor_count):
    fake = Faker()
    foreign_key_list = []
    fake.add_provider(faker_commerce.Provider)
    main_list = []
    for i in range(vendor_count):
        value = {'vendor_name': fake.company(), 'total_employees': fake.random_int(40, 169),
                 'vendor_location': fake.country()}

        main_list.append(value)
    data = supabase.table('vendor2').insert(main_list).execute()
    data_json = json.loads(data.json())
    data_entries = data_json['data']
    for i in range(len(data_entries)):
        foreign_key_list.append(int(data_entries[i]['vendor_id']))
    return foreign_key_list


def add_entries_to_product_table(supabase, vendor_id):
    fake = Faker()
    fake.add_provider(faker_commerce.Provider)
    main_list = []
    iterator = fake.random_int(1, 15)
    for i in range(iterator):
        value = {'vendor_id': vendor_id, 'product_name': fake.ecommerce_name(),
                 'inventory_count': fake.random_int(1, 100), 'price': fake.random_int(45, 100)}
        main_list.append(value)
    data = supabase.table('Product').insert(main_list).execute()


def main():
    vendor_count = 10
    supabase: Client = create_client(key.SUPABASE_URL, key.SUPABASE_KEY)
    fk_list = add_entries_to_vendor_table(supabase, vendor_count)
    #for i in range(len(fk_list)):
    #    add_entries_to_product_table(supabase, fk_list[i])


main()


In [None]:
from supabase import create_client, Client

supabase: Client = create_client(key.SUPABASE_URL, key.SUPABASE_KEY)

data = supabase.rpc('hello_world').execute()
print("Hello World:", data)


data = supabase.rpc('get_vendors').gt('total_employees', 160).execute()
print("Vendors:", data)
vars(data)
data.data[0]