# Data Indexing

There are two central steps involved:

1. Documents are loaded and split into smaller text chunks.
2. Text chunks are converted into vector embeddings and stored in a vector database (Vector DB) next to their respective text chunks.


*** 
**Background information**

* ...


***
**Coding sources**



## Get API, local supabase server key(s)

In [1]:
import os
import sys

# Assuming 'src' is one level down (in the current directory or a subdirectory)
path_to_src = os.path.join('../..','src')  # Moves one level down to 'src' folder

# Add the path to sys.path
sys.path.append(path_to_src)

# Now you can import your API_key module
import API_key as key

## include self-written functions

In [2]:
import src.forDataIndexing as di

In [3]:
# Print the current working directory
print("Current working directory:", os.getcwd())

Current working directory: c:\DATEN\PHD\WORKSHOPS\introductory workshop in LLMs\4_summarizingLiterature\RAG


# Data Preperation: Documents are loaded and split into smaller text chunks

**load_pdfs_by_filename**: Loads and stores PDF pages by filename:

In [4]:
path_to_PDFs = os.path.join('PDFs')  # Moves one level up to 'PDFs' folder


pdf_pages = di.load_pdfs_by_filename(path_to_PDFs, verbose=False)

# Optional: Print the loaded pages by filename
for filename, pages in pdf_pages.items():
    print(f"\nPDF: {filename}")
    print(f"Total Pages: {len(pages)}")
    # print(pages[0])

Ignoring wrong pointing object 12 0 (offset 0)
Ignoring wrong pointing object 1319 0 (offset 0)



PDF: 10.1007_s00146-023-01650-z.pdf
Total Pages: 8

PDF: 10.1007_s10506-017-9206-9.pdf
Total Pages: 15


In [5]:
# Assuming pdf_chunks is the dictionary containing chunks for each PDF
first_key = list(pdf_pages.keys())[0]  # Get the first PDF filename
print("first PDF of folder:", first_key)
first_pdf_pages = pdf_pages[first_key]  # Get the chunks for the first PDF


# Print the first page
print("First Page:", first_pdf_pages[0], "\n\n")

first PDF of folder: 10.1007_s00146-023-01650-z.pdf
First Page: page_content='Vol.:(0123456789)1 3AI & SOCIETY (2024) 39:1961–1968 
https://doi.org/10.1007/s00146-023-01650-z
OPEN FORUMThe regulation of artiﬁcial intelligence
Giusella Finocchiaro1 Received: 22 December 2022 / Accepted: 20 March 2023 / Published online: 3 April 2023  
© The Author(s) 2023
Abstract
Before embarking on a discussion of the regulation of artiﬁcial intelligence (AI), it is ﬁrst necessary to deﬁne the subject 
matter regulated. Deﬁning artiﬁcial intelligence is a diﬃcult endeavour, and many deﬁnitions have been proposed over the 
years. Although more than 70 years have passed since it was adopted, the most convincing deﬁnition is still nonetheless 
that proposed by Turing; in any case, it is important to be mindful of the risk of anthropomorphising artiﬁcial intelligence, 
which may arise in particular from its very deﬁnition. Once we have established the subject matter regulated, we must ask 
ourselves wheth

**split_pdf_pages_into_chunks**: Splits and stores PDF pages into chunks by filename:

In [6]:
pdf_chunks = di.split_pdf_pages_into_chunks(pdf_pages, chunk_size=500, chunk_overlap=150, verbose=False)

# Optional: Print a summary of chunks created per PDF
for filename, chunks in pdf_chunks.items():
    print(f"\nPDF: {filename}")
    print(f"Total Chunks: {len(chunks)}")


PDF: 10.1007_s00146-023-01650-z.pdf
Total Chunks: 136

PDF: 10.1007_s10506-017-9206-9.pdf
Total Chunks: 120


In [7]:
# Assuming pdf_chunks is the dictionary containing chunks for each PDF
first_key = list(pdf_chunks.keys())[0]  # Get the first PDF filename
print("first PDF of folder:", first_key)
first_pdf_chunks = pdf_chunks[first_key]  # Get the chunks for the first PDF

# Access the first and second chunks
first_chunk = first_pdf_chunks[0]
second_chunk = first_pdf_chunks[1]

# Print the first two chunks
print("\nFirst Chunk:", first_chunk, "\n\n")
print("Second Chunk:", second_chunk)

first PDF of folder: 10.1007_s00146-023-01650-z.pdf

First Chunk: page_content='Vol.:(0123456789)1 3AI & SOCIETY (2024) 39:1961–1968 
https://doi.org/10.1007/s00146-023-01650-z
OPEN FORUMThe regulation of artiﬁcial intelligence
Giusella Finocchiaro1 Received: 22 December 2022 / Accepted: 20 March 2023 / Published online: 3 April 2023  
© The Author(s) 2023
Abstract
Before embarking on a discussion of the regulation of artiﬁcial intelligence (AI), it is ﬁrst necessary to deﬁne the subject' metadata={'source': 'PDFs\\10.1007_s00146-023-01650-z.pdf', 'page': 0} 


Second Chunk: page_content='Abstract
Before embarking on a discussion of the regulation of artiﬁcial intelligence (AI), it is ﬁrst necessary to deﬁne the subject 
matter regulated. Deﬁning artiﬁcial intelligence is a diﬃcult endeavour, and many deﬁnitions have been proposed over the 
years. Although more than 70 years have passed since it was adopted, the most convincing deﬁnition is still nonetheless' metadata={'source': 'PDFs\

In [8]:
print("page content:", first_chunk.page_content, "\n\n")
print("metadata:", first_chunk.metadata)

page content: Vol.:(0123456789)1 3AI & SOCIETY (2024) 39:1961–1968 
https://doi.org/10.1007/s00146-023-01650-z
OPEN FORUMThe regulation of artiﬁcial intelligence
Giusella Finocchiaro1 Received: 22 December 2022 / Accepted: 20 March 2023 / Published online: 3 April 2023  
© The Author(s) 2023
Abstract
Before embarking on a discussion of the regulation of artiﬁcial intelligence (AI), it is ﬁrst necessary to deﬁne the subject 


metadata: {'source': 'PDFs\\10.1007_s00146-023-01650-z.pdf', 'page': 0}


# Data Storage: Text chunks are converted into vector embeddings and stored in a vector database (Vector DB) next to their respective text chunks.

In [24]:
import os
import json
#from dotenv import load_dotenv
from supabase import create_client, Client
from faker import Faker
import faker_commerce


def add_entries_to_vendor_table(supabase, vendor_count):
    fake = Faker()
    foreign_key_list = []
    fake.add_provider(faker_commerce.Provider)
    main_list = []
    for i in range(vendor_count):
        value = {'vendor_name': fake.company(), 'total_employees': fake.random_int(40, 169),
                 'vendor_location': fake.country()}

        main_list.append(value)
    data = supabase.table('vendor2').insert(main_list).execute()
    data_json = json.loads(data.json())
    data_entries = data_json['data']
    for i in range(len(data_entries)):
        foreign_key_list.append(int(data_entries[i]['vendor_id']))
    return foreign_key_list


def add_entries_to_product_table(supabase, vendor_id):
    fake = Faker()
    fake.add_provider(faker_commerce.Provider)
    main_list = []
    iterator = fake.random_int(1, 15)
    for i in range(iterator):
        value = {'vendor_id': vendor_id, 'product_name': fake.ecommerce_name(),
                 'inventory_count': fake.random_int(1, 100), 'price': fake.random_int(45, 100)}
        main_list.append(value)
    data = supabase.table('Product').insert(main_list).execute()


def main():
    vendor_count = 10
    supabase: Client = create_client(key.SUPABASE_URL, key.SUPABASE_KEY)
    fk_list = add_entries_to_vendor_table(supabase, vendor_count)
    #for i in range(len(fk_list)):
    #    add_entries_to_product_table(supabase, fk_list[i])


main()


In [33]:
from supabase import create_client, Client

supabase: Client = create_client(key.SUPABASE_URL, key.SUPABASE_KEY)

data = supabase.rpc('hello_world').execute()
print("Hello World:", data)


data = supabase.rpc('get_vendors').gt('total_employees', 160).execute()
print("Vendors:", data)
vars(data)
data.data[0]

Hello World: data='hello world' count=None
Vendors: data=[{'vendor_id': 17, 'vendor_name': 'Mcbride-Daniels', 'vendor_location': 'Saint Martin', 'total_employees': 164, 'created_at': '2024-10-09T12:38:53.571458+00:00'}, {'vendor_id': 20, 'vendor_name': 'Cervantes Group', 'vendor_location': 'Turkey', 'total_employees': 166, 'created_at': '2024-10-09T12:38:53.571458+00:00'}, {'vendor_id': 24, 'vendor_name': 'Lopez LLC', 'vendor_location': 'France', 'total_employees': 163, 'created_at': '2024-10-09T12:49:41.538304+00:00'}, {'vendor_id': 25, 'vendor_name': 'Hart, Gonzalez and Martin', 'vendor_location': 'Andorra', 'total_employees': 163, 'created_at': '2024-10-09T12:49:41.538304+00:00'}, {'vendor_id': 52, 'vendor_name': 'Gilbert-Smith', 'vendor_location': 'Cocos (Keeling) Islands', 'total_employees': 167, 'created_at': '2024-10-09T12:57:38.891265+00:00'}, {'vendor_id': 77, 'vendor_name': 'Arias PLC', 'vendor_location': 'El Salvador', 'total_employees': 161, 'created_at': '2024-10-09T13:04:

{'vendor_id': 17,
 'vendor_name': 'Mcbride-Daniels',
 'vendor_location': 'Saint Martin',
 'total_employees': 164,
 'created_at': '2024-10-09T12:38:53.571458+00:00'}