WORK IN PROGRESS!

# Data Indexing

There are two central steps involved:

1. Documents are stored, prepared and split into smaller text chunks.
2. Text chunks are converted into vector embeddings and stored in a vector database (Vector DB) next to their respective text chunks.


*** 
**Background information**

* All files, chunks and embeddings are stored on a local **Supabase** server (open source Firebase alternative; based on Postgres, which is a relational database management system), see: https://supabase.com/


***
**Coding sources**

I extend the code provided and explained in the following YouTube Video: 

- The missing pieces to your AI app (pgvector + RAG in prod): https://www.youtube.com/watch?v=ibzlEQmgPPY
    + GitHub: https://github.com/supabase-community/chatgpt-your-files


## How to use this code:


1. Enter your search / how you have found the PDF files
```
entry_search = {'topic': "AI regulation", 'subtopic': "trust, risk, benefit", 
                'search_query':'"artificial intelligence" AND (trust OR risk* OR benef*) AND "regulation"', 'search_plattform': "Google Scholar", 
                'comment':"only retrieved the first 30 entries (2 excluded, because 1x not downloadable); N=29"}
```

2. define folder path to your PDFs:
```
path_to_PDF = os.path.join('PDFs/AIregulation/')  # Moves one level up to 'PDFs' folder
```

3. provide the additional arguments to the PDF entries in your DB:

along to the entry in the DB the information written in `args_Search` is stored, to later filter your DB
```
args_Search = {'topic': "AI regulation", 'subtopic': "trust, risk, benefit"}
```

4. define using GROBID (GeneRation Of BIbliographic Data) or simple  recursive splling


Remark: GROBID is an open-source software tool designed for extracting, parsing, and structuring bibliographic data and scientific documents.
```
...
```


## If you facing issues running your Code:

It could be the case that chroma and langchain cause import issues, see: https://github.com/langchain-ai/langchain/issues/7509

In [1]:
## run in your terminal:
# pip install pydantic==1.10.8
# pip install chromadb==0.3.26

## Get API, local supabase server key(s)

In [2]:
import os
import sys

# Assuming 'src' is one level down (in the current directory or a subdirectory)
path_to_src = os.path.join('../..','src')  # Moves one level down to 'src' folder

# Add the path to sys.path
sys.path.append(path_to_src)

# Now you can import your API_key module
import API_key as key

## include self-written functions

In [3]:
import src.forDataIndexing as di

  from tqdm.autonotebook import tqdm, trange


In [4]:
# Print the current working directory
print("Current working directory:", os.getcwd())

Current working directory: /home/fenn/Desktop/Workshop LLMs/5_summarizingLiterature/RAG


# Connect to our local supabase server

In [5]:
from supabase import create_client

supabase = create_client(key.SUPABASE_URL, key.SUPABASE_KEY)

# Data Preperation: Documents are stored, prepared and split into smaller text chunks

## Enter your search / how you have found the PDF files

In [6]:
res = supabase.table("searches").select("topic").execute()
# Convert the list of tuples to a dictionary
data_dict = dict(res)
# Access the 'data' key directly
data_items = data_dict.get('data', [])
# Extract topics from the list of dictionaries within 'data'
topics = [entry['topic'] for entry in data_items]
print("search topics in your DB:\n", topics)

search topics in your DB:
 ['livmats']


In [7]:
entry_search = {'topic': "livmats", 'subtopic': "all articles", 
                'search_query':'https://www.livmats.uni-freiburg.de/en/publications', 'search_plattform': "livmats homepage", 
                'comment':"could not retrieve all articles; final N= 248"}


# entry_search = {'topic': "AI regulation", 'subtopic': "trust, risk, benefit", 
#                'search_query':'"artificial intelligence" AND (trust OR risk* OR benef*) AND "regulation"', 'search_plattform': "Google Scholar", 
#                'comment':"only retrieved the first 30 entries (1 excluded, because not downloadable); N=29"}

# Check if the 'topic' in entry_search is in topics
if entry_search['topic'] in topics:
    # Update the existing entry where the topic matches
    data = supabase.table('searches').update(entry_search).eq('topic', entry_search['topic']).execute()
    print("search entry updated")
else:
    # Insert the new entry if the topic is not found in topics
    data = supabase.table('searches').insert(entry_search).execute()
    print("new search entry added")

search entry updated


## Upload your PDFs in the local DB

define folder path to your PDFs:

In [8]:
path_to_PDF = os.path.join('PDFs/livmats/')  # Moves one level up to 'PDFs' folder

upload PDFs to storage **AND** creates an entry in the DB:

along to the entry in the DB the information written in `args_Search` is stored, to later filter your DB

Remark: this function takes a bit, because PDFs are temporarly loaded to get their number of pages

In [9]:
args_Search = {'topic': "livmats", 'subtopic': "all articles"}


di.upload_PDFs(folder_path=path_to_PDF, supabase_DB=supabase, args_Search=args_Search, verbose=False)

The following file: "10.1016j.est.2023.107775__ C. Diestel__2023__Determining the qual.pdf" is a) not a PDF or b) was already uploaded in the DB.
The following file: "10.1002ejic.202200136__ M. Daub__2022__Crystal Structures o.pdf" is a) not a PDF or b) was already uploaded in the DB.
The following file: "10.3390biomimetics4010026__ O. Speck__2019__An Overview of Bioin.pdf" is a) not a PDF or b) was already uploaded in the DB.
Failed to upload or process PDF '10.1021acsmaterialsau.2c00077__ A. Ramírez__2023__Rhenium-Based Electr.pdf': {'statusCode': 400, 'error': 'Duplicate', 'message': 'The resource already exists'}
Failed to upload or process PDF '10.1021acs.jpclett.2c03493__ M. Walter__2023__Mechanochemical Acti.pdf': {'statusCode': 400, 'error': 'Duplicate', 'message': 'The resource already exists'}
Failed to upload or process PDF '10.10881361-665Xac8dcd__ N. Sherkat__2022__Fabrication and simu.pdf': {'statusCode': 400, 'error': 'Duplicate', 'message': 'The resource already exists'

Advanced encoding [] not implemented yet
Advanced encoding [] not implemented yet
Advanced encoding [] not implemented yet
Advanced encoding [] not implemented yet
Advanced encoding [] not implemented yet
Advanced encoding [] not implemented yet
Advanced encoding [] not implemented yet


Error parsing PDF '10.1002chem.201904433__ D. Bezold__2020__Cyclotriphosphate A.pdf': cannot access local variable 'v' where it is not associated with a value
Error parsing PDF '10.3390ma13061313__ V. Slesarenko__2020__Planar Mechanical Me.pdf': cannot access local variable 'v' where it is not associated with a value
Error parsing PDF '10.1002ange.202110327__ P. Lohner__2021__Inside a Shell—Organ.pdf': cannot access local variable 'v' where it is not associated with a value
In total 239 PDFs were successfully uploaded to your DB.


PDFs in your DB:

In [10]:
res = supabase.storage.from_('files').list()
# Extracting 'name' from each dictionary
file_names = [file['name'] for file in res]

# Output the list of file names
print(file_names)

# Number of PDFs in DB
print(len(file_names))

['10.1002adem.202001037___M._Specht__2021__Adaptive_Wettability.pdf', '10.1002adfm.202011210___M._Beetz__2021__UltraThin_Protectiv.pdf', '10.1002adfm.202206958___J._Buttner__2022__Are_HalidePerovskit.pdf', '10.1002adfm.202210512___B._Tengen__2023__Immobilizing_Poly_vi.pdf', '10.1002adfm.202401188___A._Rutzler__2024__Mesoporous_NDoped_C.pdf', '10.1002adma.201905111___A._Walther__2020__Viewpoint_From_Resp.pdf', '10.1002adma.202002629___J._Deng__2020__ATPResponsive_and_A.pdf', '10.1002adma.202006341___F._Kotz__2021__TwoPhoton_Polymeriz.pdf', '10.1002adma.202211902___M._Farhan__2023__Artificial_Tendrils.pdf', '10.1002adma.202313125___N._Ghavidelnia__2024__BioInspired_Pressur.pdf', '10.1002adma.202313906___I._Fiorello__2024__A_Biohybrid_SelfDis.pdf', '10.1002admi.202102359___N._Bentley__2022__Protein_Repellent__S.pdf', '10.1002admi.202201018___A._Rajak__2022__High_Hysteresis_Susp.pdf', '10.1002admi.202201581___M._Bahrami__2022__Lubrication_Mechanis.pdf', '10.1002admi.202300316___D._Rusitov_

## PDFs in the local DB are prepared

set a folder path for the temporary download of your PDFs:

In [14]:
path_to_PDFs = os.path.join('tmp_downloads')  # Moves one level up to 'tmp_downloads' folder

get the names of all PDFs, which have not been processed (chunks + embeddings):

In [15]:
non_processed_PDFs = di.non_processed_PDFs(supabase_DB=supabase, verbose=False)
print("non_processed_PDFs:\n", non_processed_PDFs)

ids in your table documents_chunks:
 set()
non_processed_PDFs:
 ['10.1002adem.202001037___M._Specht__2021__Adaptive_Wettability.pdf', '10.1002adfm.202011210___M._Beetz__2021__UltraThin_Protectiv.pdf', '10.1002adfm.202206958___J._Buttner__2022__Are_HalidePerovskit.pdf', '10.1002adfm.202210512___B._Tengen__2023__Immobilizing_Poly_vi.pdf', '10.1002adfm.202401188___A._Rutzler__2024__Mesoporous_NDoped_C.pdf', '10.1002adma.201905111___A._Walther__2020__Viewpoint_From_Resp.pdf', '10.1002adma.202002629___J._Deng__2020__ATPResponsive_and_A.pdf', '10.1002adma.202006341___F._Kotz__2021__TwoPhoton_Polymeriz.pdf', '10.1002adma.202211902___M._Farhan__2023__Artificial_Tendrils.pdf', '10.1002adma.202313125___N._Ghavidelnia__2024__BioInspired_Pressur.pdf', '10.1002adma.202313906___I._Fiorello__2024__A_Biohybrid_SelfDis.pdf', '10.1002admi.202102359___N._Bentley__2022__Protein_Repellent__S.pdf', '10.1002admi.202201018___A._Rajak__2022__High_Hysteresis_Susp.pdf', '10.1002admi.202201581___M._Bahrami__2022_

In [16]:
args_Split = {'chunk_size': 800, 'chunk_overlap': 150}

di.load_split_embed(supabase_DB=supabase, path_to_PDFs=path_to_PDFs, args_Split=args_Split, LMM='all-MiniLM-L6-v2')
# delete all downloaded files !!! 
# or implement file by file !!!

ids in your table documents_chunks:
 set()
10.1002adfm.202206958___J._Buttner__2022__Are_HalidePerovskit.pdf
10.1002adma.202006341___F._Kotz__2021__TwoPhoton_Polymeriz.pdf
10.1002adfm.202210512___B._Tengen__2023__Immobilizing_Poly_vi.pdf
10.1002adma.202313125___N._Ghavidelnia__2024__BioInspired_Pressur.pdf
10.1002adem.202001037___M._Specht__2021__Adaptive_Wettability.pdf
10.1002adma.201905111___A._Walther__2020__Viewpoint_From_Resp.pdf
10.1002adma.202002629___J._Deng__2020__ATPResponsive_and_A.pdf
10.1002adma.202211902___M._Farhan__2023__Artificial_Tendrils.pdf
10.1002adfm.202401188___A._Rutzler__2024__Mesoporous_NDoped_C.pdf
10.1002adfm.202011210___M._Beetz__2021__UltraThin_Protectiv.pdf


## Show (GeneRation Of BIbliographic Data)

aaaaaaaaaaaaaaaaaa GROBID

https://grobid.readthedocs.io/en/latest/Grobid-docker/#grobid-and-docker-containers


docker run --gpus all --init --ulimit core=0 -p 8070:8070 grobid/grobid:0.8.1


docker run --rm --gpus all --init --ulimit core=0 -p 8070:8070 grobid/grobid:0.8.1

In [None]:
ERROR

see: https://grobid.readthedocs.io/en/latest/Grobid-service/

In [17]:
import requests
from pathlib import Path

pdf_file = Path(path_to_PDFs + "/10.1002adma.201905111___A._Walther__2020__Viewpoint_From_Resp.pdf")
files = {
    'input': open(pdf_file, 'rb'),
}

response = requests.post('http://localhost:8070/api/processFulltextDocument', files=files)

In [18]:
vars(response)

{'_content': b'<?xml version="1.0" encoding="UTF-8"?>\n<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" \nxmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" \nxsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"\n xmlns:xlink="http://www.w3.org/1999/xlink">\n\t<teiHeader xml:lang="en">\n\t\t<fileDesc>\n\t\t\t<titleStmt>\n\t\t\t\t<title level="a" type="main">From Responsive to Adaptive and Interactive Materials and Materials Systems: A Roadmap</title>\n\t\t\t\t<funder>\n\t\t\t\t\t<orgName type="full">Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)</orgName>\n\t\t\t\t</funder>\n\t\t\t\t<funder ref="#_XX7zaKZ">\n\t\t\t\t\t<orgName type="full">Volkswagen Foundation</orgName>\n\t\t\t\t</funder>\n\t\t\t\t<funder ref="#_4qPpgwW">\n\t\t\t\t\t<orgName type="full">Germany</orgName>\n\t\t\t\t</funder>\n\t\t\t\t<funder ref="#_4GCppcs">\n\t\t\t\t\t<orgName type="full">European R

In [19]:
response.content

b'<?xml version="1.0" encoding="UTF-8"?>\n<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" \nxmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" \nxsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"\n xmlns:xlink="http://www.w3.org/1999/xlink">\n\t<teiHeader xml:lang="en">\n\t\t<fileDesc>\n\t\t\t<titleStmt>\n\t\t\t\t<title level="a" type="main">From Responsive to Adaptive and Interactive Materials and Materials Systems: A Roadmap</title>\n\t\t\t\t<funder>\n\t\t\t\t\t<orgName type="full">Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)</orgName>\n\t\t\t\t</funder>\n\t\t\t\t<funder ref="#_XX7zaKZ">\n\t\t\t\t\t<orgName type="full">Volkswagen Foundation</orgName>\n\t\t\t\t</funder>\n\t\t\t\t<funder ref="#_4qPpgwW">\n\t\t\t\t\t<orgName type="full">Germany</orgName>\n\t\t\t\t</funder>\n\t\t\t\t<funder ref="#_4GCppcs">\n\t\t\t\t\t<orgName type="full">European Research Counc

aaaaaaaaaaaaaaaaaa GROBID

In [20]:
response = (
    supabase.table("documents_chunks")
    .select("*")
    .eq("order_chunks", 0)
    .execute()
)

print(len(response.data))
print(response.data)

10
[{'document_id': '71773765bc2b1eecfeeb8b148fff8ea1', 'id': '10.1002adfm.202206958___J._Buttner__2022__Are_HalidePerovskit.pdf', 'order_chunks': 0, 'section': None, 'content': 'www.afm-journal.de\n© 2022 The Authors. Advanced Functional Materials published by Wiley-VCH GmbH2206958 (1 of 10)\nAre Halide-Perovskites Suitable Materials for Battery and \nSolar-Battery Applications–Fundamental Reconsiderations \non Solubility, Lithium Intercalation, and Photo-Corrosion\nJan Büttner, Taisiia Berestok, Stephan Burger, Manuel Schmitt, Michael Daub, \nHarald Hillebrecht,* Ingo Krossing,* and Anna Fischer*\nIn recent years the development of autonomous photo-rechargeable batteries \nhas received growing attention. Especially highly integrated photobatteries based \non multifunctional materials able to harvest sunlight and store charge carriers \nare the holy grail amongst such devices. Recently 2-(1-cyclohexenyl)ethyl ammo-', 'embedding': '[-0.050163306,0.033727996,-0.06675852,0.045137636,0.03

In [21]:
response = (
    supabase.table("documents_chunks")
    .select("*")
    .eq("id", "10.1002adma.201905111___A._Walther__2020__Viewpoint_From_Resp.pdf")
    .execute()
)

print(len(response.data))
#print(response.data)

84


# Outcomes

In [None]:
data = supabase.rpc('hello_world3').execute()
print("Hello World:", data)

In [None]:
from collections import namedtuple

# Define the namedtuple
Document = namedtuple('Document', ['page_content'])

# Create an instance of Document
#doc = Document("How can AI Regulation be defined?")
doc = Document("What are the key characteristics of adaptive material systems?")

# Pass the object inside a list
embedding = di.create_embeddings([doc], LMM='all-MiniLM-L6-v2', verbose=False)
print("embedding:\n", embedding[0].tolist())
print("length embedding:\n", len(embedding[0].tolist()))

embedding:
 [-0.08592277765274048, -0.01132022961974144, -0.032426901161670685, 0.041541215032339096, 0.019651342183351517, -0.027189332991838455, -0.0204591304063797, 0.028330886736512184, -0.07722602039575577, 0.047132257372140884, -0.020578159019351006, -0.010470523498952389, 0.039294734597206116, 0.0009716683416627347, -0.06338505446910858, -0.05512506514787674, -0.031784988939762115, 0.04423924535512924, -0.014012041501700878, 0.03053467720746994, 0.004520518705248833, 0.010360722430050373, -0.03709712624549866, -0.0015223867958411574, -0.03458567336201668, 0.0380367785692215, -0.09402027726173401, -0.0069236368872225285, 0.12692289054393768, -0.11996332556009293, 0.01763284206390381, 0.08533108979463577, 7.017129246378317e-05, 0.0056784385815262794, -0.018646612763404846, 0.05234852060675621, -0.020984064787626266, -0.012818826362490654, -0.02868984080851078, 0.05313372611999512, -0.09504636377096176, -0.004218959249556065, -0.017376162111759186, -0.05295220762491226, 0.021955002

In [37]:
match_threshold = 0.7  # Replace with your desired threshold

# Call the RPC function
data = supabase.rpc('match_documents_chunks', {
    'embedding': embedding[0].tolist(),
    'match_threshold': match_threshold,
    'match_count': 5
}).execute()

In [38]:
len(data.data)

0

In [32]:
data.data

[{'document_id': '57f47d48d6852ce271052cf70a16dd2f',
  'id': '10.1002adma.201905111___A._Walther__2020__Viewpoint_From_Resp.pdf',
  'order_chunks': 8,
  'section': None,
  'content': 'www.advmat.dewww.advancedsciencenews.com\n1905111 (2 of 10) © 2019 The Author. Published by WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim\nimpact forefront materials research and consumer technologies \nfor sensors, photonics, biomaterials, and other applications. [1] \nThose typically contain molecular or (bio)macromolecular \nsegments that respond to a change in the environment (e.g., \npH, temperature, light, and electric or magnetic field) with a \nchange, e.g., in the conformation, connectivity, polarity, solu-\nbility or absorptivity that translates on a higher level in changes \nin the self-assembly behavior or a material property such as \nswelling/deswelling, contraction/expansion, mechanical stiff -\nening or softening, conductivity, color, fluorescence, perme-',
  'embedding': '[-0.060567517,-0.13

In [33]:
print(data.data[0]["content"])
print(data.data[0]["embedding"])

www.advmat.dewww.advancedsciencenews.com
1905111 (2 of 10) © 2019 The Author. Published by WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
impact forefront materials research and consumer technologies 
for sensors, photonics, biomaterials, and other applications. [1] 
Those typically contain molecular or (bio)macromolecular 
segments that respond to a change in the environment (e.g., 
pH, temperature, light, and electric or magnetic field) with a 
change, e.g., in the conformation, connectivity, polarity, solu-
bility or absorptivity that translates on a higher level in changes 
in the self-assembly behavior or a material property such as 
swelling/deswelling, contraction/expansion, mechanical stiff -
ening or softening, conductivity, color, fluorescence, perme-
[-0.060567517,-0.13711841,-0.01292634,0.012175324,0.06424536,-0.053644404,0.041096885,0.08937734,0.020206388,0.049778476,0.029439522,-0.012786108,-0.06605415,0.09734147,0.05025657,0.04376865,-0.056487873,0.048283856,-0.028870923,0.0

In [34]:
import numpy as np
numeric_vector = np.fromstring(data.data[0]["embedding"].strip('[]'), sep=',')
len(numeric_vector.tolist())

384

In [35]:
from sentence_transformers import SentenceTransformer

# Load the pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

similarities = model.similarity(embedding[0].tolist(), numeric_vector.tolist())
print(similarities)

tensor([[0.2385]])


In [None]:
ERROR

**load_pdfs_by_filename**: Loads and stores PDF pages by filename:

In [None]:



pdf_pages = di.load_pdfs_by_filename(path_to_PDFs, verbose=False)

# Optional: Print the loaded pages by filename
for filename, pages in pdf_pages.items():
    print(f"\nPDF: {filename}")
    print(f"Total Pages: {len(pages)}")
    # print(pages[0])

In [None]:
# Assuming pdf_chunks is the dictionary containing chunks for each PDF
first_key = list(pdf_pages.keys())[0]  # Get the first PDF filename
print("first PDF of folder:", first_key)
first_pdf_pages = pdf_pages[first_key]  # Get the chunks for the first PDF


# Print the first page
print("First Page:", first_pdf_pages[0], "\n\n")

**split_pdf_pages_into_chunks**: Splits and stores PDF pages into chunks by filename:

On average, one token is roughly 4 characters in English text. So, each chunk of 800 characters roughly corresponds to 200 tokens.


**The maximal number of tokens of `all-MiniLM-L6-v2` is 512.**

In [None]:
pdf_chunks = di.split_pdf_pages_into_chunks(pdf_pages, chunk_size=800, chunk_overlap=150, verbose=False)

# Optional: Print a summary of chunks created per PDF
for filename, chunks in pdf_chunks.items():
    print(f"\nPDF: {filename}")
    print(f"Total Chunks: {len(chunks)}")

In [None]:
# Assuming pdf_chunks is the dictionary containing chunks for each PDF
first_key = list(pdf_chunks.keys())[0]  # Get the first PDF filename
print("first PDF of folder:", first_key)
first_pdf_chunks = pdf_chunks[first_key]  # Get the chunks for the first PDF

# Access the first and second chunks
first_chunk = first_pdf_chunks[0]
second_chunk = first_pdf_chunks[1]

# Print the first two chunks
print("\nFirst Chunk:", first_chunk, "\n\n")
print("Second Chunk:", second_chunk)

In [None]:
print("page content:", first_chunk.page_content, "\n\n")
print("metadata:", first_chunk.metadata)

In [None]:
print(second_chunk.page_content)

In [None]:
len(first_pdf_chunks)
first_pdf_chunks[0].page_content

In [None]:
pdf_chunks.keys()

In [None]:
list(pdf_chunks.keys())

In [None]:



# Assuming first_pdf_chunks is an array of PDF page objects or text chunks
tmp_chunks = []  # Initialize an empty array to store the extracted content

# Iterate over each page or chunk in first_pdf_chunks
for chunk in first_pdf_chunks:
    # Extract the page content (assuming 'chunk' has a method or property called 'extract_content')
    content = chunk.page_content  # Modify this line based on how you extract content from your PDF chunks
    # Append the extracted content to tmp_chunks
    tmp_chunks.append(content)

# tmp_chunks now contains the extracted content from each page


In [None]:
len(tmp_chunks)

# Data Storage: Text chunks are converted into vector embeddings and stored in a vector database (Vector DB) next to their respective text chunks.

In [None]:
from sentence_transformers import SentenceTransformer
import pandas as pd


# Define sentences
sentences = tmp_chunks

# Load the pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Extract features
features = model.encode(sentences)

# Print the features as a pandas dataframe
pd.DataFrame(features)

In [None]:
len(features)
features[0]

In [None]:
similarities = model.similarity(features, features)
print(similarities)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics.pairwise import cosine_similarity
from scipy.cluster.hierarchy import dendrogram, linkage
from sentence_transformers import SentenceTransformer

# Initialize the model and encode the corpus
embedder = SentenceTransformer("all-MiniLM-L6-v2")
corpus = sentences
corpus_embeddings = embedder.encode(corpus)

# Calculate similarity matrix using cosine similarity
similarity_matrix = cosine_similarity(corpus_embeddings)

# Convert similarity matrix to a distance matrix
distance_matrix = 1 - similarity_matrix

# Perform hierarchical clustering
linked = linkage(distance_matrix, 'ward')

# Plot the dendrogram
plt.figure(figsize=(10, 8))
dendrogram(linked, labels=corpus, orientation='top', distance_sort='descending', show_leaf_counts=True)
plt.title("Dendrogram of Sentence Clustering")
plt.xlabel("Sentences")
plt.ylabel("Distance")
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()


In [None]:
aaaaaaaaaaaa

In [None]:
import os
import json
#from dotenv import load_dotenv
from supabase import create_client, Client
from faker import Faker
import faker_commerce


def add_entries_to_vendor_table(supabase, vendor_count):
    fake = Faker()
    foreign_key_list = []
    fake.add_provider(faker_commerce.Provider)
    main_list = []
    for i in range(vendor_count):
        value = {'vendor_name': fake.company(), 'total_employees': fake.random_int(40, 169),
                 'vendor_location': fake.country()}

        main_list.append(value)
    data = supabase.table('vendor2').insert(main_list).execute()
    data_json = json.loads(data.json())
    data_entries = data_json['data']
    for i in range(len(data_entries)):
        foreign_key_list.append(int(data_entries[i]['vendor_id']))
    return foreign_key_list


def add_entries_to_product_table(supabase, vendor_id):
    fake = Faker()
    fake.add_provider(faker_commerce.Provider)
    main_list = []
    iterator = fake.random_int(1, 15)
    for i in range(iterator):
        value = {'vendor_id': vendor_id, 'product_name': fake.ecommerce_name(),
                 'inventory_count': fake.random_int(1, 100), 'price': fake.random_int(45, 100)}
        main_list.append(value)
    data = supabase.table('Product').insert(main_list).execute()


def main():
    vendor_count = 10
    supabase: Client = create_client(key.SUPABASE_URL, key.SUPABASE_KEY)
    fk_list = add_entries_to_vendor_table(supabase, vendor_count)
    #for i in range(len(fk_list)):
    #    add_entries_to_product_table(supabase, fk_list[i])


main()


In [None]:
from supabase import create_client, Client

supabase: Client = create_client(key.SUPABASE_URL, key.SUPABASE_KEY)

data = supabase.rpc('hello_world').execute()
print("Hello World:", data)


data = supabase.rpc('get_vendors').gt('total_employees', 160).execute()
print("Vendors:", data)
vars(data)
data.data[0]