# Database Demo

Sample functionality for creating tables, inserting data and running similarity search with OgbujiPT.

Notes:
- `pip install jupyter` if notebook is not running

This notebook will attempt to access a database named `PGv` at `sofola:5432`, using the username `oori` and password `example`. If you have a different setup, you can change the connection string in the first cell.

## Initial setup and Imports

In [100]:
DB_NAME = 'PGv'
HOST = 'sofola'
PORT = 5432
USER = 'oori'
PASSWORD = 'example'

In [101]:
from ogbujipt.embedding_helper import PGvectorConnection

from sentence_transformers     import SentenceTransformer

e_model = SentenceTransformer('all-MiniLM-L6-v2')  # Load the embedding model

pacer_copypasta = [  # Demo data
    'The FitnessGram™ Pacer Test is a multistage aerobic capacity test that progressively gets more difficult as it continues.', 
    'The 20 meter pacer test will begin in 30 seconds. Line up at the start.', 
    'The running speed starts slowly, but gets faster each minute after you hear this signal.', 
    '[beep] A single lap should be completed each time you hear this sound.', 
    '[ding] Remember to run in a straight line, and run as long as possible.', 
    'The second time you fail to complete a lap before the sound, your test is over.', 
    'The test will begin on the word start. On your mark, get ready, start.'
]

## Connecting to the database

In [102]:
try:
    print("Connecting to database...")
    vDB = await PGvectorConnection.create(
        embedding_model=e_model, 
        db_name=DB_NAME,
        host=HOST,
        port=int(PORT),
        user=USER,
        password=PASSWORD
        )
    print("Connected to database.")
except Exception as e:
    raise e

Connecting to database...
Connected to database.


## Create Tables

In [103]:
try:  # Ensuring that the vector extension is installed
    await vDB.conn.execute('''CREATE EXTENSION IF NOT EXISTS vector;''')
    print("PGvector extension created and loaded.")
except Exception as e:
    raise e

try:  # Drop the table if one is found
    await vDB.conn.execute('''DROP TABLE IF EXISTS embeddings;''')
    print("Table dropped.")
except Exception as e:
    raise e

try:  # Creating a new table
    await vDB.create_doc_table(table_name='embeddings')
    print("Table created.")
except Exception as e:
    raise e

PGvector extension created and loaded.
Table dropped.
Table created.


## Inserting Data

In [104]:
for index, text in enumerate(pacer_copypasta):   # For each line in the copypasta
    await vDB.insert_doc_table(                  # Insert the line into the table
        table_name='embeddings',                 # The name of the table being inserted into
        content=text,                            # The text to be embedded
        permission='public',                     # Permission metadata for access control
        title=f'Pacer Copypasta line {index}',   # Title metadata
        page_numbers=[1, 2, 3],                  # Page number metadata
        tags=['fitness', 'pacer', 'copypasta'],  # Tag metadata
    )

## Similarity search

In [105]:
k = 3  # Setting number of rows to return when searching

### Searching the table with a perfect match:

In [106]:
search_string = '[beep] A single lap should be completed each time you hear this sound.'
print(f'Semantic Searching data using search string: {search_string}')

try:
    sim_search = await vDB.search_doc_table(table_name='embeddings', query_string=search_string, limit=k)
except Exception as e:
    raise e

print(f'RAW RETURN: {sim_search}')
print()
print(f'            RETURNED Title: {sim_search[0]["title"]}')
print(f'          RETURNED Content: {sim_search[0]["content"]}')
print(f'RETURNED Cosine Similarity: {sim_search[0]["cosine_similarity"]:.2f}')

Semantic Searching data using search string: [beep] A single lap should be completed each time you hear this sound.
RAW RETURN: [<Record cosine_similarity=1.0 title='Pacer Copypasta line 3' content='[beep] A single lap should be completed each time you hear this sound.'>, <Record cosine_similarity=0.685540756152295 title='Pacer Copypasta line 5' content='The second time you fail to complete a lap before the sound, your test is over.'>, <Record cosine_similarity=0.36591741151356405 title='Pacer Copypasta line 2' content='The running speed starts slowly, but gets faster each minute after you hear this signal.'>]
            RETURNED Title: Pacer Copypasta line 3
          RETURNED Content: [beep] A single lap should be completed each time you hear this sound.
RETURNED Cosine Similarity: 1.00


### Searching the table with a partial match:

In [107]:
search_string = 'Straight'
print(f'Semantic Searching data using search string: {search_string}')

try:
    sim_search = await vDB.search_doc_table(table_name='embeddings', query_string=search_string, limit=k)
except Exception as e:
    raise e

print(f'RAW RETURN: {sim_search}')
print()
print(f'            RETURNED Title: {sim_search[0]["title"]}')
print(f'          RETURNED Content: {sim_search[0]["content"]}')
print(f'RETURNED Cosine Similarity: {sim_search[0]["cosine_similarity"]:.2f}')

Semantic Searching data using search string: Straight
RAW RETURN: [<Record cosine_similarity=0.28423854269729953 title='Pacer Copypasta line 4' content='[ding] Remember to run in a straight line, and run as long as possible.'>, <Record cosine_similarity=0.10402820694362547 title='Pacer Copypasta line 6' content='The test will begin on the word start. On your mark, get ready, start.'>, <Record cosine_similarity=0.07991296083513344 title='Pacer Copypasta line 2' content='The running speed starts slowly, but gets faster each minute after you hear this signal.'>]
            RETURNED Title: Pacer Copypasta line 4
          RETURNED Content: [ding] Remember to run in a straight line, and run as long as possible.
RETURNED Cosine Similarity: 0.28
