# Database Demo

Sample functionality for creating tables, inserting data and running similarity search with OgbujiPT.

Notes:
- `pip install jupyter` if notebook is not running

This notebook will attempt to access a database named `PGv` at `sofola:5432`, using the username `oori` and password `example`. If you have a different setup, you can change the connection string in the first cell.

### Initial setup and Imports

In [434]:
import uuid
from pprint                      import pprint

from ogbujipt.embedding.pgvector import docDB, chatlogDB

from sentence_transformers       import SentenceTransformer

DB_NAME = 'PGv'
HOST = 'sofola'
PORT = 5432
USER = 'oori'
PASSWORD = 'example'

e_model = SentenceTransformer('all-MiniLM-L6-v2')  # Load the embedding model

# Document Embedding

In [435]:
pacer_copypasta = [  # Demo document
    'The FitnessGram™ Pacer Test is a multistage aerobic capacity test that progressively gets more difficult as it continues.', 
    'The 20 meter pacer test will begin in 30 seconds. Line up at the start.', 
    'The running speed starts slowly, but gets faster each minute after you hear this signal.', 
    '[beep] A single lap should be completed each time you hear this sound.', 
    '[ding] Remember to run in a straight line, and run as long as possible.', 
    'The second time you fail to complete a lap before the sound, your test is over.', 
    'The test will begin on the word start. On your mark, get ready, start.'
]

### Connecting to the database

In [436]:
pacerDB = await docDB.from_conn_params(
    embedding_model=e_model, 
    table_name='pacer',
    user=USER,
    password=PASSWORD,
    db_name=DB_NAME,
    host=HOST,
    port=int(PORT)
)

### Create Tables

In [437]:
await pacerDB.drop_table()        # Drop the table if one is found

await pacerDB.create_doc_table()  # Create a new table

### Inserting Document

In [438]:
for index, text in enumerate(pacer_copypasta):   # For each line in the copypasta
    await pacerDB.insert_doc(                    # Insert the line into the table
        content=text,                            # The text to be embedded
        permission='public',                     # Permission metadata for access control
        title=f'Pacer Copypasta line {index}',   # Title metadata
        page_numbers=[1, 2, 3],                  # Page number metadata
        tags=['fitness', 'pacer', 'copypasta'],  # Tag metadata
    )
print(f'Inserted {len(pacer_copypasta)} document chunks into the table')

Inserted 7 document chunks into the table


## Similarity search Document

### Searching the document with a perfect match

In [439]:
search_string = '[beep] A single lap should be completed each time you hear this sound.'
print(f'Semantic Searching data using search string:\n"{search_string}"')

sim_search = await pacerDB.search_doc_table(
    query_string=search_string,  # string to search by
    limit=3                      # Number of results returned
)

Semantic Searching data using search string:
"[beep] A single lap should be completed each time you hear this sound."


In [440]:
print(f'RETURNED TITLE:\n"{sim_search[0]["title"]}"')                          # Print the title of the first result
print(f'RETURNED CONTENT:\n"{sim_search[0]["content"]}"')                        # Print the content of the first result
print(f'RETURNED COSINE SIMILARITY:\n{sim_search[0]["cosine_similarity"]:.2f}')  # Print the cosine similarity of the first result

RETURNED TITLE:
"Pacer Copypasta line 3"
RETURNED CONTENT:
"[beep] A single lap should be completed each time you hear this sound."
RETURNED COSINE SIMILARITY:
0.00


In [441]:
print(f'RAW RETURN:')
pprint(sim_search)

RAW RETURN:
[<Record cosine_similarity=0.0 title='Pacer Copypasta line 3' content='[beep] A single lap should be completed each time you hear this sound.' permission='public' page_numbers=[1, 2, 3] tags=['fitness', 'pacer', 'copypasta']>,
 <Record cosine_similarity=0.31445924384770496 title='Pacer Copypasta line 5' content='The second time you fail to complete a lap before the sound, your test is over.' permission='public' page_numbers=[1, 2, 3] tags=['fitness', 'pacer', 'copypasta']>,
 <Record cosine_similarity=0.634082588486436 title='Pacer Copypasta line 2' content='The running speed starts slowly, but gets faster each minute after you hear this signal.' permission='public' page_numbers=[1, 2, 3] tags=['fitness', 'pacer', 'copypasta']>]


### Searching the document with a partial match

In [442]:
search_string = 'straight'
print(f'Semantic Searching data using search string:\n"{search_string}"')

sim_search = await pacerDB.search_doc_table(
    query_string=search_string,  # string to search by
    limit=3                      # Number of results returned
)

Semantic Searching data using search string:
"straight"


In [443]:
print(f'RETURNED TITLE:\n"{sim_search[0]["title"]}"')                          # Print the title of the first result
print(f'RETURNED CONTENT:\n"{sim_search[0]["content"]}"')                        # Print the content of the first result
print(f'RETURNED COSINE SIMILARITY:\n{sim_search[0]["cosine_similarity"]:.2f}')  # Print the cosine similarity of the first result

RETURNED TITLE:
"Pacer Copypasta line 4"
RETURNED CONTENT:
"[ding] Remember to run in a straight line, and run as long as possible."
RETURNED COSINE SIMILARITY:
0.72


In [444]:
print(f'RAW RETURN:')  
pprint(sim_search)

RAW RETURN:
[<Record cosine_similarity=0.7157614573027005 title='Pacer Copypasta line 4' content='[ding] Remember to run in a straight line, and run as long as possible.' permission='public' page_numbers=[1, 2, 3] tags=['fitness', 'pacer', 'copypasta']>,
 <Record cosine_similarity=0.8959717930563745 title='Pacer Copypasta line 6' content='The test will begin on the word start. On your mark, get ready, start.' permission='public' page_numbers=[1, 2, 3] tags=['fitness', 'pacer', 'copypasta']>,
 <Record cosine_similarity=0.9200870391648666 title='Pacer Copypasta line 2' content='The running speed starts slowly, but gets faster each minute after you hear this signal.' permission='public' page_numbers=[1, 2, 3] tags=['fitness', 'pacer', 'copypasta']>]


# Chatlog Embedding

In [445]:
abbott_and_costello = [  # Demo chatlog
    {'role': 'system', 'content': 'The user is considering becoming a ballplayer. The assistant wants to make sure they knows what they\'re getting into.'},
    {'role': 'assistant', 'content': 'Strange as it may seem, they give ball players nowadays very peculiar names.'},
    {'role': 'user', 'content': 'Funny names?'},
    {'role': 'assistant', 'content': 'Nicknames, nicknames. Now, on the St. Louis team we have Who\'s on first, What\'s on second, I Don\'t Know is on third--'},
    {'role': 'user', 'content': 'That\'s what I want to find out. I want you to tell me the names of the fellows on the St. Louis team.'},
    {'role': 'assistant', 'content': "I'm telling you. Who is on first. What's on second. I Don't Know's on third--"},
    {'role': 'user', 'content': "You know the fellows' names?"},
    {'role': 'assistant', 'content': 'Yes.'},
    {'role': 'user', 'content': "Well, then who's playing first?"},
    {'role': 'assistant', 'content': 'Yes.'},
    {'role': 'user', 'content': "I mean the fellow's name on first base."},
    {'role': 'assistant', 'content': 'Who.'},
    {'role': 'user', 'content': "The fellow playin' first base."},
    {'role': 'assistant', 'content': 'Who.'},
    {'role': 'user', 'content': "The guy on first base."},
    {'role': 'assistant', 'content': 'Who is on first.'},
    {'role': 'user', 'content': "Well, what are you askin' me for?"},
    {'role': 'assistant', 'content': "I'm not asking you--I'm telling you. Who is on first."},
    {'role': 'user', 'content': "I'm asking you--who's on first?"},
    {'role': 'assistant', 'content': 'That\'s the man\'s name.'},
    {'role': 'user', 'content': "That's who's name?"},
    {'role': 'assistant', 'content': 'Yes.'},
]

### Connecting to the database

In [446]:
baseballDB = await chatlogDB.from_conn_params(
    embedding_model=e_model, 
    table_name='baseball',
    user=USER,
    password=PASSWORD,
    db_name=DB_NAME,
    host=HOST,
    port=int(PORT)
)

### Create Tables

In [447]:
await baseballDB.drop_table()            # Drop the table if one is found

await baseballDB.create_chatlog_table()  # Create a new table

### Inserting Chatlog

In [448]:
history_key = uuid.uuid4()            # Generate a key for the chatlog
for line in abbott_and_costello:      # For each line of dialog in the script
    await baseballDB.insert_message(  # Insert the message into the table
        history_key=history_key,      # The key for the chatlog
        role=line['role'],
        content=line['content'],
        metadata={'genre': 'comedy', 'year': 1938}
    )
print(f'Inserted {len(abbott_and_costello)} lines of dialog into the table with history key "{history_key}".')

Inserted 22 lines of dialog into the table with history key "3dce2b2f-48ec-4bba-a069-373c19327308".


## Similarity search Chatlog

### Searching the chatlog with a perfect match

In [449]:
search_string = 'nickname'
print(f'Semantic Searching data using search string: "{search_string}"')

sim_search = await baseballDB.search_chatlog(
    history_key=history_key,
    query_string=search_string,
    limit=k
)

Semantic Searching data using search string: "nickname"


In [450]:
print(f'RETURNED INDEX:\n{sim_search[0]["index"]}')                             # Print the index of the first result
print(f'RETURNED MESSAGE:\n{sim_search[0]["role"]}: {sim_search[0]["content"]}')  # Print the message of the first result
print(f'RETURNED COSINE SIMILARITY:\n{sim_search[0]["cosine_similarity"]:.2f}')   # Print the cosine similarity of the first result

RETURNED INDEX:
4
RETURNED MESSAGE:
assistant: Nicknames, nicknames. Now, on the St. Louis team we have Who's on first, What's on second, I Don't Know is on third--
RETURNED COSINE SIMILARITY:
0.49


In [451]:
print('RAW RETURN:')
pprint(sim_search)

RAW RETURN:
[{'content': "Nicknames, nicknames. Now, on the St. Louis team we have Who's "
             "on first, What's on second, I Don't Know is on third--",
  'cosine_similarity': 0.4910637432970839,
  'index': 4,
  'metadata': {'genre': 'comedy', 'year': 1938},
  'role': 'assistant'},
 {'content': 'Funny names?',
  'cosine_similarity': 0.561316881258301,
  'index': 3,
  'metadata': {'genre': 'comedy', 'year': 1938},
  'role': 'user'},
 {'content': "That's the man's name.",
  'cosine_similarity': 0.5729953094121991,
  'index': 20,
  'metadata': {'genre': 'comedy', 'year': 1938},
  'role': 'assistant'}]


### Searching the chatlog with a partial match

In [452]:
search_string = 'nickname'
print(f'Semantic Searching data using search string: "{search_string}"')

sim_search = await baseballDB.search_chatlog(
    history_key=history_key,
    query_string=search_string,
    limit=k
)

Semantic Searching data using search string: "nickname"


In [453]:
print(f'RETURNED INDEX:\n{sim_search[0]["index"]}')                             # Print the index of the first result
print(f'RETURNED MESSAGE:\n{sim_search[0]["role"]}: {sim_search[0]["content"]}')  # Print the message of the first result
print(f'RETURNED COSINE SIMILARITY:\n{sim_search[0]["cosine_similarity"]:.2f}')   # Print the cosine similarity of the first result

RETURNED INDEX:
4
RETURNED MESSAGE:
assistant: Nicknames, nicknames. Now, on the St. Louis team we have Who's on first, What's on second, I Don't Know is on third--
RETURNED COSINE SIMILARITY:
0.49


In [454]:
print('RAW RETURN:')
pprint(sim_search)

RAW RETURN:
[{'content': "Nicknames, nicknames. Now, on the St. Louis team we have Who's "
             "on first, What's on second, I Don't Know is on third--",
  'cosine_similarity': 0.4910637432970839,
  'index': 4,
  'metadata': {'genre': 'comedy', 'year': 1938},
  'role': 'assistant'},
 {'content': 'Funny names?',
  'cosine_similarity': 0.561316881258301,
  'index': 3,
  'metadata': {'genre': 'comedy', 'year': 1938},
  'role': 'user'},
 {'content': "That's the man's name.",
  'cosine_similarity': 0.5729953094121991,
  'index': 20,
  'metadata': {'genre': 'comedy', 'year': 1938},
  'role': 'assistant'}]


### Retrieving the entire chatlog

In [455]:
print(f'Retreiving chatlog "{history_key}" from database')
script_from_PG = await baseballDB.get_chatlog(history_key=history_key)

Retreiving chatlog "3dce2b2f-48ec-4bba-a069-373c19327308" from database


In [456]:
print('RETURNED CHATLOG:')
for message in script_from_PG:
    print(f'{message["role"]}: {message["content"]}')

RETURNED CHATLOG:
system: The user is considering becoming a ballplayer. The assistant wants to make sure they knows what they're getting into.
assistant: Strange as it may seem, they give ball players nowadays very peculiar names.
user: Funny names?
assistant: Nicknames, nicknames. Now, on the St. Louis team we have Who's on first, What's on second, I Don't Know is on third--
user: That's what I want to find out. I want you to tell me the names of the fellows on the St. Louis team.
assistant: I'm telling you. Who is on first. What's on second. I Don't Know's on third--
user: You know the fellows' names?
assistant: Yes.
user: Well, then who's playing first?
assistant: Yes.
user: I mean the fellow's name on first base.
assistant: Who.
user: The fellow playin' first base.
assistant: Who.
user: The guy on first base.
assistant: Who is on first.
user: Well, what are you askin' me for?
assistant: I'm not asking you--I'm telling you. Who is on first.
user: I'm asking you--who's on first?
ass

In [457]:
print('RAW RETURN:')
pprint(script_from_PG)

RAW RETURN:
[{'content': 'The user is considering becoming a ballplayer. The assistant '
             "wants to make sure they knows what they're getting into.",
  'index': 1,
  'metadata': {'genre': 'comedy', 'year': 1938},
  'role': 'system'},
 {'content': 'Strange as it may seem, they give ball players nowadays very '
             'peculiar names.',
  'index': 2,
  'metadata': {'genre': 'comedy', 'year': 1938},
  'role': 'assistant'},
 {'content': 'Funny names?',
  'index': 3,
  'metadata': {'genre': 'comedy', 'year': 1938},
  'role': 'user'},
 {'content': "Nicknames, nicknames. Now, on the St. Louis team we have Who's "
             "on first, What's on second, I Don't Know is on third--",
  'index': 4,
  'metadata': {'genre': 'comedy', 'year': 1938},
  'role': 'assistant'},
 {'content': "That's what I want to find out. I want you to tell me the names "
             'of the fellows on the St. Louis team.',
  'index': 5,
  'metadata': {'genre': 'comedy', 'year': 1938},
  'role': 'us