# Saving Schema Data to Vector Database

My initial thought was to build a model that checks for similarity between the prompt and the schema information. But in doing research it sounds like this could be simplified and expedited using a vector database through langchain. We could then query the tables (with metadata) based on the question and return the documents that are most closely related. With this, we'll try to all be bundled into langchain. woohoo!

Big shoutout to this great blogpost that provided some of the framework: https://canvasapp.com/blog/text-to-sql-in-production

In [1]:
import pandas as pd
import json
import os

from dotenv import load_dotenv

from langchain.docstore.document import Document
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings

## Create Functions

In [2]:
def get_json(path):
    """Return json file from specified filepath"""
    with open(path, "r") as f:
        json_file = json.load(f)
    
    return json_file

In [3]:
def prep_chroma_documents(json_path):
    """Take json file and work through it to prepare for load to Chroma.
    Instead of list comprehension, establis the blank list and loop through the json.
    Then saves to the list using the langchain docstore.document -> Document modeul

    This works specifically with the content and metadata we want for this project"""
    docs = []
    for item in get_json(json_path):
        doc = Document(
            page_content=f"Schema Table: {item['schema_split'] + ' ' + item['table_split']}",
            metadata={
                'schema': item['schema'],
                'table': item['table'],
                'columns': json.dumps([col['c_name'] for col in item['columns']])
            }
        )
        docs.append(doc)
    
    return docs

In [4]:
table_info = prep_chroma_documents('../data/interim/schema_info.json')

table_info[:3]

[Document(page_content='Schema Table: academic author', metadata={'schema': 'academic', 'table': 'author', 'columns': '["aid", "homepage", "name", "oid"]'}),
 Document(page_content='Schema Table: academic cite', metadata={'schema': 'academic', 'table': 'cite', 'columns': '["cited", "citing"]'}),
 Document(page_content='Schema Table: academic conference', metadata={'schema': 'academic', 'table': 'conference', 'columns': '["cid", "homepage", "name"]'})]

## Establish Vector Database

There are a few options for databases, but I'll go with Chroma because it is open source, makes local device use "easy", and has built-in connections with langchain.

A key component in running apps using langchain is the ability store and work with embeddings, which is how AI models natively represent data of all kinds. Langchain will provide the application framework and Chroma will provide the vector store.

Within that we can also do some information retrieval from the database, finding the most relevant tables based on the user question. This will allow us to engineer a better prompt to feed to our gpt chatbot.

In [8]:
#setup embeddings using HuggingFace
embeddings  = HuggingFaceEmbeddings()

In [6]:
#setup directory to store database on disk
persist_dir = '../data/processed/chromadb/schema-table-split'

In [21]:
vectordb = Chroma.from_documents(documents=table_info, embedding=embeddings, persist_directory=persist_dir)

#### Persist DB

In [22]:
vectordb.persist() #think I need to call this mainly because I'm in an ipynb.
vectordb=None

## Test Vector Database Query

I'm going to start very simple and follow this first prompt through to the end. So for now, I'll only take the top result and roll with it. In the future we'll need to build out some contingency that can test multiple schemas.

In [9]:
# load from disk
vectordb = Chroma(persist_directory=persist_dir, embedding_function=embeddings)

In [26]:
query = "How many heads of the departments are older than 56?" #one of the prompts from the training data

docs = vectordb.similarity_search(query, k=1)

In [28]:
print('Most Likely Schema and Table:\n')
i=0
for doc in docs:
    print(docs[i].metadata['schema'] + ' - ' + docs[i].metadata['table'])
    i+=1

Most Likely Schema and Table:

department_management - head


## Observations and Next Steps

This change appears to have worked on the first test example. In the first run with no processing and searching on a document that contained table title and metadata it gave me the wrong answer. But doing some preprocessing, splitting out any names that contained '_' into two words and restructuring the document to include 'schema table' looks to have helped a bit.

I'll continue on with next steps, push this to main, and come back when I'm ready to complicate things.