# Saving Schema Data to Vector Database

My initial thought was to build a model that checks for similarity between the prompt and the schema information. But in doing research it sounds like this could be simplified and expedited using a vector database through langchain. We could then query the tables (with metadata) based on the question and return the documents that are most closely related. With this, we'll try to all be bundled into langchain. woohoo!

Big shoutout to this great blogpost that provided some of the framework: https://canvasapp.com/blog/text-to-sql-in-production

In [1]:
import pandas as pd
import json
import os

from langchain.docstore.document import Document
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings

from sqlalchemy import exc

## Create Functions

In [2]:
def get_json(path):
    """Return json file from specified filepath"""
    with open(path, "r") as f:
        json_file = json.load(f)
    
    return json_file

In [3]:
def prep_chroma_documents(json_path):
    """Take json file and work through it to prepare for load to Chroma.
    Instead of list comprehension, establis the blank list and loop through the json.
    Then saves to the list using the langchain docstore.document -> Document modeul

    This works specifically with the content and metadata we want for this project"""
    docs = []
    for item in get_json(json_path):
        doc = Document(
            page_content=f"Schema Table: {item['schema_split'] + ' ' + item['table_split']}",
            metadata={
                'schema': item['schema'],
                'table': item['table'],
                'columns': json.dumps([col['c_name'] for col in item['columns']])
            }
        )
        docs.append(doc)
    
    return docs

In [55]:
table_info = prep_chroma_documents('../data/interim/schema_info.json')

table_info[:3]

[Document(page_content='Schema Table: academic author', metadata={'schema': 'academic', 'table': 'author', 'columns': '["aid", "homepage", "name", "oid"]'}),
 Document(page_content='Schema Table: academic cite', metadata={'schema': 'academic', 'table': 'cite', 'columns': '["cited", "citing"]'}),
 Document(page_content='Schema Table: academic conference', metadata={'schema': 'academic', 'table': 'conference', 'columns': '["cid", "homepage", "name"]'})]

## Establish Vector Database

There are a few options for databases, but I'll go with Chroma because it is open source, makes local device use "easy", and has built-in connections with langchain.

A key component in running apps using langchain is the ability store and work with embeddings, which is how AI models natively represent data of all kinds. Langchain will provide the application framework and Chroma will provide the vector store.

Within that we can also do some information retrieval from the database, finding the most relevant tables based on the user question. This will allow us to engineer a better prompt to feed to our gpt chatbot.

In [5]:
#setup embeddings using HuggingFace
embeddings  = HuggingFaceEmbeddings()

In [6]:
#setup directory to store database on disk
persist_dir = '../data/processed/chromadb/schema-table-split'

In [58]:
vectordb = Chroma.from_documents(documents=table_info, embedding=embeddings, persist_directory=persist_dir)

#### Persist DB

In [59]:
vectordb.persist() #think I need to call this mainly because I'm in an ipynb.
vectordb=None

## Test Vector Database Query

I'm going to start very simple and follow this first prompt through to the end. So for now, I'll only take the top result and roll with it. In the future we'll need to build out some contingency that can test multiple schemas.

In [7]:
# load from disk
vectordb = Chroma(persist_directory=persist_dir, embedding_function=embeddings)

In [8]:
query = "How many heads of the departments are older than 56?" #one of the prompts from the training data

docs = vectordb.similarity_search(query, k=1)

In [9]:
print('Most Likely Schema and Table:\n')
i=0
for doc in docs:
    print(docs[i].metadata['schema'] + ' - ' + docs[i].metadata['table'])
    i+=1

Most Likely Schema and Table:

department_management - head


### Observations and Next Steps

This change appears to have worked on the first test example. In the first run with no processing and searching on a document that contained table title and metadata it gave me the wrong answer. But doing some preprocessing, splitting out any names that contained '_' into two words and restructuring the document to include 'schema table' looks to have helped a bit.

I'll continue on with next steps, push this to main, and come back when I'm ready to complicate things.

## Fine-Tune Documents

The last method only gave around 50% accuracy on identifying the right database. I wonder if there is a way to train a model, but I also want to try providing the database more detailed metadata- Specifically some sample data from the tables.

### Establish New Document Structure

Start by prepping the table info - I'll try using the full langchain get_table_info method.

In [10]:
#import langchain's SQLDatabase tools to get table information
from langchain import SQLDatabase

In [11]:
db_path = '../data/processed/db/'

In [12]:
def connect_db(db_path, target_schema):
    """
    Take in the identified schema and connect to the sqlite database with that name
    """
    db_filepath = db_path
    db_filename = target_schema + '.sqlite'

    #point to database
    base_dir = os.path.dirname(os.path.abspath(db_filepath+db_filename)) #get the full path within the device
    db_path = os.path.join(base_dir, db_filename) #combine with filename to get db_path
    db = SQLDatabase.from_uri("sqlite:///" + db_path) #connect via the lanchain method

    return db

In [31]:
#define new document builder
def prep_chroma_documents_v2(json_path):
    """Take json file and work through it to prepare for load to Chroma.
    Instead of list comprehension, establis the blank list and loop through the json.
    Then saves to the list using the langchain docstore.document -> Document module.

    This version - adding the table info using the langchain SQLDatabase SQLAlchemy wrapper to get table info to add to metadata.
    Would like to not reconnect to the database each time, but instead connect to each schema once and then loop through the tables. But I think this will be easier for now, even if it's less efficient.

    This works specifically with the content and metadata we want for this project"""
    docs = []
    for item in get_json(json_path):
        #connect to database
        db = connect_db(db_path=db_path, target_schema=item['schema'])

        #create variables
        schema = item['schema']
        table = item['table']
        columns = json.dumps([col['c_name'] for col in item['columns']])
        try:
            table_info = db.get_table_info_no_throw(table_names=[table]) #put try-except here becasue there are some issues in the source sqlite database. I want to call this out, but continue.
        except exc.SQLAlchemyError as e:
            table_info = ""
            print(schema + "-" + table + ": " + str(e))
            continue
        except TypeError as te:
            print(schema + "-" + table + ": " + str(te))       
            continue   

        #create document
        doc = Document(
            page_content=
                f"""Schema: {schema}
                Table: {table}
                Columns: {columns}
                DDL:
                    {table_info}
                """,
            metadata={
                'schema': schema,
                'table': table,
                'columns': columns,
                'table_info': table_info
            }
        )
        docs.append(doc)
    
    return docs

In [32]:
table_docs = prep_chroma_documents_v2('../data/interim/schema_info.json')

  self._metadata.reflect(


baseball_1-all_star: Could not initialize target column for ForeignKey 'player.team_id' on table 'fielding_postseason': table 'player' has no column named 'team_id'
baseball_1-appearances: Could not initialize target column for ForeignKey 'player.team_id' on table 'fielding_postseason': table 'player' has no column named 'team_id'
baseball_1-batting: Could not initialize target column for ForeignKey 'player.team_id' on table 'fielding_postseason': table 'player' has no column named 'team_id'
baseball_1-batting_postseason: Could not initialize target column for ForeignKey 'player.team_id' on table 'fielding_postseason': table 'player' has no column named 'team_id'
baseball_1-college: Could not initialize target column for ForeignKey 'player.team_id' on table 'fielding_postseason': table 'player' has no column named 'team_id'
baseball_1-fielding: Could not initialize target column for ForeignKey 'player.team_id' on table 'fielding_postseason': table 'player' has no column named 'team_id'

  self._metadata.reflect(


loan_1-bank: Could not initialize target column for ForeignKey 'customer.Cust_ID' on table 'loan': table 'customer' has no column named 'Cust_ID'
loan_1-customer: Could not initialize target column for ForeignKey 'customer.Cust_ID' on table 'loan': table 'customer' has no column named 'Cust_ID'
loan_1-loan: Could not initialize target column for ForeignKey 'customer.Cust_ID' on table 'loan': table 'customer' has no column named 'Cust_ID'
restaurants-GEOGRAPHIC: Could not initialize target column for ForeignKey 'RESTAURANT.RESTAURANT_ID' on table 'LOCATION': table 'RESTAURANT' has no column named 'RESTAURANT_ID'
restaurants-LOCATION: Could not initialize target column for ForeignKey 'RESTAURANT.RESTAURANT_ID' on table 'LOCATION': table 'RESTAURANT' has no column named 'RESTAURANT_ID'
restaurants-RESTAURANT: Could not initialize target column for ForeignKey 'RESTAURANT.RESTAURANT_ID' on table 'LOCATION': table 'RESTAURANT' has no column named 'RESTAURANT_ID'
sakila_1-film: (in table 'fil

  self._metadata.reflect(
  self._metadata.reflect(


store_product-district: Could not initialize target column for ForeignKey 'product.Product_ID' on table 'store_product': table 'product' has no column named 'Product_ID'
store_product-product: Could not initialize target column for ForeignKey 'product.Product_ID' on table 'store_product': table 'product' has no column named 'Product_ID'
store_product-store: Could not initialize target column for ForeignKey 'product.Product_ID' on table 'store_product': table 'product' has no column named 'Product_ID'
store_product-store_district: Could not initialize target column for ForeignKey 'product.Product_ID' on table 'store_product': table 'product' has no column named 'Product_ID'
store_product-store_product: Could not initialize target column for ForeignKey 'product.Product_ID' on table 'store_product': table 'product' has no column named 'Product_ID'
wta_1-matches: fromisoformat: argument must be str
wta_1-players: fromisoformat: argument must be str
wta_1-rankings: fromisoformat: argument m

  self._metadata.reflect(


There were errors on 3 or 4 of the source tables, but this looks to have worked. I'll now get this stored in a database.

### Setup New Vector DB

In [15]:
#setup directory to store database on disk
persist_dir_new = '../data/processed/chromadb/schema-metadata'

In [34]:
vectordb_new = Chroma.from_documents(documents=table_docs, embedding=embeddings, persist_directory=persist_dir_new)

vectordb_new.persist() #think I need to call this mainly because I'm in an ipynb.
vectordb_new=None

### Test New DB with one of the questions that errored out on the original method

In [36]:
# load from disk
vectordb_new = Chroma(persist_directory=persist_dir_new, embedding_function=embeddings)

test = "What is the average number of employees of the departments whose rank is between 10 and 15?" #one of the prompts from the training data

test_docs = vectordb_new.similarity_search(test, k=3)

test_docs

[Document(page_content='Schema: hr_1\n                Table: departments\n                Columns: ["DEPARTMENT_ID", "DEPARTMENT_NAME", "MANAGER_ID", "LOCATION_ID"]\n                DDL:\n                    \nCREATE TABLE departments (\n\t"DEPARTMENT_ID" DECIMAL(4, 0) DEFAULT \'0\' NOT NULL, \n\t"DEPARTMENT_NAME" VARCHAR(30) NOT NULL, \n\t"MANAGER_ID" DECIMAL(6, 0) DEFAULT NULL, \n\t"LOCATION_ID" DECIMAL(4, 0) DEFAULT NULL, \n\tPRIMARY KEY ("DEPARTMENT_ID")\n)\n\n/*\n3 rows from departments table:\nDEPARTMENT_ID\tDEPARTMENT_NAME\tMANAGER_ID\tLOCATION_ID\n10\tAdministration\t200\t1700\n20\tMarketing\t201\t1800\n30\tPurchasing\t114\t1700\n*/\n                ', metadata={'schema': 'hr_1', 'table': 'departments', 'columns': '["DEPARTMENT_ID", "DEPARTMENT_NAME", "MANAGER_ID", "LOCATION_ID"]', 'table_info': '\nCREATE TABLE departments (\n\t"DEPARTMENT_ID" DECIMAL(4, 0) DEFAULT \'0\' NOT NULL, \n\t"DEPARTMENT_NAME" VARCHAR(30) NOT NULL, \n\t"MANAGER_ID" DECIMAL(6, 0) DEFAULT NULL, \n\t"LOCA

This didn't put it in the 1st spot, but it is in the top three. I'll switch over to our other notebook and run the full result.