# Saving Schema Data to Vector Database

My initial thought was to build a model that checks for similarity between the prompt and the schema information. But in doing research it sounds like this could be simplified and expedited using a vector database through langchain. We could then query the tables (with metadata) based on the question and return the documents that are most closely related. With this, we'll try to all be bundled into langchain. woohoo!

Big shoutout to this great blogpost that provided some of the framework: https://canvasapp.com/blog/text-to-sql-in-production

In [1]:
import pandas as pd
import json
import os

from dotenv import load_dotenv

from langchain.docstore.document import Document
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings

## Create Functions

In [2]:
def get_json(path):
    """Return json file from specified filepath"""
    with open(path, "r") as f:
        json_file = json.load(f)
    
    return json_file

In [3]:
def prep_chroma_documents(json_path):
    """Take json file and work through it to prepare for load to Chroma.
    Instead of list comprehension, establis the blank list and loop through the json.
    Then saves to the list using the langchain docstore.document -> Document modeul

    This works specifically with the content and metadata we want for this project"""
    docs = []
    for item in get_json(json_path):
        doc = Document(
            page_content=f"Table: {item['table']}",
            metadata={
                'schema': item['schema'],
                'table': item['table'],
                'columns': json.dumps([col['c_name'] for col in item['columns']])
            }
        )
        docs.append(doc)
    
    return docs

In [5]:
table_info = prep_chroma_documents('../data/interim/schema_info.json')

table_info[:3]

[Document(page_content='Table: ACCOUNTS', metadata={'schema': 'small_bank_1', 'table': 'ACCOUNTS', 'columns': '["custid", "name"]'}),
 Document(page_content='Table: AREA_CODE_STATE', metadata={'schema': 'voter_1', 'table': 'AREA_CODE_STATE', 'columns': '["area_code", "state"]'}),
 Document(page_content='Table: Acceptance', metadata={'schema': 'workshop_paper', 'table': 'Acceptance', 'columns': '["Submission_ID", "Workshop_ID", "Result"]'})]

## Establish Vector Database

There are a few options for databases, but I'll go with Chroma because it is open source, makes local device use "easy", and has built-in connections with langchain.

A key component in running apps using langchain is the ability store and work with embeddings, which is how AI models natively represent data of all kinds. Langchain will provide the application framework and Chroma will provide the vector store.

Within that we can also do some information retrieval from the database, finding the most relevant tables based on the user question. This will allow us to engineer a better prompt to feed to our gpt chatbot.

In [6]:
#setup embeddings using HuggingFace
embeddings  = HuggingFaceEmbeddings()

#setup directory to store database on disk
persist_dir = '../data/processed/chromadb/'

In [7]:
vectordb = Chroma.from_documents(documents=table_info, embedding=embeddings, persist_directory=persist_dir)

#### Persist DB

In [8]:
vectordb.persist() #think I need to call this mainly because I'm in an ipynb.
vectordb=None

## Test Vector Database Query

In [None]:
# load from disk
vectordb = Chroma(persist_directory=persist_dir, embedding_function=embeddings)

In [10]:
query = "How many heads of the departments are older than 56?" #one of the prompts from the training data

docs = vectordb.similarity_search(query)
print(docs[0].page_content)

Table: Departments


In [33]:
docs

[Document(page_content='Table: Departments', metadata={'schema': 'department_store', 'table': 'Departments', 'columns': '["department_id", "dept_store_id", "department_name"]'}),
 Document(page_content='Table: Departments', metadata={'schema': 'student_transcripts_tracking', 'table': 'Departments', 'columns': '["department_id", "department_name", "department_description", "other_details"]'}),
 Document(page_content='Table: departments', metadata={'schema': 'hr_1', 'table': 'departments', 'columns': '["DEPARTMENT_ID", "DEPARTMENT_NAME", "MANAGER_ID", "LOCATION_ID"]'}),
 Document(page_content='Table: Department', metadata={'schema': 'hospital_1', 'table': 'Department', 'columns': '["DepartmentID", "Name", "Head"]'})]

In [32]:
i=0
for doc in docs[:10]:
    print(str(i+1) + ' ' + docs[i].metadata['schema'])
    i+=1

1 department_store
2 student_transcripts_tracking
3 hr_1
4 hospital_1


This appears to be working in principle, although it isn't returning the correct answer. For that query we need to be pulling fomr the 'head' table underneath the department_management schema. I have a feeling this vector search is priritizing the document table which makes sense because the schema is only in the metadata. Maybe I could concatenate the schema and table together or load them in as equal parts of the documents so it looks at both variables equally when matchings? I'll research that.