# Saving Schema Data to Vector Database

My initial thought was to build a model that checks for similarity between the prompt and the schema information. But in doing research it sounds like this could be simplified and expedited using a vector database through langchain. We could then query the tables (with metadata) based on the question and return the documents that are most closely related. With this, we'll try to all be bundled into langchain. woohoo!

Big shoutout to this great blogpost that provided some of the framework: https://canvasapp.com/blog/text-to-sql-in-production

In [1]:
import pandas as pd
import json
import os

from dotenv import load_dotenv

from langchain.docstore.document import Document
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings

## Testing

I'll test the structure of the data - breaking it out into a way to load into "documents"

### Load Data

In [2]:
#load in json
path = '../data/interim/'

with open(path+'schema_info.json', "r") as f:
    schema_info = json.load(f)

In [3]:
schema_info[0]['columns'][0]['c_name']

'custid'

I want to try to create a "document" for each table with metadata of the schema, table, and a list of columns. I don't know if the column type would be helpful to the model, so I may leave just to try, but also don't want to add useless details to storage. I'll leave them out, but comment out the raw code that will work to pull in all the details.

In [4]:
for table in schema_info:
    doc = table['table']
    cols = table['columns']
    col_names = json.dumps([column['c_name'] for column in cols]) #add in json dumps as Chroma didn't take the list -> hoping this converts to a string that works with Chroma
    metadata={
        'schema': table['schema'],
        'table': table['table'],
        'columns': col_names,
    }

print(doc, metadata)


###code to return full column_info in metadata
#for table in schema_info:
#    doc = table['table']
#    metadata={
#        'schema': table['schema'],
#        'table': table['table'],
#        'columns': json.dumps(table['columns']),
#    }

#print(doc, metadata)

written_by {'schema': 'imdb', 'table': 'written_by', 'columns': '["id", "msid", "wid"]'}


## Setup langchain Document

Documents are just a piece of text that you can optionalliy add metadata too. Straighforward, but they allow for using model and databases with langchain. They have a module for setting up documents: "Document"

In [5]:
documents = [
    Document(
        page_content=f"Table: {simple_meta['table']}",
        metadata={
            'schema': simple_meta['schema'],
            'table': simple_meta['table'],
            'columns': json.dumps([col['c_name'] for col in simple_meta['columns']])
        },
    )
    for simple_meta in schema_info
]

In [6]:
documents[:3]

[Document(page_content='Table: ACCOUNTS', metadata={'schema': 'small_bank_1', 'table': 'ACCOUNTS', 'columns': '["custid", "name"]'}),
 Document(page_content='Table: AREA_CODE_STATE', metadata={'schema': 'voter_1', 'table': 'AREA_CODE_STATE', 'columns': '["area_code", "state"]'}),
 Document(page_content='Table: Acceptance', metadata={'schema': 'workshop_paper', 'table': 'Acceptance', 'columns': '["Submission_ID", "Workshop_ID", "Result"]'})]

## Establish Vector Database

There are a few options for databases, but I'll go with Chroma because it is open source, makes local device use "easy", and has built-in connections with langchain.

A key component in running apps using langchain is the ability store and work with embeddings, which is how AI models natively represent data of all kinds. Langchain will provide the application framework and Chroma will provide the vector store.

Within that we can also do some information retrieval from the database, finding the most relevant tables based on the user question. This will allow us to engineer a better prompt to feed to our gpt chatbot.

In [7]:
#load .env file to get API Key
load_dotenv()
openai_api_key=os.getenv("OPENAI_API_KEY")

#setup embeddings using OpenAIEmbeddings
embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)

#setup directory to store database on disk
persist_dir = '../data/processed/chromadb/'

In [None]:
help(OpenAIEmbeddings)

In [None]:
db = Chroma.from_documents(
    documents, ebedding=embeddings,
    persist_directory=persist_dir
)

Currently getting an error here, I think because I haven't passed it a true working API key yet. I'll keep doing some research on it because I don't want to run it without fully understanding it and rack up cost for no reason.

## Create Functions

In [15]:
def get_json(path):
    """Return json file from specified filepath"""
    with open(path, "r") as f:
        json_file = json.load(f)
    
    return json_file

In [22]:
def prep_chroma_documents(json_path):
    """Take json file and work through it to prepare for load to Chroma.
    Instead of list comprehension, establis the blank list and loop through the json.
    Then saves to the list using the langchain docstore.document -> Document modeul

    This works specifically with the content and metadata we want for this project"""
    docs = []
    for item in get_json(json_path):
        doc = Document(
            page_content=f"Table: {item['table']}",
            metadata={
                'schema': item['schema'],
                'table': item['table'],
                'columns': json.dumps([col['c_name'] for col in item['columns']])
            }
        )
        docs.append(doc)
    
    return docs

In [25]:
test = prep_chroma_documents('../data/interim/schema_info.json')

test[:3]

[Document(page_content='Table: ACCOUNTS', metadata={'schema': 'small_bank_1', 'table': 'ACCOUNTS', 'columns': '["custid", "name"]'}),
 Document(page_content='Table: AREA_CODE_STATE', metadata={'schema': 'voter_1', 'table': 'AREA_CODE_STATE', 'columns': '["area_code", "state"]'}),
 Document(page_content='Table: Acceptance', metadata={'schema': 'workshop_paper', 'table': 'Acceptance', 'columns': '["Submission_ID", "Workshop_ID", "Result"]'})]