# Saving Schema Data to Vector Database

My initial thought was to build a model that checks for similarity between the prompt and the schema information. But in doing research it sounds like this could be simplified and expedited using a vector database through langchain. We could then query the tables (with metadata) based on the question and return the documents that are most closely related. With this, we'll try to all be bundled into langchain. woohoo!

In [61]:
import pandas as pd
import json

from langchain.docstore.document import Document
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings

## Testing

I'll test the structure of the data - breaking it out into a way to load into "documents"

### Load Data

In [32]:
#load in json
path = '../data/interim/'

with open(path+'schema_info.json', "r") as f:
    schema_info = json.load(f)

In [47]:
schema_info[0]['columns'][0]['c_name']

'custid'

I want to try to create a "document" for each table with metadata of the schema, table, and a list of columns. I don't know if the column type would be helpful to the model, so I may leave just to try, but also don't want to add useless details to storage. I'll leave them out, but comment out the raw code that will work to pull in all the details.

In [53]:
for table in schema_info:
    doc = table['table']
    cols = table['columns']
    col_names = [column['c_name'] for column in cols]
    metadata={
        'schema': table['schema'],
        'table': table['table'],
        'columns': c_names,
    }

print(doc, metadata)


###code to return full column_info in metadata
#for table in schema_info:
#    doc = table['table']
#    metadata={
#        'schema': table['schema'],
#        'table': table['table'],
#        'columns': json.dumps(table['columns']),
#    }

#print(doc, metadata)

written_by {'schema': 'imdb', 'table': 'written_by', 'columns': ['id', 'msid', 'wid']}


## Setup langchain Document

Documents are just a piece of text that you can optionalliy add metadata too. Straighforward, but they allow for using model and databases with langchain. They have a module for setting up documents: "Document"

In [58]:
documents = [
    Document(
        page_content=f"Table: {simple_meta['table']}",
        metadata={
            'schema': simple_meta['schema'],
            'table': simple_meta['table'],
            'columns': [col['c_name'] for col in simple_meta['columns']]
        },
    )
    for simple_meta in schema_info
]

In [60]:
documents[:3]

[Document(page_content='Table: ACCOUNTS', metadata={'schema': 'small_bank_1', 'table': 'ACCOUNTS', 'columns': ['custid', 'name']}),
 Document(page_content='Table: AREA_CODE_STATE', metadata={'schema': 'voter_1', 'table': 'AREA_CODE_STATE', 'columns': ['area_code', 'state']}),
 Document(page_content='Table: Acceptance', metadata={'schema': 'workshop_paper', 'table': 'Acceptance', 'columns': ['Submission_ID', 'Workshop_ID', 'Result']})]