# Vector Database Validation

I key necessity of this application is to correctly identify where to go get the answers to the question. This can be fairly simple with a precise user input question like "pull from x schema" or more unique keywords. But it can also be very tough. In the first version of this, I have a pure vectordb similarity search. I want to test how accurately this get the correct schema in the top 1 and top 3. If it's not satisfactory - I want to then move onto testing other options or ways to supplement it.

## Setup Connection to Vector DB

### Imports

In [67]:
import os
import json
import pandas as pd
import numpy as np

from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings

### Point to DB

In [60]:
#setup embeddings using HuggingFace and the directory location
embeddings  = HuggingFaceEmbeddings()
persist_dir = '../data/processed/chromadb/schema-table-split'

# load from disk
vectordb = Chroma(persist_directory=persist_dir, embedding_function=embeddings)

## Pull in Training Data to Validate Against

In [61]:
#load json
path = '../data/raw/spider/'

with open(path+'train_spider.json', "r") as f:
    spi_train = json.load(f)

spi_train[0]['db_id']

'department_management'

In [62]:
training_df = pd.json_normalize(spi_train)[['question','db_id']]

training_df = training_df.rename(columns={'db_id': 'schema'})

training_df.head()

Unnamed: 0,question,schema
0,How many heads of the departments are older th...,department_management
1,"List the name, born state and age of the heads...",department_management
2,"List the creation year, name and budget of eac...",department_management
3,What are the maximum and minimum budget of the...,department_management
4,What is the average number of employees of the...,department_management


## Write Loop to Test each of these questions against the vector db.

This could take a while. Make sure to do some tests. :)

In [63]:
#write function for similarity search to apply to each row of the dataframe
def sim_search(question, k=3):
    top_results = vectordb.similarity_search(question, k=k)
    tgt_schema = list(dict.fromkeys([doc.metadata['schema'] for doc in top_results]))

    return tgt_schema

In [64]:
#test function on 10 row df
df_test = training_df[['question','schema']].head(10)

df_test['top_three'] = df_test.apply(lambda x: sim_search(x['question'], k=3), axis=1)
df_test['top_schema'] = df_test.apply(lambda x: x['top_three'][0], axis=1)

df_test.head(3)

Unnamed: 0,question,schema,top_three,top_schema
0,How many heads of the departments are older th...,department_management,"[department_management, hr_1]",department_management
1,"List the name, born state and age of the heads...",department_management,"[department_management, local_govt_in_alabama]",department_management
2,"List the creation year, name and budget of eac...",department_management,"[department_management, e_government]",department_management


In [65]:
#Use this function to create two new fields in our training dataframe: top result and top 3 results (unique)
training_df['top_three_unique'] = training_df.apply(lambda x: sim_search(x['question'], k=3), axis=1)
training_df['top_schema'] = training_df.apply(lambda x: x['top_three_unique'][0], axis=1)

In [66]:
training_df.head()

Unnamed: 0,question,schema,top_three_unique,top_schema
0,How many heads of the departments are older th...,department_management,"[department_management, hr_1]",department_management
1,"List the name, born state and age of the heads...",department_management,"[department_management, local_govt_in_alabama]",department_management
2,"List the creation year, name and budget of eac...",department_management,"[department_management, e_government]",department_management
3,What are the maximum and minimum budget of the...,department_management,"[department_management, e_government]",department_management
4,What is the average number of employees of the...,department_management,"[department_store, department_management, hr_1]",department_store


In [68]:
#create new column that flag 1-0 if the schema matches the top 1.
training_df['top_match'] = np.where(training_df['schema'] == training_df['top_schema'], 1, 0)

In [79]:
#now for the top 3 unique
#define a function for this
def is_schema_in_top_three(row):
    if row['schema'] in row['top_three_unique']:
        boo = 1
    else:
        boo = 0
    return boo

In [80]:
# Apply the function to each row using 'apply' and store the result in a new column 'is_in_top_three'
training_df['is_in_top_three'] = training_df.apply(is_schema_in_top_three, axis=1)

In [81]:
training_df.head(10)

Unnamed: 0,question,schema,top_three_unique,top_schema,top_match,is_in_top_three
0,How many heads of the departments are older th...,department_management,"[department_management, hr_1]",department_management,1,1
1,"List the name, born state and age of the heads...",department_management,"[department_management, local_govt_in_alabama]",department_management,1,1
2,"List the creation year, name and budget of eac...",department_management,"[department_management, e_government]",department_management,1,1
3,What are the maximum and minimum budget of the...,department_management,"[department_management, e_government]",department_management,1,1
4,What is the average number of employees of the...,department_management,"[department_store, department_management, hr_1]",department_store,0,1
5,What are the names of the heads who are born o...,department_management,"[voter_1, election, party_people]",voter_1,0,0
6,What are the distinct creation years of the de...,department_management,[local_govt_in_alabama],local_govt_in_alabama,0,0
7,What are the names of the states where at leas...,department_management,"[geo, world_1, voter_1]",geo,0,0
8,In which year were most departments established?,department_management,"[department_management, hr_1]",department_management,1,1
9,Show the name and number of employees for the ...,department_management,[department_management],department_management,1,1


### Evaluate Results

Look at pure accuracy - what % of the total match for the top 1 and top 3 unique?

In [84]:
#total rows
tot_q = training_df.shape[0]

#top 1 match
top_one_match = training_df['top_match'].sum()

#top 3 match
top_three_match = training_df['is_in_top_three'].sum()

In [90]:
print(f"1st result matches: {top_one_match/tot_q:.2%}")
print(f"Top 3 results match: {top_three_match/tot_q:.2%}")

1st result matches: 42.13%
Top 3 results match: 57.00%


I've got a lot of room for improvement!

I think I want to try loading more metadata about the schema - tables. And using some entity extraction on the question, combining both steps.