# Vector Database Validation

I key necessity of this application is to correctly identify where to go get the answers to the question. This can be fairly simple with a precise user input question like "pull from x schema" or more unique keywords. But it can also be very tough. In the first version of this, I have a pure vectordb similarity search. I want to test how accurately this get the correct schema in the top 1 and top 3. If it's not satisfactory - I want to then move onto testing other options or ways to supplement it.

## Setup Connection to Vector DB

### Imports

In [1]:
import os
import json
import pandas as pd
import numpy as np

from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings

### Point to DB

In [2]:
#setup embeddings using HuggingFace and the directory location
embeddings  = HuggingFaceEmbeddings()
persist_dir = '../data/processed/chromadb/schema-table-split'

# load from disk
vectordb = Chroma(persist_directory=persist_dir, embedding_function=embeddings)

## Pull in Training Data to Validate Against

In [8]:
#load json
path = '../data/raw/spider/'

with open(path+'train_spider.json', "r") as f:
    spi_train = json.load(f)

spi_train[0]['db_id']

'department_management'

In [10]:
training_df = pd.json_normalize(spi_train)[['question','db_id']]

training_df = training_df.rename(columns={'db_id': 'target_schema'})

training_df.head()

Unnamed: 0,question,target_schema
0,How many heads of the departments are older th...,department_management
1,"List the name, born state and age of the heads...",department_management
2,"List the creation year, name and budget of eac...",department_management
3,What are the maximum and minimum budget of the...,department_management
4,What is the average number of employees of the...,department_management


## Write Loop to Test each of these questions against the vector db.

This could take a while. Make sure to do some tests. :)

In [3]:
#write function for similarity search to apply to each row of the dataframe
def sim_search(question, vector_db, k=3):
    top_results = vector_db.similarity_search(question, k=k)
    top_matches = list(dict.fromkeys([doc.metadata['schema'] for doc in top_results]))

    return top_matches

In [97]:
#test function on 10 row df
df_test = training_df[['question','target_schema']].head(10)

df_test['top_three_match'] = df_test.apply(lambda x: sim_search(x['question'], k=3), axis=1)
df_test['top_match'] = df_test.apply(lambda x: x['top_three_match'][0], axis=1)

df_test.head(3)

Unnamed: 0,question,target_schema,top_three_match,top_match
0,How many heads of the departments are older th...,department_management,"[department_management, hr_1]",department_management
1,"List the name, born state and age of the heads...",department_management,"[department_management, local_govt_in_alabama]",department_management
2,"List the creation year, name and budget of eac...",department_management,"[department_management, e_government]",department_management


In [98]:
#Use this function to create two new fields in our training dataframe: top result and top 3 results (unique)
training_df['top_three_match_unique'] = training_df.apply(lambda x: sim_search(x['question'], k=3), axis=1)
training_df['top_one_match'] = training_df.apply(lambda x: x['top_three_match_unique'][0], axis=1)

In [99]:
training_df.head()

Unnamed: 0,question,target_schema,top_three_match_unique,top_one_match
0,How many heads of the departments are older th...,department_management,"[department_management, hr_1]",department_management
1,"List the name, born state and age of the heads...",department_management,"[department_management, local_govt_in_alabama]",department_management
2,"List the creation year, name and budget of eac...",department_management,"[department_management, e_government]",department_management
3,What are the maximum and minimum budget of the...,department_management,"[department_management, e_government]",department_management
4,What is the average number of employees of the...,department_management,"[department_store, department_management, hr_1]",department_store


In [100]:
#create new column that flag 1-0 if the schema matches the top 1.
training_df['top_one_is_match'] = np.where(training_df['target_schema'] == training_df['top_one_match'], 1, 0)

In [4]:
#now for the top 3 unique
#define a function for this
def is_schema_in_top_three(row):
    if row['target_schema'] in row['top_three_match_unique']:
        boo = 1
    else:
        boo = 0
    return boo

In [105]:
# Apply the function to each row using 'apply' and store the result in a new column 'is_in_top_three'
training_df['top_three_is_match'] = training_df.apply(is_schema_in_top_three, axis=1)

In [106]:
training_df.head(10)

Unnamed: 0,question,target_schema,top_three_match_unique,top_one_match,top_one_is_match,top_three_is_match
0,How many heads of the departments are older th...,department_management,"[department_management, hr_1]",department_management,1,1
1,"List the name, born state and age of the heads...",department_management,"[department_management, local_govt_in_alabama]",department_management,1,1
2,"List the creation year, name and budget of eac...",department_management,"[department_management, e_government]",department_management,1,1
3,What are the maximum and minimum budget of the...,department_management,"[department_management, e_government]",department_management,1,1
4,What is the average number of employees of the...,department_management,"[department_store, department_management, hr_1]",department_store,0,1
5,What are the names of the heads who are born o...,department_management,"[voter_1, election, party_people]",voter_1,0,0
6,What are the distinct creation years of the de...,department_management,[local_govt_in_alabama],local_govt_in_alabama,0,0
7,What are the names of the states where at leas...,department_management,"[geo, world_1, voter_1]",geo,0,0
8,In which year were most departments established?,department_management,"[department_management, hr_1]",department_management,1,1
9,Show the name and number of employees for the ...,department_management,[department_management],department_management,1,1


In [112]:
training_df.iloc[4,0]

'What is the average number of employees of the departments whose rank is between 10 and 15?'

### Evaluate Results

Look at pure accuracy - what % of the total match for the top 1 and top 3 unique?

In [108]:
#total rows
tot_q = training_df.shape[0]

#top 1 match
top_one_match = training_df['top_one_is_match'].sum()

#top 3 match
top_three_match = training_df['top_three_is_match'].sum()

In [109]:
print(f"1st result matches: {top_one_match/tot_q:.2%}")
print(f"Top 3 results match: {top_three_match/tot_q:.2%}")

1st result matches: 42.13%
Top 3 results match: 57.00%


I've got a lot of room for improvement!

I think I want to try loading more metadata about the schema - tables. And using some entity extraction on the question, combining both steps.

## New VectorDB Test

I create a new vectordatabase loading the table info with the page contents include the schema and table again, but now including the columns and table create statements.

I'll run our same process on this new DB

In [5]:
#write updated function to give option for which search method
def sim_search(question, vector_db, method="cosine", k=3):
    if method == "cosine":
        top_results = vector_db.similarity_search(question, k=k)
        top_matches = list(dict.fromkeys([doc.metadata['schema'] for doc in top_results]))
    elif method == "mmr":
        retriever = vector_db.as_retriever(search_type="mmr")
        top_results = retriever.get_relevant_documents(question)[:3]
        top_matches = list(dict.fromkeys([doc.metadata['schema'] for doc in top_results]))

    return top_matches

In [6]:
def sim_search_test(json_file, vectordb, method="cosine"):
    """Create a quick and dirty function to test accuracy multiple times."""
    df = pd.json_normalize(json_file)[['question','db_id']]

    df = df.rename(columns={'db_id': 'target_schema'})

    #Use this function to create two new fields in our training dataframe: top result and top 3 results (unique)
    df['top_three_match_unique'] = df.apply(lambda x: sim_search(x['question'], vector_db=vectordb, method=method, k=3), axis=1)
    df['top_one_match'] = df.apply(lambda x: x['top_three_match_unique'][0], axis=1)

    df['top_one_is_match'] = np.where(df['target_schema'] == df['top_one_match'], 1, 0)

    df['top_three_is_match'] = df.apply(is_schema_in_top_three, axis=1)

    #total rows
    tot_q = df.shape[0]
    #top 1 match
    top_one_match = df['top_one_is_match'].sum()
    #top 3 match
    top_three_match = df['top_three_is_match'].sum()

    print(f"1st result matches: {top_one_match/tot_q:.2%}")
    print(f"Top 3 results match: {top_three_match/tot_q:.2%}")

### Test schema-metadata vector db

In [4]:
#setup embeddings using HuggingFace and the directory location
embeddings2  = HuggingFaceEmbeddings()
persist_dir2 = '../data/processed/chromadb/schema-metadata'

# load from disk
vectordb_2 = Chroma(persist_directory=persist_dir2, embedding_function=embeddings2)

In [25]:
#sim_search_test(json_file=spi_train, vectordb=vectordb_2)

1st result matches: 54.91%
Top 3 results match: 70.83%


In [26]:
#test with MMR Method
#sim_search_test(json_file=spi_train, vectordb=vectordb_2, method="mmr")

1st result matches: 57.04%
Top 3 results match: 68.44%


Interesting. Both are better than the old version. But the mmr is slightly better on primary matches, slightly worse on top 3 matches.

I'll keep working on improving the score.

## LLM Method Test

This one I worry will be too slow, so I may do it on a sampling of the questions. First I need to do my uploads and create function to recurringly call to LLMs.

In [5]:
#imports
from dotenv import load_dotenv

from langchain import HuggingFaceHub
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

In [6]:
#create prompt contents
extract_prompt = """question: How many teams are there?
response: team

question: What user spent the most money in March?
response: user, money

question: What is the name of the instructor who advises the student with the greatest number of total credits?
response: instructor, advisor, student, credits
"""

instruction = """
For a given question, determine the keywords for determining what tables should be used to write a SQL query to answer the question.
"""

In [7]:
#get api key
load_dotenv()
hf_api_token = os.getenv('hf_token')

#add path to HF repo
repo_id = 'tiiuae/falcon-7b-instruct'

#establish llm model
llm = HuggingFaceHub(repo_id=repo_id, huggingfacehub_api_token=hf_api_token, model_kwargs={"temperature": .005, "max_length": 512})

In [8]:
#create function
def sim_search_llm(question, vector_db, method="cosine", k=3, instruction=instruction, extract_prompt=extract_prompt):
    
    entity_extr_prompt = PromptTemplate(
    input_variables=[],
    template = 
        instruction
        + "\nHere are are three question/response examples: "
        + extract_prompt
        + "\n\nquestion: "
        + question
        + "\nresponse:"
    )

    #create chain
    chain = LLMChain(llm=llm, prompt=entity_extr_prompt, verbose=False)

    #predict
    results= chain.predict() #run llm

    #run simsearch on predicted keywords
    top_results = vector_db.similarity_search(results, k=k)
    top_matches = list(dict.fromkeys([doc.metadata['schema'] for doc in top_results]))

    return top_matches

In [40]:
training_df.shape

(7000, 2)

In [41]:
#pull sample of questions and save to pandas df => start with 500 samples
df_sample = training_df.sample(n=500, replace=False)

In [43]:
#define function for testing these 50 sample questions
def sim_search_llm_test(df, vectordb):
    """Create a quick and dirty function to test accuracy multiple times."""
    
    #Use this function to create two new fields in our training dataframe: top result and top 3 results (unique)
    df['top_three_match_unique'] = df.apply(lambda x: sim_search_llm(x['question'], vector_db=vectordb, k=3), axis=1)
    df['top_one_match'] = df.apply(lambda x: x['top_three_match_unique'][0], axis=1)

    df['top_one_is_match'] = np.where(df['target_schema'] == df['top_one_match'], 1, 0)

    df['top_three_is_match'] = df.apply(is_schema_in_top_three, axis=1)

    #total rows
    tot_q = df.shape[0]
    #top 1 match
    top_one_match = df['top_one_is_match'].sum()
    #top 3 match
    top_three_match = df['top_three_is_match'].sum()

    return df
    print(f"1st result matches: {top_one_match/tot_q:.2%}")
    print(f"Top 3 results match: {top_three_match/tot_q:.2%}")

In [44]:
ss_test_df = sim_search_llm_test(df=df_sample, vectordb=vectordb_2)

ValueError: Error raised by inference API: Rate limit reached. You reached free usage limit (reset hourly). Please subscribe to a plan at https://huggingface.co/pricing to use the API at this rate

In [46]:
ss_test_df.head(50)

Unnamed: 0,question,target_schema,top_three_match_unique,top_one_match,top_one_is_match,top_three_is_match
170,What are the ids of all stations that have a l...,bike_1,"[bike_1, station_weather, train_station]",bike_1,1,1
5404,What are names for top three branches with mos...,shop_membership,[shop_membership],shop_membership,1,1
4819,What is the average total number of passengers...,aircraft,"[flight_4, aircraft, flight_2]",flight_4,0,1
2221,What is the maximum fastest lap speed in race ...,formula_1,[formula_1],formula_1,1,1
2781,Sort the names of all counties in descending a...,election,[imdb],imdb,0,0
5892,What is the name and detail of each staff member?,cre_Theme_park,"[tracking_software_problems, tracking_grants_f...",tracking_software_problems,0,0
5380,List the names of all the distinct product nam...,tracking_software_problems,"[customers_campaigns_ecommerce, company_1, tra...",customers_campaigns_ecommerce,0,1
4740,What are the ids and names of department store...,department_store,[imdb],imdb,0,0
1829,"What is the maximum, minimum and average marke...",browser_web,"[phone_market, program_share, film_rank]",phone_market,0,0
6500,"What are the names of the scientists, and how ...",scientist_1,"[scientist_1, tracking_grants_for_research, cr...",scientist_1,1,1


## Final Test - No Table Info

In [7]:
#setup embeddings using HuggingFace and the directory location
embeddings  = HuggingFaceEmbeddings()
persist_dir3 = '../data/processed/chromadb/sch-tab-col'

In [9]:
# load from disk
vectordb_3 = Chroma(persist_directory=persist_dir3, embedding_function=embeddings)

In [10]:
sim_search_test(json_file=spi_train, vectordb=vectordb_3)

1st result matches: 53.13%
Top 3 results match: 68.23%


**Observations**
Having the table info looks like it gives the results some slight improvement. I'll go with that - probably changing some of the naming and building out the method in a .py file.

## Next Steps

Look at which databases perform the worst and strategize ways to improve. Possibily end up trying to automate the schema pull from the sqlite databases themselves. Instead of relying on the provided schema info. Consolidating that schema info would just be an extra step in the real world.