## SQLDatabaseSequentialChain

Using the simple chain, I got the process working, but the results aren't great. If I point the input variables to very specific places it can get the right answer. But using the standard prompt template through langchain doesn't give the correct results for our test queries. This prompt feeds in the table creation statements and first few rows of each table. In doing this it then relies on the AI to take the useful information based on the input question. I think there is some further prompt engineering I could do, but first I want to explore the other SQL options from Lanchain and iterate based on what I feel offers the best long-term solultions.

This is this sequential chain that 1. Determines which tables to us based on the query. 2. Based on those tables, call the normal SQL database chain.

This sounds similar, but I'm curious if parsing these and asking the model to do this in 2 steps will improve performance.

In [1]:
from watermark import watermark
print(watermark())

Last updated: 2023-08-01T00:52:32.569591-07:00

Python implementation: CPython
Python version       : 3.11.4
IPython version      : 8.14.0

Compiler    : Clang 15.0.7 
OS          : Darwin
Release     : 22.5.0
Machine     : x86_64
Processor   : i386
CPU cores   : 8
Architecture: 64bit



### Imports

In [1]:
import pandas as pd
import os
from dotenv import load_dotenv

from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain import HuggingFaceHub

from langchain import SQLDatabase, SQLDatabaseChain
from langchain.chains import SQLDatabaseSequentialChain
from langchain.prompts.prompt import PromptTemplate
from langchain.output_parsers import CommaSeparatedListOutputParser

### Copy Functions - Point to Database

I could build out the whole automation, but I just want to test this chain so I will point it to the right place and do a test on our simplest quesiton.

In [2]:
def db_select(persist_dir, query):
    """
    Load the schema info vector database from disk and run the input question against it to return the most likely database we need to pull data from.
    """
    #setup embeddings using HuggingFace and the directory location
    embeddings  = HuggingFaceEmbeddings()
    per_dir = persist_dir

    # load from disk
    vectordb = Chroma(persist_directory=per_dir, embedding_function=embeddings)

    #run prompt as query and get most likely results
    result = vectordb.similarity_search(query, k=1)

    #save variables
    top_schema = result[0].metadata['schema']
    top_table = result[0].metadata['table']
    table_cols = result[0].metadata['columns']

    top_result = (top_schema, top_table, table_cols)

    return top_result

In [3]:
def locate_and_connect_db(filepath, filename):
    """Locate the absolute path of the given SQLITE database and connect to it via the langchain SQLDatabase.from_uri method.
    filepath is the filepath within the repo, example '../data/db/data.db'
    filename is just the filename.filetype, example 'data.db'
    """
    base_dir = os.path.dirname(os.path.abspath(filepath+filename)) #get the full path within the device
    db_path = os.path.join(base_dir, filename) #combine with filename to get db_path
    db = SQLDatabase.from_uri("sqlite:///" + db_path) #connect via the lanchain method

    return db

### Load LLM

In [4]:
def load_llm_model(env_variable, repo_id='tiiuae/falcon-7b-instruct', temp=0.5, max_length=200):
    """
    Take in the target hugging face repo and the api key to setup the llm to use in our query chain
    env_variable is the name of the variable that stores your API key in your .env file
    """
    load_dotenv()
    hf_api_token = os.getenv(env_variable)
    llm = HuggingFaceHub(repo_id=repo_id, huggingfacehub_api_token=hf_api_token, model_kwargs={"temperature": temp, "max_length": max_length})

    return llm

### Create Chain

This time use the SQLDatabaseSequentialChain.

In [5]:
def create_sql_chain(llm, db, verbose=True):
    """Take in prompt template and input variables, along with the llm and db created from other functions to create our SQL Chain"""
    db_chain = SQLDatabaseSequentialChain.from_llm(llm, db, verbose=verbose, use_query_checker=True) #create our database chain
    
    return db_chain

### Link Full Process

In [10]:
def sql_analyst_seq(question, vector_db_path='../data/processed/chromadb/schema-table-split', db_root_path='../data/processed/db/', sql_dialect='sqlite', env_api_key_var='hf_token'):
    """take in user info on where the databases and the api key are located and answer their question using the sql chain."""
    
    #query our vector database with the question to get the schema most related to the question.
    top_result = db_select(persist_dir=vector_db_path, query=question)
    top_schema = top_result[0]
    top_table = top_result[1]

    #use this result to establish our database
    db = locate_and_connect_db(filepath=db_root_path, filename=top_schema+'.'+sql_dialect)

    #initialize our large language model - needs an API key - we'll use the standard variables
    llm = load_llm_model(env_variable=env_api_key_var)

    # #create prompt template using our preset template
    # prompt_template = create_prompt_template(prompt_template=template)

    #setup the sql chain using the standard variables
    sql_chain = create_sql_chain(db=db, llm=llm)

    #run sql_chain on question
    print(db)
    print(top_schema+'-'+top_table)
    sql_chain(question)

## Test

In [12]:
sql_analyst_seq("Return the first row of the 'head' table in the 'department_management' schema.")

<langchain.sql_database.SQLDatabase object at 0x18d894950>
department_management-head


[1m> Entering new  chain...[0m




Table names to use:
[33;1m[1;3m['head', 'management'][0m

[1m> Entering new  chain...[0m
Return the first row of the 'head' table in the 'department_management' schema.
SQLQuery:[32;1m[1;3m[0m
SQLResult: [33;1m[1;3m[0m
Answer:[32;1m[1;3m3 rows from 'head' table:
head_ID	name	born_[0m
[1m> Finished chain.[0m

[1m> Finished chain.[0m


### Observations

This is giving me similar issues of not pulling in the right tables. The difference here is that the first step identifies the tables for the model to review. If that errors it pulls in nothing and it appears the model is just winging it from that point. And even sometimes when it does pull in the tables to try to use, it still has trouble with the question. It's also harder to test this because the prompt template for the first step in the chain isn't available to view on chainlit. So I can't copy and paste it into chatgpt for a reference point.

[Chainlit](https://docs.chainlit.io/overview) is a really cool tool that lets you interact with your application in a chat interface and view the intermediate steps the LLM is performing. 

In this case, I think the step to identify the right table(s) will be key in a production app, but this just wasn't working for me. So one more "out-of-the-box" langchain feature to try.