# Prompt Templates

Now that I have my vector database setup, I want to pull down the most likely schema and table and save the metadata as variables. Then I think the best way to proceed to is to create a prompt template that accepts the variables and allows a consistent entry point to the model.

## Imports

In [None]:
import pandas as pd
import os
from dotenv import load_dotenv

from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain import HuggingFaceHub

from langchain import SQLDatabase, SQLDatabaseChain
from langchain.chains import SQLDatabaseSequentialChain
from langchain.prompts.prompt import PromptTemplate

## Load Top Results from VectorDB

In [None]:
#setup embeddings using HuggingFace and the directory location
embeddings  = HuggingFaceEmbeddings()
persist_dir = '../data/processed/chromadb/schema-table-split'

# load from disk
vectordb = Chroma(persist_directory=persist_dir, embedding_function=embeddings)

In [None]:
#run prompt as query and get most likely results
query = "How many heads of the departments are older than 56?" #first prompt from the training data

result = vectordb.similarity_search(query, k=1)

In [None]:
result

## Save Metadata to Variables

In [None]:
top_schema = result[0].metadata['schema']
top_table = result[0].metadata['table']
table_cols = result[0].metadata['columns']

print(top_schema, top_table, table_cols)

## Build SQL Agent Through HF Hub

The docs on langchain are focused around either accessing the OpenAI API or a local model. I'm hoping I can mix both and access a Hugging Face model through an API without having to store it locally. I think I can do this through the Hugging Face Hub, by combined the steps in different parts of the documentation:
- [Langchain - Hugging Face Hub](https://python.langchain.com/docs/modules/model_io/models/llms/integrations/huggingface_hub)
- [Langchain - SQL Agents](https://python.langchain.com/docs/modules/chains/popular/sqlite)

I do have the start of the locally stored method commented out below.

### Load API Token

In [None]:
load_dotenv()
hf_api_token = os.getenv('hf_token')

In [None]:
#add path to HF repo
repo_id = 'tiiuae/falcon-7b-instruct'

### Point to LLM Model

In [None]:
#establish llm model
llm = HuggingFaceHub(repo_id=repo_id, huggingfacehub_api_token=hf_api_token, model_kwargs={"temperature": 0.5, "max_length": 128})

### Setup Simple SQL Agent
Langchain has some awesome features, but I'll start with a simple chain that points to one table.

First, I need to use the output of my vector db query to point to the sqlite database we want to work with.

#### Point to sqlite db location

In [None]:
BASE_DIR = os.path.dirname(os.path.abspath('../data/processed/db/department_management.sqlite'))
db_path = os.path.join(BASE_DIR, "department_management.sqlite")

db_path

#### Establish db

In [None]:
db = SQLDatabase.from_uri("sqlite:///" + db_path)

#### Create Prompt Template
I'll even specify the single table here to really spoon feed the model.

In [None]:
sql_agent_prompt_template = """You are an expert data analyst. Given an input question, first write a syntactically correct {dialect} query to run, then look at the results of the query and return and describe the answer.
Use the following format:

Question: Question here
SQLQuery: SQL Query to run
SQLResult: Result of the SQLQuery
Answer: Final answer here

Only use the following tables:
{table_info}

Question: {input}"""

In [None]:
sql_prompt = PromptTemplate(
    input_variables=["input", "table_info", "dialect"], template=sql_agent_prompt_template)

#### Setup db_chain

In [None]:
db_chain = SQLDatabaseChain.from_llm(llm, db, prompt=sql_prompt, verbose=True)

#### Test with our simple question

In [None]:
db_chain.run(sql_prompt.format(input="How many heads of the departments are older than 56?", table_info="head", dialect="sqlite"))

**Observations:**
This simple chain method is working, although not as cleanly as I would like. The chain is inconsistent in returning "5" as the answer vs describing it with something like "There are 5 department heads older than 56."

I'm also struggling with the prompt template. I'm thinking there must be some preset formats in the langchain sql chain because anytime I try to change anything structurally, it errors out. So unforuntately this means the output usually includes "Question:" at the end. I'll need to check the documentation and online discussions a bit more to see if I can get around that.

## Create Functions
Even if I want some cleanup, this technically works which is great. But it is very segmented. So I want to pull the steps together so I don't have to manually change variables in each step.

In [None]:
def db_select(persist_dir, query):
    """
    Load the schema info vector database from disk and run the input question against it to return the most likely database we need to pull data from.
    """
    #setup embeddings using HuggingFace and the directory location
    embeddings  = HuggingFaceEmbeddings()
    per_dir = persist_dir

    # load from disk
    vectordb = Chroma(persist_directory=per_dir, embedding_function=embeddings)

    #run prompt as query and get most likely results
    result = vectordb.similarity_search(query, k=1)

    #save variables
    top_schema = result[0].metadata['schema']
    top_table = result[0].metadata['table']
    table_cols = result[0].metadata['columns']

    top_result = (top_schema, top_table, table_cols)

    return top_result

In [None]:
top_result = db_select(persist_dir='../data/processed/chromadb/schema-table-split', query='How many dpeartment heads are older than 56?')
top_result[0]

In [None]:
def locate_and_connect_db(filepath, filename):
    """Locate the absolute path of the given SQLITE database and connect to it via the langchain SQLDatabase.from_uri method.
    filepath is the filepath within the repo, example '../data/db/data.db'
    filename is just the filename.filetype, example 'data.db'
    """
    base_dir = os.path.dirname(os.path.abspath(filepath+filename)) #get the full path within the device
    db_path = os.path.join(base_dir, filename) #combine with filename to get db_path
    db = SQLDatabase.from_uri("sqlite:///" + db_path) #connect via the lanchain method

    return db

In [None]:
def load_llm_model(env_variable, repo_id='tiiuae/falcon-7b-instruct', temp=0.5, max_length=64):
    """
    Take in the target hugging face repo and the api key to setup the llm to use in our query chain
    env_variable is the name of the variable that stores your API key in your .env file
    """
    load_dotenv()
    hf_api_token = os.getenv(env_variable)
    llm = HuggingFaceHub(repo_id=repo_id, huggingfacehub_api_token=hf_api_token, model_kwargs={"temperature": temp, "max_length": max_length})

    return llm

In [None]:
def create_sql_chain(db, llm, verbose=True, input_vars=['input', 'table_info', 'dialect']
                     , prompt_template=
                        """You are an expert data analyst. Given an input question, first write a syntactically correct {dialect} query to run, then look at the results of the query and return and describe the answer.
Use the following format:

Question: Question here
SQLQuery: SQL Query to run
SQLResult: Result of the SQLQuery
Answer: Final answer here

Only use the following tables:
{table_info}

Question: {input}"""
                    ):
    """Take in prompt template and input variables, along with the llm and db created from other functions to create our SQL Chain"""
    sql_prompt = PromptTemplate(input_variables=input_vars, template=prompt_template) #create our prompt template
    db_chain = SQLDatabaseChain.from_llm(llm, db, prompt=sql_prompt, verbose=verbose) #create our database chain
    
    return db_chain

In [None]:
def run_chain(db_chain, question, table, sql_dialect='sqlite'):
    """Function to run chain that will end our overall application function by taking in your question and variables created by the other functions"""
    db_chain.run(sql_prompt.format(input=question, table_info=table, dialect=sql_dialect))

In [None]:
def sql_analyst(question, vector_db_path='../data/processed/chromadb/schema-table-split', db_root_path='../data/processed/db/', sql_dialect='sqlite', env_api_key_var='hf_token'):
    """take in user info on where the databases and the api key are located and answer their question using the sql chain."""
    
    #query our vector database with the question to get the schema most related to the question.
    top_result = db_select(persist_dir=vector_db_path, query=question)
    top_schema = top_result[0]
    top_table = top_result[1]

    #use this result to establish our database
    db = locate_and_connect_db(filepath=db_root_path, filename=top_schema+'.'+sql_dialect)

    #initialize our large language model - needs an API key - we'll use the standard variables
    llm = load_llm_model(env_variable=env_api_key_var)

    #setup the sql chain using the standard variables
    sql_chain = create_sql_chain(db=db, llm=llm)

    #run sql_chain on question
    run_chain(db_chain=sql_chain, question=question, table=top_table, sql_dialect=sql_dialect)

#### Test

In [63]:
sql_analyst("How many heads of the departments are older than 56?")



[1m> Entering new  chain...[0m
You are an expert data analyst. Given an input question, first write a syntactically correct sqlite query to run, then look at the results of the query and return and describe the answer.
Use the following format:

Question: Question here
SQLQuery: SQL Query to run
SQLResult: Result of the SQLQuery
Answer: Final answer here

Only use the following tables:
head

Question: How many heads of the departments are older than 56?
SQLQuery:[32;1m[1;3mSELECT COUNT(*) FROM head WHERE age > 56.[0m
SQLResult: [33;1m[1;3m[(5,)][0m
Answer:[32;1m[1;3m5.

The query returns a count of 5 heads of departments older than 56[0m
[1m> Finished chain.[0m
