In this notebook we will test the candidate generation module 

In [1]:
# IMPORTs
from utils.task import Task
import json
import os
from src.pipeline.candidate_generation import candidate_generation
from utils.database_utils.db_info import get_db_schema
from dotenv import load_dotenv
from utils.prompt import load_prompt
import tiktoken

load_dotenv()

True

In [2]:
# Function to load JSON data
def load_json_data(filepath):
    with open(filepath, 'r') as file:
        data = json.load(file)
    return data


# Function to create task object
def create_task(example):
    return Task(example)

In [3]:
# load the task data
filepath = "C:/Users\yousf\Bureau\ConvergenceAI\CHESS_Impl\data/test/subsampled_test.json"
data = load_json_data(filepath)
# load the retrieved entities
filepath_entities = "C:/Users\yousf\Bureau\ConvergenceAI\CHESS_Impl\data/test/retrieved_entities.json"
retrieved_entities = load_json_data(filepath_entities)
# load the retrieved context
filepath_context = "C:/Users\yousf\Bureau\ConvergenceAI\CHESS_Impl\data/test/retrieved_context.json"
retrieved_context = load_json_data(filepath_context)
# load the selected schema
filepath_schema = "C:/Users\yousf\Bureau\ConvergenceAI\CHESS_Impl\data/test/selected_schema.json"
tentative_schema = load_json_data(filepath_schema)

In [4]:
# test candidate generation module with a random sample
index = 1
example = data[index]
task = create_task(example)

#get the database schema
dp_path = os.getenv("DB_ROOT_PATH") + f"/{task.db_id}/{task.db_id}.sqlite"
schema = get_db_schema(dp_path)
model = ("llama-3")
ans = candidate_generation(task=task, retrieved_entities=retrieved_entities[index],
                      retrieved_context=retrieved_context[index], selected_schema=schema, model=model, num_samples=1)
ans

You are a data science expert.
Below, you are presented with a database schema and a question.
Your task is to read the schema, understand the question, and generate a valid SQLite query to answer the question.
Before generating the final SQL query think step by step on how to write the query.

Database Schema
###
CREATE TABLE frpm
(
	CDSCode TEXT not null primary key,
	`Academic Year` TEXT null, --
	`County Code` TEXT null, -- examples: `02`
	`District Code` INTEGER null, --
	`School Code` TEXT null, -- description: School Code
	`County Name` TEXT null, -- examples: `Merced`
	`District Name` TEXT null, --
	`School Name` TEXT null, --
	`District Type` TEXT null, --
	`School Type` TEXT null, --
	`Educational Option Type` TEXT null, --
	`NSLP Provision Status` TEXT null, -- examples: `Provision 2`, `Lunch Provision 2`
	`Charter School (Y/N)` INTEGER null, --
	`Charter School Number` TEXT null, --
	`Charter Funding Type` TEXT null, --
	IRC INTEGER null, --
	`Low Grade` TEXT null, --
	`Hig

{'chain_of_thought_reasoning': "First, I identified the relevant columns from the question: CDSCode, County Code, School Type, Low Grade, High Grade, and County Name. Then, I determined that the question is asking for the city location of a high school level school with Lunch Provision 2, whose lowest grade is 9 and the highest grade is 12 in the county of Merced. I used the hint that high school can be represented as EILCode = 'HS'. I joined the frpm and schools tables based on the CDSCode, and then filtered the results to match the conditions specified in the question. Finally, I selected the City column from the schools table, which is the column that provides the city location.",
 'SQL': "SELECT City FROM schools WHERE EILCode = 'HS' AND County = 'Merced' AND LowGrade = '9' AND HighGrade = '12' AND NSLPProvisionStatus = 'Lunch Provision 2' AND CDSCode IN (SELECT CDSCode FROM frpm WHERE CountyCode = '02');"}

## Cost Estimation per task

In [5]:

PROMPT_PATH = os.getenv("PROMPT_ROOT_PATH") + "\\candidate_generation.txt"
prompt = load_prompt(PROMPT_PATH)

In [6]:

def tokens_calc(example):
    encoding = tiktoken.get_encoding("cl100k_base")
    num_tokens = len(encoding.encode(example))
    return num_tokens

In [7]:
#prompt template tokens 
tokens_calc(prompt)

631

The prompt template has 631 tokens in total, and we have also 3 other variables (Database_Schema,Question and Hint).
After i see a formatted prompt example it contains about 2000 tokens (because database_schema will be long)
So let's suppose that input tokens is 2000

In [8]:
## Output tokens estimation from an example 
output_example = """
{'chain_of_thought_reasoning': "First, I identified the relevant columns from the question: CDSCode, County Code, School Type, Low Grade, High Grade, and County Name. Then, I determined that the question is asking for the city location of a high school level school with Lunch Provision 2, whose lowest grade is 9 and the highest grade is 12 in the county of Merced. I used the hint that high school can be represented as EILCode = 'HS'. I joined the frpm and schools tables based on the CDSCode, and then filtered the results to match the conditions specified in the question. Finally, I selected the City column from the schools table, which is the column that provides the city location.",
 'SQL': "SELECT City FROM schools WHERE EILCode = 'HS' AND County = 'Merced' AND LowGrade = '9' AND HighGrade = '12' AND NSLPProvisionStatus = 'Lunch Provision 2' AND CDSCode IN (SELECT CDSCode FROM frpm WHERE CountyCode = '02');"}
"""
tokens_calc(output_example)

227

Let's suppose the output tokens is 250

In [9]:
## Price calculation (just with gpt4 because in this module we don't use gpt3.5) 
input_price_per_token_gpt4 = 0.01 / 1000
output_price_per_token_gpt4 = 0.03 / 1000
price_gpt4 = 2000 * input_price_per_token_gpt4 + 250 * output_price_per_token_gpt4
print("estimated price per Task (GPT-4):", price_gpt4, "$")

estimated price per Task (GPT-4): 0.0275 $


In [10]:
## in this module there is a number of retrials so let's estimate the price with different number of retrials 

num_retrials = [1, 2, 3, 4, 5]
for num_retrial in num_retrials:
    total_price = price_gpt4 * num_retrial
    print("The estimated price with number of retrials of "+str(num_retrial)+": ",total_price)

The estimated price with number of retrials of 1:  0.0275
The estimated price with number of retrials of 2:  0.055
The estimated price with number of retrials of 3:  0.0825
The estimated price with number of retrials of 4:  0.11
The estimated price with number of retrials of 5:  0.1375
