In this notebook we will test the column filtering module 

In [4]:
# IMPORTs
from utils.task import Task
import json
from src.pipeline.column_filtering import column_filtering
from utils.prompt import load_prompt
import os
import tiktoken

In [2]:
# Function to load JSON data
def load_json_data(filepath):
    with open(filepath, 'r') as file:
        data = json.load(file)
    return data


# Function to create task object
def create_task(example):
    return Task(example)


In [3]:
# test column filtering module with a random sample
filepath = "C:/Users\yousf\Bureau\ConvergenceAI\CHESS_Impl\data/test/subsampled_test.json"
data = load_json_data(filepath)
filepath_entities = "C:/Users\yousf\Bureau\ConvergenceAI\CHESS_Impl\data/test/retrieved_entities.json"
retrieved_entities = load_json_data(filepath_entities)[1]
example = data[1]
task = create_task(example)
llm = "llama-3"
ans = column_filtering(task, retrieved_entities, llm)
ans

  0%|          | 0/3 [00:00<?, ?it/s]
  0%|          | 0/29 [00:00<?, ?it/s][A
  3%|▎         | 1/29 [00:02<01:15,  2.71s/it][A
  7%|▋         | 2/29 [00:03<00:35,  1.33s/it][A
 10%|█         | 3/29 [00:03<00:24,  1.08it/s][A
 14%|█▍        | 4/29 [00:05<00:29,  1.17s/it][A
 17%|█▋        | 5/29 [00:07<00:37,  1.55s/it][A
 21%|██        | 6/29 [00:07<00:25,  1.12s/it][A
 24%|██▍       | 7/29 [00:09<00:27,  1.27s/it][A
 28%|██▊       | 8/29 [00:10<00:29,  1.39s/it][A
 31%|███       | 9/29 [00:12<00:29,  1.49s/it][A
 34%|███▍      | 10/29 [00:14<00:29,  1.53s/it][A
 38%|███▊      | 11/29 [00:16<00:31,  1.73s/it][A
 41%|████▏     | 12/29 [00:16<00:21,  1.26s/it][A
 45%|████▍     | 13/29 [00:16<00:14,  1.07it/s][A
 48%|████▊     | 14/29 [00:18<00:18,  1.22s/it][A
 52%|█████▏    | 15/29 [00:20<00:18,  1.34s/it][A
 55%|█████▌    | 16/29 [00:22<00:19,  1.53s/it][A
 59%|█████▊    | 17/29 [00:24<00:19,  1.65s/it][A
 62%|██████▏   | 18/29 [00:25<00:18,  1.65s/it][A
 66%|██████

{'filtered_schema': {'frpm': ['CDSCode',
   'County Code',
   'County Name',
   'NSLP Provision Status'],
  'satscores': ['cds', 'cname'],
  'schools': ['CDSCode',
   'County',
   'City',
   'MailCity',
   'SOC',
   'EILCode',
   'EILName',
   'GSserved',
   'School']}}

## Cost Estimation per task

In [3]:
PROMPT_PATH = os.getenv("PROMPT_ROOT_PATH") + "\column_filtering.txt"
prompt = load_prompt(PROMPT_PATH)

In [5]:
def tokens_calc(example):
    encoding = tiktoken.get_encoding("cl100k_base")
    num_tokens = len(encoding.encode(example))
    return num_tokens

In [6]:
#prompt template tokens 
tokens_calc(prompt)

3860

The prompt template has 3860 tokens in total, and we have also 3 other variables (Column Profile,Question and Hint).
So let's suppose in total we have 4000 tokens per column 

In [7]:
## Output tokens estimation from an example 
output_example = """```json
{{
  "chain_of_thought_reasoning": "The question seeks to identify the best-selling app and its sentiments polarity, with the hint specifying the calculation for "best selling" as the maximum product of Price and Installs. The Price column is crucial for this computation as it provides the price at which each app is sold, which, when multiplied by the number of installs, helps determine the app's total revenue. This makes the Price column directly relevant to identifying the best-selling app according to the hint's criteria.",
  "is_column_information_relevant": "Yes"
}}
"""
tokens_calc(output_example)

120

Let's suppose that the output tokens number is 150 per column

In [16]:
## Price calculation per column 
input_price_per_token_gpt4 = 0.01 / 1000
output_price_per_token_gpt4 = 0.03 / 1000
input_price_per_token_gpt3 = 0.0015 / 1000
output_price_per_token_gpt3 = 0.002 / 1000
price_gpt4 = 4000 * input_price_per_token_gpt4 + 150 * output_price_per_token_gpt4
price_gpt3 = 4000 * input_price_per_token_gpt3 + 150 * output_price_per_token_gpt3
print("estimated price per Column(GPT-4):", price_gpt4, "$")
print("estimated price per Column(GPT-3.5):", price_gpt3, "$")

estimated price per Column(GPT-4): 0.0445 $
estimated price per Column(GPT-3.5): 0.0063 $


This price is only for one column. For a task we will do this operation for all the columns present in the
database.
Let's do estimation by the number of columns in the database 

In [18]:
cols_estimation = [10, 20, 50, 100, 200]
for col in cols_estimation:
    total_price_gpt4 = price_gpt4 * col
    print("Estimated Price per task with a database of " + str(col) + " columns in total(GPT-4):", total_price_gpt4
          , "$")
    total_price_gpt3 = price_gpt3 * col
    print("Estimated Price per task with a database of " + str(col) + " columns in total(GPT-3.5):", total_price_gpt3
          , "$\n")

Estimated Price per task with a database of 10 columns in total(GPT-4): 0.44499999999999995 $
Estimated Price per task with a database of 10 columns in total(GPT-3.5): 0.063 $

Estimated Price per task with a database of 20 columns in total(GPT-4): 0.8899999999999999 $
Estimated Price per task with a database of 20 columns in total(GPT-3.5): 0.126 $

Estimated Price per task with a database of 50 columns in total(GPT-4): 2.225 $
Estimated Price per task with a database of 50 columns in total(GPT-3.5): 0.315 $

Estimated Price per task with a database of 100 columns in total(GPT-4): 4.45 $
Estimated Price per task with a database of 100 columns in total(GPT-3.5): 0.63 $

Estimated Price per task with a database of 200 columns in total(GPT-4): 8.9 $
Estimated Price per task with a database of 200 columns in total(GPT-3.5): 1.26 $


In most cases the total number of columns will be greater than 50. So the price per task will be more than
2$(0.3$ with gpt3.5)and which is very expensive (for the most inefficient module in the pipeline) 