# Table selection

In [1]:
%load_ext dotenv
%dotenv

In [2]:
from src.utils.table_selection.table_selector import *
from src.utils.table_selection.table_details import *
from src.utils.api_data_request.api_generator import *
from src.utils.data_analysis.data_analysis import *

In [3]:
table_descriptions = get_table_schemas()
print(table_descriptions, "\n")

table name: Consumer Price Index - CPI
table description: Table 'Consumer Price Index - CPI' has price index for products, encompassing over 200 expenditure categories grouped into major segments (food and beverages, housing, apparel, recreation, and other goods).
table dimensions: Year (periodicity of the data), Month and Year (periodicity of the data with the format YYYYMM (example March of 2015 is 201503)), Level 1.1 (most general level of products and services), Level 2.2, Level 3.3, Level 4.4, Level 5.5 (most detailed level of products and services)
table measures: Consumer Price Inde (consumer price index), Standard Error (standard error of CPI), Percent Change (percentage change)


table name: dot_faf
table description: Table 'dot_faf' has freight movement among states and major metropolitan areas by all modes of transportation. Shows which goods are shipped from one region of the US to another region, according to type of commodity, mode of shipment, value, and weight.
table di

In [4]:
get_table_names()

['Consumer Price Index - CPI', 'dot_faf', 'ed_defaults', 'pums_5']

In [5]:
query = 'what was the CPI of electronics in March of 2014?'

#### Getting table from LLM (OpenAI's API)

In [6]:
table_llm = get_relevant_tables_from_lm(query)


Messages:
 [{'role': 'user', 'content': 'What service presented the highest growth in the consumer price index during the last year?'}, {'role': 'assistant', 'content': '\n{\n    "tables": ["Consumer Price Index - Services"]\n}\n'}, {'role': 'user', 'content': 'What is the average monthly CPI volatility for all product groups combined?'}, {'role': 'assistant', 'content': '\n{\n    "tables": ["Consumer Price Index - Products"]\n}\n'}, {'role': 'user', 'content': 'Which states or metropolitan areas are the primary origins for specific products, and which are the primary destinations?'}, {'role': 'assistant', 'content': '\n{\n    "tables": ["Consumer Price Index - Products"]\n}\n'}, {'role': 'user', 'content': 'What is the trend in total freight movement volume across all transportation modes over the past five years?'}, {'role': 'assistant', 'content': '\n{\n    "tables": ["dot_faf", "us_location_data"]\n}\n'}, {'role': 'user', 'content': 'What is the trend in total freight movement volu

In [7]:
print(table_llm)


['Consumer Price Index - CPI']


#### Getting table from DB (with embeddings)

In [8]:
table_db = get_relevant_tables_from_database(query)

DF:                    table_name  similarity
0  Consumer Price Index - CPI    0.399441


In [9]:
get_table_schemas(table_db)

"table name: Consumer Price Index - CPI\ntable description: Table 'Consumer Price Index - CPI' has price index for products, encompassing over 200 expenditure categories grouped into major segments (food and beverages, housing, apparel, recreation, and other goods).\ntable dimensions: Year (periodicity of the data), Month and Year (periodicity of the data with the format YYYYMM (example March of 2015 is 201503)), Level 1.1 (most general level of products and services), Level 2.2, Level 3.3, Level 4.4, Level 5.5 (most detailed level of products and services)\ntable measures: Consumer Price Inde (consumer price index), Standard Error (standard error of CPI), Percent Change (percentage change)\n"

#### Getting table with hybrid approach 
- Gets top k matches from DB, then asks the LLM to choose the most relevant

In [10]:
table_db_llm = request_tables_to_lm_from_db(query)

DF:                    table_name  similarity
0  Consumer Price Index - CPI    0.399441
1                      pums_5    0.231907
2                     dot_faf    0.204892
Messages:
 [{'role': 'user', 'content': 'What service presented the highest growth in the consumer price index during the last year?'}, {'role': 'assistant', 'content': '\n{\n    "tables": ["Consumer Price Index - Services"]\n}\n'}, {'role': 'user', 'content': 'What is the average monthly CPI volatility for all product groups combined?'}, {'role': 'assistant', 'content': '\n{\n    "tables": ["Consumer Price Index - Products"]\n}\n'}, {'role': 'user', 'content': 'Which states or metropolitan areas are the primary origins for specific products, and which are the primary destinations?'}, {'role': 'assistant', 'content': '\n{\n    "tables": ["Consumer Price Index - Products"]\n}\n'}, {'role': 'user', 'content': 'What is the trend in total freight movement volume across all transportation modes over the past five years?'}

# API parameter generator

In [11]:
v, m, c = get_api_params_from_lm(query, table_db_llm[0], model = 'gpt-4-1106-preview')


ChatGPT response: - The chosen variables, measures and filters can answer the query because the query is asking for the Consumer Price Index (CPI) of electronics in March of 2014. The variables needed are the time (Year and Month) and the product category (Level 1.1 to Level 5.5). The measure needed is the Consumer Price Index. The filters needed are the specific time (March 2014) and the specific product category (electronics).

- The variable "Year" and "Month and Year" exist in the table under the parent dimension "Time". The variables "Level 1.1" to "Level 5.5" exist in the table under the parent dimension "Products or Services". The measure "Consumer Price Index" exists in the table. The filters "Year = 2014", "Month and Year = 201403" and "Level 1.1 = Electronics" exist in the table.

```

{
    "variables": ["Month and Year", "Level 1.1"],
    "measures": ["Consumer Price Index"],
    "filters": ["Month and Year = 201403", "Level 1.1 = Electronics"]
}

```

Parameters: {
    "v

In [12]:
print("Variables:\n\n", v, "\n", "\nMeasures:\n\n", m, "\n", "\nCuts:\n\n", c)

Variables:

 ['Month and Year', 'Level 1.1'] 
 
Measures:

 ['Consumer Price Index'] 
 
Cuts:

 ['Month and Year = 201403', 'Level 1.1 = Electronics']


In [13]:
api_url = api_build(table = table_db_llm[0], drilldowns = v, measures = m, cuts = c)

In [14]:
api_url

'https://api-ts-zircon.datausa.io/tesseract/data.jsonrecords?cube=Consumer Price Index - CPI&drilldowns=Month and Year,Level 4.4&measures=Consumer Price Inde&Month and Year=201403&Level 4.4=403051619'

# API Request

In [15]:
data, df = api_request(api_url)

In [16]:
data

[{'Month and Year ID': 201403,
  'Month and Year': 'March, 2014',
  'Level 4.4 ID': 403051619,
  'Level 4.4': 'Appliances',
  'Consumer Price Inde': -3.299999952316284}]

In [17]:
df

Unnamed: 0,Month and Year ID,Month and Year,Level 4.4 ID,Level 4.4,Consumer Price Inde
0,201403,"March, 2014",403051619,Appliances,-3.3


# Data Analysis

In [18]:
agent_answer(df, query)

Prompt: 
        You are an expert data analyst. You are given a question from a user and a pandas dataframe that contains the data to answer this question.

        The question is the following:
        ---------------

        what was the CPI of electronics in March of 2014?
        
---------------

        Explain your reasoning to reach the answer.
        Take into consideration the data type and formatting of the columns.
        Its possible that any product/service or other variables the user is referring to appears with a different name in the dataframe.
        Lets think it through step by step.
        


[1m> Entering new  chain...[0m




[32;1m[1;3mThought: To answer the question, I need to find the Consumer Price Index (CPI) for electronics in March of 2014. I will need to look at the 'Month and Year' and 'Level 4.4' columns to filter the data accordingly. Since the user is asking for electronics, I need to check if the 'Level 4.4' column contains a category for electronics or a similar term. I will also need to verify that the 'Month and Year' column contains the date 'March, 2014'. If the data is present, I can then extract the CPI value for that category and time period.

Action: I will first check the unique values in the 'Level 4.4' column to see if there is a category for electronics or a similar term.
Action Input: df['Level 4.4'].unique()[0m
Observation: I will first check the unique values in the 'Level 4.4' column to see if there is a category for electronics or a similar term. is not a valid tool, try another one.
Thought:[32;1m[1;3mI need to use the correct tool to execute the Python command to check 

"With the given data, it is not possible to definitively answer the question about the CPI of electronics in March of 2014, as the category 'Electronics' is not explicitly listed in the dataframe. The 'Appliances' category has a CPI of -3.3 for March 2014, but it is unclear if this includes electronics. Additional data or clarification on the categorization would be needed to provide a precise answer."

In [19]:
response = agent_answer(df, query)
print(response)

Prompt: 
        You are an expert data analyst. You are given a question from a user and a pandas dataframe that contains the data to answer this question.

        The question is the following:
        ---------------

        what was the CPI of electronics in March of 2014?
        
---------------

        Explain your reasoning to reach the answer.
        Take into consideration the data type and formatting of the columns.
        Its possible that any product/service or other variables the user is referring to appears with a different name in the dataframe.
        Lets think it through step by step.
        


[1m> Entering new  chain...[0m




[32;1m[1;3mThought: To answer the question, I need to find the row in the dataframe `df` where the "Month and Year" column matches "March, 2014" and the "Level 4.4" column contains the term "electronics". However, the provided dataframe snippet only shows the "Appliances" category, not "electronics". I will need to check if there is a row that matches these criteria.

Action: I will first check if the "Level 4.4" column contains the term "electronics" by using the `unique()` method to list all unique values in that column. This will help me understand if "electronics" is a category in the dataframe and how it is formatted.

Action Input: df['Level 4.4'].unique()[0m
Observation: I will first check if the "Level 4.4" column contains the term "electronics" by using the `unique()` method to list all unique values in that column. This will help me understand if "electronics" is a category in the dataframe and how it is formatted. is not a valid tool, try another one.
Thought:[32;1m[1;3

In [20]:
response

'Based on the available data, there is no record for the CPI of electronics in March of 2014 in the dataframe. It is possible that the data for electronics is categorized under a different name, or it may not be present in the dataset at all.'