* changed the temperature to 0.2 and top_p to 0.1 to make a more stable result

In [1]:
import pandas as pd
import openai
from io import StringIO
import re
from google.cloud import bigquery

In [2]:
with open("../api_key", 'r') as f:
    api_key = f.read().strip()

In [3]:
OPENDAI_CLIENT = openai.OpenAI(api_key = api_key)
def gpt_request(prompt, temperature=0.2, openai_client=OPENDAI_CLIENT):
    response = openai_client.chat.completions.create(
            model= "gpt-4-0125-preview", #"gpt-3.5-turbo-0125",  # You can switch this to "gpt-4-turbo-preview", "gpt-3.5-turbo-0125"
            messages=[
                {
                    "role": "user",
                    "content": prompt
                },
            ],
            temperature=temperature,
            # max_tokens=256,
            top_p=0.1,
            frequency_penalty=0,
            presence_penalty=0
        )
    ans_string = response.choices[0].message.content

    match = re.search("```(.*?)```", ans_string, re.DOTALL)

    if match:
        query_string = match.group(1)  # Extract the actual CSV data
        print("Got the query!")
        return query_string
    else:
        print("No query found in the string.")
        return None
    

In [4]:
context_info = """
Table schemas:
* "mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_users"
    * user_id: user's id
    * user_name: user's name
    * created: a timestamp of the registery time
* "mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_items"
    * item_id: item's id
    * item_name: item's name
    * price: the item price in Japanese yen
    * category: the item's category
    * created: timestamp that the items are listed
* "mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_transactions"
    * transaction_id: the transaction's id
    * item_id: item's id
    * user_id: who bought the item
    * quantity: how many itmes the user purchased
    * sold_time: timestamp that the items are sold  
"""

In [5]:
# Construct the prompt
prompt_template = """
{context_info}

Output constraint will be:
1. a string of query 
3. This query should be in three quotes (```) on both sides so that I can easily extract it from your result.

{question}
"""

# Print the prompt to verify
# print(prompt_template[:1000])

In [9]:
def get_bq_results(query_string, max_trial_number=10):
    client = bigquery.Client()
    results = None
    for i in range(max_trial_number):
        try:
            # Attempt to validate the query
            query_job = client.query(query_string)
            results = query_job.to_dataframe()
            print("Query validation successful!")
            return results, query_string
        except Exception as e:
            error_message = str(e)
            print(f"{i}: Query validation failed: {error_message}")
            revise_query_prompt = revise_query_prompt_template.format(query_string=query_string, error_message=error_message)
            # print(revise_query_prompt)
            query_string = gpt_request(revise_query_prompt)
    print(f"didn't find an accurate results after trying {max_trial_number} times")
    return None, query_string
    

In [10]:
revise_query_prompt_template = """
input query is like "{query_string}" 
the error message is "{error_message}"
please revise the query. 

Output constraint will be:
1. a string of query 
3. This query should be in three quotes (```) on both sides so that I can easily extract it from your result.

"""

## zero-shot prompting

In [11]:
question = """
I'd like to fetch users who didn't purchase category "Electronics" ever but purchased at least once last 1 year,
segment them based on the purchase recency: 1) less than 6 months, 2) between 6 months and 12 months
calculate the average purchase price for each segment
"""
prompt = prompt_template.format(context_info=context_info,
                             question=question)
result_list = []
for i in range(5):
    print(i)
    query_string = gpt_request(prompt)
    df_result , updated_query_string = get_bq_results(query_string)
    display(df_result)
    print("\n")
    result_list.append({
        'trial': i,
        'original_query': query_string,
        'updated_query': updated_query_string,
        'result': df_result
    })
    

0
Got the query!
0: Query validation failed: 400 TIMESTAMP_SUB does not support the YEAR date part when the argument is TIMESTAMP type at [15:29]; reason: invalidQuery, location: query, message: TIMESTAMP_SUB does not support the YEAR date part when the argument is TIMESTAMP type at [15:29]

Location: US
Job ID: 1432d9ce-487e-4865-9c45-c972466fc8dc

Got the query!
1: Query validation failed: 400 TIMESTAMP_SUB does not support the MONTH date part when the argument is TIMESTAMP type at [20:32]; reason: invalidQuery, location: query, message: TIMESTAMP_SUB does not support the MONTH date part when the argument is TIMESTAMP type at [20:32]

Location: US
Job ID: a2e270f0-283f-4ea7-8df3-c305fcc1f683

Got the query!
Query validation successful!


Unnamed: 0,purchase_recency,average_purchase_price
0,Less than 6 months,538.288225
1,Between 6 and 12 months,452.7535




1
Got the query!
0: Query validation failed: 400 TIMESTAMP_SUB does not support the YEAR date part when the argument is TIMESTAMP type at [15:29]; reason: invalidQuery, location: query, message: TIMESTAMP_SUB does not support the YEAR date part when the argument is TIMESTAMP type at [15:29]

Location: US
Job ID: 4a6db837-3e66-4c8a-b85e-69a22cd4e11a

Got the query!
1: Query validation failed: 400 TIMESTAMP_SUB does not support the MONTH date part when the argument is TIMESTAMP type at [20:28]; reason: invalidQuery, location: query, message: TIMESTAMP_SUB does not support the MONTH date part when the argument is TIMESTAMP type at [20:28]

Location: US
Job ID: 51bfa9f7-cf59-4a5b-8b7a-82a13db903ea

Got the query!
Query validation successful!


Unnamed: 0,user_id,purchase_recency,avg_price
0,462,Less than 6 months,769.51
1,300,Between 6 and 12 months,418.32
2,622,Less than 6 months,51.04
3,824,Less than 6 months,986.32
4,408,Less than 6 months,143.05
...,...,...,...
81,38,Between 6 and 12 months,100.33
82,737,Between 6 and 12 months,30.81
83,790,Less than 6 months,653.60
84,758,Between 6 and 12 months,676.20




2
Got the query!
0: Query validation failed: 400 TIMESTAMP_SUB does not support the YEAR date part when the argument is TIMESTAMP type at [15:29]; reason: invalidQuery, location: query, message: TIMESTAMP_SUB does not support the YEAR date part when the argument is TIMESTAMP type at [15:29]

Location: US
Job ID: c524ac89-702c-441d-90d0-8f7f48132a12

Got the query!
1: Query validation failed: 400 TIMESTAMP_SUB does not support the MONTH date part when the argument is TIMESTAMP type at [20:32]; reason: invalidQuery, location: query, message: TIMESTAMP_SUB does not support the MONTH date part when the argument is TIMESTAMP type at [20:32]

Location: US
Job ID: 7b44d573-7fbd-4ee0-b1e4-78eab0fc3544

Got the query!
Query validation successful!


Unnamed: 0,purchase_recency,average_purchase_price
0,Less than 6 months,538.288225
1,Between 6 and 12 months,452.7535




3
Got the query!
0: Query validation failed: 400 TIMESTAMP_SUB does not support the YEAR date part when the argument is TIMESTAMP type at [15:31]; reason: invalidQuery, location: query, message: TIMESTAMP_SUB does not support the YEAR date part when the argument is TIMESTAMP type at [15:31]

Location: US
Job ID: 8a77e373-74b7-47e4-97d6-893fc5de91b8

Got the query!
1: Query validation failed: 400 TIMESTAMP_SUB does not support the MONTH date part when the argument is TIMESTAMP type at [20:36]; reason: invalidQuery, location: query, message: TIMESTAMP_SUB does not support the MONTH date part when the argument is TIMESTAMP type at [20:36]

Location: US
Job ID: c3513c33-750a-45ac-a86f-a07cc4a4404c

Got the query!
Query validation successful!


Unnamed: 0,purchase_recency,average_purchase_price
0,Less than 6 months,538.288225
1,Between 6 and 12 months,452.7535




4
Got the query!
0: Query validation failed: 400 TIMESTAMP_SUB does not support the YEAR date part when the argument is TIMESTAMP type at [14:27]; reason: invalidQuery, location: query, message: TIMESTAMP_SUB does not support the YEAR date part when the argument is TIMESTAMP type at [14:27]

Location: US
Job ID: 6da5b114-b9e4-4911-8eed-e521216b3928

Got the query!
1: Query validation failed: 400 TIMESTAMP_SUB does not support the MONTH date part when the argument is TIMESTAMP type at [30:30]; reason: invalidQuery, location: query, message: TIMESTAMP_SUB does not support the MONTH date part when the argument is TIMESTAMP type at [30:30]

Location: US
Job ID: e1e32eeb-f335-44b4-b755-34593f3e3901

Got the query!
Query validation successful!


Unnamed: 0,purchase_recency,average_purchase_price
0,,507.917727
1,Less than 6 months,522.914894
2,Between 6 and 12 months,466.732609






### the query with different logics

In [12]:
print(result_list[1]['updated_query'])


WITH ElectronicsPurchasers AS (
  SELECT DISTINCT t.user_id
  FROM `mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_transactions` t
  JOIN `mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_items` i ON t.item_id = i.item_id
  WHERE i.category = 'Electronics'
),
RecentPurchases AS (
  SELECT t.user_id, 
         MAX(t.sold_time) AS last_purchase_time,
         AVG(i.price) AS avg_price
  FROM `mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_transactions` t
  JOIN `mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_items` i ON t.item_id = i.item_id
  WHERE t.user_id NOT IN (SELECT user_id FROM ElectronicsPurchasers)
    AND t.sold_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 365 DAY) AND CURRENT_TIMESTAMP()
  GROUP BY t.user_id
)
SELECT 
  user_id,
  IF(last_purchase_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 180 DAY), 'Less than 6 months', 'Between 6 and 12 months') AS purchase_recency,
  avg_price
FROM RecentPurchases



In [13]:
print(result_list[0]['updated_query'])


WITH ElectronicsPurchasers AS (
  SELECT DISTINCT t.user_id
  FROM `mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_transactions` t
  JOIN `mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_items` i ON t.item_id = i.item_id
  WHERE i.category = 'Electronics'
),
RecentPurchases AS (
  SELECT t.user_id,
         MAX(t.sold_time) AS last_purchase_time,
         AVG(i.price) AS avg_price
  FROM `mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_transactions` t
  JOIN `mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_items` i ON t.item_id = i.item_id
  WHERE t.user_id NOT IN (SELECT user_id FROM ElectronicsPurchasers)
    AND t.sold_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 365 DAY) AND CURRENT_TIMESTAMP()
  GROUP BY t.user_id
)
SELECT 
  CASE 
    WHEN last_purchase_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 180 DAY) THEN 'Less than 6 months'
    ELSE 'Between 6 and 12 months'
  END AS purchase_recency,
  AVG(avg_price) AS average_purchase_price
FROM RecentP

## Chain-of-Thought prompting

In [12]:
COT_prompt_template = """
{context_info}  
Output constraint will be:
1. a string of query 
3. This query should be in three quotes (```) on both sides so that I can easily extract it from your result.

{previous_info}

The follow-up question:
{follow_question}

"""

In [13]:
previous_info_template = """
Previous questions includes:
"{previous_questions}"
the previous query string is like 
"{previous_query_string}"
these are validated and correct. Please further build the query based on the new question. 

"""

In [14]:
# question = """
# I'd like to fetch users who didn't purchase category "Electronics" ever.
# """
split_question_list = [
    "I'd like to fetch users who didn't purchase category 'Electronics' ever",
    "and did purchase in last 1 year. "
    """segment them based on the purchase recency: 1) less than 6 months, 2) between 6 months and 12 months,
    calculate the average purchase price for each segment"""
]
split_res_list = []
for ite in range(5):
    query_string_list = []
    for i, question in enumerate(split_question_list):
        print(i)
        if i == 0:
            prompt = COT_prompt_template.format(context_info=context_info,
                                                previous_info="",
                                                 follow_question=question)
        else:
            prompt = COT_prompt_template.format(context_info=context_info,
                                                previous_info=previous_info_template.format(
                                                    previous_questions=split_question_list[:i],
                                                    previous_query_string = query_string_list[i-1]),
                                                follow_question=question)
        query_string = gpt_request(prompt)
        # print(query_string)
        query_string_list.append(query_string)
    df_result , updated_query_string = get_bq_results(query_string_list[-1])
    split_res_list.append({
        'trial': ite,
        'original_queries': query_string_list,
        'updated_query': updated_query_string,
        'result': df_result
    })
    print(df_result)


0
Got the query!
1
Got the query!
0: Query validation failed: 400 TIMESTAMP_DIFF does not support the MONTH date part when the argument is TIMESTAMP type at [20:11]; reason: invalidQuery, location: query, message: TIMESTAMP_DIFF does not support the MONTH date part when the argument is TIMESTAMP type at [20:11]

Location: US
Job ID: 07fb2902-dac7-4860-a56b-0f2ea02dbf9f

Got the query!
Query validation successful!
    user_id  user_name         purchase_recency  average_purchase_price
0        16    User_16       Less than 6 months                  367.62
1        17    User_17  Between 6 and 12 months                   50.68
2        24    User_24       Less than 6 months                  653.60
3        26    User_26       Less than 6 months                  837.47
4        38    User_38  Between 6 and 12 months                  100.33
..      ...        ...                      ...                     ...
88      921   User_921       Less than 6 months                  376.99
89     

## Interactive data interface

In [None]:

# question = """
# I'd like to fetch users who didn't purchase category "Electronics" ever.
# """
# split_question_list = [
#     "I'd like to fetch users who didn't purchase category 'Electronics' ever",
#     "and did purchase in last 1 year. "
#     """segment them based on the purchase recency: 1) less than 6 months, 2) between 6 months and 12 months,
#     calculate the average purchase price per user for each segment"""
# ]
split_res_list = []


COT_retry_prompt_template = """
{context_info} 

{previous_info}

On top of it, user gave the question last time:
{previous_question}

and got the query last time like this:
{previous_query}

The user is not satisfied and we need to retry to provide query with the refined question:
{follow_question}

Output constraint will be:
1. a string of query 
3. This query should be in three quotes (```) on both sides so that I can easily extract it from your result.


"""

validation_prompt_template = """
{context_info} 

current_query is like:
{query_string}

the information the user would like to check:

{validate_info}

please generate a query which is able to provide the validation info
Output constraint will be:
1. a string of query 
3. This query should be in three quotes (```) on both sides so that I can easily extract it from your result.


"""
question_string_list = []
is_retry = False
previous_question = ""
current_question = ""
current_query = ""
while True:
    previous_info = ""
    if len(question_string_list) != 0: 
        previous_info = previous_info_template.format(
                                                    previous_questions=question_string_list,
                                                    previous_query_string = current_query)
    
    if is_retry is True:
        print("is retrying")
        question_string_list.append(current_question)
        prompt = COT_retry_prompt_template.format(context_info=context_info,
                                                previous_info=previous_info,
                                                previous_question=previous_question,
                                                previous_query=current_query,
                                                follow_question=current_question)
    else:
        current_question = input("Please enter your query in natural language: ")
        if "done" in current_question:
            break
        question_string_list.append(current_question)
        
        prompt = COT_prompt_template.format(context_info=context_info,
                                            previous_info=previous_info,
                                             follow_question=current_question)
    print(prompt)
    current_query = gpt_request(prompt)
    df_result , current_query = get_bq_results(current_query)

    print(f"the current query is")
    print(current_query)
    print(f"the current result looks like:")
    display(df_result)
    print("does the query look good?")
    user_input = input("please enter yes or no: ")
    if "no" in user_input:
        previous_question = current_question
        current_question = input("please enter the refined requirement: ")
        question_string_list = question_string_list[:-1]
        is_retry = True 
        continue
    print("would you like to validate it?")
    user_input = input("please enter yes or no: ")
    if "yes" in user_input:
        yes_validation = True
        while yes_validation:
            validation_info = input("please enter the information you would like to check: ")
            validation_prompt = validation_prompt_template.format(context_info=context_info,
                                                    query_string=current_query,
                                                    validate_info=validation_info)
            print()
            validation_query = gpt_request(validation_prompt)
            print(validation_query)
            df_val_result , updated_query_string = get_bq_results(validation_query)
            display(df_val_result)
            print("would you like to try some other validations?")
            validation_input = input("please enter yes or no: ")
            if "yes" in validation_input:
                yes_validation = True
            else:
                yes_validation = False
        if "no" in user_input:
            previous_question = current_question
            current_question = input("please enter the refined requirement: ")
            is_retry = True
            continue
    is_retry = False
    

Please enter your query in natural language:  I'd like to fetch users who didn't purchase category 'Electronics' ever




Table schemas:
* "mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_users"
    * user_id: user's id
    * user_name: user's name
    * created: a timestamp of the registery time
* "mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_items"
    * item_id: item's id
    * item_name: item's name
    * price: the item price in Japanese yen
    * category: the item's category
    * created: timestamp that the items are listed
* "mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_transactions"
    * transaction_id: the transaction's id
    * item_id: item's id
    * user_id: who bought the item
    * quantity: how many itmes the user purchased
    * sold_time: timestamp that the items are sold  
  
Output constraint will be:
1. a string of query 
3. This query should be in three quotes (```) on both sides so that I can easily extract it from your result.



The follow-up question:
I'd like to fetch users who didn't purchase category 'Electronics' ever


Got the query!
Query validation succ

Unnamed: 0,user_id,user_name
0,88,User_88
1,180,User_180
2,477,User_477
3,486,User_486
4,530,User_530
...,...,...
920,391,User_391
921,578,User_578
922,877,User_877
923,299,User_299


does the query look good?


please enter yes or no:  yes


would you like to validate it?


please enter yes or no:  yes
please enter the information you would like to check:  I'd like to get the user count who purchase the category for each category and also the total user count from the user table



Got the query!

SELECT 
    i.category,
    COUNT(DISTINCT t.user_id) AS user_count
FROM `mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_transactions` t
JOIN `mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_items` i ON t.item_id = i.item_id
GROUP BY i.category
UNION ALL
SELECT 
    'Total' AS category,
    COUNT(DISTINCT user_id) AS user_count
FROM `mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_users`

Query validation successful!


Unnamed: 0,category,user_count
0,Toys,99
1,Home,104
2,Electronics,75
3,Fashion,98
4,Books,91
5,Total,1000


would you like to try some other validations?


please enter yes or no:  no
Please enter your query in natural language:  and did purchase in last 1 year.




Table schemas:
* "mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_users"
    * user_id: user's id
    * user_name: user's name
    * created: a timestamp of the registery time
* "mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_items"
    * item_id: item's id
    * item_name: item's name
    * price: the item price in Japanese yen
    * category: the item's category
    * created: timestamp that the items are listed
* "mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_transactions"
    * transaction_id: the transaction's id
    * item_id: item's id
    * user_id: who bought the item
    * quantity: how many itmes the user purchased
    * sold_time: timestamp that the items are sold  
  
Output constraint will be:
1. a string of query 
3. This query should be in three quotes (```) on both sides so that I can easily extract it from your result.


Previous questions includes:
"["I'd like to fetch users who didn't purchase category 'Electronics' ever"]"
the previous query string is

Unnamed: 0,user_id,user_name
0,149,User_149
1,581,User_581
2,275,User_275
3,681,User_681
4,383,User_383
...,...,...
82,52,User_52
83,233,User_233
84,488,User_488
85,405,User_405


does the query look good?


please enter yes or no:  yes


would you like to validate it?


please enter yes or no:  yes
please enter the information you would like to check:  from the filtered users, summarize what categories they have purchased ever into an array for each user



Got the query!

SELECT u.user_id, u.user_name, ARRAY_AGG(DISTINCT i.category) AS purchased_categories
FROM `mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_users` u
JOIN `mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_transactions` t ON u.user_id = t.user_id
JOIN `mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_items` i ON t.item_id = i.item_id
WHERE u.user_id IN (
    SELECT u.user_id
    FROM `mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_users` u
    LEFT JOIN (
        SELECT DISTINCT t.user_id
        FROM `mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_transactions` t
        JOIN `mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_items` i ON t.item_id = i.item_id
        WHERE i.category = 'Electronics'
    ) e ON u.user_id = e.user_id
    JOIN (
        SELECT DISTINCT t.user_id
        FROM `mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_transactions` t
        WHERE t.sold_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 365 DAY)
    ) p ON u.user_

Unnamed: 0,user_id,user_name,purchased_categories
0,100,User_100,[Toys]
1,462,User_462,"[Home, Fashion]"
2,921,User_921,[Toys]
3,300,User_300,[Fashion]
4,622,User_622,[Toys]
...,...,...,...
82,496,User_496,[Books]
83,267,User_267,[Toys]
84,38,User_38,[Toys]
85,737,User_737,[Books]


would you like to try some other validations?


please enter yes or no:  yes
please enter the information you would like to check:  return all filtered users' maximum purchase time with an order



Got the query!

SELECT u.user_id, u.user_name, MAX(t.sold_time) AS max_purchase_time
FROM `mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_users` u
LEFT JOIN (
    SELECT DISTINCT t.user_id
    FROM `mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_transactions` t
    JOIN `mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_items` i ON t.item_id = i.item_id
    WHERE i.category = 'Electronics'
) e ON u.user_id = e.user_id
JOIN `mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_transactions` t ON u.user_id = t.user_id
WHERE e.user_id IS NULL
AND t.sold_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 365 DAY)
GROUP BY u.user_id, u.user_name
ORDER BY max_purchase_time DESC

Query validation successful!


Unnamed: 0,user_id,user_name,max_purchase_time
0,106,User_106,2024-04-24 15:21:00+00:00
1,523,User_523,2024-04-21 12:37:00+00:00
2,921,User_921,2024-04-04 04:53:00+00:00
3,581,User_581,2024-03-30 00:37:00+00:00
4,824,User_824,2024-03-29 23:15:00+00:00
...,...,...,...
82,491,User_491,2023-05-12 22:38:00+00:00
83,601,User_601,2023-05-11 05:40:00+00:00
84,858,User_858,2023-05-04 15:01:00+00:00
85,737,User_737,2023-04-28 06:54:00+00:00


would you like to try some other validations?


please enter yes or no:  no
Please enter your query in natural language:  please filter the purchase which occurs only before today when selecting user did purchase in last 1 year. 




Table schemas:
* "mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_users"
    * user_id: user's id
    * user_name: user's name
    * created: a timestamp of the registery time
* "mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_items"
    * item_id: item's id
    * item_name: item's name
    * price: the item price in Japanese yen
    * category: the item's category
    * created: timestamp that the items are listed
* "mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_transactions"
    * transaction_id: the transaction's id
    * item_id: item's id
    * user_id: who bought the item
    * quantity: how many itmes the user purchased
    * sold_time: timestamp that the items are sold  
  
Output constraint will be:
1. a string of query 
3. This query should be in three quotes (```) on both sides so that I can easily extract it from your result.


Previous questions includes:
"["I'd like to fetch users who didn't purchase category 'Electronics' ever", 'and did purchase in last 1 y

Unnamed: 0,user_id,user_name
0,149,User_149
1,581,User_581
2,275,User_275
3,681,User_681
4,383,User_383
...,...,...
81,52,User_52
82,233,User_233
83,488,User_488
84,405,User_405


does the query look good?


please enter yes or no:  yes


would you like to validate it?


please enter yes or no:  yes
please enter the information you would like to check:  return all filtered users' maximum purchase time with an order



Got the query!

SELECT u.user_id, u.user_name, MAX(t.sold_time) AS max_purchase_time
FROM `mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_users` u
LEFT JOIN (
    SELECT DISTINCT t.user_id
    FROM `mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_transactions` t
    JOIN `mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_items` i ON t.item_id = i.item_id
    WHERE i.category = 'Electronics'
) e ON u.user_id = e.user_id
JOIN `mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_transactions` t ON u.user_id = t.user_id
WHERE e.user_id IS NULL
AND t.sold_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 365 DAY)
AND t.sold_time < CURRENT_TIMESTAMP()
GROUP BY u.user_id, u.user_name
ORDER BY max_purchase_time DESC

Query validation successful!


Unnamed: 0,user_id,user_name,max_purchase_time
0,921,User_921,2024-04-04 04:53:00+00:00
1,581,User_581,2024-03-30 00:37:00+00:00
2,824,User_824,2024-03-29 23:15:00+00:00
3,149,User_149,2024-03-27 14:09:00+00:00
4,446,User_446,2024-03-24 17:34:00+00:00
...,...,...,...
81,491,User_491,2023-05-12 22:38:00+00:00
82,601,User_601,2023-05-11 05:40:00+00:00
83,858,User_858,2023-05-04 15:01:00+00:00
84,737,User_737,2023-04-28 06:54:00+00:00


would you like to try some other validations?


please enter yes or no:  no
Please enter your query in natural language:  segment the users based on their purchase recency: 1) less than 6 months, 2) between 6 months and 12 months,




Table schemas:
* "mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_users"
    * user_id: user's id
    * user_name: user's name
    * created: a timestamp of the registery time
* "mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_items"
    * item_id: item's id
    * item_name: item's name
    * price: the item price in Japanese yen
    * category: the item's category
    * created: timestamp that the items are listed
* "mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_transactions"
    * transaction_id: the transaction's id
    * item_id: item's id
    * user_id: who bought the item
    * quantity: how many itmes the user purchased
    * sold_time: timestamp that the items are sold  
  
Output constraint will be:
1. a string of query 
3. This query should be in three quotes (```) on both sides so that I can easily extract it from your result.


Previous questions includes:
"["I'd like to fetch users who didn't purchase category 'Electronics' ever", 'and did purchase in last 1 y

Unnamed: 0,user_id,user_name,purchase_recency
0,149,User_149,Less than 6 months
1,581,User_581,Less than 6 months
2,275,User_275,Less than 6 months
3,681,User_681,Between 6 and 12 months
4,383,User_383,Less than 6 months
...,...,...,...
81,52,User_52,Between 6 and 12 months
82,233,User_233,Between 6 and 12 months
83,488,User_488,Between 6 and 12 months
84,405,User_405,Less than 6 months


does the query look good?


please enter yes or no:  yes


would you like to validate it?


please enter yes or no:  yes
please enter the information you would like to check:  provide each users's max purchase time and purchase_recency segment, please order it based on the max purchase time



Got the query!

SELECT 
  u.user_id, 
  u.user_name,
  MAX(t.sold_time) AS max_purchase_time,
  CASE 
    WHEN MAX(t.sold_time) >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 6 * 30 DAY) THEN 'Less than 6 months'
    WHEN MAX(t.sold_time) < TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 6 * 30 DAY) 
         AND MAX(t.sold_time) >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 12 * 30 DAY) THEN 'Between 6 and 12 months'
    ELSE 'More than 12 months'
  END AS purchase_recency
FROM `mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_users` u
LEFT JOIN (
    SELECT DISTINCT t.user_id
    FROM `mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_transactions` t
    JOIN `mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_items` i ON t.item_id = i.item_id
    WHERE i.category = 'Electronics'
) e ON u.user_id = e.user_id
JOIN `mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_transactions` t ON u.user_id = t.user_id
WHERE e.user_id IS NULL
AND t.sold_time < CURRENT_TIMESTAMP()
AND t.sold_time >

Unnamed: 0,user_id,user_name,max_purchase_time,purchase_recency
0,921,User_921,2024-04-04 04:53:00+00:00,Less than 6 months
1,581,User_581,2024-03-30 00:37:00+00:00,Less than 6 months
2,824,User_824,2024-03-29 23:15:00+00:00,Less than 6 months
3,149,User_149,2024-03-27 14:09:00+00:00,Less than 6 months
4,446,User_446,2024-03-24 17:34:00+00:00,Less than 6 months
...,...,...,...,...
81,491,User_491,2023-05-12 22:38:00+00:00,Between 6 and 12 months
82,601,User_601,2023-05-11 05:40:00+00:00,Between 6 and 12 months
83,858,User_858,2023-05-04 15:01:00+00:00,Between 6 and 12 months
84,737,User_737,2023-04-28 06:54:00+00:00,Between 6 and 12 months


would you like to try some other validations?


please enter yes or no:  no
Please enter your query in natural language:  calculate the average purchase price for each segment




Table schemas:
* "mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_users"
    * user_id: user's id
    * user_name: user's name
    * created: a timestamp of the registery time
* "mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_items"
    * item_id: item's id
    * item_name: item's name
    * price: the item price in Japanese yen
    * category: the item's category
    * created: timestamp that the items are listed
* "mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_transactions"
    * transaction_id: the transaction's id
    * item_id: item's id
    * user_id: who bought the item
    * quantity: how many itmes the user purchased
    * sold_time: timestamp that the items are sold  
  
Output constraint will be:
1. a string of query 
3. This query should be in three quotes (```) on both sides so that I can easily extract it from your result.


Previous questions includes:
"["I'd like to fetch users who didn't purchase category 'Electronics' ever", 'and did purchase in last 1 y

Unnamed: 0,user_id,user_name,purchase_recency,average_purchase_price
0,149,User_149,Less than 6 months,903.760
1,581,User_581,Less than 6 months,456.090
2,275,User_275,Less than 6 months,769.510
3,681,User_681,Between 6 and 12 months,445.090
4,383,User_383,Less than 6 months,458.880
...,...,...,...,...
81,52,User_52,Between 6 and 12 months,949.335
82,233,User_233,Between 6 and 12 months,660.120
83,488,User_488,Between 6 and 12 months,21.410
84,405,User_405,Less than 6 months,376.990


does the query look good?


please enter yes or no:  no
please enter the refined requirement:  I'd like to get aggregated result for each segment instead of each user


is retrying


Table schemas:
* "mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_users"
    * user_id: user's id
    * user_name: user's name
    * created: a timestamp of the registery time
* "mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_items"
    * item_id: item's id
    * item_name: item's name
    * price: the item price in Japanese yen
    * category: the item's category
    * created: timestamp that the items are listed
* "mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_transactions"
    * transaction_id: the transaction's id
    * item_id: item's id
    * user_id: who bought the item
    * quantity: how many itmes the user purchased
    * sold_time: timestamp that the items are sold  
 


Previous questions includes:
"["I'd like to fetch users who didn't purchase category 'Electronics' ever", 'and did purchase in last 1 year.', 'please filter the purchase which occurs only before today when selecting user did purchase in last 1 year. ', 'segment the users based on th

Unnamed: 0,purchase_recency,segment_average_purchase_price
0,Less than 6 months,538.288225
1,Between 6 and 12 months,452.7535


does the query look good?


please enter yes or no:  yes


would you like to validate it?


please enter yes or no:  yes
please enter the information you would like to check:  calculate the min purchase date and max purchase date for each segment



Got the query!

SELECT 
  purchase_recency,
  MIN(min_purchase_date) AS segment_min_purchase_date,
  MAX(max_purchase_date) AS segment_max_purchase_date
FROM (
  SELECT 
    u.user_id,
    CASE 
      WHEN MAX(t.sold_time) >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 6 * 30 DAY) THEN 'Less than 6 months'
      WHEN MAX(t.sold_time) < TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 6 * 30 DAY) 
           AND MAX(t.sold_time) >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 12 * 30 DAY) THEN 'Between 6 and 12 months'
    END AS purchase_recency,
    MIN(t.sold_time) AS min_purchase_date,
    MAX(t.sold_time) AS max_purchase_date
  FROM `mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_users` u
  LEFT JOIN (
      SELECT DISTINCT t.user_id
      FROM `mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_transactions` t
      JOIN `mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_items` i ON t.item_id = i.item_id
      WHERE i.category = 'Electronics'
  ) e ON u.user_id = e.user_id
  JOIN 

Unnamed: 0,purchase_recency,segment_min_purchase_date,segment_max_purchase_date
0,Less than 6 months,2023-05-07 17:41:00+00:00,2024-04-04 04:53:00+00:00
1,Between 6 and 12 months,2023-04-25 15:35:00+00:00,2023-10-18 06:06:00+00:00


would you like to try some other validations?


please enter yes or no:  calculate the min users' purchase recency and max user's purchase recency for each segment
Please enter your query in natural language:  I'd like to define the date scope of the segment as: Less than 6 months as[2023-10-15, 2023-04-14], Between 6 and 12 months as [2023-04-15, 2023-10-14]




Table schemas:
* "mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_users"
    * user_id: user's id
    * user_name: user's name
    * created: a timestamp of the registery time
* "mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_items"
    * item_id: item's id
    * item_name: item's name
    * price: the item price in Japanese yen
    * category: the item's category
    * created: timestamp that the items are listed
* "mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_transactions"
    * transaction_id: the transaction's id
    * item_id: item's id
    * user_id: who bought the item
    * quantity: how many itmes the user purchased
    * sold_time: timestamp that the items are sold  
  
Output constraint will be:
1. a string of query 
3. This query should be in three quotes (```) on both sides so that I can easily extract it from your result.


Previous questions includes:
"["I'd like to fetch users who didn't purchase category 'Electronics' ever", 'and did purchase in last 1 y

Unnamed: 0,purchase_recency,segment_average_purchase_price
0,Between 6 and 12 months,461.767262


does the query look good?


please enter yes or no:  no
please enter the refined requirement:  "Less than 6 months" is between 2023-10-15 and 2024-04-14, "Between 6 and 12 months" is between 2023-04-15 and 2023-10-14.


is retrying


Table schemas:
* "mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_users"
    * user_id: user's id
    * user_name: user's name
    * created: a timestamp of the registery time
* "mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_items"
    * item_id: item's id
    * item_name: item's name
    * price: the item price in Japanese yen
    * category: the item's category
    * created: timestamp that the items are listed
* "mercari-ml-crm-jp-dev.z_yilin.llm_query_experiment_transactions"
    * transaction_id: the transaction's id
    * item_id: item's id
    * user_id: who bought the item
    * quantity: how many itmes the user purchased
    * sold_time: timestamp that the items are sold  
 


Previous questions includes:
"["I'd like to fetch users who didn't purchase category 'Electronics' ever", 'and did purchase in last 1 year.', 'please filter the purchase which occurs only before today when selecting user did purchase in last 1 year. ', 'segment the users based on th

Unnamed: 0,purchase_recency,segment_average_purchase_price
0,Less than 6 months,531.578688
1,Between 6 and 12 months,458.646154


does the query look good?


* question 1: I'd like to fetch users who didn't purchase category 'Electronics' ever
   * validation: I'd like to get the user count who purchase the category for each category and also the total user count from the user table
* question 2: and did purchase in last 1 year.
   * validation: from the filtered users, summarize what categories they have purchased ever into an array for each user
   * validation: return all filtered users' maximum purchase time with an order
* question 3: please filter the purchase which occurs only before today when selecting user did purchase in last 1 year. 
   * validation: return all filtered users' maximum purchase time with an order
* question 3: segment the users based on their purchase recency: 1) less than 6 months, 2) between 6 months and 12 months,
   * validation: provide each users's max purchase time and purchase_recency segment, please order it based on the max purchase time
* question 5: calculate the average purchase price for each segment
   * validation: calculate the min purchase date and max purchase date for each segment
* question 4: I'd like to define the date scope of the segment as: Less than 6 months as[2023-10-15, 2024-04-14], Between 6 and 12 months as [2023-04-15, 2023-10-14]