# Data privacy

According to [Open AI's official statement](https://help.openai.com/en/articles/5722486-how-your-data-is-used-to-improve-model-performance): 

>OpenAI does not use data submitted by customers via our API to train OpenAI models or improve OpenAI’s service offering. In order to support the continuous improvement of our models, you can fill out this form to opt-in to share your data with us. 

According to [Open AI's API data usage policies](https://openai.com/policies/api-data-usage-policies):

>1. OpenAI will not use data submitted by customers via our API to train or improve our models, unless you explicitly decide to share your data with us for this purpose. You can opt-in to share data.
>2. Any data sent through the API will be retained for abuse and misuse monitoring purposes for a maximum of 30 days, after which it will be deleted (unless otherwise required by law).

# Library import and client setup

In [1]:
import openai
from google.cloud import bigquery
import pandas as pd
from google.cloud import storage


In [2]:
client = bigquery.Client(project="setalab")



In [3]:
# Set up OpenAI API credentials
openai.api_key = "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"

#openai.api_key = "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"
# Define the GPT-4 model prompt
model_name = "gpt-3.5-turbo-0301"

# model_name = "gpt-4"
# GPT-4 is not available to even subscribed accounts. Need to join the waitlist for this: https://openai.com/waitlist/gpt-4-api

# Table info feed-in

In [4]:
def get_table_schema(table_name):
    table = client.get_table(table_name)
    schema = table.schema
    schema_info = ""
    for field in schema:
        schema_info += f"{field.name}  {field.field_type},\n"
    return schema_info

In [5]:
def table_info_init(table: str = "setalab.llm_explore.owid_covid_data") -> None:
    # Set the table name
    global TABLE_NAME
    TABLE_NAME = table
    # Get the schema of the table
    global SCHEMA
    SCHEMA = get_table_schema(table)
    print(SCHEMA)

In [6]:
#def table_info_init(table: str = "test.owid_covid_data") -> None:
    # Set the table name
# table = "setalab.llm_explore.owid_covid_data"
# global TABLE_NAME
# TABLE_NAME = table
# # Get the schema of the table
# global SCHEMA
# SCHEMA = get_table_schema(table)
# #print(schema)
# print(SCHEMA)

# Helper functions

In [7]:
def generate_sql_query(question, temperature: float = 0.2) -> str:
    print(f"Question:\n{question}")
    prompt_without_context = question
    # prompt_with_context = f"Given the table '{table_name}', please write an SQL query to {question}"

    messages = [
        {"role": "system", "content": f"Use the reference table name: {TABLE_NAME}"},
        {"role": "system", "content": f"Use the reference table schema: {SCHEMA}"},
        {"role": "system", "content": "Use the provided table and schema to compose SQL query in Google BigQuery dialect to help user answer their question."},
        {"role": "user", "content": prompt_without_context},
        # {"role": "assistant", "content": prompt_with_context},
        # Set of rules to get better query results
        {"role": "assistant", "content": "If the user ask for something by X, then it's a good idea to have a GROUP BY X clause in the SQL"},
        # {"role": "assistant", "content": "New column names in AS clause should not have any white spaces and do not put them inside quotes"},
        {"role": "assistant", "content": "Remember to make sure all the columns are grouped nor aggregated when you are using aggragation in SQL"},
        {"role": "assistant", "content": "If the user do not mention any time period, the query should take into account all the dates available in the table"},
        {"role": "assistant", "content": "When the user do not mention any specific requirements for the numeric values, default to average"},
        {"role": "system", "content": "Please provide the query ONLY. DO NOT provide any explanation"},
    ]

    response = openai.ChatCompletion.create(
        model=model_name,
        messages=messages,
        max_tokens=100,
        temperature=temperature,
        n=1,
        stop=None,
    )
    sql_query = response.choices[0].message['content']
    sql_query = sql_query.strip().strip("```")
    return sql_query



In [8]:
def execute_sql_query(sql_query):
    result = client.query(sql_query).to_dataframe()
    return result

In [9]:
def ask_question(question: str = None) -> str:
    user_input = input("Input question:") 

    sql_query = generate_sql_query(user_input)
    print("Generated SQL Query:\n", sql_query)

In [10]:
pip install db_dtypes

Note: you may need to restart the kernel to use updated packages.


In [11]:
import db_dtypes

In [12]:
def ask_question_and_validate(question: str = None, temperature: float=0.2) -> pd.DataFrame:
    user_input = input("Input question:") if not question else question

    sql_query = generate_sql_query(user_input, temperature)
    print("Generated SQL Query:\n", sql_query)

    # Execute the generated SQL query
    result = execute_sql_query(sql_query)
    # print("Result:", result)
    return result

# Further exploration with other questions

In [13]:
table_info_init()

iso_code  STRING,
continent  STRING,
location  STRING,
date  DATE,
total_cases  FLOAT,
new_cases  FLOAT,
new_cases_smoothed  FLOAT,
total_deaths  FLOAT,
new_deaths  FLOAT,
new_deaths_smoothed  FLOAT,
total_cases_per_million  FLOAT,
new_cases_per_million  FLOAT,
new_cases_smoothed_per_million  FLOAT,
total_deaths_per_million  FLOAT,
new_deaths_per_million  FLOAT,
new_deaths_smoothed_per_million  FLOAT,
reproduction_rate  FLOAT,
icu_patients  FLOAT,
icu_patients_per_million  FLOAT,
hosp_patients  STRING,
hosp_patients_per_million  STRING,
weekly_icu_admissions  STRING,
weekly_icu_admissions_per_million  STRING,
weekly_hosp_admissions  STRING,
weekly_hosp_admissions_per_million  STRING,
total_tests  FLOAT,
new_tests  FLOAT,
total_tests_per_thousand  FLOAT,
new_tests_per_thousand  FLOAT,
new_tests_smoothed  FLOAT,
new_tests_smoothed_per_thousand  FLOAT,
positive_rate  FLOAT,
tests_per_case  FLOAT,
tests_units  STRING,
total_vaccinations  FLOAT,
people_vaccinated  FLOAT,
people_fully_vaccin

In [14]:
df = ask_question_and_validate(
    "show top 10 countries with highest average percentage of Infection Rate compared to Population"
)
df

Question:
show top 10 countries with highest average percentage of Infection Rate compared to Population
Generated SQL Query:
 
SELECT location, AVG(new_cases/ population) * 100 AS infection_rate
FROM setalab.llm_explore.owid_covid_data
GROUP BY location
ORDER BY infection_rate DESC
LIMIT 10



Unnamed: 0,location,infection_rate
0,Cyprus,0.059241
1,San Marino,0.057899
2,Brunei,0.054799
3,Austria,0.054598
4,Faeroe Islands,0.052408
5,Slovenia,0.050928
6,Gibraltar,0.050513
7,Martinique,0.050262
8,South Korea,0.049215
9,Andorra,0.048303


In [15]:
df = ask_question_and_validate(
    "show average percentage of Infection Rate compared to Population by country, return top 10"
)
df

Question:
show average percentage of Infection Rate compared to Population by country, return top 10
Generated SQL Query:
 SELECT location, AVG(total_cases/population)*100 AS infection_rate_percentage
FROM setalab.llm_explore.owid_covid_data
GROUP BY location
ORDER BY infection_rate_percentage DESC
LIMIT 10


Unnamed: 0,location,infection_rate_percentage
0,Cook Islands,33.473507
1,Saint Helena,33.37853
2,San Marino,30.520538
3,Nauru,30.467643
4,Andorra,30.114239
5,Gibraltar,29.960316
6,Cyprus,29.164427
7,Faeroe Islands,27.886633
8,Slovenia,27.203686
9,Austria,25.39993


In [16]:
df = ask_question_and_validate(
    "show average percentage of death over population by country, order from highest to lowest"
)
df

Question:
show average percentage of death over population by country, order from highest to lowest
Generated SQL Query:
 SELECT location, AVG(total_deaths/population)*100 AS avg_death_percentage
FROM setalab.llm_explore.owid_covid_data
GROUP BY location
ORDER BY avg_death_percentage DESC


Unnamed: 0,location,avg_death_percentage
0,Peru,0.468791
1,Bulgaria,0.326416
2,Bosnia and Herzegovina,0.309773
3,Hungary,0.293098
4,Gibraltar,0.289355
...,...,...
250,Taiwan,
251,Macao,
252,North Korea,
253,Hong Kong,


In [17]:
df = ask_question_and_validate(
    "show percentage of death over population by country, order from highest to lowest"
)
df

Question:
show percentage of death over population by country, order from highest to lowest
Generated SQL Query:
 SELECT 
  location, 
  AVG(total_deaths/population)*100 AS percentage_of_death_over_population
FROM 
  setalab.llm_explore.owid_covid_data
GROUP BY 
  location
ORDER BY 
  percentage_of_death_over_population DESC


Unnamed: 0,location,percentage_of_death_over_population
0,Peru,0.468791
1,Bulgaria,0.326416
2,Bosnia and Herzegovina,0.309773
3,Hungary,0.293098
4,Gibraltar,0.289355
...,...,...
250,Taiwan,
251,Macao,
252,North Korea,
253,Hong Kong,


In [18]:
df = ask_question_and_validate(
    "show top 3 contintents with the highest average death count per population"
)
df

Question:
show top 3 contintents with the highest average death count per population
Generated SQL Query:
 SELECT continent, AVG(total_deaths/population) AS avg_death_count_per_population
FROM setalab.llm_explore.owid_covid_data
GROUP BY continent
ORDER BY avg_death_count_per_population DESC
LIMIT 3;


Unnamed: 0,continent,avg_death_count_per_population
0,South America,0.001658
1,Europe,0.001593
2,North America,0.000901


In [19]:
df = ask_question_and_validate(
    "Show average likelihood of dying by country"
)
df

Question:
Show average likelihood of dying by country
Generated SQL Query:
 SELECT location, AVG(total_deaths/total_cases) as avg_likelihood_of_dying
FROM setalab.llm_explore.owid_covid_data
GROUP BY location


Unnamed: 0,location,avg_likelihood_of_dying
0,Andorra,0.014026
1,Bermuda,0.020785
2,British Virgin Islands,0.023383
3,Cayman Islands,0.010402
4,Greenland,0.001601
...,...,...
250,Cameroon,0.018174
251,Bahamas,0.031933
252,Antigua and Barbuda,0.029716
253,Japan,0.013081


In [20]:
df = ask_question_and_validate(
    "Show latest Percentage of Population that has recieved at least one dose of Covid Vaccine by country"
)
df

Question:
Show latest Percentage of Population that has recieved at least one dose of Covid Vaccine by country
Generated SQL Query:
 
SELECT 
  location, 
  AVG(people_vaccinated_per_hundred) AS percentage_population_vaccinated
FROM 
  setalab.llm_explore.owid_covid_data
WHERE 
  people_vaccinated_per_hundred IS NOT NULL
GROUP BY 
  location
ORDER BY 
  percentage_population_vaccinated DESC



Unnamed: 0,location,percentage_population_vaccinated
0,Gibraltar,103.330895
1,Pitcairn,100.000000
2,United Arab Emirates,93.267984
3,China,88.230000
4,Cuba,82.688979
...,...,...
230,Papua New Guinea,2.464225
231,Democratic Republic of Congo,2.267681
232,Yemen,1.830909
233,Haiti,1.103333


# Remarks

- The result is very volatile, the SQL answer will likely to differ with each time we ask the model
- Since this is a `Language Model` only, it will not understand the implications of some queries it has written. Most usual case with the model is that it does not write proper `GROUP BY` clause

## How we phrase the question is very important

It will hugely affect the resulting query


### First example

This is done without the instruction

```JSON
{
    "role": "assistant",
    "content": "If the user do not mention any time period, the query should take into account all the dates available in the table"
}
```


Take this question for example:

    "show percentage of death over population by country, order from highest to lowest"

The model will return the following SQL

```SQL
SELECT location, (total_deaths/population)*100 AS death_percentage
FROM test.owid_covid_data
WHERE population IS NOT NULL
GROUP BY location, population
ORDER BY death_percentage DESC
```

This query will throw an error, because of the missing aggregation

`BadRequest: 400 SELECT list expression references column total_deaths which is neither grouped nor aggregated`

But if we rephrase the question as follow:

    "show average percentage of death over population by country, order from highest to lowest"

This is the SQL we received:

```SQL
 SELECT location, AVG(total_deaths_per_million/population)*100 AS avg_death_percentage
FROM test.owid_covid_data
GROUP BY location
ORDER BY avg_death_percentage DESC
```

Which is more in line with what we want. The difference is only the word __"average"__ that we put in the question

## Second example

This is done with the full set of instructions seen above in function `generate_sql_query`

Question: __"show top 3 contintents with the highest death count per population"__

SQL:
```SQL
 SELECT continent, SUM(total_deaths)/SUM(population) AS death_per_population
FROM test.owid_covid_data
GROUP BY continent
ORDER BY death_per_population DESC
LIMIT 3;
```

Question: __"show top 3 contintents with the highest average death count per population"__

SQL:
```SQL
 SELECT continent, AVG(total_deaths/population) AS avg_death_per_pop
FROM test.owid_covid_data
GROUP BY continent
ORDER BY avg_death_per_pop DESC
LIMIT 3;
```

## The temperature parameter

According to the [API documentation](https://platform.openai.com/docs/api-reference/chat/create):

```
temperature number  Optional    Defaults to 1

What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.
```

Reducing this parameter will only make the answers we receive less random, but will not guarantee the same answer everytime. So It's best to just keep it low.

## Fine-tuning

[More to explore]

### Detailed Instructions

A way to fine-tune the model is to give it some "rules" under the `messages` parameter. For example

```python
messages = [
    ...
    {"role": "assistant", "content": "If the user ask for something by X, then the answer SQL should have GROUP BY X clause"},
    ...
]
```

With this message automatically embedded into the question every time we ask the model, the situation where the model "forgets" to aggregate the column reduced drastically

So to apply the "GPT approach", we can summarize some "common sense" rules to feed to the model input as guidance

### Further fine tuning (TODO)

Refer to:

- https://platform.openai.com/docs/guides/fine-tuning
- https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset
- https://docs.wandb.ai/guides/integrations/openai?utm_source=wandb_docs&utm_medium=code&utm_campaign=OpenAI+API
