# Query Data with LLM using OpenAI API

In this section, we will review different ways to query data with LLMs using the [OpenAI API](https://platform.openai.com/docs/api-reference/introduction). We will start with basic prompt methods to generate SQL queries and then move on to more advanced prompt techniques. We than fully functionalize the process and build the following three components:
- **Create Prompt**: This function takes a user question and returns a prompt
- **API handler**: This function sends prompts to the API, returns the response, and parse it to SQL query
- **DB handler**: This function takes the query and returns a dataframe 


<figure>
 <img src="images/architecture.png" width="100%" align="center"/></a>
<figcaption> The agent architecture </figcaption>
</figure>

<br>
<br />

To demonstrate the process, we will use the Chicago Crime dataset we pulled earlier from the API and load it into a Pandas dataframe. We will also use the `duckdb` library to query the data using SQL queries.

<figure>
 <img src="images/chicago_crime.png" width="100%" align="center"/></a>
<figcaption> The Chicago Crime dataset </figcaption>
</figure>

<br>
<br />

## Load the Libraries

In [1]:
import pandas as pd
import duckdb
import os
from openai import OpenAI

## Load the Data

We will load the data into a pandas dataframe and reformat some of the table columns:

In [2]:
chicago_crime = pd.read_csv("./data/chicago_crime_2023_2025.csv")
chicago_crime["updated_on"] = pd.to_datetime(chicago_crime["updated_on"])
chicago_crime["x_coordinate"] = chicago_crime["x_coordinate"].astype(float)
chicago_crime["y_coordinate"] = chicago_crime["y_coordinate"].astype(float)
chicago_crime["latitude"] = chicago_crime["latitude"].astype(float)
chicago_crime["longitude"] = chicago_crime["longitude"].astype(float)
chicago_crime["year"]= chicago_crime["year"].astype(int)

In [3]:
chicago_crime.head()

Unnamed: 0,id,case_number,datetime,block,iucr,primary_type,description,location_description,arrest,domestic,...,district,ward,community_area,fbi_code,x_coordinate,y_coordinate,year,updated_on,latitude,longitude
0,13140855,JG341458,2023-01-01 00:00:00,082XX S JEFFERY BLVD,1754,OFFENSE INVOLVING CHILDREN,AGGRAVATED SEXUAL ASSAULT OF CHILD BY FAMILY M...,APARTMENT,False,True,...,4,8.0,46.0,02,1190953.0,1850848.0,2023,2023-09-24 15:41:26,41.745739,-87.575883
1,13180096,JG387858,2023-01-01 00:00:00,075XX S WOLCOTT AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,RESIDENCE,False,False,...,6,17.0,71.0,11,1164996.0,1854651.0,2023,2023-08-20 15:40:56,41.756762,-87.670887
2,13168471,JG374193,2023-01-01 00:00:00,013XX W HARRISON ST,460,BATTERY,SIMPLE,SCHOOL - PUBLIC GROUNDS,False,False,...,12,34.0,28.0,08B,1167465.0,1897475.0,2023,2023-08-19 15:40:26,41.874223,-87.66061
3,13078152,JG267031,2023-01-01 00:00:00,101XX S BEVERLY AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,RESIDENCE,False,False,...,22,21.0,73.0,11,1168800.0,1837525.0,2023,2023-08-19 15:40:26,41.709685,-87.657439
4,13120699,JG314178,2023-01-01 00:00:00,063XX N FAIRFIELD AVE,1544,SEX OFFENSE,SEXUAL EXPLOITATION OF A CHILD,OTHER (SPECIFY),False,False,...,24,50.0,2.0,17,1156847.0,1941985.0,2023,2023-08-19 15:40:26,41.996584,-87.698384


In [4]:
print(chicago_crime.dtypes)

id                               int64
case_number                     object
datetime                        object
block                           object
iucr                            object
primary_type                    object
description                     object
location_description            object
arrest                            bool
domestic                          bool
beat                             int64
district                         int64
ward                           float64
community_area                 float64
fbi_code                        object
x_coordinate                   float64
y_coordinate                   float64
year                             int64
updated_on              datetime64[ns]
latitude                       float64
longitude                      float64
dtype: object


## The OpenAI API

Throughout this workshop, we will use API as the backend LLM for generating the SQL queries. This notebook will show how to use the OpenAI API for generating SQL queries. Let's start by reviewing the functionality of the API.

**Note:** The OpenAI API is not free, and you need a paid subscription for using it.

### Setting Up API Key

To use the OpenAI API, you will have to create an account, register to the API and setup your own key. To register to the API go the [OpenAI API main page](https://platform.openai.com/settings/organization/api-keys), and select the "API Keys" on the left menu (marked in yellow)  and click the Create new secret key button on the top right corner (marked in purple):
<figure>
 <img src="images/openai_api.png" width="100%" align="center"/></a>
<figcaption> Creating a new API key </figcaption>
</figure>

<br>
<br />

You can see in the button of this page the list of existing API keys (marked in green).

You should store the API key as an environment variable (for macOS and Linux) by adding the following line to your code to your `zshrc` or `bashrc` file:

```bash
export OPENAI_API_KEY="your_api_key"
```
<br>
<br />

## Getting Started with the OpenAI API

Let's start by creating a client object by registering the API key. We will load the environment variable using the `os`'s `environ.get()` method: 

In [5]:
api_key = os.environ.get("OPENAI_API_KEY")
client = OpenAI(api_key= api_key)

You can use the `models.list()` method to get the list of available models on the API:

In [6]:
models_list = client.models.list()

In [7]:
pd.DataFrame(models_list.data, columns=["id","created", "object", "owned_by"])

Unnamed: 0,id,created,object,owned_by
0,"(id, gpt-4o-realtime-preview-2024-12-17)","(created, 1733945430)","(object, model)","(owned_by, system)"
1,"(id, gpt-4o-audio-preview-2024-12-17)","(created, 1734034239)","(object, model)","(owned_by, system)"
2,"(id, gpt-4-1106-preview)","(created, 1698957206)","(object, model)","(owned_by, system)"
3,"(id, dall-e-3)","(created, 1698785189)","(object, model)","(owned_by, system)"
4,"(id, dall-e-2)","(created, 1698798177)","(object, model)","(owned_by, system)"
...,...,...,...,...
67,"(id, whisper-1)","(created, 1677532384)","(object, model)","(owned_by, openai-internal)"
68,"(id, gpt-4.1-2025-04-14)","(created, 1744315746)","(object, model)","(owned_by, system)"
69,"(id, omni-moderation-latest)","(created, 1731689265)","(object, model)","(owned_by, system)"
70,"(id, o4-mini-2025-04-16)","(created, 1744133506)","(object, model)","(owned_by, system)"


We will use the [chat.completions](https://platform.openai.com/docs/api-reference/chat/create) method to send prompts to the API. Here are the key arguments of the function:

* `model`: The name of the model to use for generating completions. We will use the `gpt-4.1` model for chat completions.
* `messages`: A list of messages that represent the conversation between the user and the model. Each message should have a role (either "user" or "assistant") and content.
* `temperature`: Controls the randomness of the generated text. Higher values result in more random outputs, while lower values result in more deterministic outputs.
* `max_tokens`: The maximum number of tokens to generate in the response.

Throughout this workshop, to reproduce the same results, we will set the `temperature` argument to 0.

We will start with a simplistic prompt, where we are going to provides the model with context and ask it to generate a query. The OpenAPI specification requires us to provide the `messages` argument as a list of dictionaries with two keys: `"role"` and `"content"`. The `"role"` key should have a value of `system` or `user`, or `developer`, while the `"content"` key should contain the actual content of the message. `"user"` or `"assistant"`, while the `"content"` key should contain the actual content of the message. We will start by simply define a user role:

In [8]:
user1 = """
I am working with a dataset that contains information about Chicago crime incidents. 
I want to create a SQL query to pull the total number of crimes that ended in arrest by a year
"""

In [9]:
response1 = client.chat.completions.create(
  model="gpt-4.1",
  messages=[
         {"role": "user", "content": user1}
  ],
  temperature= 0, 
  max_tokens=250
)

Let's print the response:

In [10]:
print(response1.choices[0].message.content)

Certainly! Assuming your table is named `chicago_crimes` and has at least the following columns:

- `arrest` (a boolean or string indicating if an arrest was made, e.g., `TRUE`/`FALSE` or `'Y'`/`'N'`)
- `date` (a date or datetime column indicating when the crime occurred)

Here’s a sample SQL query to get the total number of crimes that ended in arrest, grouped by year:

```sql
SELECT 
    EXTRACT(YEAR FROM date) AS year,
    COUNT(*) AS total_arrests
FROM 
    chicago_crimes
WHERE 
    arrest = TRUE  -- or 'Y' if it's a string
GROUP BY 
    year
ORDER BY 
    year;
```

**Notes:**
- If your `arrest` column is a string (`'Y'`/`'N'`), change `arrest = TRUE` to `arrest = 'Y'`.
- If your SQL dialect does not support `EXTRACT(YEAR FROM date)`, you can use `YEAR(date)` (MySQL, SQL Server) or `strftime('%Y', date)` (SQLite).

**Example for MySQL:


As you can notice, the model response is generic and due to a lack of context. The model makes some assumptions about the table's column names and their attributes. We can improve this by providing a better context in the next example:

In [11]:
user2 = """I am working with a dataset that contains information about Chicago crime incidents. 
        The table name is 'chicago_crime'. I want to create a SQL query to pull the total number of crimes that ended in arrest by a year. 
        The columns are: id, case_number, date, block, primary_type, description, location_description, arrest, domestic, beat, district, ward, community_area, fbi_code, x_coordinate, y_coordinate, year, updated_on. Please provide the SQL query.
        """

In [12]:
response2 = client.chat.completions.create(
  model="gpt-4.1",
  messages=[
         {"role": "user", "content": user2}
  ],
  temperature= 0, 
  max_tokens=250
)

In [13]:
print(response2.choices[0].message.content)

Certainly! To get the total number of crimes that ended in arrest by year from your `chicago_crime` table, you can use the following SQL query:

```sql
SELECT year, COUNT(*) AS total_arrests
FROM chicago_crime
WHERE arrest = TRUE
GROUP BY year
ORDER BY year;
```

**Explanation:**
- `WHERE arrest = TRUE` filters only the rows where an arrest was made.
- `GROUP BY year` groups the results by the `year` column.
- `COUNT(*) AS total_arrests` counts the number of crimes with arrests for each year.
- `ORDER BY year` sorts the results by year.

If your database uses `'Y'`/`'N'` or `1`/`0` instead of `TRUE`/`FALSE` for the `arrest` column, adjust the `WHERE` clause accordingly:

- For `'Y'`/`'N'`: `WHERE arrest = 'Y'`
- For `1`/`0`: `WHERE arrest = 1`

Let me know if you need the query in a different format!


Let's copy the SQL code and test it:

In [14]:
sql = """
SELECT year, COUNT(*) AS total_arrests
FROM chicago_crime
WHERE arrest = TRUE
GROUP BY year
ORDER BY year;"""

In [15]:
duckdb.sql(sql).show()

┌───────┬───────────────┐
│ year  │ total_arrests │
│ int64 │     int64     │
├───────┼───────────────┤
│  2023 │         31990 │
│  2024 │         35376 │
│  2025 │         12555 │
└───────┴───────────────┘



## Setting Templates 

In this section we will generalize the prompt by setting a prompt template. Starting by setting a dedicated template for the Chicago Crime dataset:

In [16]:
prompt_template = """I am working with a dataset that contains information about Chicago crime incidents. 
                    The table name is 'chicago_crime'. I want to create a SQL query that answers the following question: "{}". 
                    The columns are: id, case_number, date, block, primary_type, description, location_description, arrest, domestic, beat, district, ward, community_area, fbi_code, x_coordinate, y_coordinate, year, updated_on. Please provide the SQL query.
                """

In [17]:
user_question = "How many cases ended up with arrest during 2024?"
user3 = prompt_template.format(user_question)

In [18]:
response3 =  client.chat.completions.create(
  model="gpt-4.1",
  messages=[
         {"role": "user", "content": user3}
  ],
  temperature= 0, 
  max_tokens=250
)

In [19]:
print(response3.choices[0].message.content)

Certainly! To find out **how many cases ended up with arrest during 2024**, you need to count the rows where `arrest` is `TRUE` (or `'true'`/`'Y'` depending on your data type) and `year` is 2024.

Assuming `arrest` is a boolean or a value like `'true'`/`'Y'`, here is the SQL query:

```sql
SELECT COUNT(*) AS arrest_cases_2024
FROM chicago_crime
WHERE arrest = TRUE
  AND year = 2024;
```

If `arrest` is stored as a string (e.g., `'true'`, `'Y'`, `'Yes'`), adjust accordingly:

```sql
SELECT COUNT(*) AS arrest_cases_2024
FROM chicago_crime
WHERE arrest = 'true'
  AND year = 2024;
```
or
```sql
SELECT COUNT(*) AS arrest_cases_2024
FROM chicago_crime
WHERE arrest = 'Y'
  AND year = 2024;
```

**Note:**  
- If you want to be extra sure, you can also filter by the `date` column if it contains the full


Let's evaluate the query with the `duckdb` `sql` method:

In [20]:
sql = """
SELECT COUNT(*) AS arrest_cases_2024
FROM chicago_crime
WHERE arrest = TRUE
  AND year = 2024;
  """

In [21]:
duckdb.sql(sql).show()

┌───────────────────┐
│ arrest_cases_2024 │
│       int64       │
├───────────────────┤
│             35376 │
└───────────────────┘



## Generalize the Template

Let's now generalize the template that it can be applied on any table. We will split the template into two parts: one for the user and another for the system. The system part will be a prompt that describes the table's structure, while the user part will ask for a query. Here is the system template:

In [22]:
system_template = """
Given the following SQL table, your job is to write queries given a user’s request. \n
CREATE TABLE {} ({}) \n
"""

Where, we will define the table schema as SQL statement. Likewise, we will define the user template as:

In [23]:
user_template = "Write a SQL query that returns - {}"

Next, we want to auto generate the table schema the by applying SQL `DESCRIBE` command leveraging `duckdb` SQL execution on the table:

In [24]:
table_name = "chicago_crime" 
tbl_description = duckdb.sql("DESCRIBE SELECT * FROM " + table_name +  ";")
tbl_description

┌──────────────────────┬──────────────┬─────────┬─────────┬─────────┬─────────┐
│     column_name      │ column_type  │  null   │   key   │ default │  extra  │
│       varchar        │   varchar    │ varchar │ varchar │ varchar │ varchar │
├──────────────────────┼──────────────┼─────────┼─────────┼─────────┼─────────┤
│ id                   │ BIGINT       │ YES     │ NULL    │ NULL    │ NULL    │
│ case_number          │ VARCHAR      │ YES     │ NULL    │ NULL    │ NULL    │
│ datetime             │ VARCHAR      │ YES     │ NULL    │ NULL    │ NULL    │
│ block                │ VARCHAR      │ YES     │ NULL    │ NULL    │ NULL    │
│ iucr                 │ VARCHAR      │ YES     │ NULL    │ NULL    │ NULL    │
│ primary_type         │ VARCHAR      │ YES     │ NULL    │ NULL    │ NULL    │
│ description          │ VARCHAR      │ YES     │ NULL    │ NULL    │ NULL    │
│ location_description │ VARCHAR      │ YES     │ NULL    │ NULL    │ NULL    │
│ arrest               │ BOOLEAN      │ 

We will parse from the table description the column names and their class:

In [25]:
col_attr = tbl_description.df()[["column_name", "column_type"]]
col_attr["column_joint"] = col_attr["column_name"] + " " +  col_attr["column_type"]
col_names = str(list(col_attr["column_joint"].values)).replace('[', '').replace(']', '').replace('\'', '')
col_names

'id BIGINT, case_number VARCHAR, datetime VARCHAR, block VARCHAR, iucr VARCHAR, primary_type VARCHAR, description VARCHAR, location_description VARCHAR, arrest BOOLEAN, domestic BOOLEAN, beat BIGINT, district BIGINT, ward DOUBLE, community_area DOUBLE, fbi_code VARCHAR, x_coordinate DOUBLE, y_coordinate DOUBLE, year BIGINT, updated_on TIMESTAMP_NS, latitude DOUBLE, longitude DOUBLE'

We can now use it to set the table schema on the system template:

In [26]:
system = system_template.format(table_name, col_names)
print(system)


Given the following SQL table, your job is to write queries given a user’s request. 

CREATE TABLE chicago_crime (id BIGINT, case_number VARCHAR, datetime VARCHAR, block VARCHAR, iucr VARCHAR, primary_type VARCHAR, description VARCHAR, location_description VARCHAR, arrest BOOLEAN, domestic BOOLEAN, beat BIGINT, district BIGINT, ward DOUBLE, community_area DOUBLE, fbi_code VARCHAR, x_coordinate DOUBLE, y_coordinate DOUBLE, year BIGINT, updated_on TIMESTAMP_NS, latitude DOUBLE, longitude DOUBLE) 




Likewise, we will set the user template by adding the question:

In [27]:
question = "How many cases ended up with arrest during 2024?"
user = user_template.format(question)
print(user)

Write a SQL query that returns - How many cases ended up with arrest during 2024?


Let's now set the message's argument by defining system and user roles by using the system and user templates:

In [28]:
response4 = client.chat.completions.create(
  model="gpt-4.1",
  messages=[
          {"role": "system", "content": system},
          {"role": "user", "content": user}
  ],
  temperature= 0, 
  max_tokens=150
)

In [29]:
print(response4.choices[0].message.content)

```sql
SELECT COUNT(*) AS arrest_cases_2024
FROM chicago_crime
WHERE arrest = TRUE
  AND year = 2024;
```
This query counts the number of cases in 2024 where an arrest was made.


Note that the return SQL code comes in markdown code chunk format (e.g., `'''sql`). Let's add to the system prompt the following sentence to ensure that the output is a clean SQL code:
```
 Return just the SQL query as plan text, without additional text.
```

In [30]:
system_template = """
Given the following SQL table, your job is to write queries given a user’s request.Return just the SQL query as plan text, without additional text and don't use markdown format. \n
CREATE TABLE {} ({}) \n
"""
table_name = "chicago_crime"
system = system_template.format(table_name, col_names)

Let's rerun the same question:

In [31]:
response5 = client.chat.completions.create(
  model="gpt-4.1",
  messages=[
          {"role": "system", "content": system},
          {"role": "user", "content": user}
  ],
  temperature= 0, 
  max_tokens=150
)

In [32]:
print(response5.choices[0].message.content)

SELECT COUNT(*) 
FROM chicago_crime 
WHERE arrest = TRUE AND year = 2024;


And, as you can see it returned a clean SQL query.

In [33]:
sql = response5.choices[0].message.content
duckdb.sql(sql).show()

┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│        35376 │
└──────────────┘



As we cannot ensure that the model will all ways return the code without the markdown code chunk format, we can use the following two helpers functions that check if the code is within a markdown format and if so, it will parse it and return the clean code:

In [34]:
import re

def is_markdown_code_chunk(text):
    """
    Checks if the given text is in Markdown code chunk format.

    Args:
        text (str): The text to check.

    Returns:
        bool: True if the text is in Markdown code chunk format, False otherwise.
    """
    pattern = r"```[^`]*```"
    return bool(re.search(pattern, text, re.DOTALL))

def extract_code_from_markdown(markdown_text):
    """
    Extracts code from a Markdown code chunk.

    Args:
        markdown_text (str): The Markdown text containing the code chunk.

    Returns:
        str: The extracted code.
    """
    pattern = r"```(.*?)\n(?P<code>.*?)\n```"
    match = re.search(pattern, markdown_text, re.DOTALL)
    if match:
        return match.group("code")
    else:
        return None

## Functionalize the Process

The last step is to create a function that takes a user question and returns the corresponding data. The `create_message` function set the system and user message (or prompt) template:

<figure>
 <img src="images/architecture 1.png" width="100%" align="center"/></a>
<figcaption> The agent architecture</figcaption>
</figure>

<br>
<br />

In [35]:
def create_message(table_name, query):

    class message:
        def __init__(message, system, user, column_names, column_attr):
            message.system = system
            message.user = user
            message.column_names = column_names
            message.column_attr = column_attr

    
    system_template = """

    Given the following SQL table, your job is to write queries given a user’s request. Return just the SQL query as plan text, without additional text and don't use markdown format. \n

    CREATE TABLE {} ({}) \n
    """

    user_template = "Write a SQL query that returns - {}"
    
    tbl_describe = duckdb.sql("DESCRIBE SELECT * FROM " + table_name +  ";")
    col_attr = tbl_describe.df()[["column_name", "column_type"]]
    col_attr["column_joint"] = col_attr["column_name"] + " " +  col_attr["column_type"]
    col_names = str(list(col_attr["column_joint"].values)).replace('[', '').replace(']', '').replace('\'', '')

    system = system_template.format(table_name, col_names)
    user = user_template.format(query)

    m = message(system = system, user = user, column_names = col_attr["column_name"], column_attr = col_attr["column_type"])
    return m

Let's test it:

In [36]:
query = "How many cases ended up with arrest?"
msg = create_message(table_name = "chicago_crime", query = query)

In [37]:
msg.system

"\n\n    Given the following SQL table, your job is to write queries given a user’s request. Return just the SQL query as plan text, without additional text and don't use markdown format. \n\n\n    CREATE TABLE chicago_crime (id BIGINT, case_number VARCHAR, datetime VARCHAR, block VARCHAR, iucr VARCHAR, primary_type VARCHAR, description VARCHAR, location_description VARCHAR, arrest BOOLEAN, domestic BOOLEAN, beat BIGINT, district BIGINT, ward DOUBLE, community_area DOUBLE, fbi_code VARCHAR, x_coordinate DOUBLE, y_coordinate DOUBLE, year BIGINT, updated_on TIMESTAMP_NS, latitude DOUBLE, longitude DOUBLE) \n\n    "

In [38]:
message = [
    {
      "role": "system",
      "content": msg.system
    },
    {
      "role": "user",
      "content": msg.user
    }
    ]


In [39]:
response6 = client.chat.completions.create(
  model="gpt-4.1",
  messages=message,
  temperature= 0, 
  max_tokens=150
)

print(response6.choices[0].message.content)

SELECT COUNT(*) AS arrest_count FROM chicago_crime WHERE arrest = TRUE;


The next step is to set a function that send the prompt to the API (i.e., API handler). In this case, we will set the function to work with the Gemini API:

<figure>
 <img src="images/architecture 2.png" width="100%" align="center"/></a>
<figcaption> The agent architecture</figcaption>
</figure>

<br>
<br />

In [40]:
def lang2sql_openai(api_key, table_name, query, model ="gpt-4.1", temperature = 0, max_tokens = 256, frequency_penalty = 0,presence_penalty= 0):
    class lang2sql:
        def __init__(output, message, response, sql):
            output.message = message
            output.response = response
            output.sql = sql      

    client = OpenAI(
    api_key=api_key,
    )

    m = create_message(table_name = table_name, query = query)

    message = [
    {
      "role": "system",
      "content": m.system
    },
    {
      "role": "user",
      "content": m.user
    }
    ]

    response = client.chat.completions.create(
        model=model,
        messages=message,
        temperature= temperature, 
        max_tokens=max_tokens,
        frequency_penalty = frequency_penalty,
        presence_penalty = presence_penalty
    )

    sql = response.choices[0].message.content

    output = lang2sql(message = m, response = response, sql = sql)
    return output

The function returns a clean SQL query:

In [41]:
query = lang2sql_openai(api_key = api_key, 
                    table_name = "chicago_crime", 
                    query = "How many cases ended up with arrest?")

In [42]:
print(query.sql)

duckdb.sql(query.sql).show()

SELECT COUNT(*) AS arrest_count FROM chicago_crime WHERE arrest = TRUE;
┌──────────────┐
│ arrest_count │
│    int64     │
├──────────────┤
│        79921 │
└──────────────┘



Last but not least, we will set a wrapper function that generate the query and pull the data from the table:
<figure>
 <img src="images/architecture 3.png" width="100%" align="center"/></a>
<figcaption> The agent architecture</figcaption>
</figure>

<br>
<br />

In [43]:
def lang2answer_openai(api_key, table_name, query, model ="gpt-4.1", temperature = 0, max_tokens = 256, frequency_penalty = 0,presence_penalty= 0, verbose=False):
    query = lang2sql_openai(api_key = api_key,
                    table_name = table_name,
                    query = query, 
                    model = model,
                    temperature = temperature,
                    max_tokens = max_tokens,
                    frequency_penalty = frequency_penalty,
                    presence_penalty = presence_penalty
                    )
    if verbose:
        print(query.sql)
    output = duckdb.sql(query.sql)
    return output
    


Let's test it out!

In [44]:
lang2answer_openai(api_key = api_key, 
            table_name = "chicago_crime", 
            query = "How many cases ended up with arrest?",
            verbose=True)

SELECT COUNT(*) AS arrest_count FROM chicago_crime WHERE arrest = TRUE;


┌──────────────┐
│ arrest_count │
│    int64     │
├──────────────┤
│        79921 │
└──────────────┘

In [45]:
lang2answer_openai(api_key = api_key, 
            table_name = "chicago_crime", 
            query = "Summarize the cases by primary type",
            verbose=True)

SELECT primary_type, COUNT(*) AS case_count
FROM chicago_crime
GROUP BY primary_type
ORDER BY case_count DESC


┌───────────────────────────────────┬────────────┐
│           primary_type            │ case_count │
│              varchar              │   int64    │
├───────────────────────────────────┼────────────┤
│ THEFT                             │     134807 │
│ BATTERY                           │     103019 │
│ CRIMINAL DAMAGE                   │      66227 │
│ MOTOR VEHICLE THEFT               │      55945 │
│ ASSAULT                           │      52618 │
│ OTHER OFFENSE                     │      38416 │
│ DECEPTIVE PRACTICE                │      37709 │
│ ROBBERY                           │      21977 │
│ BURGLARY                          │      18314 │
│ WEAPONS VIOLATION                 │      18250 │
│         ·                         │         ·  │
│         ·                         │         ·  │
│         ·                         │         ·  │
│ CONCEALED CARRY LICENSE VIOLATION │        510 │
│ LIQUOR LAW VIOLATION              │        453 │
│ INTIMIDATION                 

In [46]:
lang2answer_openai(api_key = api_key, 
            table_name = "chicago_crime", 
            query = "How many cases is the type of robbery?",
            verbose=True)

SELECT COUNT(*) FROM chicago_crime WHERE primary_type = 'ROBBERY'


┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│        21977 │
└──────────────┘

In [47]:
lang2answer_openai(api_key = api_key, 
            table_name = "chicago_crime", 
            query = "Show me cases that are type of robbery",
            verbose=True)

SELECT * FROM chicago_crime WHERE primary_type = 'ROBBERY'


┌──────────┬─────────────┬─────────────────────┬────────────────────────┬─────────┬──────────────┬────────────────────────────────────┬──────────────────────┬─────────┬──────────┬───────┬──────────┬────────┬────────────────┬──────────┬──────────────┬──────────────┬───────┬─────────────────────┬──────────────┬───────────────┐
│    id    │ case_number │      datetime       │         block          │  iucr   │ primary_type │            description             │ location_description │ arrest  │ domestic │ beat  │ district │  ward  │ community_area │ fbi_code │ x_coordinate │ y_coordinate │ year  │     updated_on      │   latitude   │   longitude   │
│  int64   │   varchar   │       varchar       │        varchar         │ varchar │   varchar    │              varchar               │       varchar        │ boolean │ boolean  │ int64 │  int64   │ double │     double     │ varchar  │    double    │    double    │ int64 │    timestamp_ns     │    double    │    double     │
├──────────┼───────

In [48]:
lang2answer_openai(api_key = api_key, 
            table_name = "chicago_crime", 
            query = "which district had the highest number of arrests in 2024?",
            verbose=True)

SELECT district
FROM chicago_crime
WHERE arrest = TRUE AND year = 2024
GROUP BY district
ORDER BY COUNT(*) DESC
LIMIT 1;


┌──────────┐
│ district │
│  int64   │
├──────────┤
│       11 │
└──────────┘