# Convert `SQL` examples to `JSON` format

We have a bunch of examples in a [SQL script](examples/examples.sql). We want to convert these examples into JSON format to make it easier to load into LangChain.


## Read examples from file

First, let's load in the examples from the SQL script:

In [1]:
with open('examples/examples.sql') as f:
    script_contents = f.read()

Let's take a look at part of the SQL script contents:

In [2]:
print(script_contents[:500])

/*
What is the most popular media type among all the tracks?
*/
SELECT 
    MediaType.Name AS media_type,
    COUNT(Track.TrackId) AS track_count
FROM Track
    INNER JOIN 
        MediaType ON MediaType.MediaTypeId = Track.MediaTypeId
GROUP BY Track.MediaTypeId
ORDER BY track_count DESC
LIMIT 3;

/*
What is the total price for the album "Big Ones"?
*/
SELECT 
    Album.Title AS album_title,
    SUM(Track.UnitPrice) AS album_price
FROM
    Track
    INNER JOIN
        Album ON Album.AlbumId = Tr


## Split up examples

Each SQL example is separated by 2 new lines `\n\n`. Start by splitting up the examples:

In [3]:
raw_examples = script_contents.split(sep='\n\n')

# print out the first 3 examples
raw_examples[:3]

['/*\nWhat is the most popular media type among all the tracks?\n*/\nSELECT \n    MediaType.Name AS media_type,\n    COUNT(Track.TrackId) AS track_count\nFROM Track\n    INNER JOIN \n        MediaType ON MediaType.MediaTypeId = Track.MediaTypeId\nGROUP BY Track.MediaTypeId\nORDER BY track_count DESC\nLIMIT 3;',
 '/*\nWhat is the total price for the album "Big Ones"?\n*/\nSELECT \n    Album.Title AS album_title,\n    SUM(Track.UnitPrice) AS album_price\nFROM\n    Track\n    INNER JOIN\n        Album ON Album.AlbumId = Track.AlbumId\nWHERE\n    Album.Title = \'Big Ones\'\nGROUP BY\n    Track.AlbumId;',
 '/*\nWhat is the best-selling track of all time?\n*/\nSELECT \n    Track.Name AS track_name, \n    SUM(\n        InvoiceLine.Quantity * InvoiceLine.UnitPrice\n    ) AS total_sales\nFROM\n    InvoiceLine\n    INNER JOIN\n        Track ON InvoiceLine.TrackId = Track.TrackId\nGROUP BY \n    InvoiceLine.TrackId\nORDER BY \n    total_sales DESC\nLIMIT 5;']

## Process one example

To make things simple and clear, let's first work with only one example:

In [18]:
an_example = raw_examples[2]

an_example

'/*\nWhat is the best-selling track of all time?\n*/\nSELECT \n    Track.Name AS track_name, \n    SUM(\n        InvoiceLine.Quantity * InvoiceLine.UnitPrice\n    ) AS total_sales\nFROM\n    InvoiceLine\n    INNER JOIN\n        Track ON InvoiceLine.TrackId = Track.TrackId\nGROUP BY \n    InvoiceLine.TrackId\nORDER BY \n    total_sales DESC\nLIMIT 5;'

### Split into comment and query parts

Given an SQL example like this:

```sqlite
/*
How many employees are there?
*/
SELECT COUNT(*) FROM Employee;
```

We want to create a JSON example like this:
```json
{
    "input": "How many employees are there?",
    "query": "SELECT COUNT(*) FROM Employee;"
}
```

Let's split an example into a `comment` and `query` part. We can split it at the closing comment tag `*/`. The closing comment tag is surrounded by two new line characters `\n`.

In [29]:
comment, query = an_example.split('\n*/\n')

print(
    f"Comment:\n{repr(comment)}\n"
)
print(
    f"Query:\n{repr(query)}"
)

Comment:
'/*\nWhat is the best-selling track of all time?'

Query:
'SELECT \n    Track.Name AS track_name, \n    SUM(\n        InvoiceLine.Quantity * InvoiceLine.UnitPrice\n    ) AS total_sales\nFROM\n    InvoiceLine\n    INNER JOIN\n        Track ON InvoiceLine.TrackId = Track.TrackId\nGROUP BY \n    InvoiceLine.TrackId\nORDER BY \n    total_sales DESC\nLIMIT 5;'


### Clean up comment

Clean up the `comment` part by removing the start comment tag `/*` plus new line `\n` from beginning of string.

In [30]:
comment

'/*\nWhat is the best-selling track of all time?'

In [31]:
cleaned_comment = comment.removeprefix('/*\n')
cleaned_comment

'What is the best-selling track of all time?'

### Process the query

Now, let's clean up the `query` part. The SQL query is broken into multiple lines with proper indentation to make it readable. However, these extra white spaces may end up confusing the LLM. So, we'll need to compress the query into a single line.

Let's see how we can do this. Start by splitting the query into a list of lines:

In [32]:
query

'SELECT \n    Track.Name AS track_name, \n    SUM(\n        InvoiceLine.Quantity * InvoiceLine.UnitPrice\n    ) AS total_sales\nFROM\n    InvoiceLine\n    INNER JOIN\n        Track ON InvoiceLine.TrackId = Track.TrackId\nGROUP BY \n    InvoiceLine.TrackId\nORDER BY \n    total_sales DESC\nLIMIT 5;'

In [33]:
query_lines = query.splitlines()
query_lines

['SELECT ',
 '    Track.Name AS track_name, ',
 '    SUM(',
 '        InvoiceLine.Quantity * InvoiceLine.UnitPrice',
 '    ) AS total_sales',
 'FROM',
 '    InvoiceLine',
 '    INNER JOIN',
 '        Track ON InvoiceLine.TrackId = Track.TrackId',
 'GROUP BY ',
 '    InvoiceLine.TrackId',
 'ORDER BY ',
 '    total_sales DESC',
 'LIMIT 5;']

For each line in the query, strip out leading and trailing whitespaces:

In [34]:
query_lines_stripped = [line.strip() for line in query_lines]
query_lines_stripped

['SELECT',
 'Track.Name AS track_name,',
 'SUM(',
 'InvoiceLine.Quantity * InvoiceLine.UnitPrice',
 ') AS total_sales',
 'FROM',
 'InvoiceLine',
 'INNER JOIN',
 'Track ON InvoiceLine.TrackId = Track.TrackId',
 'GROUP BY',
 'InvoiceLine.TrackId',
 'ORDER BY',
 'total_sales DESC',
 'LIMIT 5;']

Join the lines into a single string separated by a single space `' '`. Now the query is a single long line.

In [35]:
query_one_line = " ".join(query_lines_stripped)
query_one_line

'SELECT Track.Name AS track_name, SUM( InvoiceLine.Quantity * InvoiceLine.UnitPrice ) AS total_sales FROM InvoiceLine INNER JOIN Track ON InvoiceLine.TrackId = Track.TrackId GROUP BY InvoiceLine.TrackId ORDER BY total_sales DESC LIMIT 5;'

### Combine into a dictionary

In JSON, the `comment` and `query` will be combined as a Python `dict`:

In [36]:
example_dict = {
    "input": cleaned_comment,
    "query": query_one_line,
}
example_dict

{'input': 'What is the best-selling track of all time?',
 'query': 'SELECT Track.Name AS track_name, SUM( InvoiceLine.Quantity * InvoiceLine.UnitPrice ) AS total_sales FROM InvoiceLine INNER JOIN Track ON InvoiceLine.TrackId = Track.TrackId GROUP BY InvoiceLine.TrackId ORDER BY total_sales DESC LIMIT 5;'}

## Helper Function

Let's take what we have learned above and group them into a function to make it easier to use. This helper function only processes one example:

In [37]:
def process_query(query: str) -> str:
    # Split query by new line `\n`
    query_lines = query.splitlines()

    # Remove leading and trailing spaces
    query_lines_stripped = [
        line.strip() 
        for line in query_lines
    ]

    # Join by single space character into one line
    return " ".join(query_lines_stripped)

In [38]:
def process_example(
    an_example: str
) -> dict[str, str]:

    # Split example by closing comment tag `*/` into
    # the comment and query parts.
    comment, query = an_example.split('\n*/\n')

    # Remove the opening comment tag `/*`.
    comment = comment.removeprefix('/*\n')

    # Transform multi-line query into one-line query
    query = process_query(query)
    
    return {
        "input": comment,
        "query": query,
    }

Let's try out the function on one example and see what we get back:

In [39]:
process_example(an_example)

{'input': 'What is the best-selling track of all time?',
 'query': 'SELECT Track.Name AS track_name, SUM( InvoiceLine.Quantity * InvoiceLine.UnitPrice ) AS total_sales FROM InvoiceLine INNER JOIN Track ON InvoiceLine.TrackId = Track.TrackId GROUP BY InvoiceLine.TrackId ORDER BY total_sales DESC LIMIT 5;'}

Great! So we can see that it works. 😃


## Process multiple examples

Now that we have seen how it works on one example, we can process and clean up multiple raw examples. To do this, we will just make use of Python's list comprehension:


In [40]:
clean_examples = [process_example(e) for e in raw_examples]

# take a look at a few processed examples
clean_examples[:3]

[{'input': 'What is the most popular media type among all the tracks?',
  'query': 'SELECT MediaType.Name AS media_type, COUNT(Track.TrackId) AS track_count FROM Track INNER JOIN MediaType ON MediaType.MediaTypeId = Track.MediaTypeId GROUP BY Track.MediaTypeId ORDER BY track_count DESC LIMIT 3;'},
 {'input': 'What is the total price for the album "Big Ones"?',
  'query': "SELECT Album.Title AS album_title, SUM(Track.UnitPrice) AS album_price FROM Track INNER JOIN Album ON Album.AlbumId = Track.AlbumId WHERE Album.Title = 'Big Ones' GROUP BY Track.AlbumId;"},
 {'input': 'What is the best-selling track of all time?',
  'query': 'SELECT Track.Name AS track_name, SUM( InvoiceLine.Quantity * InvoiceLine.UnitPrice ) AS total_sales FROM InvoiceLine INNER JOIN Track ON InvoiceLine.TrackId = Track.TrackId GROUP BY InvoiceLine.TrackId ORDER BY total_sales DESC LIMIT 5;'}]

## Save to JSON file

Now that we have our processed examples, we can save it to a JSON file. This will make it easier for us to load the examples into LangChain later.

In [41]:
import json

def write_to_json(
    examples: list[
        dict[str, str]
    ]
) -> None:
    with open(
        file='examples/examples.json', 
        mode='w', 
        encoding='utf-8'
    ) as f:
        json.dump(examples, f, ensure_ascii=False, indent=4)

In [42]:
write_to_json(examples=clean_examples)