# Customize LLM for Text to SQL Application

In this notebook, we will find the best customized model on the text-to-sql application by running multiple training experiment, and picking the best model based on a customized evaluation system. Finally, you can easily deploy the customized model.

We utilize [Synthetic Text2SQL 🤗](https://huggingface.co/datasets/gretelai/synthetic_text_to_sql). This dataset contains pairs of natural language descriptions and corresponding SQL queries. The dataset will be used to train our model. Additionally, we will provide the model with SQL context to ensure accurate and context-aware query generation.

Please install Python SDK:

In [None]:
!pip install leeroo-client

or install it from source:

In [None]:
!git clone https://github.com/Leeroo-AI/leeroo-client
%cd leeroo-client 
!pip install -e .
%cd .

Leeroo dager supports the following format for training dataset:

```json
[
    {
        "query": QUERY,
        "response": RESPONSE,
    },
    {
        ....
    }
]
```

In [None]:
import os
import time
import json
import datasets
from tqdm import tqdm
from pprint import pprint
from leeroo_client.client import LeerooClient

In [None]:


dataset = datasets.load_dataset("gretelai/synthetic_text_to_sql")
n_seed_samples = 1000
data = []

for d in tqdm(dataset['train']):
    data.append(
        dict(
            query = f"## sql context :\n{d['sql_context']}\n\n## Query generation task:\n{d['sql_prompt']}\n\n",
            response = d['sql']
        )
    )
    if len(data) == n_seed_samples:
        break

json.dump(data, open('texttosql_seed_data.json', 'w'))
print(len(data))
pprint(data[-1])

Create your API key in [here](http://app.leeroo.com/dashboard), if you don't have one!

In [None]:
leeroo_api_key = #LEEROO_API_KEY
client = LeerooClient(
    leeroo_api_key,
)

For designing the workflow of experiments, please provide:

- `evaluation_criteria` (optional): A short description of what are important factors in your mind for scoring the responses of LLM. Just describe them in natural language.
- `workflow_name` : The name of this experiment. This will be later saved along with the id of workflow.  
- `seed_data_path`: The dataset should follow JSON format with `query` and `response` as fields.

In [3]:
evaluation_criteria = \
"""
Extract SQL Context:
Review the SQL context given in the input, including table definitions and any sample data inserted into these tables.

Formulate Expected Query:
Based on the task description, determine the logical structure and components of the SQL query that should be generated. For instance, identify the relevant tables, columns, and conditions that should be included in the query.

Check Query Components:
Ensure the generated query includes the correct tables and columns specified in the SQL context.
Verify that the conditions and clauses in the query match the task description. For example, checking for conditions like InvestmentType = 'Bond' and State = 'TX'.

Syntax Validation:
Confirm that the generated query is syntactically correct according to SQL standards. It should be executable without syntax errors.
Logical Accuracy:

Ensure the logic of the query aligns with the task description. For instance, it should correctly aggregate the values as required by the task.

Output should be only sql query and no explaination.
"""

In [None]:
workflow_configs = client.initialize_workflow_configs(
    evaluation_criteria=evaluation_criteria,
    workflow_name="TextToSqlCheckSensitivityQATask",
    seed_data_path="texttosql_seed_data.json",
    budget=2 # each experiment needs at least 2 unites of time, you can increase it for running more experiments
) 
workflow_configs

🚀 Once you're happy with hyper-parameters, you can submit the training workflow. It will **automatically execute experiments, evaluate them, and pick the best model** based your customized evaluation system!

In [10]:
# workflow_configs['experiment_config']['0']['training_args']['num_train_epochs'] = 1

In [None]:
# Submit workflow for execution
running_workflow_status = client.submit_workflow(
    workflow_configs=workflow_configs
)
print(" Workflow running state:", running_workflow_status)

You can get the status of all your workflows, by running the following command:

- `runing_workflows`: shows the training workflows with `running` status.  
- `finished_workflows`: shows executed workflows

In [None]:
# Retrieve user's workflows
user_workflows = client.all_workflows()

print( f"Total finished workflows : {len(user_workflows['finished_workflows'])}")
print( f"Total running workflows : {len(user_workflows['running_workflows'])}")

user_workflows['running_workflows']

If you need further details on the status of a specific workflow, you can run the following function:

- `status`: overal status of workflow
- `workflow_node_status`: status of all nodes
- `workflow_name`: name of your workflow
- `workflow_running_state_id`: id of your workflow

In [None]:
# Check status of the running workflow
workflow_status = client.get_workflow_status('1721586842')
workflow_status

Once the workflow is executed, you can deploy it as:

In [None]:
# Deploy the workflow
workflow_id = '1721586842'
deployment_status = client.deploy_workflow(
    workflow_id
)
print(deployment_status)

Get the status of deployment by:

In [None]:
deployment_details = client.get_workflow_deployment_status('DeploymentState-1721599750.281206')
deployment_details

In [None]:
# Get Model id
import requests
model_details = requests.get( f"http://3.80.255.142:9000/v1/models").json()
model_id = model_details['data'][0]['id']
model_details

In [None]:
# Inference
import json
sql_data = json.load(open('texttosql_seed_data.json'))

url = "http://3.80.255.142:9000/v1/completions"

for d in sql_data[-5:]:
    data = {
        "model": model_id,
        "prompt": [d['query']],
        "max_tokens": 200,
        "temperature": 0.0
    }
    response = requests.post(url, json=data)
    print("Prompt :\n", d['query'], "\nLLM Response:")
    print(response.json()['choices'][0]['text'])
    print("\nOriginal Response:\n", d['response'])
    print("-----\n\n")

Kill the deployed model by running the following command: (you can later deploy it again, if needed)

In [None]:
client.kill_deployment(
    'DeploymentState-1721599750.281206'
)