## End-to-End NL-to-SQL Inference and Evaluation
The intent of this notebook is to demonstrate the end-to-end process for reproducing the data collection, synthesis, evaluation, and consolidation required to formulate a dataset on which to run statistical experiments evaluating the relationship between schema identifier naturalness and NL-to-SQL model performance.

This notebook is one of two notebooks created for this purpose. The second notebook is end-to-end-prototype-analysis.ipynb

Copyright 2024 Kyle Luoma

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

#### Imports

In [1]:
from src import end_to_end_data_prep_and_prediction as pred
from itertools import product

  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


#### Database and model selections
Comment or uncomment the databases and models in the dictionaries prior to running the main function in the next cell

In [2]:
selected_spider_databases = [("spider", db) for db in [
        'battle_death',
        'car_1',
        'concert_singer',
        'course_teach',
        'cre_Doc_Template_Mgt',
        'dog_kennels',
        'employee_hire_evaluation',
        'flight_2',
        'museum_visit',
        'network_1',
        'orchestra',
        'pets_1',
        'poker_player',
        'real_estate_properties',
        'singer',
        'student_transcripts_tracking',
        'tvshow',
        'voter_1',
        'world_1',
        'wta_1'
        ]]

selected_snails_databases = [("snails", db) for db in [
        "ASIS_20161108_HerpInv_Database",
        "ATBI",
        "CratersWildlifeObservations",
        "KlamathInvasiveSpecies",
        "NorthernPlainsFireManagement",
        "NTSB",
        "NYSED_SRC2022",
        "PacificIslandLandbirds",
        "SBODemoUS-Banking",
        "SBODemoUS-Business Partners",
        "SBODemoUS-Finance",
        "SBODemoUS-General",
        "SBODemoUS-Human Resources",
        "SBODemoUS-Inventory and Production",
        "SBODemoUS-Reports",
        "SBODemoUS-Sales Opportunities",
        "SBODemoUS-Service"
        ]]

selected_models = [
        "gpt-4o", 
        "gpt-3.5-turbo",
        "DINSQL",
        "CodeS",
        # "Phind-CodeLlama-34B-v2" #Use only with bypass_nl_sql_inference=True in main call below
        ]

selected_naturalness = [
        "NATIVE", 
        "N1", 
        "N2", 
        "N3"
        ]

#### Run the Main function in pred

Running this as main with the above combinations of benchmark, database, model, and naturalness level
    reproduces the NL-to-SQL annotations used in our analysis.
    NOTE: Unfortunately, the Phind-CodeLlama model cited in our paper is no longer available on TogetherAI,
    so we cannot offer a simple reproducibility solution here. SQL inference output from this model is
    available in the ./queries/predicted directory.

##### Outputs
- Queries predicted by LLMs are stored in: ./db/queries/predicted 
- Excel files containing the analysis results are stored in: ./data/nl-to-sql_performance_annotations/pending_evaluation
- Individual query generation logs can be found in ./logs

##### Next Steps
Once NL-to-SQL inference and follow-on evaluations are complete, run 
```python
python ./src/query_manual_evaluation.py
```
to perform manual evaluation of the results files.
Load the files from the /pending_evaluation folder and once you have manually scored the results, save them to ./data/nl-to-sql_performance_annotations

After manual validation, you can generate the results analysis as they appear in our report using the `reproducibility-SNAILS-NL-to-SQL-naturalness-analysis.ipynb` notebook.

In [3]:
for combo in product(
    # selected_spider_databases +
    selected_snails_databases,
    selected_models,
    selected_naturalness
):
    pred.main(
        model=combo[1],
        service="openai",
        naturalness=combo[2],
        database=combo[0][1],
        bypass_nl_sql_inference=True, # set to True if you don't want to run LLM NL-to-SQL and only want to run the additional evaluation steps
        db_list_file={
            "spider": ".local/spider_dbinfo.json",
            "snails": ".local/dbinfo.json"
            }[combo[0][0]]
    )

### Data Loading ###
### SQL Inference ###
### Determine Schema and Query Naturalness ###
### Generate gold query statistics and naturalness scores ###
### Make a list of all aliases in all queries ###


FileNotFoundError: [Errno 2] No such file or directory: 'java -jar ./bin/SQLParserQueryAnalyzer_jar/SQLParserQueryAnalyzer.jar --schematagger "SELECT COUNT(*) TURTLECOUNT  FROM TBLFIELDDATATURTLEMEASUREMENTS WHERE AGE = \'5\' " "tsql" --dialect tsql'