# EDA REZA

In [1]:
!pip install tf-keras
!pip install evaluate
!pip install matplotlib
!pip install seaborn
!pip install accelerate>=0.26.0

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


In [2]:
import pandas as pd
import sqlite3
import evaluate
import os
import json
import matplotlib.pyplot as plt
import seaborn
from transformers import (
    pipeline,
    T5Tokenizer,
    T5ForConditionalGeneration,
    AutoModelForSeq2SeqLM,
    DataCollatorForSeq2Seq,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
    Trainer,
    TrainingArguments,
    AutoTokenizer,
)
import torch
from torch.utils.data import Dataset, DataLoader

  from .autonotebook import tqdm as notebook_tqdm





Examples of queries that show need for fine tuning
- Query: which player has the longest name.
     - Output: SELECT player_name FROM Player ORDER BY height DESC LIMIT 1

- Query: how many players are over 6 feet tall
     - Output: SELECT COUNT(*) FROM Player WHERE height > 6

In [3]:
def execute_query(query, print_padding):
    try:
        conn = sqlite3.connect("data\\reza_data.db")
        cursor = conn.cursor()
        cursor.execute(query)

        print(f"{'Query results:':<{print_padding}}{cursor.fetchall()}")
    except sqlite3.Error:
        print("Error executing sql")
    finally:
        conn.close()

# Load the tokenizer and model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('cssupport/t5-small-awesome-text-to-sql').to(device)

# fine-tune the model
# t5 fine-tuning article https://medium.com/nlplanet/a-full-guide-to-finetuning-t5-for-text2text-and-building-a-demo-with-streamlit-c72009631887


model.eval()

def generate_sql(input_prompt):
    """Generate SQL query from natural language input."""
    inputs = tokenizer(input_prompt, padding=True, truncation=True, return_tensors="pt").to(device)

    with torch.no_grad():
        outputs = model.generate(**inputs, max_length=512)

    return tokenizer.decode(outputs[0], skip_special_tokens=True)

padding = 30
print("Good examples:\n")
# example 
natural_language_query = "How many players are there?"
input_prompt = f"""tables:
CREATE TABLE Player (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    player_api_id INTEGER UNIQUE,
    player_name TEXT,
    player_fifa_api_id INTEGER UNIQUE,
    birthday TEXT,
    height INTEGER,
    weight INTEGER
)
query for: {natural_language_query}"""

generated_sql = generate_sql(input_prompt)
print(f"{'Original query:':<{padding}}{natural_language_query}\n{'Generated SQL:':<{padding}}{generated_sql}")
execute_query(generated_sql, padding)

# example 
natural_language_query = "Who is the heaviest player?"
input_prompt = f"""tables:
CREATE TABLE Player (
    player_name TEXT,
    weight INTEGER
)
query for: {natural_language_query}"""

generated_sql = generate_sql(input_prompt)
print(f"\n{'Original query:':<{padding}}{natural_language_query}\n{'Generated SQL:':<{padding}}{generated_sql}")
execute_query(generated_sql, padding)

# example 
natural_language_query = "How old is the oldest player?"
input_prompt = f"""tables:
CREATE TABLE Player (
    birthday TEXT
)
query for: {natural_language_query}"""

generated_sql = generate_sql(input_prompt)
print(f"\n{'Original query:':<{padding}}{natural_language_query}\n{'Generated SQL:':<{padding}}{generated_sql}")
execute_query(generated_sql, padding)

print("\nExamples showing need for fine-tuning:")
# example 
natural_language_query = "Who has the shortest name?"
input_prompt = f"""tables:
CREATE TABLE Player (
    player_name TEXT
)
query for: {natural_language_query}"""

generated_sql = generate_sql(input_prompt)
print(f"\n{'Original query:':<{padding}}{natural_language_query}\n{'Generated SQL:':<{padding}}{generated_sql}")
execute_query(generated_sql, padding)

# example 
natural_language_query = "Does anybody have a birthday on January 1st?"
input_prompt = f"""tables:
CREATE TABLE Player (
    birthday TEXT
)
query for: {natural_language_query}"""

generated_sql = generate_sql(input_prompt)
print(f"\n{'Original query:':<{padding}}{natural_language_query}\n{'Generated SQL:':<{padding}}{generated_sql}")
execute_query(generated_sql, padding)

# example 
natural_language_query = "Who has the first birthday of the year?"
input_prompt = f"""tables:
CREATE TABLE Player (
    player_name TEXT,
    birthday TEXT
)
query for: {natural_language_query}"""

generated_sql = generate_sql(input_prompt)
print(f"\n{'Original query:':<{padding}}{natural_language_query}\n{'Generated SQL:':<{padding}}{generated_sql}")
execute_query(generated_sql, padding)



You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


Good examples:

Original query:               How many players are there?
Generated SQL:                SELECT COUNT(*) FROM Player
Query results:                [(11060,)]

Original query:               Who is the heaviest player?
Generated SQL:                SELECT player_name FROM Player WHERE weight = (SELECT MAX(weight) FROM Player)
Query results:                [('Kristof van Hout',), ('Tim Wiese',)]

Original query:               How old is the oldest player?
Generated SQL:                SELECT MIN(birthday) FROM Player
Query results:                [('1967-01-23 00:00:00',)]

Examples showing need for fine-tuning:

Original query:               Who has the shortest name?
Generated SQL:                SELECT player_name FROM Player ORDER BY player_name LIMIT 1
Query results:                [('Aaron Appindangoye',)]

Original query:               Does anybody have a birthday on January 1st?
Generated SQL:                SELECT DISTINCT birthday FROM Player WHERE birthday = "Jan

## Improve model with fine-tuning


Training data to fine-tune the model

In [4]:
# Example data to fine-tune the model
data = [
    {
        "input": "translate English to SQL: Who has the shortest name?",
        "target": "SELECT player_name FROM player ORDER BY LENGTH(player_name) LIMIT 1"
    },
    {
        "input": "translate English to SQL: Does anybody have a birthday on January 1st?",
        "target": "SELECT * FROM player WHERE SUBSTR(birthday, 6, 5) = '01-01'"
    },
    {
        "input": "translate English to SQL: List all players born in December.",
        "target": "SELECT * FROM player WHERE SUBSTR(birthday, 6, 2) = '12'"
    },
    {
        "input": "translate English to SQL: Who has the longest name?",
        "target": "SELECT player_name FROM player ORDER BY LENGTH(player_name) DESC LIMIT 1"
    },
    {
        "input": "translate English to SQL: Show me all players born in 1990.",
        "target": "SELECT * FROM player WHERE SUBSTR(birthday, 1, 4) = '1990'"
    },
    {
        "input": "translate English to SQL: Find players with 'John' in their name.",
        "target": "SELECT * FROM player WHERE player_name LIKE '%John%'"
    },
    {
        "input": "translate English to SQL: How many players are there?",
        "target": "SELECT COUNT(*) FROM player"
    },
    {
        "input": "translate English to SQL: Show players whose name starts with A.",
        "target": "SELECT * FROM player WHERE player_name LIKE 'A%'"
    },
    {
        "input": "translate English to SQL: List all players ordered by birthday.",
        "target": "SELECT * FROM player ORDER BY birthday"
    },
    {
        "input": "translate English to SQL: Get names and birthdays of all players.",
        "target": "SELECT player_name, birthday FROM player"
    },
    {
        "input": "translate English to SQL: Are there any players with a birthday on July 4th?",
        "target": "SELECT * FROM player WHERE SUBSTR(birthday, 6, 5) = '07-04'"
    },
    {
        "input": "translate English to SQL: Which player has the earliest birthday?",
        "target": "SELECT * FROM player ORDER BY birthday LIMIT 1"
    },
    {
        "input": "translate English to SQL: Show all players whose birthday is not in January.",
        "target": "SELECT * FROM player WHERE SUBSTR(birthday, 6, 2) != '01'"
    },
    {
        "input": "translate English to SQL: What are the names of players born after 2000?",
        "target": "SELECT player_name FROM player WHERE birthday > '2000-01-01'"
    },
    {
        "input": "translate English to SQL: List all players born before 1980.",
        "target": "SELECT * FROM player WHERE birthday < '1980-01-01'"
    },
    {
        "input": "translate English to SQL: What is the name of the tallest player?",
        "target": "select player_name from player order by height desc"
    },
    {
        "input": "translate English to SQL: What is the name of the lightest player?",
        "target": "select player_name from player order by weight asc"
    },
    {
        "input": "translate English to SQL: How many players weigh less than 140?",
        "target": "select count(*) from player where weight < 140"
    }
]

In [5]:
# re-pull model to ensure I'm running against the right model in noteboook
# Use pre-trained model to test output of training data before fine-tuning
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('cssupport/t5-small-awesome-text-to-sql').to(device)

for example in data:
    input_prompt = example["input"]
    expected_sql = example["target"]
    generated_sql = generate_sql(input_prompt)
    print(f"\nInput: {input_prompt}\nExpected: {expected_sql}\nGenerated: {generated_sql}\nMatch: {generated_sql.strip() == expected_sql.strip()}")


Input: translate English to SQL: Who has the shortest name?
Expected: SELECT player_name FROM player ORDER BY LENGTH(player_name) LIMIT 1
Generated: SELECT MIN(name) FROM SELECT MIN(name) FROM SELECT MIN(name) FROM SELECT MIN(name) FROM SELECT MIN(name) FROM SELECT MIN(name) FROM SELECT MIN(name) FROM SELECT MIN(name) FROM SELECT MIN(name) FROM SELECT MIN(name) FROM SELECT MIN(name) FROM SELECT MIN(name) FROM SELECT MIN(sql) FROM SELECT SELECT MIN(sql) FROM SELECT
Match: False

Input: translate English to SQL: Does anybody have a birthday on January 1st?
Expected: SELECT * FROM player WHERE SUBSTR(birthday, 6, 5) = '01-01'
Generated: SELECT DISTINCT name FROM swich_sql" AND January 1st = "January"
Match: False

Input: translate English to SQL: List all players born in December.
Expected: SELECT * FROM player WHERE SUBSTR(birthday, 6, 2) = '12'
Generated: SELECT * FROM players WHERE birth = "december"
Match: False

Input: translate English to SQL: Who has the longest name?
Expected: SE

In [6]:
class TextToSQLDataset(Dataset):
    def __init__(self, tokenizer, data, max_length=512):
        self.tokenizer = tokenizer
        self.data = data
        self.max_length = max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        input_text = self.data[idx]['input']
        target_text = self.data[idx]['target']
        
        # Tokenize inputs and targets
        inputs = self.tokenizer(input_text, max_length=self.max_length, padding='max_length', truncation=True, return_tensors="pt")
        targets = self.tokenizer(target_text, max_length=self.max_length, padding='max_length', truncation=True, return_tensors="pt")
        
        return {
            'input_ids': inputs.input_ids.squeeze(0),
            'attention_mask': inputs.attention_mask.squeeze(0),
            'labels': targets.input_ids.squeeze(0)
        }

In [7]:
# re-pull model to ensure I'm running against the right model in noteboook
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('cssupport/t5-small-awesome-text-to-sql').to(device)

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=50, # Will be overfitting the model
    per_device_train_batch_size=1,
    learning_rate=5e-4,
    weight_decay=0.0,
    logging_dir='./logs',
    # logging_steps=1, # commented out for smaller export
    save_strategy="no",
    remove_unused_columns=False,
    report_to="none",
)

dataset = TextToSQLDataset(tokenizer, data, max_length=64)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    data_collator=data_collator
)

# Fine-tune the model
trainer.train()

# Save the model
model.save_pretrained('./fine-tuned-model')
tokenizer.save_pretrained('./fine-tuned-model')

  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Step,Training Loss
500,0.0652


('./fine-tuned-model\\tokenizer_config.json',
 './fine-tuned-model\\special_tokens_map.json',
 './fine-tuned-model\\spiece.model',
 './fine-tuned-model\\added_tokens.json')

In [8]:
# Use the newly fine-tuned model to generate SQL queries
# I am using the same questions as what was provided in the fine-tuning data
# to see if the fine-tuned model can generate the correct SQL queries after the training.
fine_tuned_model_path = './fine-tuned-model'
tokenizer = T5Tokenizer.from_pretrained(fine_tuned_model_path)
model = T5ForConditionalGeneration.from_pretrained(fine_tuned_model_path).to(device)

def generate_sql_fine_tuned(input_prompt):
    """Generate SQL query from natural language input using the fine-tuned model."""
    inputs = tokenizer(input_prompt, return_tensors="pt", truncation=True, padding=True, max_length=64).to(device)
    
    with torch.no_grad():
        outputs = model.generate(**inputs, max_length=64)

    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Print the generated SQL queries and compare them with the expected queries from training
for example in data:
    input_prompt = example["input"]
    expected_sql = example["target"]
    generated_sql = generate_sql_fine_tuned(input_prompt)
    print(f"\nInput: {input_prompt}\nExpected: {expected_sql}\nGenerated: {generated_sql}\nMatch: {generated_sql.strip() == expected_sql.strip()}")


Input: translate English to SQL: Who has the shortest name?
Expected: SELECT player_name FROM player ORDER BY LENGTH(player_name) LIMIT 1
Generated: SELECT player_name FROM player ORDER BY LENGTH(player_name) LIMIT 1
Match: True

Input: translate English to SQL: Does anybody have a birthday on January 1st?
Expected: SELECT * FROM player WHERE SUBSTR(birthday, 6, 5) = '01-01'
Generated: SELECT * FROM player WHERE SUBSTR(birthday, 6, 5) = '01-01'
Match: True

Input: translate English to SQL: List all players born in December.
Expected: SELECT * FROM player WHERE SUBSTR(birthday, 6, 2) = '12'
Generated: SELECT * FROM player WHERE SUBSTR(birthday, 6, 2) = '12'
Match: True

Input: translate English to SQL: Who has the longest name?
Expected: SELECT player_name FROM player ORDER BY LENGTH(player_name) DESC LIMIT 1
Generated: SELECT player_name FROM player ORDER BY LENGTH(player_name) DESC LIMIT 1
Match: True

Input: translate English to SQL: Show me all players born in 1990.
Expected: SELEC

In [9]:
# ************************************************************** #
# Queries that weren't part of the training data
natural_language_query = "How many players have a name with 'Christian' in it?"
input_prompt = f"translate English to SQL: {natural_language_query}"
generated_sql = generate_sql_fine_tuned(input_prompt)
print(f"Original query: {natural_language_query}\nGenerated SQL: {generated_sql}")
execute_query(generated_sql, padding)

natural_language_query = "How many players have a height greater than 190?"
input_prompt = f"translate English to SQL: {natural_language_query}"
generated_sql = generate_sql_fine_tuned(input_prompt)
print(f"Original query: {natural_language_query}\nGenerated SQL: {generated_sql}")
execute_query(generated_sql, padding)

natural_language_query = "How many players have a weight less than than 120?"
input_prompt = f"translate English to SQL: {natural_language_query}"
generated_sql = generate_sql_fine_tuned(input_prompt)
print(f"Original query: {natural_language_query}\nGenerated SQL: {generated_sql}")
execute_query(generated_sql, padding)

natural_language_query = "What player has the longest name?"
input_prompt = f"translate English to SQL: {natural_language_query}"
generated_sql = generate_sql_fine_tuned(input_prompt)
print(f"Original query: {natural_language_query}\nGenerated SQL: {generated_sql}")
execute_query(generated_sql, padding)

Original query: How many players have a name with 'Christian' in it?
Generated SQL: SELECT COUNT(*) FROM player WHERE player_name LIKE '%Christian%'
Query results:                [(55,)]
Original query: How many players have a height greater than 190?
Generated SQL: select count(*) from player where height > 190
Query results:                [(1333,)]
Original query: How many players have a weight less than than 120?
Generated SQL: select count(*) from player where weight  120
Error executing sql
Original query: What player has the longest name?
Generated SQL: SELECT player_name FROM player ORDER BY LENGTH(player_name) DESC LIMIT 1
Query results:                [('Domingos Alexandre Martins da Costa Alex',)]


#### Fine-tuning results

* After fine-tuning the model with more examples, we are now able to get more correct queries translated.
* It looks as though it truly failed only two of the queries from the training data :
```
Input: translate English to SQL: List all players born before 1980.
Expected: SELECT * FROM player WHERE birthday < '1980-01-01'
Generated: SELECT * FROM player WHERE birthday  '1980-01-01'
Match: False
```
* I tried testing queries against columns that didn't have much training, like `weight` and `height` and saw they weren't working too well, so I added more training examples for it, that seemed to improve it.
* I've tried adding more training data to hopefully help with this kind of issue, but I wasn't able to get the model to understand `less than`.
```
Original query: How many players have a weight less than than 120?
Generated SQL: select count(*) from player where weight  120
Error executing sql
```

### Thoughts

* I am worried that I had to overfit the model for the data mostly due to issues with datatypes in the database vs what would be best practice.
  * `birthday` in the database is a `text` column and not a `date` column, so I had to train the model to do some janky SQL to handle those columns. I fear this would make the model very rigid.
* I have not figured out how to get a BLEU score on the model yet. I was trying to do that before, but I was getting errors about needing eval procedure that I never found out.

# Using different pre-trained models

### https://huggingface.co/suriya7/t5-base-text-to-sql

In [10]:
suriya7_tokenizer = AutoTokenizer.from_pretrained("suriya7/t5-base-text-to-sql")
suriya7_model = AutoModelForSeq2SeqLM.from_pretrained("suriya7/t5-base-text-to-sql")

def translate_to_sql_select(english_query):
  input_text = "translate English to SQL: " + english_query
  input_ids = suriya7_tokenizer.encode(input_text, return_tensors="pt")
  outputs = suriya7_model.generate(input_ids)
  sql_query = suriya7_tokenizer.decode(outputs[0], skip_special_tokens=True)
  return sql_query

# Example usage
english_query = "Who has the shortest name?"
sql_query = translate_to_sql_select(english_query)
print("SQL Query:", sql_query)


SQL Query: SELECT MIN(name) FROM table_name_94 WHERE name = "


### https://huggingface.co/gaussalgo/T5-LM-Large-text2sql-spider

In [14]:
model_path = 'gaussalgo/T5-LM-Large-text2sql-spider'
spider_model = AutoModelForSeq2SeqLM.from_pretrained(model_path)
spider_tokenizer = AutoTokenizer.from_pretrained(model_path)

question = "Does anybody have a birthday on January 1st?"
schema = """
    "Player" "id" int , "player_api_id" int , "player_name" text , "player_fifa_api_id" int , "birthday" text , "height" int , "weight" int , primary key: "id"
"""

input_text = " ".join(["Question: ",question, "Schema:", schema])

model_inputs = spider_tokenizer(input_text, return_tensors="pt")
outputs = spider_model.generate(**model_inputs, max_length=512)

output_text = spider_tokenizer.batch_decode(outputs, skip_special_tokens=True)

print("SQL Query:")
print(output_text[0])
print(question)
execute_query(output_text[0], padding)



question = "Who has the shortest name?"
input_text = " ".join(["Question: ",question, "Schema:", schema])

model_inputs = spider_tokenizer(input_text, return_tensors="pt")
outputs = spider_model.generate(**model_inputs, max_length=512)

output_text = spider_tokenizer.batch_decode(outputs, skip_special_tokens=True)

print("SQL Query:")
print(output_text[0])
print(question)
execute_query(output_text[0], padding)


question = "Who has the first birthday of the year?"
input_text = " ".join(["Question: ",question, "Schema:", schema])

model_inputs = spider_tokenizer(input_text, return_tensors="pt")
outputs = spider_model.generate(**model_inputs, max_length=512)

output_text = spider_tokenizer.batch_decode(outputs, skip_special_tokens=True)

print("SQL Query:")
print(output_text[0])
print(question)
execute_query(output_text[0], padding)

SQL Query:
SELECT DISTINCT birthday FROM player WHERE birthday LIKE "%January%"
Does anybody have a birthday on January 1st?
Query results:                []
SQL Query:
SELECT player_name FROM player ORDER BY height LIMIT 1
Who has the shortest name?
Query results:                [('Juan Quero',)]
SQL Query:
SELECT player_name FROM player ORDER BY birthday DESC LIMIT 1
Who has the first birthday of the year?
Query results:                [('Jonathan Leko',)]


#### Thoughts on other models

I found two other text-to-sql models on hugging face and tried them out above.

I did like the `gaussalgo` model's input for a schema which would help the model be able to tell what kind of clauses it could use reliably, like if the datatype is a `date` then it can more easily find out date operations, but if the date column were of the type `text`, then the way it would write the query would be different depending on the format of the text in the column, as we can see from the SQL generated by it for the birthday question, it was expecting the full month name to be spelled out. This model also had other shortcomings compared to our selected model, and that was its contextual understand of some of the questions, i.e. asking who has the shortest name, the model saw `shortest` and put the clause on the `height` column. I decided this model would need more training than our original model and not worth going forward with.

The other model, `suriya7` is built of the T5 model, much like our selected model, but I was not able to find a way to give it context to the schema/table structure we want to build the queries for, hence the generated query: `SELECT MIN(name) FROM table_name_94 WHERE name = "` for the question, `Who has the shortest name?`. Right away, this was clearly a sign that this model was going to be more work than it seemed to be worth, not only because it didn't have context on the column names and table names, but the query it generated would have been invalid even if the column name was the correct `player_name`. I also decided that this model was not worth pursuing further.