# Dataset Generation - Part 2

Now that we have our join combinations, we need to use it with some extra information to generate our dialogues. We will also use LLMs to generate our dialogue dataset, and use human expert to check if dialogues make sense and if the sql query generated as label is also correct.

Our prompt to llm will be something like this:
```python
"""
Hey friend, take this joins combinations: {join_combinations}
Consider that in this joins, we involve this tables: {tables_involved}

Here are some informations about this tables: 
<big piece of context that have create table clauses and columns values examples>

<Instructions>
<Example of dialogue in the output format>
"""
```

It works very nice because isn't hard to llms create valid SQL queries when they know the structure of the table (we give it in create table clause) and column values so llm don't hallucinate about what that can be.

We have already created in last notebook the join combinations, and it is on our csv, but we need to:
1. Extract create table clauses for each table in schema.
2. Get from database example values available for each column.

So let's do it!

## Extracting table DDLs:

In [1]:
import re
from dotenv import load_dotenv

load_dotenv()

with open("mondial/schema.txt", "r") as f:
    schema = f.read()

def extract_ddls_from_schema(schema_text):

    pattern = r'-- auto-generated definition\s+create table (MONDIAL_\w+)\s*\(\s*([\s\S]+?)\)\s*/'
    matches = re.findall(pattern, schema_text)
    
    ddl_dict = {}
    for table_name, ddl_content in matches:
        full_ddl = f"CREATE TABLE {table_name} (\n{ddl_content}\n)"
        ddl_dict[table_name] = full_ddl
    
    return ddl_dict

table_ddls = extract_ddls_from_schema(schema)

print(f"Total of tables found: {len(table_ddls)}")
print("\nExample of DDL for MONDIAL_COUNTRY:")
print(table_ddls.get("MONDIAL_COUNTRY", "Table not found"))

Total of tables found: 40

Example of DDL for MONDIAL_COUNTRY:
CREATE TABLE MONDIAL_COUNTRY (
NAME        VARCHAR2(50),
    CODE        VARCHAR2(4),
    CAPITAL     VARCHAR2(50),
    PROVINCE    VARCHAR2(50),
    AREA        NUMBER,
    POPULATION  NUMBER,
    META_REPCOL VARCHAR2(4000)

)


## Getting column values examples

### !! In my case, my company host MONDIAL on Oracle, so i'll do the stuff in next cell, you may change it to adapt to your database !!

In [2]:
import json
import oracledb

# 
with open("mondial/mondial_db_connection.json", "r") as file:
    db_config = json.load(file)

db_host = db_config["DB_HOST"]
db_port = db_config["DB_PORT"]
db_user = db_config["DB_USER_NAME"]
db_pass = db_config["DB_PASS"]
service_name = db_config["SERVICE_NAME"]

dsn = f"{db_host}:{db_port}/{service_name}"
connection = oracledb.connect(
    user=db_user,
    password=db_pass,
    dsn=dsn
)
cursor = connection.cursor()

# Get all tables names
cursor.execute("SELECT TABLE_NAME FROM USER_TABLES WHERE TABLE_NAME LIKE 'MONDIAL_%'")
tables = [row[0] for row in cursor.fetchall()]

table_examples = {}  # Dictionary to store examples by table

# For each table, search column names and examples of values
for table in tables:
    # Query to get column names
    cursor.execute(f"SELECT COLUMN_NAME FROM USER_TAB_COLUMNS WHERE TABLE_NAME = '{table}'")
    columns = [row[0] for row in cursor.fetchall()]
    
    columns_examples = {}
    for column in columns:
        # Get up to 20 distinct values
        try:
            cursor.execute(f"SELECT DISTINCT {column} FROM {table} WHERE {column} IS NOT NULL FETCH FIRST 20 ROWS ONLY")
            examples = []
            for row in cursor.fetchall():
                value = row[0]
                # If the value is a LOB, read the content
                if isinstance(value, oracledb.LOB):
                    value = value.read()
                examples.append(value)
            
            columns_examples[column] = examples
        except Exception as e:
            columns_examples[column] = ["<error to get examples>"]

    table_examples[table] = columns_examples

cursor.close()
connection.close()

print(f"Examples obtained for {len(table_examples)} tables")
print("\nExample for MONDIAL_COUNTRY:")
print(json.dumps(table_examples.get("MONDIAL_COUNTRY", {}), indent=2, default=str)[:500] + "...")

Examples obtained for 59 tables

Example for MONDIAL_COUNTRY:
{
  "NAME": [
    "Austria",
    "Lithuania",
    "Poland",
    "Faroe Islands",
    "Finland",
    "Ireland",
    "Malta",
    "Georgia",
    "Bangladesh",
    "Myanmar",
    "Thailand",
    "Israel",
    "Indonesia",
    "Papua New Guinea",
    "Saudi Arabia",
    "Syria",
    "Lebanon",
    "Anguilla",
    "United States",
    "Haiti"
  ],
  "CODE": [
    "A",
    "AFG",
    "AG",
    "AL",
    "AMSA",
    "AND",
    "ANG",
    "ARM",
    "ARU",
    "AUS",
    "AXA",
    "AZ",
    "B",
    "B...


## Now we have the create tables, example of values, lets load the joins and get into prompting

In [3]:
import pandas as pd

# Load the file with the optimal join combinations
join_combinations_df = pd.read_csv("mondial/optimal_join_combinations.csv")

# Create a list of objects for each join combination
combination_data = []
for idx, row in join_combinations_df.iterrows():
    tipo = row['Type']
    combinacao = row['Combination']
    tabelas_str = row['Tables']
    condicoes_join = row['Join Conditions']
    
    # Extract list of tables from string
    tabelas_list = [t.strip() for t in tabelas_str.strip('"').split(',')]
    
    combination_data.append({
        "combination_id": f"{tipo}_{combinacao}",
        "tables": tabelas_list,
        "combination_str": condicoes_join
    })

print(f"Total of join combinations loaded: {len(combination_data)}")
print("\nExample of combination:")
print(json.dumps(combination_data[0], indent=2))

Total of join combinations loaded: 50

Example of combination:
{
  "combination_id": "2_1",
  "tables": [
    "MONDIAL_GEO_MOUNTAIN",
    "MONDIAL_MOUNTAIN",
    "MONDIAL_MOUNTAINONISLAND"
  ],
  "combination_str": "MONDIAL_GEO_MOUNTAIN.['MONDIAL_GEO_MOUNTAIN.MOUNTAIN=MONDIAL_MOUNTAIN.NAME']; MONDIAL_MOUNTAINONISLAND.['MONDIAL_MOUNTAINONISLAND.MOUNTAIN=MONDIAL_MOUNTAIN.NAME']"
}


In [None]:
from llm_config import LLMConfig
from dialogue_generator import DialogueGenerator

dialogue_generator = DialogueGenerator(
    llm=LLMConfig(provider="azure").get_llm(model="o3-mini", reasoning_effort="high"),
    join_combination_data=combination_data,
    table_ddls_hashmap=table_ddls,
    table_column_values_hashmap=table_examples,
    output_file="mondial/dialogue_dataset.json",
    database_connection="mondial/mondial_db_connection.json"
)

dialogues = dialogue_generator.create_dialogue_dataset()

print(dialogues)

 - Generating dialogue for combination 1 of 50
[Using checkpoint]Dialogue for combination 1 already exists in dialogue_dataset.json. Skipping...
 - Generating dialogue for combination 2 of 50
[Generated dialogue] experiment_id='1' total_expected_interactions=3 interactions=[Interaction(interaction_id='1', speaker='User', utterance='Show me details about the Atlantic Ocean, including its area, depth, and the associated country and province.', intention='Retrieve detailed information (area, depth, country, province) about the Atlantic Ocean.', ground_truths=GroundTruth(tables_from_schema_linking=['MONDIAL_SEA', 'MONDIAL_GEO_SEA'], golden_sql="SELECT s.NAME, s.AREA, s.DEPTH, g.COUNTRY, g.PROVINCE FROM MONDIAL_SEA s JOIN MONDIAL_GEO_SEA g ON s.NAME = g.SEA WHERE s.NAME = 'Atlantic Ocean';")), Interaction(interaction_id='2', speaker='User', utterance='Which sea does the Atlantic Ocean merge with?', intention='Find out which sea merges with the Atlantic Ocean.', ground_truths=GroundTruth(tab