# Query Pipeline for Advanced Text-to-SQL

In this guide we show you how to setup a text-to-SQL pipeline over your data with our [query pipeline](https://docs.llamaindex.ai/en/stable/module_guides/querying/pipeline/root.html) syntax.

This gives you flexibility to enhance text-to-SQL with additional techniques. We show these in the below sections:
1. **Query-Time Table Retrieval**: Dynamically retrieve relevant tables in the text-to-SQL prompt.
2. **Query-Time Sample Row retrieval**: Embed/Index each row, and dynamically retrieve example rows for each table in the text-to-SQL prompt.

Our out-of-the box pipelines include our `NLSQLTableQueryEngine` and `SQLTableRetrieverQueryEngine`. (if you want to check out our text-to-SQL guide using these modules, take a look [here](https://docs.llamaindex.ai/en/stable/examples/index_structs/struct_indices/SQLIndexDemo.html)). This guide implements an advanced version of those modules, giving you the utmost flexibility to apply this to your own setting.

## Load and Ingest Data


### Load Data
We use the [WikiTableQuestions dataset](https://ppasupat.github.io/WikiTableQuestions/) (Pasupat and Liang 2015) as our test dataset.

We go through all the csv's in one folder, store each in a sqlite database (we will then build an object index over each table schema).

In [1]:
import io
import os
import time
import re
import requests
import zipfile
import json
import json as pyjson

import pandas as pd
from pathlib import Path
from typing import List

from pydantic import BaseModel, Field

from llama_index.core import Settings
from llama_index.core.program import LLMTextCompletionProgram
from llama_index.llms.ollama import Ollama

# put data into sqlite db
from sqlalchemy import (
    create_engine,
    MetaData,
    Table,
    Column,
    String,
    Integer,
)

# setup Arize Phoenix for logging/observability
import phoenix as px
from llama_index.core import set_global_handler

from llama_index.core.objects import (
    SQLTableNodeMapping,
    ObjectIndex,
    SQLTableSchema,
)
from llama_index.core import SQLDatabase, VectorStoreIndex
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

from llama_index.core.retrievers import SQLRetriever

from llama_index.core.prompts.default_prompts import DEFAULT_TEXT_TO_SQL_PROMPT
from llama_index.core.prompts import PromptTemplate
from llama_index.core.tools import FunctionTool
from llama_index.core.llms import ChatResponse

from llama_index.core.workflow import Workflow, step, StartEvent, StopEvent
from llama_index.core.workflow.events import Event

# import networkx as nx
# from pyvis.network import Network

from llama_index.utils.workflow import (
    draw_all_possible_flows,
    draw_most_recent_execution,
)

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
URL = "https://github.com/ppasupat/WikiTableQuestions/releases/download/v1.0.2/WikiTableQuestions-1.0.2-compact.zip"

OUTPUT_DIR = "../data"
os.makedirs(OUTPUT_DIR, exist_ok=True)

print("Downloading...")
response = requests.get(URL)
response.raise_for_status()

print("Extracting...")
with zipfile.ZipFile(io.BytesIO(response.content)) as z:
    z.extractall(OUTPUT_DIR)

print("Done.")

Downloading...
Extracting...
Done.


In [53]:
DATA_DIR = Path("../data/WikiTableQuestions/csv/200-csv")
CSV_FILES = sorted([f for f in DATA_DIR.glob("*.csv")])
dfs = []

for csv_file in CSV_FILES:
    print(f"processing file: {csv_file}")
    try:
        df = pd.read_csv(csv_file)
        dfs.append(df)
    except Exception as e:
        print(f"Error parsing {csv_file}: {str(e)}")

processing file: ..\data\WikiTableQuestions\csv\200-csv\0.csv
processing file: ..\data\WikiTableQuestions\csv\200-csv\1.csv
processing file: ..\data\WikiTableQuestions\csv\200-csv\10.csv
processing file: ..\data\WikiTableQuestions\csv\200-csv\11.csv
processing file: ..\data\WikiTableQuestions\csv\200-csv\12.csv
processing file: ..\data\WikiTableQuestions\csv\200-csv\14.csv
processing file: ..\data\WikiTableQuestions\csv\200-csv\15.csv
Error parsing ..\data\WikiTableQuestions\csv\200-csv\15.csv: Error tokenizing data. C error: Expected 4 fields in line 16, saw 5

processing file: ..\data\WikiTableQuestions\csv\200-csv\17.csv
Error parsing ..\data\WikiTableQuestions\csv\200-csv\17.csv: Error tokenizing data. C error: Expected 6 fields in line 5, saw 7

processing file: ..\data\WikiTableQuestions\csv\200-csv\18.csv
processing file: ..\data\WikiTableQuestions\csv\200-csv\20.csv
processing file: ..\data\WikiTableQuestions\csv\200-csv\22.csv
processing file: ..\data\WikiTableQuestions\csv\20

### Extract Table Name and Summary from each Table

Here we use gpt-3.5 to extract a table name (with underscores) and summary from each table with our Pydantic program.

In [54]:
TABLEINFO_DIR = "../data/WikiTableQuestions_TableInfo"
os.makedirs(TABLEINFO_DIR, exist_ok=True)

In [55]:
class TableInfo(BaseModel):
    """Information regarding a structured table."""

    table_name: str = Field(
        ..., description="table name (must be underscores and NO spaces)"
    )
    table_summary: str = Field(
        ..., description="short, concise summary/caption of the table"
    )

PROMPT_STR = """\
    Return only a JSON object, with no explanation, no prose, no markdown, and no trailing text.
    You are to produce **only** a JSON object matching the following exact schema:

    {
        "table_name": "<short_name_in_snake_case_without_spaces>",
        "table_summary": "<short concise caption of the table>"
    }

    Example:
    {"table_name": "movie_info", "table_summary": "Summary of movie data"}

    Rules:
    - The table_name must be unique to the table, describe it clearly, and be in snake_case.
    - Do NOT output a generic table name (e.g., "table", "my_table").
    - Do NOT make the table name one of the following: {exclude_table_name_list}.
    - Do NOT include any keys other than "table_name" and "table_summary".
    - Do NOT include extra text before/after the JSON.
    - Do NOT include any other keys or text before/after the JSON.
    - Do NOT wrap in ```json.

    Table:
    {table_str}
"""

Settings.llm = Ollama(
    model="qwen3:0.6b", 
    request_timeout=240,
    format="json",
    # context_window=1000
)

program = LLMTextCompletionProgram.from_defaults(
    output_cls=TableInfo,
    prompt_template_str=PROMPT_STR,
    llm=Settings.llm,
    # verbose=True,
)

In [56]:
def extract_first_json_block(text: str):
    match = re.search(r"\{.*\}", text, re.S)  # grab first {...} block
    if not match:
        raise ValueError("No JSON object found in output")
    return pyjson.loads(match.group())


MAX_RETRIES = 3


def _get_tableinfo_with_index(idx: int) -> str:
    results_gen = Path(TABLEINFO_DIR).glob(f"{idx}_*")
    results_list = list(results_gen)
    
    if len(results_list) == 0:
        return None
    elif len(results_list) == 1:
        path = results_list[0]
        json_str = path.read_text(encoding="utf-8")
        return TableInfo.model_validate_json(json_str)
    else:
        raise ValueError(f"More than one file matching index: {list(results_gen)}")

In [57]:
table_names = set()
table_infos = []

for idx, df in enumerate(dfs):
    table_info = _get_tableinfo_with_index(idx)
    if table_info:
        table_infos.append(table_info)
        continue

    df_str = df.head(10).to_csv()

    for attempt in range(MAX_RETRIES):
        try:
            raw_output = program(
                table_str=df_str,
                exclude_table_name_list=str(list(table_names)),
            )

            if isinstance(raw_output, TableInfo):
                table_info = raw_output
            elif isinstance(raw_output, dict):
                table_info = TableInfo(**raw_output)
            elif isinstance(raw_output, str):
                parsed_dict = extract_first_json_block(raw_output)
                table_info = TableInfo(**parsed_dict)
            else:
                raise TypeError(f"Unexpected return type from program(): {type(raw_output)}")

            table_name = table_info.table_name
            print(f"Processed table: {table_name}")

            if table_name in table_names:
                print(f"Table name '{table_name}' already exists, skipping this table.")
                table_info = None  # don’t append duplicate
                break  # skip

            # save table info
            table_names.add(table_name)
            out_file = f"{TABLEINFO_DIR}/{idx}_{table_name}.json"
            json.dump(table_info.model_dump(), open(out_file, "w"))
            break  # move to next table

        except Exception as e:
            print(f"Error with attempt {attempt+1}: {e}")
            time.sleep(2)

    if table_info:
        table_infos.append(table_info)

To retry for a single index (in needed)

In [None]:
# idx = 20
# df = dfs[idx]

# table_info = _get_tableinfo_with_index(idx)
# if table_info:
#     table_infos.append(table_info)
# else:
#     df_str = df.head(20).to_csv()

#     for attempt in range(MAX_RETRIES):
#         try:
#             raw_output = program(
#                 table_str=df_str,
#                 exclude_table_name_list=str(list(table_names)),
#             )

#             if isinstance(raw_output, TableInfo):
#                 table_info = raw_output
#             elif isinstance(raw_output, dict):
#                 table_info = TableInfo(**raw_output)
#             elif isinstance(raw_output, str):
#                 parsed_dict = extract_first_json_block(raw_output)
#                 table_info = TableInfo(**parsed_dict)
#             else:
#                 raise TypeError(f"Unexpected return type from program(): {type(raw_output)}")

#             table_name = table_info.table_name
#             print(f"Processed table: {table_name}")

#             if table_name in table_names:
#                 print(f"Table name '{table_name}' already exists, skipping this table.")
#                 table_info = None
#                 break

#             table_names.add(table_name)
#             out_file = f"{TABLEINFO_DIR}/{idx}_{table_name}.json"
#             json.dump(table_info.model_dump(), open(out_file, "w"))
#             break

#         except Exception as e:
#             print(f"Error with attempt {attempt+1}: {e}")
#             time.sleep(2)

#     if table_info:
#         table_infos.append(table_info)

Error with attempt 1: 1 validation error for TableInfo
  Invalid JSON: trailing characters at line 1 column 97 [type=json_invalid, input_value='{"table_name": "award_in...n the specified table"}', input_type=str]
    For further information visit https://errors.pydantic.dev/2.11/v/json_invalid
Processed table: award_nominations


### Put Data in SQL Database

We use `sqlalchemy`, a popular SQL database toolkit, to load all the tables.

In [58]:
# Function to create a sanitized column name
def sanitize_column_name(col_name):
    # Remove special characters and replace spaces with underscores
    return re.sub(r"\W+", "_", col_name)


# Function to create a table from a DataFrame using SQLAlchemy
def create_table_from_dataframe(
    df: pd.DataFrame, table_name: str, engine, metadata_obj
):
    # Sanitize column names
    sanitized_columns = {col: sanitize_column_name(col) for col in df.columns}
    df = df.rename(columns=sanitized_columns)

    # Dynamically create columns based on DataFrame columns and data types
    columns = [
        Column(col, String if dtype == "object" else Integer)
        for col, dtype in zip(df.columns, df.dtypes)
    ]

    # Create a table with the defined columns
    table = Table(table_name, metadata_obj, *columns)

    # Create the table in the database
    metadata_obj.create_all(engine)

    # Insert data from DataFrame into the table
    with engine.connect() as conn:
        for _, row in df.iterrows():
            insert_stmt = table.insert().values(**row.to_dict())
            conn.execute(insert_stmt)
        conn.commit()


engine = create_engine("sqlite:///:memory:")
metadata_obj = MetaData()
for idx, df in enumerate(dfs):
    tableinfo = _get_tableinfo_with_index(idx)
    if tableinfo is None:
        print(f"[ERROR] No TableInfo for index {idx}")
        continue  # skip this one or handle it differently
    print(f"Creating table: {tableinfo.table_name}")
    create_table_from_dataframe(df, tableinfo.table_name, engine, metadata_obj)

Creating table: movie_chart_positions
Creating table: movie_data
Creating table: death_accident_statistics
Creating table: award_data_1972
Creating table: award_data
Creating table: people_info
Creating table: broadcasting_info
Creating table: person_info
Creating table: chart_positions
Creating table: kodachrome_film_info
Creating table: bbc_radio_costs
Creating table: airport_locations
Creating table: party_voters
Creating table: club_performance
Creating table: horse_race_data
Creating table: grammy_awards
Creating table: boxing_matches
Creating table: sports_performance_data
Creating table: district_info
Creating table: party_data
Creating table: award_nominations
Creating table: government_ministers
Creating table: new_municipality_old_municipality_seat
Creating table: team_performance
Creating table: encoding_info
Creating table: temperature_data
Creating table: people_terms
Creating table: new_mexico_governorships
Creating table: weather_statistics
Creating table: drop_event_dat

In [59]:
px.launch_app()
set_global_handler("arize_phoenix")

Existing running Phoenix instance detected! Shutting it down and starting a new instance...
Attempting to instrument while already instrumented


🌍 To view the Phoenix app in your browser, visit http://localhost:6006/
📖 For more information on how to use Phoenix, check out https://arize.com/docs/phoenix


## Advanced Capability 1: Text-to-SQL with Query-Time Table Retrieval.

We now show you how to setup an e2e text-to-SQL with table retrieval.

### Define Modules

Here we define the core modules.
1. Object index + retriever to store table schemas
2. SQLDatabase object to connect to the above tables + SQLRetriever.
3. Text-to-SQL Prompt
4. Response synthesis Prompt
5. LLM

Object index, retriever, SQLDatabase

In [60]:
embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")

sql_database = SQLDatabase(engine)
table_node_mapping = SQLTableNodeMapping(sql_database)

table_schema_objs = [
    SQLTableSchema(table_name=t.table_name, context_str=t.table_summary)
    for t in table_infos
]  # add a SQLTableSchema for each table

obj_index = ObjectIndex.from_objects(
    table_schema_objs,
    table_node_mapping,
    VectorStoreIndex,
    embed_model=embed_model,
)
obj_retriever = obj_index.as_retriever(similarity_top_k=5)

SQLRetriever + Table Parser

In [61]:
sql_retriever = SQLRetriever(sql_database)


def get_table_context_str(table_schema_objs: List[SQLTableSchema]):
    """Get table context string."""
    context_strs = []
    for table_schema_obj in table_schema_objs:
        table_info = sql_database.get_single_table_info(
            table_schema_obj.table_name
        )
        if table_schema_obj.context_str:
            table_opt_context = " The table description is: "
            table_opt_context += table_schema_obj.context_str
            table_info += table_opt_context

        context_strs.append(table_info)
    return "\n\n".join(context_strs)


table_parser_component = get_table_context_str(table_schema_objs)

Text-to-SQL Prompt + Output Parser

In [None]:
def parse_response_to_sql(response: ChatResponse) -> str:
    """Parse response to SQL."""
    response = response.message.content
    sql_query_start = response.find("SQLQuery:")
    if sql_query_start != -1:
        response = response[sql_query_start:]
        
        if response.startswith("SQLQuery:"):
            response = response[len("SQLQuery:") :]
    sql_result_start = response.find("SQLResult:")
    if sql_result_start != -1:
        response = response[:sql_result_start]
    return response.strip().strip("```").strip()


sql_parser_component = FunctionTool.from_defaults(fn=parse_response_to_sql)

text2sql_prompt = DEFAULT_TEXT_TO_SQL_PROMPT.partial_format(
    dialect=engine.dialect.name
)
print(text2sql_prompt.template)

Given an input question, first create a syntactically correct {dialect} query to run, then look at the results of the query and return the answer. You can order the results by a relevant column to return the most interesting examples in the database.

Never query for all the columns from a specific table, only ask for a few relevant columns given the question.

Pay attention to use only the column names that you can see in the schema description. Be careful to not query for columns that do not exist. Pay attention to which column is in which table. Also, qualify column names with the table name when needed. You are required to use the following format, each taking one line:

Question: Question here
SQLQuery: SQL Query to run
SQLResult: Result of the SQLQuery
Answer: Final answer here

Only use tables listed below.
{schema}

Question: {query_str}
SQLQuery: 


Response Synthesis Prompt

In [63]:
response_synthesis_prompt_str = (
    "Given an input question, synthesize a response from the query results.\n"
    "Query: {query_str}\n"
    "SQL: {sql_query}\n"
    "SQL Response: {context_str}\n"
    "Response: "
)
response_synthesis_prompt = PromptTemplate(
    response_synthesis_prompt_str,
)

### Define Workflow

Now that the components are in place, let's define the query pipeline!

In [None]:
# # custom events
# class TableRetrievedEvent(Event):
#     tables: list
#     query_str: str

# class SchemaProcessedEvent(Event):
#     table_schema: str
#     query_str: str

# class SQLPromptReadyEvent(Event):
#     t2s_prompt: str
#     query_str: str
#     table_schema: str

# class SQLGeneratedEvent(Event):
#     sql_query: str
#     query_str: str
#     table_schema: str

# class SQLParsedEvent(Event):
#     sql_query: str
#     query_str: str
#     table_schema: str

# class SQLResultsEvent(Event):
#     context_str: str
#     sql_query: str
#     query_str: str

# class ResponsePromptReadyEvent(Event):
#     rs_prompt: str


# def extract_sql_from_response(llm_response: str) -> str:
#     """
#     Extract SQL query from LLM response that might contain reasoning or formatting.
#     """
#     response = llm_response.strip()
    
#     # First, remove <think> blocks entirely
#     response = re.sub(r'<think>.*?</think>', '', response, flags=re.DOTALL).strip()
    
#     # Method 1: Look for SQLQuery: pattern
#     sql_query_match = re.search(r'SQLQuery:\s*([^;]+;?)', response, re.IGNORECASE | re.DOTALL)
#     if sql_query_match:
#         sql = sql_query_match.group(1).strip()
#         return clean_sql_query(sql)
    
#     # Method 2: Look for SQL in code blocks
#     code_block_match = re.search(r'```sql\s*\n(.*?)\n```', response, re.IGNORECASE | re.DOTALL)
#     if code_block_match:
#         sql = code_block_match.group(1).strip()
#         return clean_sql_query(sql)
    
#     # Method 3: Look for standalone SQL statements (most common case)
#     sql_keywords = ['SELECT', 'INSERT', 'UPDATE', 'DELETE', 'WITH']
    
#     # Split by lines and look for SQL statements
#     lines = response.split('\n')
#     for line in lines:
#         line = line.strip()
#         if not line:
#             continue
            
#         # Check if line starts with SQL keyword
#         if any(line.upper().startswith(keyword.upper()) for keyword in sql_keywords):
#             return clean_sql_query(line)
    
#     # Method 4: Look for multi-line SQL statements
#     for keyword in sql_keywords:
#         pattern = rf'\b{keyword}\b.*?(?=\n\s*\n|\nSQLResult|\nAnswer|$)'
#         sql_match = re.search(pattern, response, re.IGNORECASE | re.DOTALL)
#         if sql_match:
#             sql = sql_match.group(0).strip()
#             return clean_sql_query(sql)
    
#     # Fallback: if nothing found, return empty string to avoid errors
#     print(f"Warning: Could not extract SQL from response: {response[:100]}...")
#     return "SELECT 1"  # Safe fallback query


# def clean_sql_query(sql: str) -> str:
#     """
#     Clean and standardize SQL query.
#     """
#     if not sql:
#         return "SELECT 1"
    
#     # Remove extra whitespace
#     sql = ' '.join(sql.split())
    
#     # Fix quote issues - convert double quotes to single quotes for string literals
#     # This is a simple approach - for more complex cases, you'd need a proper SQL parser
#     sql = re.sub(r'"([^"]*)"', r"'\1'", sql)
    
#     # Remove multiple semicolons
#     sql = re.sub(r';+', ';', sql)
    
#     # Remove trailing semicolon and add it back cleanly
#     sql = sql.rstrip(';').strip()
    
#     # Don't add semicolon for now since it might be causing issues
#     return sql


# class Text2SQLWorkflow(Workflow):
    
#     @step
#     async def input_step(self, ev: StartEvent) -> TableRetrievedEvent:
#         """Process the initial query and retrieve relevant tables"""
#         query = ev.query
        
#         # Retrieve table schemas (you'll need to define obj_retriever)
#         table_schema_objs = obj_retriever.retrieve(query)
        
#         return TableRetrievedEvent(
#             tables=table_schema_objs,
#             query_str=query
#         )
    
#     @step
#     async def table_output_parser_step(self, ev: TableRetrievedEvent) -> SchemaProcessedEvent:
#         """Parse table schemas into string format"""
#         # You'll need to define get_table_context_str function
#         schema_str = get_table_context_str(ev.tables)
        
#         return SchemaProcessedEvent(
#             table_schema=schema_str,
#             query_str=ev.query_str
#         )
    
#     @step
#     async def text2sql_prompt_step(self, ev: SchemaProcessedEvent) -> SQLPromptReadyEvent:
#         """Create the text-to-SQL prompt"""
#         # Enhanced prompt to ensure clean SQL output
#         ENHANCED_PROMPT = f"""
#             Given the following table schema and user question, generate a SQL query.

#             Table Schema:
#             {ev.table_schema}

#             User Question: {ev.query_str}

#             Instructions:
#             1. Generate ONLY a valid SQL query
#             2. Do not include any explanations, reasoning, or additional text
#             3. Do not include SQLQuery:, SQLResult:, or Answer: labels
#             4. Do not wrap in code blocks or other formatting
#             5. End the query with a semicolon

#             SQL Query:
#         """
        
#         # If you have a custom text2sql_prompt, use it instead
#         # prompt = text2sql_prompt.format(
#         #     query_str=ev.query_str,
#         #     table_schema=ev.table_schema
#         # )
        
#         return SQLPromptReadyEvent(
#             t2s_prompt=ENHANCED_PROMPT,
#             query_str=ev.query_str,
#             table_schema=ev.table_schema
#         )
    
#     @step
#     async def text2sql_llm_step(self, ev: SQLPromptReadyEvent) -> SQLGeneratedEvent:
#         """Generate SQL query using LLM"""
#         # You'll need to configure Settings.llm
#         sql_response = await Settings.llm.acomplete(ev.t2s_prompt)
        
#         return SQLGeneratedEvent(
#             sql_query=str(sql_response).strip(),
#             query_str=ev.query_str,
#             table_schema=ev.table_schema
#         )
    
#     @step
#     async def sql_output_parser_step(self, ev: SQLGeneratedEvent) -> SQLParsedEvent:
#         """Parse and clean the generated SQL query"""
#         # Extract clean SQL from the LLM response
#         clean_sql = extract_sql_from_response(ev.sql_query)
        
#         print(f"Original LLM Response: {ev.sql_query}")
#         print(f"Cleaned SQL Query: {clean_sql}")
        
#         # Validate that we have a reasonable SQL query
#         if not clean_sql or clean_sql == "SELECT 1":
#             print("Warning: Could not extract valid SQL, using fallback")
        
#         return SQLParsedEvent(
#             sql_query=clean_sql,
#             query_str=ev.query_str,
#             table_schema=ev.table_schema
#         )
    
#     @step
#     async def sql_retriever_step(self, ev: SQLParsedEvent) -> SQLResultsEvent:
#         """Execute SQL query and get results"""
#         try:
#             # You'll need to define sql_retriever
#             results = sql_retriever.retrieve(ev.sql_query)
            
#             return SQLResultsEvent(
#                 context_str=str(results),
#                 sql_query=ev.sql_query,
#                 query_str=ev.query_str
#             )
#         except Exception as e:
#             print(f"SQL Execution Error: {e}")
#             print(f"Failed SQL Query: {ev.sql_query}")
#             # Return error information for debugging
#             return SQLResultsEvent(
#                 context_str=f"SQL execution failed: {str(e)}",
#                 sql_query=ev.sql_query,
#                 query_str=ev.query_str
#             )
    
#     @step
#     async def response_synthesis_prompt_step(self, ev: SQLResultsEvent) -> ResponsePromptReadyEvent:
#         """Create the response synthesis prompt"""
#         # You'll need to define response_synthesis_prompt template
#         prompt = response_synthesis_prompt.format(
#             query_str=ev.query_str,
#             context_str=ev.context_str,
#             sql_query=ev.sql_query
#         )
        
#         return ResponsePromptReadyEvent(rs_prompt=prompt)
    
#     @step
#     async def response_synthesis_llm_step(self, ev: ResponsePromptReadyEvent) -> StopEvent:
#         """Generate final answer using LLM"""
#         answer = await Settings.llm.acomplete(ev.rs_prompt)
        
#         return StopEvent(result=str(answer))


# async def run_text2sql_workflow(query: str):
#     workflow = Text2SQLWorkflow(timeout=120)
#     result = await workflow.run(query=query)
#     return result

In [64]:
# custom events
class TableRetrievedEvent(Event):
    tables: list
    query_str: str

class SchemaProcessedEvent(Event):
    table_schema: str
    query_str: str

class SQLPromptReadyEvent(Event):
    t2s_prompt: str
    query_str: str
    table_schema: str
    retry_count: int = 0
    error_message: str = ""

class SQLGeneratedEvent(Event):
    sql_query: str
    query_str: str
    table_schema: str
    retry_count: int = 0
    error_message: str = ""

class SQLParsedEvent(Event):
    sql_query: str
    query_str: str
    table_schema: str
    retry_count: int = 0
    error_message: str = ""

class SQLResultsEvent(Event):
    context_str: str
    sql_query: str
    query_str: str
    success: bool = True

class ResponsePromptReadyEvent(Event):
    rs_prompt: str


def is_valid_sql_start(text: str) -> bool:
    """Check if text starts with valid SQL"""
    if not text:
        return False
    
    sql_keywords = ['SELECT', 'WITH', 'INSERT', 'UPDATE', 'DELETE']
    text_upper = text.upper().strip()
    return any(text_upper.startswith(keyword) for keyword in sql_keywords)

def extract_sql_from_response(llm_response: str) -> str:
    """
    Extract SQL query from LLM response that might contain reasoning or formatting.
    """
    response = llm_response.strip()
    
    # First, remove <think> blocks entirely
    response = re.sub(r'<think>.*?</think>', '', response, flags=re.DOTALL).strip()
    
    # Remove any non-SQL content at the beginning
    response = re.sub(r'^[^S]*(?=SELECT|WITH|INSERT|UPDATE|DELETE)', '', response, flags=re.IGNORECASE)
    
    # Method 1: Look for SQLQuery: pattern
    sql_query_match = re.search(r'SQLQuery:\s*([^;]+;?)', response, re.IGNORECASE | re.DOTALL)
    if sql_query_match:
        sql = sql_query_match.group(1).strip()
        return clean_sql_query(sql)
    
    # Method 2: Look for SQL in code blocks
    code_block_patterns = [
        r'```sql\s*\n(.*?)\n```',
        r'```\s*\n(.*?)\n```',
        r'`([^`]+)`'
    ]
    
    for pattern in code_block_patterns:
        match = re.search(pattern, response, re.IGNORECASE | re.DOTALL)
        if match:
            sql = match.group(1).strip()
            if is_valid_sql_start(sql):
                return clean_sql_query(sql)
    
    # Method 3: Look for standalone SQL statements
    sql_keywords = ['SELECT', 'INSERT', 'UPDATE', 'DELETE', 'WITH']
    
    # Split by lines and look for SQL statements
    lines = response.split('\n')
    for line in lines:
        line = line.strip()
        if not line:
            continue
            
        # Check if line starts with SQL keyword
        if any(line.upper().startswith(keyword.upper()) for keyword in sql_keywords):
            return clean_sql_query(line)
    
    # Method 4: Look for multi-line SQL statements
    for keyword in sql_keywords:
        pattern = rf'\b{keyword}\b.*?(?=\n\s*\n|\nSQLResult|\nAnswer|$)'
        sql_match = re.search(pattern, response, re.IGNORECASE | re.DOTALL)
        if sql_match:
            sql = sql_match.group(0).strip()
            return clean_sql_query(sql)
    
    # Fallback: if nothing found, return empty string to avoid errors
    print(f"Warning: Could not extract SQL from response: {response[:100]}...")
    return "SELECT 1"  # Safe fallback query


def clean_sql_query(sql: str) -> str:
    """
    Clean and standardize SQL query.
    """
    if not sql:
        return "SELECT 1"
    
    # Remove extra whitespace
    sql = ' '.join(sql.split())
    
    # Fix quote issues - convert double quotes to single quotes for string literals
    # This is a simple approach - for more complex cases, you'd need a proper SQL parser
    sql = re.sub(r'"([^"]*)"', r"'\1'", sql)
    
    # Remove multiple semicolons
    sql = re.sub(r';+', ';', sql)
    
    # Remove trailing semicolon and add it back cleanly
    sql = sql.rstrip(';').strip()
    
    # Don't add semicolon for now since it might be causing issues
    return sql


def analyze_sql_error(error_message: str, sql_query: str, table_schema: str) -> str:
    """
    Analyze SQL error and provide suggestions for fixing the query.
    """
    error_lower = error_message.lower()
    
    if "no such column" in error_lower:
        # Extract the problematic column name
        column_match = re.search(r'no such column:\s*(\w+)', error_lower)
        if column_match:
            bad_column = column_match.group(1)
            
            # Try to suggest correct column names from schema
            schema_lower = table_schema.lower()
            possible_columns = re.findall(r'(\w+):', schema_lower)
            
            suggestions = []
            for col in possible_columns:
                if bad_column.lower() in col.lower() or col.lower() in bad_column.lower():
                    suggestions.append(col)
            
            error_msg = f"Column '{bad_column}' does not exist."
            if suggestions:
                error_msg += f" Did you mean: {', '.join(suggestions[:3])}?"
            error_msg += f"\n\nAvailable columns from schema:\n{table_schema}"
            return error_msg
    
    elif "no such table" in error_lower:
        table_match = re.search(r'no such table:\s*([\w\s\[\]]+)', error_lower)
        if table_match:
            bad_table = table_match.group(1).strip()
            return f"Table '{bad_table}' does not exist. Available tables from schema:\n{table_schema}"
    
    elif "syntax error" in error_lower:
        return f"SQL syntax error. Please check:\n- Missing quotes around strings\n- Proper parentheses\n- Correct SQL keywords\n\nFailed query: {sql_query}"
    
    return f"SQL execution error: {error_message}\n\nFailed query: {sql_query}\n\nSchema: {table_schema}"

def create_enhanced_prompt(table_schema: str, query_str: str, retry_count: int = 0, error_message: str = ""):
    if retry_count == 0:
        # Initial attempt
        ENHANCED_PROMPT = f"""Given the table schema and user question below, generate ONLY a valid SQL query.

            Table Schema:
            {table_schema}

            User Question: {query_str}

            IMPORTANT RULES:
            1. Return ONLY the SQL query, nothing else
            2. Use single quotes for string literals, not double quotes
            3. Do not include any explanations, reasoning, or additional text
            4. Do not include labels like "SQLQuery:", "Answer:", etc.
            5. Do not wrap in code blocks or markdown formatting
            6. Do not include semicolons at the end
            7. Do not include any <think> tags or reasoning
            8. Only use column names that exist in the provided schema

            Example format:
            SELECT column_name FROM table_name WHERE condition

            Your SQL query:
        """
    else:
        # Retry attempt with error information
        ENHANCED_PROMPT = f"""The previous SQL query failed with an error. Please generate a corrected SQL query.

            Table Schema:
            {table_schema}

            User Question: {query_str}

            Previous Error: {error_message}

            IMPORTANT RULES:
            1. Return ONLY the corrected SQL query, nothing else
            2. Use single quotes for string literals, not double quotes
            3. Carefully check that all column names exist in the provided schema
            4. Do not include any explanations, reasoning, or additional text
            5. Do not include labels like "SQLQuery:", "Answer:", etc.
            6. Do not wrap in code blocks or markdown formatting
            7. Do not include semicolons at the end
            8. Only use column names that are explicitly listed in the schema above

            Your corrected SQL query:
        """
    
    return ENHANCED_PROMPT

In [65]:
class Text2SQLWorkflow(Workflow):
    
    @step
    async def input_step(self, ev: StartEvent) -> TableRetrievedEvent:
        """Process the initial query and retrieve relevant tables"""
        query = ev.query
        
        # Retrieve table schemas (you'll need to define obj_retriever)
        table_schema_objs = obj_retriever.retrieve(query)
        
        return TableRetrievedEvent(
            tables=table_schema_objs,
            query_str=query
        )
    
    @step
    async def table_output_parser_step(self, ev: TableRetrievedEvent) -> SchemaProcessedEvent:
        """Parse table schemas into string format"""
        # You'll need to define get_table_context_str function
        schema_str = get_table_context_str(ev.tables)
        
        return SchemaProcessedEvent(
            table_schema=schema_str,
            query_str=ev.query_str
        )
    
    @step
    async def text2sql_prompt_step(self, ev: SchemaProcessedEvent | SQLResultsEvent) -> SQLPromptReadyEvent:
        """Create the text-to-SQL prompt with optional error correction"""
        
        # Handle both initial attempt and retry attempts
        if isinstance(ev, SchemaProcessedEvent):
            table_schema = ev.table_schema
            query_str = ev.query_str
            retry_count = 0
            error_message = ""
        else:  # SQLResultsEvent (retry case)
            table_schema = getattr(ev, 'table_schema', '')
            query_str = ev.query_str
            retry_count = getattr(ev, 'retry_count', 0) + 1
            error_message = getattr(ev, 'error_message', '')
        
        prompt = create_enhanced_prompt(table_schema, query_str, retry_count, error_message)
        
        return SQLPromptReadyEvent(
            t2s_prompt=prompt,
            query_str=query_str,
            table_schema=table_schema,
            retry_count=retry_count,
            error_message=error_message
        )
    
    @step
    async def text2sql_llm_step(self, ev: SQLPromptReadyEvent) -> SQLGeneratedEvent:
        """Generate SQL query using LLM"""
        # You'll need to configure Settings.llm
        sql_response = await Settings.llm.acomplete(ev.t2s_prompt)
        
        return SQLGeneratedEvent(
            sql_query=str(sql_response).strip(),
            query_str=ev.query_str,
            table_schema=ev.table_schema,
            retry_count=ev.retry_count,
            error_message=ev.error_message
        )
    
    @step
    async def sql_output_parser_step(self, ev: SQLGeneratedEvent) -> SQLParsedEvent:
        """Parse and clean the generated SQL query"""
        # Extract clean SQL from the LLM response
        clean_sql = extract_sql_from_response(ev.sql_query)
        
        print(f"Attempt #{ev.retry_count + 1}")
        print(f"Original LLM Response: {ev.sql_query}")
        print(f"Cleaned SQL Query: {clean_sql}")
        
        # Validate that we have a reasonable SQL query
        if not clean_sql or clean_sql == "SELECT 1":
            print("Warning: Could not extract valid SQL, using fallback")
        
        return SQLParsedEvent(
            sql_query=clean_sql,
            query_str=ev.query_str,
            table_schema=ev.table_schema,
            retry_count=ev.retry_count,
            error_message=ev.error_message
        )
    
    @step
    async def sql_retriever_step(self, ev: SQLParsedEvent) -> SQLResultsEvent:
        """Execute SQL query and get results with retry logic"""
        max_retries = 3
        
        try:
            # You'll need to define sql_retriever
            results = sql_retriever.retrieve(ev.sql_query)
            
            print(f"[SUCCESS] SQL executed successfully on attempt #{ev.retry_count + 1}")
            return SQLResultsEvent(
                context_str=str(results),
                sql_query=ev.sql_query,
                query_str=ev.query_str,
                success=True
            )
        except Exception as e:
            error_msg = str(e)
            print(f"[ERROR] SQL Execution Error (Attempt #{ev.retry_count + 1}): {error_msg}")
            print(f"Failed SQL Query: {ev.sql_query}")
            
            # Check if we should retry
            if ev.retry_count < max_retries:
                print(f"[RETRY] Retrying... (Attempt #{ev.retry_count + 2}/{max_retries + 1})")
                
                # Create a new event that will trigger a retry
                error_analysis = analyze_sql_error(error_msg, ev.sql_query, ev.table_schema)
                
                # Return an SQLResultsEvent that will trigger a retry
                retry_event = SQLResultsEvent(
                    context_str="",
                    sql_query=ev.sql_query,
                    query_str=ev.query_str,
                    success=False
                )
                retry_event.retry_count = ev.retry_count + 1
                retry_event.error_message = error_analysis
                retry_event.table_schema = ev.table_schema
                
                return retry_event
            else:
                print(f"[ERROR, RETRY FAILED] Max retries ({max_retries}) reached. Giving up.")
                return SQLResultsEvent(
                    context_str=f"Failed to execute SQL after {max_retries + 1} attempts. Final error: {error_msg}",
                    sql_query=ev.sql_query,
                    query_str=ev.query_str,
                    success=False
                )
    
    @step
    async def retry_handler_step(self, ev: SQLResultsEvent) -> SQLPromptReadyEvent:
        """Handle retry logic - only triggered when SQL execution fails"""
        # This step only processes failed SQL results that need retrying
        if ev.success or not hasattr(ev, 'retry_count'):
            return None  # Let successful results pass through to response synthesis
        
        print(f"[RETRY] Preparing retry #{ev.retry_count + 1}")
        
        # Create a new prompt event for retry
        return SQLPromptReadyEvent(
            t2s_prompt="",  # Will be filled in text2sql_prompt_step
            query_str=ev.query_str,
            table_schema=getattr(ev, 'table_schema', ''),
            retry_count=ev.retry_count,
            error_message=getattr(ev, 'error_message', 'Unknown error')
        )
    
    @step
    async def response_synthesis_prompt_step(self, ev: SQLResultsEvent) -> ResponsePromptReadyEvent:
        """Create the response synthesis prompt - only for successful SQL results"""
        # Only process successful SQL results
        if not ev.success:
            return None
            
        # You'll need to define response_synthesis_prompt template
        prompt = response_synthesis_prompt.format(
            query_str=ev.query_str,
            context_str=ev.context_str,
            sql_query=ev.sql_query
        )
        
        return ResponsePromptReadyEvent(rs_prompt=prompt)
    
    @step
    async def response_synthesis_llm_step(self, ev: ResponsePromptReadyEvent) -> StopEvent:
        """Generate final answer using LLM"""
        answer = await Settings.llm.acomplete(ev.rs_prompt)
        
        return StopEvent(result=str(answer))


async def run_text2sql_workflow(query: str):
    workflow = Text2SQLWorkflow(timeout=240)
    result = await workflow.run(query=query)
    return result

### Visualize Workflow

A really nice property of the query pipeline syntax is you can easily visualize it in a graph via networkx.

In [None]:
# # Build a directed graph of steps
# G = nx.DiGraph()

# # Nodes
# steps = [
#     "input",
#     "table_retriever",
#     "table_output_parser",
#     "text2sql_prompt",
#     "text2sql_llm",
#     "sql_output_parser",
#     "sql_retriever",
#     "response_synthesis_prompt",
#     "response_synthesis_llm"
# ]
# G.add_nodes_from(steps)

# # Edges
# edges = [
#     ("input", "table_retriever"),
#     ("table_retriever", "table_output_parser"),
    
#     ("input", "text2sql_prompt"),
#     ("table_output_parser", "text2sql_prompt"),

#     ("text2sql_prompt", "text2sql_llm"),
#     ("text2sql_llm", "sql_output_parser"),
#     ("sql_output_parser", "sql_retriever"),
    
#     ("sql_output_parser", "response_synthesis_prompt"),
#     ("sql_retriever", "response_synthesis_prompt"),
#     ("input", "response_synthesis_prompt"),
    
#     ("response_synthesis_prompt", "response_synthesis_llm")
# ]
# G.add_edges_from(edges)

# # Visualize
# net = Network(notebook=True, cdn_resources="in_line", directed=True)
# net.from_nx(G)

# html_content = net.generate_html()
# with open("../outputs/trials_v1/text2sql_dag.html", "w", encoding="utf-8") as f:
#     f.write(html_content)

# print("Saved text2sql_dag.html successfully.")

Saved text2sql_dag.html successfully.


In [66]:
async def visualize_text2sql_workflow():
    """
    Function to visualize the Text2SQL workflow both as all possible flows
    and a specific execution example
    """
    output_dir = ("../outputs/trials_v1")
    os.makedirs(output_dir, exist_ok=True)
    
    # 1. Draw ALL possible flows through your workflow
    print("Drawing all possible flows...")
    all_flows_path = os.path.join(output_dir, "text2sql_workflow_all_flows.html")
    draw_all_possible_flows(
        Text2SQLWorkflow, 
        filename=all_flows_path
    )
    print(f"[SUCCESS] All possible flows saved to: {all_flows_path}")

    # 2. Draw a specific execution to see the actual path taken
    print("Running workflow and drawing execution path...")
    
    # Create workflow instance
    workflow = Text2SQLWorkflow(timeout=240)
    
    # Run with a sample query
    sample_query = "What are the top 5 customers by total orders?"
    
    try:
        # Execute the workflow
        result = await workflow.run(query=sample_query)
        
        # Draw the execution path
        execution_path = os.path.join(output_dir, "text2sql_workflow_recent_execution.html")
        draw_most_recent_execution(
            workflow,
            filename=execution_path
        )
        print(f"[SUCCESS] Recent execution path saved to: {execution_path}")
        print(f"Workflow result: {result}")
        
    except Exception as e:
        print(f"[ERROR] Error during workflow execution: {e}")
        print("Note: You may need to set up your retriever and LLM settings first")

# Alternative: Just visualize all flows without execution
def visualize_workflow_structure_only():
    """
    Just visualize the workflow structure without executing it
    """
    output_dir = "../outputs/trials_v1"
    os.makedirs(output_dir, exist_ok=True)
    
    structure_path = os.path.join(output_dir, "fixed_text2sql_workflow_structure.html")
    print("Drawing workflow structure...")
    draw_all_possible_flows(
        Text2SQLWorkflow,
        filename=structure_path
    )
    print(f"[SUCCESS] Workflow structure saved to: {structure_path}")


# Option 1: Just structure
visualize_workflow_structure_only()

# Option 2: Full visualization with execution
# asyncio.run(visualize_text2sql_workflow())

Drawing workflow structure...
../outputs/trials_v1\fixed_text2sql_workflow_structure.html
[SUCCESS] Workflow structure saved to: ../outputs/trials_v1\fixed_text2sql_workflow_structure.html


### Run Some Queries!

Now we're ready to run some queries across this entire pipeline.

In [76]:
tables = obj_retriever.retrieve("What was the year that The Notorious B.I.G was signed to Bad Boy?")
for table in tables:
    print(f"Table: {table}, Type: {type(table)}")

Table: table_name='people_info' context_str="Summary of information about artists' years of signing and album releases", Type: <class 'llama_index.core.objects.table_node_mapping.SQLTableSchema'>
Table: table_name='grammy_awards' context_str='Summary of Grammy Award data', Type: <class 'llama_index.core.objects.table_node_mapping.SQLTableSchema'>
Table: table_name='drop_event_data' context_str='Summary of historical drop event data over time', Type: <class 'llama_index.core.objects.table_node_mapping.SQLTableSchema'>
Table: table_name='award_data_1972' context_str='Summary of awards in 1972', Type: <class 'llama_index.core.objects.table_node_mapping.SQLTableSchema'>
Table: table_name='movie_data' context_str='Summary of movie data', Type: <class 'llama_index.core.objects.table_node_mapping.SQLTableSchema'>


In [67]:
result = await run_text2sql_workflow("What was the year that The Notorious B.I.G was signed to Bad Boy?")
print(result)

Attempt #1
Original LLM Response: <think>
Okay, let's see. The user is asking for the year that The Notorious B.I.G was signed to Bad Boy. So first, I need to figure out which tables can provide this information.

The table 'people_info' has columns Act, Year_signed, and others. The user is asking about the year of signing, so maybe that's the relevant column. The 'grammy_awards' table has Year, but that's for Grammy awards. The question is about the year the artist was signed, not their awards. So probably the people_info table is the right one.

Looking at the schema, the people_info table has Year_signed as a column. So the query should select Year_signed from people_info where Act is 'The Notorious B.I.G'. But wait, does the people_info table have the artist's name stored? The problem says the user question is about the year the artist was signed, so the Act column would contain the artist's name. So the SQL query would be SELECT Year_signed FROM people_info WHERE Act = 'The Notori

In [75]:
tables = obj_retriever.retrieve("Who won best director in the 1972 academy awards?")
for table in tables:
    print(f"Table: {table}, Type: {type(table)}")

Table: table_name='award_nominations' context_str='Summary of award data', Type: <class 'llama_index.core.objects.table_node_mapping.SQLTableSchema'>
Table: table_name='award_data_1972' context_str='Summary of awards in 1972', Type: <class 'llama_index.core.objects.table_node_mapping.SQLTableSchema'>
Table: table_name='grammy_awards' context_str='Summary of Grammy Award data', Type: <class 'llama_index.core.objects.table_node_mapping.SQLTableSchema'>
Table: table_name='award_data' context_str='Summary of awards data across categories and years', Type: <class 'llama_index.core.objects.table_node_mapping.SQLTableSchema'>
Table: table_name='movie_data' context_str='Summary of movie data', Type: <class 'llama_index.core.objects.table_node_mapping.SQLTableSchema'>


In [71]:
response_1 = await run_text2sql_workflow("Who won best director in the 1972 academy awards?")
print(response_1)

Attempt #1
Original LLM Response: <think>
Okay, let's see. The user wants to know who won best director in the 1972 academy awards. First, I need to figure out which tables contain the relevant information.

Looking at the tables provided: there's 'award_data_1972' which has columns like Award, Category, Nominee, Result. The user is asking about the 1972 academy awards, so maybe the 'award_data_1972' table is the right one. The 'award_data' table has Year, Award, Category, Nominated_work, Result. But the user is asking about best director, which might be a specific category. Wait, the 'award_data_1972' has 'Category' as a column. So, if the category is 'Best Director', then we can join with 'award_data_1972' to find the nominee.

But the user's question is about the 1972 academy awards, which probably refers to the 'award_data_1972' table. So the SQL query should select the Nominee from that table where the Category matches 'Best Director' and the Year is 1972. Also, need to check if t

In [74]:
tables = obj_retriever.retrieve("What was the term of Pasquale Preziosa?")
for table in tables:
    print(f"Table: {table}, Type: {type(table)}")

Table: table_name='government_ministers' context_str='Summary of historical government ministers', Type: <class 'llama_index.core.objects.table_node_mapping.SQLTableSchema'>
Table: table_name='new_municipality_old_municipality_seat' context_str='This table shows entries with the same values in two columns, but the third is unique.', Type: <class 'llama_index.core.objects.table_node_mapping.SQLTableSchema'>
Table: table_name='people_terms' context_str='Summary of individual term data', Type: <class 'llama_index.core.objects.table_node_mapping.SQLTableSchema'>
Table: table_name='award_data_1972' context_str='Summary of awards in 1972', Type: <class 'llama_index.core.objects.table_node_mapping.SQLTableSchema'>
Table: table_name='people_info' context_str="Summary of information about artists' years of signing and album releases", Type: <class 'llama_index.core.objects.table_node_mapping.SQLTableSchema'>


In [73]:
response_2 = await run_text2sql_workflow("What was the term of Pasquale Preziosa?")
print(response_2)

Attempt #1
Original LLM Response: <think>
Okay, let's see. The user is asking for the term of Pasquale Preziosa. First, I need to figure out if there's a table that contains information about people, their terms, and maybe some connections.

Looking at the tables provided: there's 'people_terms' which has Name, Term_start, Term_end. The question is about Pasquale Preziosa's term. So I need to check if there's a way to link Pasquale Preziosa to the 'people_terms' table.

The problem mentions that the 'people_terms' table might be connected to other tables. Wait, the user's question doesn't specify any other tables. The only tables are 'government_ministers', 'new_municipality_old_municipality_seat', 'people_terms', 'award_data_1972', and 'people_info'. 

But the 'people_terms' table has Name, Term_start, Term_end. If Pasquale Preziosa is a person in 'people_terms', then her term would be in that table. So I need to check if there's a way to link her to the 'people_terms' table. But the 

In [77]:
tables = obj_retriever.retrieve("Show me total sales by region")
for table in tables:
    print(f"Table: {table}, Type: {type(table)}")

Table: table_name='chart_positions' context_str='Summary of music chart data across countries', Type: <class 'llama_index.core.objects.table_node_mapping.SQLTableSchema'>
Table: table_name='district_info' context_str='Summary of district data', Type: <class 'llama_index.core.objects.table_node_mapping.SQLTableSchema'>
Table: table_name='movie_chart_positions' context_str='Summary of movie chart positions across different countries', Type: <class 'llama_index.core.objects.table_node_mapping.SQLTableSchema'>
Table: table_name='broadcasting_info' context_str='Summary of broadcasting data', Type: <class 'llama_index.core.objects.table_node_mapping.SQLTableSchema'>
Table: table_name='bbc_radio_costs' context_str='Summary of BBC Radio service costs compared to 2011.', Type: <class 'llama_index.core.objects.table_node_mapping.SQLTableSchema'>


In [78]:
response_3 = await run_text2sql_workflow("Show me total sales by region")
print(response_3)

Attempt #1
Original LLM Response: <think>
Okay, let me try to figure out how to answer the user's question. The user wants to show total sales by region. Looking at the tables provided, there's a table called 'chart_positions' which has columns related to chart positions in different countries. Another table is 'Certifications_sales_thresholds_'. Hmm, wait, the user mentioned 'total sales by region', and the chart_positions table has fields like Peak_chart_positions_US (VARCHAR), but I don't see a column for sales yet.

Wait, the user's question is about total sales, so maybe there's a connection to the 'Certifications_sales_thresholds_'. Let me check the schema again. Oh, right, the 'chart_positions' table has a column named Certifications_sales_thresholds_ (INTEGER). But how does that relate to total sales? Maybe each entry in the 'chart_positions' table corresponds to a certification, and the sales threshold is a value. But how to aggregate that into total sales by region?

Wait, pe

# FINAL SCORE: 2/4 

## -> Problem: Column names aren't getting specified properly