# Generating Synthetic Training Data for Semantic Parsers

Even in the age of pre-trained Large Language Models (LLMs), we still need to train or fine-tune models for natural language tasks, to make them better at specific tasks and/or because we want to use a smaller model that will incur lower costs. One approach that has become popular is to use a very large and capable LLM to take "naturally occurring" text and annotate it with whatever output we want to train the smaller model to produce. However, I think that it may often be more effective to do the reverse.

I will illustrate what I mean using the task of semantic parsing - mapping natural language to something that can be understood by a machine. In order to produce semantic parsing data, I will suggest we:

1. **Generate** a wide variety of machine-executable commands (annotations).
2. **Synthesize** natural language questions corresponding to those commands.

This idea is heavily inspired by the methodology in the paper \"[Building a Semantic Parser Overnight](https://aclanthology.org/P15-1129/)\" by Wang, Berant, and Liang. While the original paper used crowdworkers to paraphrase stilted, automatically generated descriptions into natural language, we can now leverage LLMs for this step, allowing for massive scalability.

Why do I think this is better? As you will see below, it has the following advantages:

- LLMs are built to generate natural language. Language is also inherently less precision-oriented than anything machine readable, which makes this task easier for the LLM.
- It is far easier to make sure that you cover all possible cases. Your users may not regularly ask a certain type of question because your current system does not understand it correctly, and your system does not get better because you have no queries to train on. So you are stuck in an endless cycle. But possible gold outputs can generally be enumerated.
- It will be easier for a human to verify the resulting data if they can assume the code is correct.

## A Toy Example: A Restaurant Booking API

Let's demonstrate this with a simple restaurant booking API. We'll imagine our API has the following capabilities:

* **`find_restaurant`**: Find restaurants by cuisine, rating, or both
* **`sort_restaurants`**: Sort restaurants from best to worst rating
* **`book_restaurant`**: Book a table at a specific restaurant for a given time

First, we will define some Python commands.

In [None]:
# let's write down the actual Python API for concreteness
from datetime import datetime, timedelta

restaurants = [
    {
        "name": "The Pizza Place",
        "cuisine": "Italian",
        "rating": 4.5,
    },
    {
        "name": "Curry House",
        "cuisine": "Indian",
        "rating": 4.2,
    },
    {
        "name": "Taco Town",
        "cuisine": "Mexican",
        "rating": 3.8,
    },
    {
        "name": "Playa Bar",
        "cuisine": "Mexican",
        "rating": 4.7,
    },
    {
        "name": "Spice Merchant",
        "cuisine": "Indian",
        "rating": 4.1,
    },
]

class Booking:
    def __init__(self, restaurant, time):
        self.restaurant = restaurant
        self.time = time

bookings = []


def find_restaurant(cuisine=None, rating_gt=None):
    if cuisine is None and rating_gt is None:
        return restaurants
    else:
        results = []
        for r in restaurants:
            if cuisine and r["cuisine"] != cuisine:
                continue
            if rating_gt and r["rating"] <= rating_gt:
                continue
            results.append(r)
        return results

def make_booking(restaurant, time):
    bookings.append(Booking(restaurant, time))

def sort_restaurants(restaurants):
    return sorted(restaurants, key=lambda r: r["rating"], reverse=True)

It seems simplistic, but this interface already allows for a lot of complexity through **compositionality**, e.g., the equivalent of "Make a booking for tomorrow at the best-rated Indian restaurant".

A system would need to parse this into a sequence of operations:
1.  **Find all Indian restaurants:**
    `find_restaurant(cuisine="Indian")`
2.  **Sort them by rating to find the best one:**
    `sort_restaurants(restaurants=results_from_step_1)`
3.  **Take the top result and book it:**
    `make_booking(restaurant=best_restaurant_from_step_2, time="tomorrow")`

As a single, nested Python call, this composition looks like this:

In [2]:
make_booking(sort_restaurants(find_restaurant(cuisine="Indian"))[0], datetime.now() + timedelta(days=1))

The complexity we've seen justifies building a natural language interface. To do this effectively, we need to train a fast, specialized semantic parser and so we need to create data.

This is where our core strategy comes into play. Instead of waiting for user queries to annotate, we will **generate the machine-readable commands first** and then synthesize the corresponding natural language. The structured commands we want to generate follow predictable patterns and can be systematically enumerated, making this generation easy.

Let's start by writing a simple generator to produce a variety of valid API calls. For each call, we'll also generate a canonical, "stilted" description of what it does. This description will serve as the basis for our later LLM paraphrasing step.

In [3]:
# we use random number generation to get variance in what we generate
import random
random.seed(235711)

cuisines = [
    "Italian",
    "Indian",
    "Mexican"
] # could be a random draw from all cuisines for all restaurants in the database

# we will produce two strings, one for the instructions, one for the "natural language" description
def make_find_restaurant_expression() -> tuple[str,str]:
  code = "find_restaurant("
  description = "find restaurants"

  if random.uniform(0,1) > 0.5: # we also want some simple expressions
    cuisine_code, cuisine_text = make_cuisine_expression()
    code += f"cuisine={cuisine_code},"
    description += f" that has cuisine {cuisine_text}"
  if random.uniform(0,1) > 0.5: # we also want some simple expressions
    rating_gt_code, rating_gt_text = make_rating_gt_expression()
    code += f"rating_gt={rating_gt_code}"
    description += f" with a rating above {rating_gt_text}"
  code += ")"
  return code, description

def make_cuisine_expression() -> tuple[str,str]:
  cuisine = random.choice(cuisines)
  return f"\"{cuisine}\"", f"{cuisine}"

def make_rating_gt_expression() -> tuple[str,str]:
  # it may be a bit weird for someone to ask for a rating of 1 or better, but we can just ensure coverage
  rating = random.choice(range(1,5))
  return f"{rating}", f" at least {rating} rating"

def make_sort_resturants_expression() -> tuple[str,str]:
  # say we support up to 10 choices
  num_to_pick = random.choice(range(1,10))
  restaurants_expression, restaurants_description = make_find_restaurant_expression()
  code = f"sort_restaurants({restaurants_expression})[:{num_to_pick}]"
  description = f"the best {num_to_pick} {restaurants_description}"
  return code, description

def make_make_booking_expression() -> tuple[str,str]:
  restaurant_expression, restaurant_description = make_find_restaurant_expression()
  time_expression, time_description = make_time_expression()
  code = f"make_booking(restaurant={restaurant_expression}, time={time_expression})"
  description = f"make a booking for {time_description} at {restaurant_description}"
  return code, description

# we will be lazy for time and only cover today, tomorrow and one week from now
time_exp =[
    ("datetime.now()", "today"),
    ("datetime.now() + timedelta(days=1)", "tomorrow"),
    ("datetime.now() + timedelta(days=7)", "one week from now")
]

def make_time_expression() -> tuple[str,str]:
  # TODO :) : replace with logic to pick a time and then prompt an LLM to describe it
  time, description = random.choice(time_exp)
  return time, description

# look at some examples:
for _ in range(5):
  print(make_find_restaurant_expression())
for _ in range(5):
  print(make_make_booking_expression())
for _ in range(5):
  print(make_sort_resturants_expression())

('find_restaurant(cuisine="Italian",rating_gt=2)', 'find restaurants that has cuisine Italian with a rating above  at least 2 rating')
('find_restaurant(cuisine="Indian",)', 'find restaurants that has cuisine Indian')
('find_restaurant()', 'find restaurants')
('find_restaurant(cuisine="Indian",rating_gt=2)', 'find restaurants that has cuisine Indian with a rating above  at least 2 rating')
('find_restaurant(rating_gt=4)', 'find restaurants with a rating above  at least 4 rating')
('make_booking(restaurant=find_restaurant(cuisine="Mexican",), time=datetime.now() + timedelta(days=1))', 'make a booking for tomorrow at find restaurants that has cuisine Mexican')
('make_booking(restaurant=find_restaurant(cuisine="Italian",), time=datetime.now() + timedelta(days=1))', 'make a booking for tomorrow at find restaurants that has cuisine Italian')
('make_booking(restaurant=find_restaurant(rating_gt=1), time=datetime.now() + timedelta(days=7))', 'make a booking for one week from now at find restau

This programmatic approach immediately highlights a key advantage: **total control over data coverage and distribution**, so you're not limited to what people *currently* ask for, which often creates blind spots.

For example, less common cuisines might rarely appear, preventing the model from ever learning them. By generating the structured commands first, we can systematically ensure that every parameter, function, and complex composition is well-represented in our training set.

Of course, the descriptions we generated (`find restaurants that are Indian with a rating above 4`) are stilted and robotic. This brings us to the final, crucial step: using an LLM to paraphrase these canonical descriptions into a diverse set of natural-sounding user queries. Fortunately, this task plays directly to the primary strength of modern LLMs.

In [None]:
import numpy as np
from vllm import LLM
# We can use a smaller, non-reasoning model. You still need a decent GPU, or you need to use a cloud model.
# This will print a lot of initialization information when you first run the cell.
# Running this cell multiple times may make VLLM reserve too much GPU memory, which will lead to errors.
llm = LLM(model="HuggingFaceTB/SmolLM2-1.7B-Instruct")

In [None]:
from vllm import SamplingParams
# Sampling parameters for text generation
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=1000)

PROMPT_TEMPLATE = """You are a helpful assistant that translates code and a stilted description into natural language requests for a restaurant booking API.
You will be given pairs of code and a stilted description, and your task is to generate a semicolon-separated list of natural language requests that correspond to the given code and description.
Only output the semicolon-separated list of natural language requests and nothing else. Here are some examples:
Code: find_restaurant(cuisine="Italian",rating_gt=2)
Description: find restaurants that has cuisine Italian with a rating above  at least 2 rating
Requests: Show me Italian restaurants with a rating over 2 ; Find Italian restaurants with a rating above 2 ; Are there any Italian places rated higher than 2 stars?

Code: find_restaurant()
Description: find restaurants
Requests: Show me all restaurants ; Find restaurants ; List all restaurants

Code: sort_restaurants(find_restaurant(cuisine="Indian", rating_gt=3))[:3]
Description: the best 3 restaurants that has cuisine Indian with a rating above  at least 3 rating
Requests: What are the best 3 Indian restaurants with a rating above 3?; Show me the 3 best Indian restaurants with at least a 3 rating

Code: make_booking(restaurant=find_restaurant(cuisine="Mexican",), time=datetime.now() + timedelta(days=1))
Description: make a booking for tomorrow at find restaurants that has cuisine Mexican
Requests: Please make a booking for tomorrow at a Mexican restaurant; I'd like to make a reservation for tomorrow at a Mexican restaurant

Now, turn the following code and description into natural language requests:
Code: {code}
Description: {description}
Requests:"""

def naturalize_expression(codes, descriptions):
    """Uses the LLM to turn the code and description into a natural language question using a chat template."""
    prompts = []
    for code, description in zip(codes, descriptions):
        message = [
            {"role": "user", "content": PROMPT_TEMPLATE.format(description=description, code=code)}
        ]
        prompt = llm.get_tokenizer().apply_chat_template(
            message, tokenize=False, add_generation_prompt=True
        )
        prompts.append(prompt)
        
    batch_outputs = llm.generate(prompts, sampling_params)
    outputs = llm.generate(prompts, sampling_params)
    # Assuming the first output is the most relevant
    return [output.outputs[0].text.strip() for output in outputs]

# Test with an example, you can rerun this cell to see different examples.
# VLLM is extremely fast at processing large batches of requests simultaneously.
# To build a real dataset, you'd want to generate hundreds of prompts and send them all in a single batch.
code, description = make_find_restaurant_expression()
natural_question = naturalize_expression([code], [description])
print(f"Code: {code}")
print(f"Description: {description}")
print(f"Natural Questions: {natural_question[0]}")

Adding requests: 100%|██████████| 1/1 [00:00<00:00, 195.16it/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  4.88it/s, est. speed input: 2174.53 toks/s, output: 167.24 toks/s]
Adding requests: 100%|██████████| 1/1 [00:00<00:00, 654.24it/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  6.24it/s, est. speed input: 2784.08 toks/s, output: 195.23 toks/s]

Code: find_restaurant(rating_gt=1)
Description: find restaurants with a rating above  at least 1 rating
Natural Questions: Show me restaurants with a rating above 1 ; Find restaurants with a rating above 1 ; Are there any restaurants rated higher than 1 stars?





## Conclusion and Next Steps

Now you have a complete, scalable pipeline for generating high-quality training data for a semantic parser. By starting with the structured code and synthesizing natural language, you gain precise control over data coverage and can create a vast dataset with minimal manual effort. Note that the task is so simple that a relatively small LLM can be used to run it.

The `(natural_question, code_string)` pairs can be used to:

1.  **Train a specialized T5 model**: the most direct next step is to fine-tune an encoder-decoder model like [T5](https://huggingface.co/docs/transformers/en/model_doc/t5) or BART. This will result in a small, fast, and highly accurate parser perfect for a production environment.

2.  **Fine-tune a larger LLM**: The same data can be used to fine-tune a more powerful LLM. This would make the model an expert at your specific API, capable of handling even more complex nuances.

3.  **Create few-shot examples**: For the largest models, these high-quality pairs are perfect for creating few-shot prompts to guide the model's behavior at inference time without any fine-tuning at all.

This "generate-first" methodology gives you a powerful toolkit for building robust natural language interfaces for any structured API.