In [1]:
%load_ext autoreload
%autoreload 2

# Markdown Table Manipulation Skills

Modern enterprises rely on structured data to drive decisions across operations, HR, product, and sales. But real-world data is rarely clean. Tables are often inconsistent, incomplete, or split across sources. Analysts and engineers spend countless hours fixing formatting issues, merging data, and applying business logic manually.

This project teaches a language model how to understand, clean, manipulate, and reason over markdown tables—turning messy or fragmented tabular inputs into clean, analysis-ready markdown outputs that can be dropped into dashboards, reports, or downstream systems.

We do this using InstructLab, by providing examples of real-world table tasks that require reasoning, formatting precision, and consistency.


These tasks develop a model’s capabilities in:
* Cleaning: Normalize inconsistent entries (e.g., “USA”, “U.S.”, “United States” → “US”)
* Filtering: Apply multi-column conditions (e.g., Progress < 60% and Budget < 100k)
* Computation: Derive new columns from formulas (e.g., Adjusted Revenue = Revenue × Multiplier)
* Joining: Merge data from multiple markdown tables using a shared key
* Classification: Infer labels like “Seniority” from unstructured title strings
* Standardization: Enforce markdown formatting, column consistency, and data integrity


Task Examples Include:
1.	Applying Rules Across Columns

    Derive new columns by applying conditional logic to existing data. Examples include assigning statuses, flags, or labels based on thresholds, categories, or rule-based formulas.

2.  Cleaning and Normalizing Tabular Data

    Standardize inconsistent entries such as location names, department labels, or text casing to ensure consistency across rows—essential for reliable analysis or joins.

3. 	Inferring Categorical Labels from Text
    
    Extract or classify values (e.g., seniority, department type, status) from semi-structured strings using pattern recognition or keyword-based inference.

4. 	Merging and Enriching Data Across Tables
    
    Perform relational joins using keys like ID or Region, and enhance the dataset by combining fields from multiple sources.

5.  Retrieval and Filtering From the Table

    Retrieve specific rows or columns based on conditions or patterns, useful for ad-hoc queries or filtering out irrelevant data.


## 🧑‍🏫 Step 1: Set Up the Teacher Model

This demo expects an openai compatible endpoint. You can use your favorite inference server like vLLM, HFInferenceServer, LlamaStack, etc. For more details on how to setup an inference server using vLLM, please refer to the [README](README.md).

For this demo we will use Llama-3.3-70B-Instruct as our teacher model.

#### Let's test the connection

In [2]:
from openai import OpenAI

openai_api_key = "EMPTY" # replace with your inference server api key
openai_api_base = "http://0.0.0.0:8000/v1" # replace with your inference server endpoint


client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
teacher_model = models.data[0].id

# Test the connection with a simple completion
response = client.chat.completions.create(
    model=teacher_model,
    messages=[{"role": "user", "content": "Hello!"}],
    temperature=0.0,
    max_tokens=10
)
completion = response.choices[0].message.content

print(f"Connection successful! {teacher_model}: {completion}")

Connection successful! meta-llama/Llama-3.3-70B-Instruct: Hello. How can I help you today?


## ✍️ Step 2: Provide Custom Examples

As outlined in the LAB paper, the first step is to provide a small number of **seed examples** (typically 5) to bootstrap the skill. These examples are passed into the generation pipeline as input and are stored in a `qna.yaml` file.

For this demo, we’ll use the pre-populated seed file located at: [table_manipulation_qna.yaml](seed_data/table_manipulation_qna.yaml)

Lets convert the yaml into a jsonl file which can be used to bootstrap the skill.

In [3]:
import yaml
from datasets import Dataset

def convert_yaml_to_jsonl(yaml_path):
    # Load YAML file
    with open(yaml_path, 'r') as f:
        yaml_data = yaml.safe_load(f)
    
    # Extract examples into list of dicts
    examples = []
    for example in yaml_data['seed_examples']:
        examples.append({
            'task_description': yaml_data['task_description'],
            'seed_question': example['question'],
            'seed_response': example['answer']
        })
    
    # Convert to HF Dataset
    dataset = Dataset.from_list(examples)
    return dataset

# Load and convert the seed data
seed_data = convert_yaml_to_jsonl('seed_data/table_manipulation_qna.yaml')

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
from rich import print
from rich.panel import Panel

print(Panel(
    "\n\n".join(f"[bold]{k}:[/bold] \n\n{v}" for k,v in seed_data[0].items()),
    title="Seed Data Example"
))

## 🚀 Step 3: Generate Synthetic Data

Now that we have our seed data ready, we can use LAB’s Skill Data Generator to create **high-quality synthetic training examples** for our custom skill.

This step leverages a predefined **flow configuration** that encodes how seed examples are expanded — by generating new contexts, questions, and responses, and filtering them for quality.

In this demo, we'll use the `flows/table_manipulation.yaml` pipeline to generate synthetic data.

In [5]:
import os
from instructlab.sdg.pipeline import Pipeline, PipelineContext
from blocks import *

ctx = PipelineContext(client=client, model_family="llama", model_id=teacher_model)
skills_pipe = Pipeline.from_file(ctx, os.path.join(os.getcwd(), "flows/table_manipulation.yaml"))

In [6]:
generated_data = skills_pipe.generate(seed_data)

Map (num_proc=8): 100%|██████████| 8/8 [00:00<00:00, 61.81 examples/s]
Filter (num_proc=8): 100%|██████████| 8/8 [00:00<00:00, 77.55 examples/s]
Map (num_proc=8): 100%|██████████| 8/8 [00:00<00:00, 67.11 examples/s]
Filter (num_proc=8): 100%|██████████| 8/8 [00:00<00:00, 77.35 examples/s]
Map (num_proc=8): 100%|██████████| 8/8 [00:00<00:00, 67.32 examples/s]
Filter (num_proc=8): 100%|██████████| 8/8 [00:00<00:00, 77.82 examples/s]
Map (num_proc=8): 100%|██████████| 8/8 [00:00<00:00, 63.43 examples/s]
Filter (num_proc=8): 100%|██████████| 8/8 [00:00<00:00, 74.63 examples/s]
Map (num_proc=8): 100%|██████████| 8/8 [00:00<00:00, 63.70 examples/s]
Filter (num_proc=8): 100%|██████████| 8/8 [00:00<00:00, 76.01 examples/s]
Map (num_proc=8): 100%|██████████| 8/8 [00:00<00:00, 60.67 examples/s]
Filter (num_proc=8): 100%|██████████| 8/8 [00:00<00:00, 75.22 examples/s]
Map (num_proc=8): 100%|██████████| 8/8 [00:00<00:00, 63.63 examples/s]
Filter (num_proc=8): 100%|██████████| 8/8 [00:00<00:00, 76.

## 🔍 Step 4: Explore and Validate the Synthetically Generated Data

Once the skill generation pipeline has been executed, the output is a set of **synthetically generated examples** — new context-question-response triples that follow the same structure as the seed data but are expanded and refined by the teacher model.

Below is an example of one generated entry:

In [8]:
import random
from rich.panel import Panel
from rich.console import Console

console = Console()
rand_idx = random.choice(range(len(generated_data)))

# Pretty print the generated examples using rich
example = generated_data[rand_idx]
console.print(Panel.fit(
    f"[bold orange1]Question:[/bold orange1]\n{example['question']}\n\n" 
    f"[bold green]Response:[/bold green]\n{example['response']}"
))
console.rule(style="bright_white")

## 🏁 Conclusion

In this notebook, we demonstrated how to teach a custom skill to a language model using the InstructLab Skill Data Generator (SDG). Starting from a small set of seed examples, we walked through the full synthetic data generation pipeline — including context creation, question generation, response synthesis, evaluation, and filtering.

We explored a real-world use case: Manipulating Markdown Tables, and showed how the LAB framework can automate the generation of high-quality, instructional training data at scale.

This approach is especially powerful for procedural or domain-specific tasks where labeled data is scarce but consistent task logic can be modeled. With just a few carefully curated seed examples, you can unlock scalable skill creation and push new capabilities into LLMs with minimal manual effort.

You’re now ready to use these synthetic examples for Fine-tuning small models!

Next steps?

* Try changing the parameters of the flow to see how the generated data changes (e.g. change the num_samples or try generating with different temperature)
* Try adapting this pipeline to your own task, domain, or format — whether it’s triaging support tickets, extracting structured data, or following domain-specific workflows. The skills are yours to create.