In [1]:
%load_ext autoreload
%autoreload 2

## Teaching a Language Model the Skill: Unstructured Text → Markdown Table

Company X receives large volumes of user feedback through support emails, in-app surveys, and app store reviews. These messages often contain valuable product insights, but the content is unstructured and difficult to analyze at scale.

To streamline internal workflows, an AI team at Company X wants to teach a language model how to convert raw user feedback into structured markdown tables. These tables summarize key topics, user sentiment, and issues in a format that’s easy to scan, report, or push into dashboards and tracking systems.

We can do this using InstructLab!

#### 🧾 Example Input and Output

📥 Input (Unstructured Feedback)
```
Hey team — I’ve been using the new update for about a week now.

Couple of things:
- The dark mode is awesome, great job!
- But the loading time after login feels slower than before. Not a deal breaker but noticeable.
- I also noticed that the calendar widget doesn’t update properly if I change time zones.

Overall, I love where this is going. Just needs a few tweaks.
```
📤 Output (Markdown Table)

| Feature           | Feedback                                                               | Sentiment |
|------------------|------------------------------------------------------------------------|-----------|
| Dark Mode        | Works well, user is satisfied.                                          | Positive  |
| Login Performance| Loading time after login is slower than previous version.               | Negative  |
| Calendar Widget  | Doesn't update correctly when time zones change.                        | Negative  |
| Overall          | User is happy with the direction of the product, but suggests tweaks.   | Positive  |

## Recap: Setting up data generation pipeline

```mermaid
flowchart LR
    A[Flows] --> B[Blocks] --> C[Prompts]
    C --> D[Synthetic Data!]
```

## 🧑‍🏫 Step 1: Serving Teacher Model

This demo expects an openai compatible endpoint. You can use your favorite inference server like vLLM, HFInferenceServer, LlamaStack, etc. For more details on how to setup an inference server using vLLM, please refer to the [README](README.md).

For this demo we will use meta-llama/Llama-3.3-70B-Instruct as our teacher model.

#### Let's test the connection

In [2]:
from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://150.239.209.43:8008/v1"


client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
teacher_model = models.data[0].id

# Test the connection with a simple completion
response = client.chat.completions.create(
    model=teacher_model,
    messages=[{"role": "user", "content": "Hello!"}],
    temperature=0.0,
    max_tokens=10
)
completion = response.choices[0].message.content

print(f"Connection successful! {teacher_model}: {completion}")

Connection successful! meta-llama/Llama-3.3-70B-Instruct: Hello. How can I help you today?


## ✍️ Step 2: Provide Custom Examples

As outlined in the LAB paper, the first step is to provide a small number of **seed examples** (typically 5) to bootstrap the skill. These examples are passed into the generation pipeline as input and are stored in a `qna.yaml` file.

For this demo, we’ll use the pre-populated seed file located at: [unstructured_to_structured_qna.yaml](seed_data/unstructured_to_structured_qna.yaml)

```yaml
 version: 3
 created_by: Red Hat AI Innovation Team
 domain: Information Extraction
 seed_examples:
 - answer: |
     | Feature           | Feedback                                                           | Sentiment |
     |------------------|--------------------------------------------------------------------|-----------|
     | Dashboard        | Much faster than previous version, filters are responsive.         | Positive  |
     | Export to CSV    | Clicking the export button doesn't trigger a download.             | Negative  |
     | Dark Mode        | Resets to light mode on login.                                     | Negative  |
   context: >-
     Been using the new dashboard for a few days. It's way faster than the
     previous one, really appreciate the snappy filters. But export to CSV seems
     broken — nothing happens when I click it. Also, dark mode resets every
     time I log in.
   question: Convert the above feedback into a markdown table with columns for Feature, Feedback, and Sentiment?
```

Lets convert the yaml into a jsonl file which can be used to bootstrap the skill.

In [3]:
import yaml
from datasets import Dataset

def convert_yaml_to_jsonl(yaml_path):
    # Load YAML file
    with open(yaml_path, 'r') as f:
        yaml_data = yaml.safe_load(f)
    
    # Extract examples into list of dicts
    examples = []
    for example in yaml_data['seed_examples']:
        examples.append({
            'task_description': yaml_data['task_description'],
            'seed_context': example['context'],
            'seed_question': example['question'],
            'seed_response': example['answer']
        })
    
    # Convert to HF Dataset
    dataset = Dataset.from_list(examples)
    return dataset

# Load and convert the seed data
seed_data = convert_yaml_to_jsonl('seed_data/unstructured_to_structured_qna.yaml')

from rich import print
from rich.panel import Panel

print(Panel(
    "\n\n".join(f"[bold]{k}:[/bold] \n\n{v}" for k,v in seed_data[0].items()),
    title="Seed Data Example"
))


  from .autonotebook import tqdm as notebook_tqdm


## 🚀 Step 3: Generate Synthetic Data

Now that we have our seed data ready, we can use LAB’s Skill Data Generator to create **high-quality synthetic training examples** for our custom skill.

This step leverages a predefined **flow configuration** that encodes how seed examples are expanded — by generating new contexts, questions, and responses, and filtering them for quality.

In this demo, we'll use the `flows/unstructured_to_structured.yaml` pipeline to generate synthetic data.

### Flows

```mermaid
 flowchart LR
     A[LLMBlock<br/>gen_contexts<br/>⟶ context] --> B[AddStaticValue<br/>add_question<br/>⟶ question]
     B --> C[LLMBlock<br/>gen_responses<br/>⟶ response]
     C --> D[LLMBlock<br/>evaluate_qa_pair<br/>⟶ evaluation, score]
     D --> E[FilterByValueBlock<br/>filter_qa_pair<br/>score >= 2.0]
     E --> F[Generated Data]
```

### Blocks: Adding Custom Blocks

One of the core design goals of SDG Hub is **modularity and extensibility**. Creating a new block is as simple as writing a Python class. Any Pythonic transformation or logic—no matter how simple or complex—can be encapsulated as a block and plugged into a pipeline.

Here’s an example of how to create a custom block that adds a static value to every row in the dataset:

```python
@BlockRegistry.register("AddStaticValue")
class AddStaticValue(Block):
    def __init__(self, ctx, pipe, block_name, column_name: str, static_value: str):
        super().__init__(ctx, pipe, block_name)
        self.column_name = column_name
        self.static_value = static_value

    @staticmethod
    def _map_populate_column(samples, column_name, static_value, num_proc=1):
        def populate_column(sample):
            sample[column_name] = static_value
            return sample

        return samples.map(populate_column, num_proc=num_proc)

    def generate(self, samples: Dataset) -> Dataset:
        samples = self._map_populate_column(
            samples, self.column_name, self.static_value
        )
        return samples
```

✨ Why This Matters
* Simplicity: You can wrap any custom Python function into a block—no special framework or boilerplate needed.
* Composable: Once registered, blocks can be easily used in your YAML workflows alongside LLM-based and filtering blocks.
* Parallel-ready: Custom blocks can leverage the existing multiprocessing implementation.

### Prompts 

```yaml
system: You are a highly capable AI Assistant that specializes in generating high-quality content tailored to specific tasks.

introduction: |
  Your task is to write a rich, relevant, and well-structured **context** for the following task:
  Task Description: {{task_description}}

principles: |
  Please follow these guiding principles when generating the context:
  * The context should be coherent, informative, and closely aligned with the task description.
  * Do not include any greetings, explanations, or meta commentary.
  * Maintain a natural, human-like tone suitable for the domain.
  * Follow the formatting shown in the example exactly.
  * Wrap the output between the tags: [Start of Context] and [End of Context].

examples: |
  To guide you, here is an example of a well-structured context:
  
  [Start of Context]
  {{seed_context}}
  [End of Context]

generation: |
  Now generate a new context following the same structure and principles. 
  Begin your output with [Start of Context] and end with [End of Context]. 
  Do not include any additional text outside these tags.

start_tags: ["[Start of Context]"]
end_tags: ["[End of Context]"]
```

In [4]:
import os
from instructlab.sdg.pipeline import Pipeline, PipelineContext
from blocks import *

ctx = PipelineContext(client=client, model_family="llama", model_id=teacher_model)
skills_pipe = Pipeline.from_file(ctx, os.path.join(os.getcwd(), "flows/unstructured_to_structured.yaml"))

In [5]:
generated_data = skills_pipe.generate(seed_data)

Map: 100%|██████████| 8/8 [00:00<00:00, 3281.61 examples/s]
Map: 100%|██████████| 8/8 [00:00<00:00, 3765.08 examples/s]
Map: 100%|██████████| 8/8 [00:00<00:00, 4091.50 examples/s]
Map: 100%|██████████| 8/8 [00:00<00:00, 4202.71 examples/s]
Map: 100%|██████████| 8/8 [00:00<00:00, 4159.98 examples/s]
Map: 100%|██████████| 8/8 [00:00<00:00, 3915.34 examples/s]
Map: 100%|██████████| 2/2 [00:00<00:00, 1105.36 examples/s]
Map (num_proc=8): 100%|██████████| 8/8 [00:00<00:00, 54.49 examples/s]
Filter (num_proc=8): 100%|██████████| 8/8 [00:00<00:00, 66.77 examples/s]
Map (num_proc=8): 100%|██████████| 8/8 [00:00<00:00, 57.70 examples/s]
Filter (num_proc=8): 100%|██████████| 8/8 [00:00<00:00, 67.49 examples/s]
Map (num_proc=8): 100%|██████████| 8/8 [00:00<00:00, 56.44 examples/s]
Filter (num_proc=8): 100%|██████████| 8/8 [00:00<00:00, 68.41 examples/s]
Map (num_proc=8): 100%|██████████| 8/8 [00:00<00:00, 56.99 examples/s]
Filter (num_proc=8): 100%|██████████| 8/8 [00:00<00:00, 67.80 examples/s]


## 🔍 Step 4: Explore and Validate the Synthetically Generated Data

Once the skill generation pipeline has been executed, the output is a set of **synthetically generated examples** — new context-question-response triples that follow the same structure as the seed data but are expanded and refined by the teacher model.

Below is an example of one generated entry:

In [7]:
import random
from rich.panel import Panel
from rich.console import Console

console = Console()
rand_idx = random.choice(range(len(generated_data)))

# Pretty print the generated examples using rich
example = generated_data[rand_idx]
console.print(Panel.fit(
    f"[bold orange1]Context:[/bold orange1]\n{example['context']}\n\n"
    f"[bold cyan]Question:[/bold cyan]\n{example['question']}\n\n" 
    f"[bold green]Response:[/bold green]\n{example['response']}"
))
console.rule(style="bright_white")

## 💾 Save the generated data

```python
generated_data.to_json("llama_generated_unstructured_to_structured.jsonl", orient="records", lines=True)
```

## 🏁 Conclusion

In this notebook, we demonstrated how to teach a custom skill to a language model using the InstructLab Skill Data Generator (SDG). Starting from a small set of seed examples, we walked through the full synthetic data generation pipeline — including context creation, question generation, response synthesis, evaluation, and filtering.

We explored a real-world use case: **transforming unstructured user feedback into structured markdown tables**, and showed how the LAB framework can automate the generation of high-quality, instructional training data at scale.

This approach is especially powerful for procedural or domain-specific tasks where labeled data is scarce but consistent task logic can be modeled. With just a few carefully curated seed examples, you can unlock scalable skill creation and push new capabilities into LLMs with minimal manual effort.

You’re now ready to use these synthetic examples for Fine-tuning small models! 

Next steps? 

* Try changing the parameters of the flow to see how the generated data changes (e.g. change the `num_samples` or try generating with different temperature)
* Try adapting this pipeline to your own task, domain, or format — whether it’s triaging support tickets, extracting structured data, or following domain-specific workflows. The skills are yours to create.