# Synthetic Dialogue Generation


## Getting started

### Setup (Ollama)

Let's run the ollama server first, as a background process:

In [1]:
import os
get_ipython().system = os.system  # a hack to allow running background processes from Jupyter notebook

!OLLAMA_KEEP_ALIVE=-1 ollama serve > /dev/null 2>&1 &

0

Let's use Qwen 2.5 (14b) as our base LLM model:

In [None]:
MODEL_NAME = "qwen2.5:14b"  # The llm we want to use (https://ollama.com/library)

Let's make sure we have the model download for ollama to use:

In [3]:
!ollama pull qwen2.5:14b

[?2026h[?25l[1Gpulling manifest ⠋ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠙ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠹ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠸ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠼ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠴ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest [K
pulling 2049f5674b1e: 100% ▕██████████████████▏ 9.0 GB                         [K
pulling 66b9ea09bd5b: 100% ▕██████████████████▏   68 B                         [K
pulling eb4402837c78: 100% ▕██████████████████▏ 1.5 KB                         [K
pulling 832dd9e00a68: 100% ▕██████████████████▏  11 KB                         [K
pulling db59b814cab7: 100% ▕██████████████████▏  488 B                         [K
verifying sha256 digest [K
writing manifest [K
success [K[?25h[?2026l


0

Let's check if our selected model is now part of Ollama's available local models:

In [4]:
!ollama list

NAME               ID              SIZE      MODIFIED               
qwen2.5:14b        7cdf5a0187d5    9.0 GB    Less than a second ago    
llama3.2:1b        baf6a787fdff    1.3 GB    11 days ago               
gemma3:4b          a2af6cc3eb7f    3.3 GB    2 weeks ago               
gemma3:27b         a418f5838eaf    17 GB     6 weeks ago               
deepseek-r1:14b    ea35dfe18182    9.0 GB    7 weeks ago               
deepseek-r1:32b    38056bbcbb2d    19 GB     7 weeks ago               
gemma3:1b          8648f39daa8f    815 MB    7 weeks ago               


0

### Defining the Output (Dialogue)

We will begin by defining the JSON objects that we will use to represent the generated dialogues. For now this object will have only three fields: `"model"`, `"seed"`, `"scenario"`, and `"dialog"` to store the name of the model and the seed used to generate the dialogue, as well as the scenario associated to the dialogue and the dialogue itself, respectively. More preciselly, the `"dialog"` field will contain the list of turns of the conversation in order, with the speaker name and the corresponding utterances. As shown in the following example:

In [None]:
example_dialogue = {
    "model": "qwen2.5:14b",  # the model used to generate the dialogue
    "seed": 123,  # the seed used to generated
    "scenario": "short hello and good bye conversation",  # the scenario used to generated the dialogue
    "turns": [
        {"speaker": "Alice", "text": "Hey Bob!"},
        {"speaker": "Bob", "text": "Hey Alice!"},
        {"speaker": "Alice", "text": "Bye Bob!"},
        {"speaker": "Bob", "text": "Bye bye!"},
    ]
}

We can use `pydantic` to properly define our `Dialogue` type:

In [6]:
from pydantic import BaseModel
from typing import List, Union, Optional

class Turn(BaseModel):
    speaker: str
    text: str

class Dialog(BaseModel):
    model: str  # the model used to generate the dialogue
    seed: int  # the seed used to generated
    scenario: Optional[Union[dict, str]] = None  # the scenario used to generated the dialogue
    turns: List[Turn]  # the list of turns of the conversation

Having a Python `pydantic` class to formally represent our dialogues is quite useful, we can convert any JSON dialogue to our `Dialog` class as follows:

In [7]:
my_dialogue = Dialog.model_validate(example_dialogue)
my_dialogue

Dialog(model='qwen2.5:14b', seed=123, scenario='short hello and good bye conversation', turns=[Turn(speaker='Alice', text='Hey Bob!'), Turn(speaker='Bob', text='Hey Alice!'), Turn(speaker='Alice', text='Bye Bob!'), Turn(speaker='Bob', text='Bye bye!')])

Or the opposite, convert our `Dialog`s to a `dict` or a JSON as follows:

In [8]:
my_dialogue.model_dump()  # a dict

{'model': 'qwen2.5:14b',
 'seed': 123,
 'scenario': 'short hello and good bye conversation',
 'turns': [{'speaker': 'Alice', 'text': 'Hey Bob!'},
  {'speaker': 'Bob', 'text': 'Hey Alice!'},
  {'speaker': 'Alice', 'text': 'Bye Bob!'},
  {'speaker': 'Bob', 'text': 'Bye bye!'}]}

In [9]:
my_dialogue_json = my_dialogue.model_dump_json(indent=2)  # a string containing the dialog as a JSON object
print(my_dialogue_json)

{
  "model": "qwen2.5:14b",
  "seed": 123,
  "scenario": "short hello and good bye conversation",
  "turns": [
    {
      "speaker": "Alice",
      "text": "Hey Bob!"
    },
    {
      "speaker": "Bob",
      "text": "Hey Alice!"
    },
    {
      "speaker": "Alice",
      "text": "Bye Bob!"
    },
    {
      "speaker": "Bob",
      "text": "Bye bye!"
    }
  ]
}


Or, of course, create a new `Dialog` from scratch:

In [10]:
Dialog(
    model="qwen2.5:14b",
    seed=123,
    turns=[
        Turn(speaker="Alice", text="Hi :)"),
        Turn(speaker="Bob", text="Bye! :(")
    ]
)

Dialog(model='qwen2.5:14b', seed=123, scenario=None, turns=[Turn(speaker='Alice', text='Hi :)'), Turn(speaker='Bob', text='Bye! :(')])

Alternativelly, we can use the built-in `Dialog` class from `sdialog`:

In [11]:
from sdialog import Dialog

my_dialog = Dialog.model_validate(example_dialogue)
my_dialog

Dialog(formatVersion='0.0.5', model='qwen2.5:14b', seed=123, dialogId=None, complete=None, scenario='short hello and good bye conversation', turns=[Turn(speaker='Alice', text='Hey Bob!'), Turn(speaker='Bob', text='Hey Alice!'), Turn(speaker='Alice', text='Bye Bob!'), Turn(speaker='Bob', text='Bye bye!')], events=None)

Which besides providing the exact same functionalities, among other things, allow us to:

- Pretty print the dialogue:

In [12]:
my_dialog.print()

[1m[95m[model] [35mqwen2.5:14b[0m
[1m[95m[seed] [35m123[0m
[1m[35m--- Dialogue Begins ---[0m
[31m[Alice] [0mHey Bob![0m
[94m[Bob] [37mHey Alice![0m
[31m[Alice] [0mBye Bob![0m
[94m[Bob] [37mBye bye![0m
[1m[35m--- Dialogue Ends ---[0m


- Print it in a vanilla textual form:

In [13]:
print(my_dialog)

Alice: Hey Bob!
Bob: Hey Alice!
Alice: Bye Bob!
Bob: Bye bye!


- Save it to a file:

In [14]:
# either as a JSON object
my_dialog.to_file("output/my_dialogue.json")

# or a txt file
my_dialog.to_file("output/my_dialogue.txt")

_(check created files [`output/my_dialogue.json`](output/my_dialogue.json) and [`output/my_dialogue.txt`](output/my_dialogue.txt))_

- Load a dialogue from disk

In [15]:
my_dialog = Dialog.from_file("output/my_dialogue.json")
my_dialog.print(scenario=True)  # `scenario=True` to also print the metadata stored in scenario field

[1m[95m[model] [35mqwen2.5:14b[0m
[1m[95m[seed] [35m123[0m
[1m[95m[scenario] [35m[0m
[35mshort hello and good bye conversation[0m
[1m[35m--- Dialogue Begins ---[0m
[31m[Alice] [0mHey Bob![0m
[94m[Bob] [37mHey Alice![0m
[31m[Alice] [0mBye Bob![0m
[94m[Bob] [37mBye bye![0m
[1m[35m--- Dialogue Ends ---[0m


In [16]:
my_dialog = Dialog.from_file("output/my_dialogue.txt")
my_dialog.print(scenario=True)

[1m[35m--- Dialogue Begins ---[0m
[31m[Alice] [0mHey Bob![0m
[94m[Bob] [37mHey Alice![0m
[31m[Alice] [0mBye Bob![0m
[94m[Bob] [37mBye bye![0m
[1m[35m--- Dialogue Ends ---[0m


- Or simple things like quickly know how long a dialogue is:

In [17]:
len(my_dialog)

4

Now we are ready to begin working on synthetic `Dialog` (;)) generation!

## Dialogue Generation

### Introduction

From previous section we defined what a synthetic dialogue looks like, which contains not only the conversational turns but also useful metadata.

However, we want the LLM to generate the dialogue per se, not the metadata of course.

Therefore, let's define now how we want the actual output of the LLM to look like.

The simples is to use the same format as our `Dialog` but without metada fields, for instance, we would like the LLM to simply generate a JSON like this one:

```json
{
    "dialog": [
        {"speaker": "Alice", "text": "Hey Bob!"},
        {"speaker": "Bob", "text": "Hey Alice!"},
        {"speaker": "Alice", "text": "Bye Bob!"},
        {"speaker": "Bob", "text": "Bye bye!"},
    ]
}
```
Or as we already did with `Dialog`, we can formally define a `LLMDialogOutput` as:

In [18]:
# Let's define the LLM output simply as a "dialog" field containing the list of turns
class LLMDialogOutput(BaseModel):
  dialog: List[Turn]

Now we need the LLM to generate a dialogue as such JSON object. 

The first thing we could try is to simply instruct the LLM in its prompt to _"output a JSON object with a `"dialog"` key containing a list of turns, where each turn contains two keys, `"speaker"` and `"text"`, to save the speaker name and the utterance, respectively"_.

However, describing JSON objects with words in an unambiguous way is not an easy task and there's no guarantee the LLM will actually follow the instruction 100% of the times.

Instead, we can work smarter (not harder! ;)): We can [force the LLM to generate an structured output](https://blog.danielclayton.co.uk/posts/ollama-structured-outputs/) by using a formal grammar to conditionate decoding.

But wait, with `ollama` is even easier, since you can simply pass the [JSON **schema**](https://json-schema.org/overview/what-is-jsonschema) (yes, you guessed it, a JSON describing a JSON :)) that formally described the expected format of the output and will automatically do the work for us.

But, how do I get the JSON schema? Don't worry! We don't have to do it manually!
if your output is defined as a `pydantic` model, as we already did with the `Dialog` and `LLMDialogOutput`, **we can use the built-in `.model_json_schema()` method to obtain its JSON schema**, as follows:

In [19]:
# let's get the json schema for our defined Output
LLMDialogOutput.model_json_schema()

{'$defs': {'Turn': {'properties': {'speaker': {'title': 'Speaker',
     'type': 'string'},
    'text': {'title': 'Text', 'type': 'string'}},
   'required': ['speaker', 'text'],
   'title': 'Turn',
   'type': 'object'}},
 'properties': {'dialog': {'items': {'$ref': '#/$defs/Turn'},
   'title': 'Dialog',
   'type': 'array'}},
 'required': ['dialog'],
 'title': 'LLMDialogOutput',
 'type': 'object'}

So now that we know everything we need to know, we can force the LLM to always produce the output in such format as simple as:

In [20]:
from langchain_ollama.chat_models import ChatOllama

llm = ChatOllama(model=MODEL_NAME,
                 format=LLMDialogOutput.model_json_schema())

# NOTE: note that here we're NOT giving a single instruction about how the output should look like
llm_output = llm.invoke([("human", "generate a short and weird random dialogue between Alice and Bob")]).content

# and still, the output is a perfect JSON, as we wanted it :)
print(llm_output)

{ "dialog": [
  { "speaker": "Alice", "text": "Bob, did you know that pineapples can't actually grow on trees?" },
  { "speaker": "Bob", "text": "Really? Then where do they come from—outer space?" },
  { "speaker": "Alice", "text": "Sort of. Pineapples grow at the center of a plant, like a big flower, not on a tree." },
  { "speaker": "Bob", "text": "Wow, and I thought my belief in flying spaghetti monsters was strange!" }
] }


### Description-based Generation

We can use everything we have learned so far to define a our own `DialogueGenerator` class that we can instantiate using different LLMs and descriptions to generate our dialogues or, better, we can use `sdialog`'s built-in one:

In [21]:
from sdialog.generators import DialogGenerator

Which takes the following arguments as input:
- `model` model name to use (any model tag from [ollama hub](https://ollama.com/library)).
- `dialogue_details` the details about the desired dialogue.
- `output_format` the output format as a `pydantic` class or JSON scheme, as we did above (`LLMDialogOutput` by default).
- `scenario` an optional metadata field that describes the scenario used to generated dialogue.

For instance, let's create an instance of `DialogGenerator` to generate conversations between Bob and Alice about her birthday:

In [22]:
dialog_generator = DialogGenerator(
    model=MODEL_NAME,
    dialogue_details="The conversation is between a dad (Bob) and his doughter (Alice). "
                     "Her birthday is coming up and she wants to throw a Star Wars themed party.",
)

Then, we can use the build-in `.generate()` method to generate conversations for such instance:

_(each time you run the code below, a different dialogue will be generated)_

In [23]:
dialog = dialog_generator.generate()
dialog.print(scenario=True)

[1m[95m[model] [35mqwen2.5:14b[0m
[1m[95m[seed] [35m3164540885[0m
[1m[95m[scenario] [35m[0m
[35mThe conversation is between a dad (Bob) and his doughter (Alice). Her birthday is coming up and she wants to throw a Star Wars themed party.[0m
[1m[35m--- Dialogue Begins ---[0m
[31m[Alice] [0mHey Dad, can we talk about my birthday party?[0m
[94m[Bob] [37mOf course, sweetie! What's on your mind for your special day?[0m
[31m[Alice] [0mWell, I've been thinking... what if we have a Star Wars themed party? Can we do that?[0m
[94m[Bob] [37mA Star Wars party? That sounds like fun! What kind of things were you thinking about for the party?[0m
[31m[Alice] [0mI was thinking, maybe decorations with lightsabers and the Death Star. And everyone could come dressed as their favorite character![0m
[94m[Bob] [37mThat sounds really cool. Do you have a list of who you'd like to invite? And do you know what games or activities they might enjoy?[0m
[31m[Alice] [0mI've got a 

Let's now change the description, now the party has to be about Lord of the Rings, not Star Wars. To do this we can use the `.set()` method to set new details:

In [24]:
dialog_generator.set("The conversation is between a dad (Bob) and his doughter (Alice). "
                     "Her birthday is coming up and she wants to throw a Lord of the Rings themed party.")

In [25]:
dialog = dialog_generator.generate()
dialog.print(scenario=True)

[1m[95m[model] [35mqwen2.5:14b[0m
[1m[95m[seed] [35m2444532898[0m
[1m[95m[scenario] [35m[0m
[35mThe conversation is between a dad (Bob) and his doughter (Alice). Her birthday is coming up and she wants to throw a Lord of the Rings themed party.[0m
[1m[35m--- Dialogue Begins ---[0m
[94m[Bob] [37mHey Alice, how are you doing today?[0m
[31m[Alice] [0mHi Dad! I'm great, thanks for asking. Actually, there's something important I wanted to talk to you about.[0m
[94m[Bob] [37mSure thing, what's on your mind?[0m
[31m[Alice] [0mWell, my birthday is coming up soon and I've been thinking... What if we throw a Lord of the Rings themed party for my birthday? Wouldn't that be awesome?[0m
[94m[Bob] [37mA Lord of the Rings party? That does sound fun! What kind of things were you thinking about doing?[0m
[31m[Alice] [0mI was thinking we could set up different stations based on each of the books. There could be a Hobbiton-style snack table, and maybe activities like rin

You can use the seed number above to re-generate the exact same dialogue each time as an argument of `generate()`, as follows:

In [26]:
dialog_generator.generate(4216355045).print()

[1m[95m[model] [35mqwen2.5:14b[0m
[1m[95m[seed] [35m4216355045[0m
[1m[35m--- Dialogue Begins ---[0m
[94m[Bob] [37mHey Alice, how are you today?[0m
[31m[Alice] [0mHi Dad! I'm great, thanks for asking. Actually, there's something really important that I want to talk to you about.[0m
[94m[Bob] [37mSure thing, what's on your mind?[0m
[31m[Alice] [0mWell, my birthday is coming up soon and I've been thinking of having a special party. There's this movie series called 'The Lord of the Rings' that I really love. Could we maybe have a theme party based on it?[0m
[94m[Bob] [37mThat sounds like an interesting idea! What kind of things do you have in mind for the party?[0m
[31m[Alice] [0mI was thinking we could invite all my friends, decorate with some cool props and maybe even dress up as our favorite characters. We could also watch one or two movies together.[0m
[94m[Bob] [37mThat does sound like a lot of fun! Do you have any specific ideas about decorations or gam

### Role-Playing-based Generation

Our goal here will be to have to LLM to generate the dialogues by role-playing the different charecters.

Each character will be fully defined by its persona, so, the same way started this tutorial by defining what a "synthetic `Dialog`" will actually be, we should now define our `Persona`.

Hopefully, the `sdialog` contains a `BasePersona` that we can import to create our own custom persona classes, let's import it:

In [27]:
from sdialog.personas import BasePersona

And let's define our concrete `Persona` class now by specifying some useful attributes like a name, role, background, etc:

In [28]:
class Persona(BasePersona):
    name: str = ""
    role: str = ""
    background: str = ""
    personality: str = ""
    circumstances: str = ""
    rules: str = ""

Now we can create/instantiate any `Persona` we want for our characters. Let's create our Bob and Alice:

In [29]:
bob_persona = Persona(
        name="Bob",
        role="great dad",
        circumstances="Your daughter will talk to you",
        background="Computer Science PhD.",
        personality="an extremely happy person that likes to help people",
)

alice_persona = Persona(
    name="Alice",
    role="lovely daughter",
    circumstances="Your birthday is getting closer and you are talking with your dad to organize the party."
                  "You want your party to be themed as Lord of The Rings."
)

If we print `bob` we will see that it is automatically converted to natural language description, which is usefull when we want to create the actual prompt for our LLM.

In [30]:
print(bob_persona)

Your name: Bob
Your role: great dad
Your circumstances: Your daughter will talk to you
Your background: Computer Science PhD.
Your personality: an extremely happy person that likes to help people


In case we are working with a really complex persona and this default description is not good for your needs, you can overwrite it by defining your own `description()` method as in the following example:

In [31]:
class PersonaCustom(BasePersona):
    name: str = ""
    role: str = ""

    def description(self):
        return f"Your awesome name is {self.name} and your awesome role is being a {self.role}"

awesome_bob = PersonaCustom(
        name="Bob",
        role="great dad"
)

# Let's print "awesome_bob" persona
print(awesome_bob)

Your awesome name is Bob and your awesome role is being a great dad


So, we now know how to create personas, let's move to the fun part which is actually creating the actual generator.

Fortunatelly, we can simply use `sdialog`'s built-in `PersonaDialogGenerator` class to generate our persona-based dialogues as follows:

In [32]:
from sdialog.generators import PersonaDialogGenerator

dialog_generator = PersonaDialogGenerator(
    model=MODEL_NAME,
    persona_a=bob_persona,
    persona_b=alice_persona,
)

dialog_generator.generate().print()

[1m[95m[model] [35mqwen2.5:14b[0m
[1m[95m[seed] [35m1665669424[0m
[1m[35m--- Dialogue Begins ---[0m
[94m[Bob] [37mHello![0m
[31m[Alice] [0mHi Dad![0m
[94m[Bob] [37mHow's everything going, sweetie?[0m
[31m[Alice] [0mI'm good dad! My birthday is coming up soon and I have a special theme in mind.[0m
[94m[Bob] [37mOh really? What kind of theme are you thinking about for your party?[0m
[31m[Alice] [0mI want it to be themed as Lord of The Rings. Can we do that?[0m
[94m[Bob] [37mThat sounds like a fantastic idea! I'm sure we can make it happen.[0m
[31m[Alice] [0mThanks Dad! What kind of decorations do you think we could use?[0m
[94m[Bob] [37mWe could have banners with Elvish script, some maps of Middle-earth, and even some Gandalf and Aragorn posters. How does that sound?[0m
[31m[Alice] [0mThat sounds amazing! Can we also have a Hobbit cake?[0m
[94m[Bob] [37mOf course, we can definitely get a Hobbit-themed cake or maybe one shaped like the One Ring!

## Use Case: Dialogue Generation for STAR Dataset

### Introduction

> ℹ️ Before we begin this section, make sure you have the STAR dataset downloaded in your system, inside the `datasets` folder:
> ```bash
> cd datasets
> git clone git@github.com:RasaHQ/STAR.git
> ```
> Make sure you have a `datasets/STAR` folder the `dialogues` and `tasks` folders inside.

The [STAR](https://arxiv.org/pdf/2010.11853) dataset contains 6652 human-generated dialogues as JSON objects where files are named as `NUMBER.json`.

Humans had to follow a well-defined set of instruction to generate the dialogue role playing the system (wizard) and the client (user).

For instance, clicking [here](datasets/STAR/dialogues/1.json) we can open the file [`1.json`](datasets/STAR/dialogues/1.json) containing the first dialogue. For now, let's focus only on the `"Scenario"` field.

For instance, for the dialogue in `1.json` it is as follows:

```json
{
    "Domains": [  # List of domains
        "doctor"
    ],
    "Happy": true,  # Wheather or not the dialogue follos a happy path
    "MultiTask": false,  # Wheather or not this dialogue involves more than one task
    "UserTask": "You (Alexis) had an appointment with Dr. Morgan the other day. Unfortunately, you forgot to write down the instructions the doctor gave you. Please followup and find out how often to take your medicine.",
    "WizardTask": "Inform the user of his/her doctor's orders.",
    "WizardCapabilities": [  # List of flowcharts describing the each task the Wizard is cable of doing
        {
        "Domain": "doctor",
        "SchemaImage": "doctor_followup.jpg",
        "Task": "doctor_followup"
        }
    ]
}
```

We can use the `STAR` module from `sdialog` to read scenarios object from any dialogue in STAR given it's id given a STAR conversation id as follows:

In [None]:
from sdialog.datasets import STAR

# Let's first indicate where the dataset is located
STAR.set_path("datasets/STAR/")

# Let's set the first dialogue as the target example
TARGET_DIALOG = 1

# Let's load the scenario of the first dialog
scenario = STAR.get_dialog_scenario(TARGET_DIALOG)
scenario

{'Domains': ['doctor'],
 'Happy': True,
 'MultiTask': False,
 'UserTask': 'You (Alexis) had an appointment with Dr. Morgan the other day. Unfortunately, you forgot to write down the instructions the doctor gave you. Please followup and find out how often to take your medicine.',
 'WizardCapabilities': [{'Domain': 'doctor',
   'SchemaImage': 'doctor_followup.jpg',
   'Task': 'doctor_followup'}],
 'WizardTask': "Inform the user of his/her doctor's orders."}

Which corresponds to the following dialogue:

In [None]:
original_dialogue = STAR.get_dialog(1)
original_dialogue.print()

[1m[95m[dialog_id] [35m1[0m
[1m[35m--- Dialogue Begins ---[0m
[94m[User] [37mHello, I'm really worried. I forgot what I'm supposed to do and forgot to write it down... What do I do?[0m
[31m[System] [0mCould I get your name, please?[0m
[94m[User] [37mMy name is Alexis and my last doctor was Dr. Morgan, but now my doctor is Dr. Johnson and I forgot how to take my medicine.[0m
[31m[System] [0mYour instructions are: Take your medicine before you go to sleep. If you experience nausea, please contact your doctor immediately..[0m
[94m[User] [37mAre you sure I'm supposed to take it before bed? I don't go to sleep every day because my sleep schedule is totally off right now because of the Coronavirus.[0m
[31m[System] [0mYes. It must be before bed or it will not be effective.[0m
[94m[User] [37mOkay thank you. I will get back in touch if this doesn't help.[0m
[31m[System] [0mThank you and goodbye.[0m
[1m[35m--- Dialogue Ends ---[0m


### Description-based Generation

In the `scenario`, we can see that in this conversation, the user's behavior is defined by instructions given in natural language (`"UserTask"`), however, the system/wizard behavior is more rigidly defined as a graph describing the dialogue policy to followed (since system was expected to be more deterministic). These graphs are described as JSON objects storing the graph edges as key:value pairs (source:destination). We can find these graphs in the [`STAR/tasks`](datasets/STAR/tasks) folder.

Ideally, we would like our `DialogGenerator` to generate dialogues for each different scenario. That is, given a `scenario` we would like generate multiple dialogues for it.

To achieve this, we only need to find a way to describe each `scenario` using natural language so that we can pass it to our `DialogGenerator`.

Fortunately, we can use the built-in `get_scenario_description()` method to do this, which takes an `scenario` as input and returns its natural language description containing all the details (including the system behavior described by the graphs):

In [35]:
print(STAR.get_scenario_description(scenario))

The conversation is between a User and a AI assistant in the following domains: doctor.

The User instructions are: You (Alexis) had an appointment with Dr. Morgan the other day. Unfortunately, you forgot to write down the instructions the doctor gave you. Please followup and find out how often to take your medicine.
The AI assistant instructions are: Inform the user of his/her doctor's orders.

In addition, the AI assistant is instructed to follow specific flowcharts to address the tasks. Flowcharts are defined as graph described using DOT.
The actual DOT for the current tasks are:

The graph for the task 'doctor_followup' with domain 'doctor' is:
```dot
digraph doctor_followup  {
    hello -> ask_name;
    ask_name -> doctor_ask_doctor_name;
    doctor_ask_doctor_name -> query;
    query -> doctor_inform_doctors_instructions;
    doctor_inform_doctors_instructions -> anything_else
}
```
and one example responses for each node is provided in the following json:
```json
{
  "hello": "H

Note how the original graph in JSON describing the system's behavior have been converted to a [DOT](https://en.wikipedia.org/wiki/DOT_(graph_description_language)) description which should be easier to interpret by the LLM, since DOT is a well-known format to describe graphs in plain text. 

Let's now put these two methods together and create a function that given a STAR dialogue ID will generate a natural language description of the scenario associated to it, simply as follows:

In [36]:
def get_dialog_scenario_description(dialogue_id):
    # Get the scenario of the target dialogue
    scenario = STAR.get_dialog_scenario(dialogue_id)
    # Then return it along its description in natural language
    return scenario, STAR.get_scenario_description(scenario)

With this function now we have everything we need to generate synthethic dialogues for STAR that follows the same scenario as a given target real dialogue.

For instance, let's say we want to generate dialogues following the same scenario as the first STAR dialogue, we can simply:

In [37]:
# First, let's get the scenario and description of the first dialogue
scenario, description = get_dialog_scenario_description(dialogue_id=1)

# le'ts now create a dialogue generator for it
dialog_generator = DialogGenerator(
    model=MODEL_NAME,
    dialogue_details=description,
    scenario=scenario
)

Let's now generate multiple conversation that follows the same scenario as dialogue 1 of STAR dataset.
> **Note**
> Run the cell multiple times to get different conversations for it

In [38]:
dialog = dialog_generator.generate()
dialog.print(scenario=True)

[1m[95m[model] [35mqwen2.5:14b[0m
[1m[95m[seed] [35m3277458858[0m
[1m[95m[scenario] [35m[0m
[35m{
  "Domains": [
    "doctor"
  ],
  "Happy": true,
  "MultiTask": false,
  "UserTask": "You (Alexis) had an appointment with Dr. Morgan the other day. Unfortunately, you forgot to write down the instructions the doctor gave you. Please followup and find out how often to take your medicine.",
  "WizardCapabilities": [
    {
      "Domain": "doctor",
      "SchemaImage": "doctor_followup.jpg",
      "Task": "doctor_followup"
    }
  ],
  "WizardTask": "Inform the user of his/her doctor's orders."
}[0m
[1m[35m--- Dialogue Begins ---[0m
[31m[AI] [0mHello, how can I help?[0m
[94m[User] [37mHi there, could you please remind me of what Dr. Morgan told me about my medication?[0m
[31m[AI] [0mCould I get your name, please?[0m
[94m[User] [37mMy name is Alexis.[0m
[31m[AI] [0mWho is your doctor?[0m
[94m[User] [37mDr. Morgan.[0m
[31m[AI] [0mYour instructions are: Tak

We can see the LLM is able to follow the scenario surprisegnly well, specially for the system part which is guided by a graph with pre-defined responses.

Now update the generator to match a more challenging scenario, let's say that of dialogue 5100 that is multi-task and does not follow a happy path:

In [None]:
scenario, description = get_dialog_scenario_description(5100)

dialog_generator.set(description, scenario)

scenario

{'Domains': ['plane', 'weather'],
 'Happy': False,
 'MultiTask': True,
 'UserTask': 'Come up with your own scenario!\n\nAbout you:\n- Your name: Ben\n\n The AI Assistant can handle:\n- Search for a flight (e.g. from Chicago to Pittsburgh)\n- Book a flight (e.g. with id 193)\n- Checking the weather forecast in different Cities (e.g. Chicago or Pittsburgh)',
 'WizardCapabilities': [{'Domain': 'plane',
   'SchemaImage': 'plane_search.jpg',
   'Task': 'plane_search'},
  {'Domain': 'plane', 'SchemaImage': 'plane_book.jpg', 'Task': 'plane_book'},
  {'Domain': 'weather', 'SchemaImage': 'weather.jpg', 'Task': 'weather'}],
 'WizardTask': 'Follow the flow charts and help the user.'}

And generate dialogues for it:

_(run multi-times the call to generate different ones)_

In [40]:
dialog_generator.generate().print()

[1m[95m[model] [35mqwen2.5:14b[0m
[1m[95m[seed] [35m3526676935[0m
[1m[35m--- Dialogue Begins ---[0m
[31m[AI Assistant] [0mHello, how can I help?[0m
[94m[Ben] [37mHi there. First of all, could you book a flight for me with the ID 193 please?[0m
[31m[AI Assistant] [0mMay I have your name, please?[0m
[94m[Ben] [37mSure, my name is Ben.[0m
[31m[AI Assistant] [0mCan I have your flight ID, please?[0m
[94m[Ben] [37mThe flight ID is 193. Wait a moment though, can you also check the weather in Pittsburgh for me before we proceed with booking?[0m
[31m[AI Assistant] [0mOf course, I'll do that after checking if your flight is available. Let's first see about the availability of the flight.[0m
[94m[Ben] [37m[0m
[31m[AI Assistant] [0mThe flight is available. Should I reserve it for you?[0m
[94m[Ben] [37mI think there might be a problem, let's check the weather in Pittsburgh first and if everything looks good we'll book.[0m
[31m[AI Assistant] [0mSure thing, 

### Role-playing-based Generation

Before, in previous section we only had to find a way to describe each `scenario` using natural language so that we can pass it to our `DialogGenerator`.

Likewise, now we have to find a way to create the right system and user personas for each scenario which means we have to return the right system and user `Persona`s for a given `scenario`.

Fortunately, we can use the built-in `STAR.get_user_persona_for_scenario(scenario)` and `STAR.get_system_persona_for_scenario(scenario)` methods to achieve this.

For instance, let's get the user persona for the `scenario` of the first dialogue above:

In [41]:
scenario = STAR.get_dialog_scenario(TARGET_DIALOG)

user_persona = STAR.get_user_persona_for_scenario(scenario)
print(user_persona)

Your role: user calling a AI assistant that can perform multiple tasks in the following domains: doctor.

The following should be considered regarding the conversation:
   1. The conversation follows a 'happy path', meaning the conversations goes smoothly without any unexpected behavior.
   2. The conversation involves only one task you were instructed to (doctor_followup), nothing else
Your circumstances: You (Alexis) had an appointment with Dr. Morgan the other day. Unfortunately, you forgot to write down the instructions the doctor gave you. Please followup and find out how often to take your medicine.


Now, similar to what we did in the previous subsection, we just need to define a function that given a dialogue ID can return its scenario as well as the system and user persona for it: 

In [42]:
def get_dialog_scenario_and_personas(dialogue_id):
    # Get the scenario of the target dialogue
    scenario = STAR.get_dialog_scenario(dialogue_id)
    # Get the personas
    system = STAR.get_system_persona_for_scenario(scenario)
    user = STAR.get_user_persona_for_scenario(scenario)
    return scenario, system, user

And that's it, not we can simply create a `PersonaDialogGenerator` using the system and user personas as follows:

In [43]:
# let's get the personas and the scenario
scenario, system, user = get_dialog_scenario_and_personas(dialogue_id=1)

# le'ts now create a dialogue generator for it
dialog_generator = PersonaDialogGenerator(
    model=MODEL_NAME,
    persona_a=system,
    persona_b=user,
    scenario=scenario
)

# let's generate the dialogue
dialog_generator.generate().print()

[1m[95m[model] [35mqwen2.5:14b[0m
[1m[95m[seed] [35m1388934746[0m
[1m[35m--- Dialogue Begins ---[0m
[31m[AI assistant] [0mHello.[0m
[94m[Alexis] [37mHi there, could you please help me remember what my doctor told me after my appointment last week?[0m
[31m[AI assistant] [0mOf course. Could I get your name, please?[0m
[94m[Alexis] [37mMy name is Alexis.[0m
[31m[AI assistant] [0mWho is your doctor?[0m
[94m[Alexis] [37mIt's Dr. Morgan.[0m
[31m[AI assistant] [0mYour instructions are: Take the prescribed medication three times a day, before meals.[0m
[94m[Alexis] [37mThank you so much! Is there anything else that I can do for you?[0m
[31m[AI assistant] [0mNo, thank you and goodbye.[0m
[1m[35m--- Dialogue Ends ---[0m


And that's it for this tutorial, congrats! 😎

### Saving our dialogues

Finally, let's generate one synthetic dialog for each happy `"doctor_followup"` dialog in STAR and save it to disk for later use.

Let's first get all happy dialogues for this task using `sdialog`'s built-in `STAR.get_dialogs()` function:

In [44]:
original_dialogs = STAR.get_dialogs(task_name="doctor_followup", happy=True, multitask=False)
print('Total number of happy "doctor_followup" dialogues in STAR:', len(original_dialogs))

Reading dialogs:   0%|          | 0/6652 [00:00<?, ?it/s]

Total number of happy "doctor_followup" dialogues in STAR: 105


Now let's generate the dialogues and save them in the path pointed by the `PATH_OUTPUT` variable.

In [45]:
import os

from tqdm.auto import tqdm

PATH_OUTPUT = "output/STAR/full-generation"

path_txt = os.path.join(PATH_OUTPUT, "txt")
path_json = os.path.join(PATH_OUTPUT, "json")
os.makedirs(path_txt, exist_ok=True)
os.makedirs(path_json, exist_ok=True)

for dialog in tqdm(original_dialogs, desc="Dialog generation"):
    if os.path.exists(os.path.join(path_txt, f"{dialog.dialogId}.txt")):
        continue

    scenario, description = STAR.get_dialog_scenario_description(dialog.dialogId)
    dialog_generator = DialogGenerator(
        model=MODEL_NAME,
        dialogue_details=description,
        scenario=scenario
    )
    dialog = dialog_generator.generate(id=dialog.dialogId, seed=dialog.dialogId)

    # Normalize speaker names in each turn (since their also generated by the LLM)
    for turn in dialog.turns:
        turn.speaker = "System" if "AI" in turn.speaker else "User"

    dialog.to_file(os.path.join(path_json, f"{dialog.dialogId}.json"))
    dialog.to_file(os.path.join(path_txt, f"{dialog.dialogId}.txt"))

Dialog generation:   0%|          | 0/105 [00:00<?, ?it/s]

Finally, let's check the files were generated:

In [46]:
%ls output/STAR/full-generation/

[0m[01;34mjson[0m/
[01;34mtxt[0m/


In [None]:
%ls output/STAR/full-generation/txt

1.txt
1848.txt
1886.txt
1896.txt
1899.txt
1942.txt
1963.txt
1970.txt
2103.txt
2228.txt
2273.txt
2487.txt
2526.txt
2578.txt
2579.txt
2624.txt
2699.txt
2733.txt
3007.txt
3024.txt
3050.txt
3071.txt
3073.txt
3086.txt
3088.txt
3110.txt
3116.txt
3126.txt
3136.txt
3155.txt
3198.txt
3202.txt
3210.txt
3234.txt
3254.txt
3264.txt
3269.txt
3274.txt
3298.txt
3316.txt
3323.txt
3329.txt
3330.txt
3356.txt
3371.txt
3391.txt
3403.txt
3418.txt
3422.txt
3437.txt
3446.txt
3454.txt
3469.txt
3494.txt
3516.txt
3528.txt
3550.txt
3652.txt
3675.txt
3743.txt
3769.txt
4055.txt
4058.txt
4067.txt
4076.txt
4082.txt
4093.txt
4100.txt
4111.txt
4159.txt
4173.txt
4191.txt
4205.txt
4214.txt
4224.txt
4231.txt
4233.txt
4253.txt
4255.txt
4263.txt
4341.txt
4349.txt
4358.txt
4381.txt
4395.txt
4403.txt
4416.txt
4459.txt
4468.txt
4477.txt
4502.txt
4515.txt
4524.txt
4536.txt
4570.txt
4591.txt
4623.txt
4653.txt
4737.txt
4743.txt
4752.txt
4851.txt
4870.txt
4875.txt
9.txt


## Exercise: Doctor-Patient Conversations

Suppose now you have to generate synthetic doctor-patient conversations, is it better to do it using the "description" or the "role-playing" approach?

- How would you define the Patient and Doctor personas?

In [49]:
# TODO: do your magic!
class DoctorPersona(BasePersona):
    pass

class PatientPersona(BasePersona):
    pass

- How would you initialize them? (e.g. concrete values for each defined attribute)

In [None]:
doctor = DoctorPersona(
    # TODO: assign values to the attribues!
)
patient = PatientPersona(
    # TODO: assign values to the attribues!
)

Once we have defined our Personas and created our two doctor and patient personas, wwe can simply create a generator for them as we did before in this tutorial:

In [None]:
# le'ts now create a dialogue generator for our doctor and patients
dialog_generator = PersonaDialogGenerator(
    model=MODEL_NAME,
    persona_a=doctor,
    persona_b=patient
)

And generate as many dialogues for them: _(running the cell multiple times)_

In [None]:
# let's generate the doctor-patient dialogue
dialog_generator.generate().print()

- Can you think of an `scenario` object for doctor-patient conversations? What would this scenario contain?

> 💡 **Hint:** what is it that you want to keep track/control when generating the conversations? (different dissises? different outcomes? different skills? etc.)

In [None]:
scenario = {}  # TODO: define your own, perhaps better to use pydantic's BaseModel instead of a dict {}

If we had such `scenario` then we could define a function that could return the right doctor and patient for the provided scenario:

In [None]:
def get_doctor_patient_for_scenario(scenario):
    # TODO: do some magic to return the right doctor
    #       and patient personas for the given scenario
    # doctor = DoctorPersona()
    # patient = DoctorPersona()
    return doctor, patient

Which we could finally use to create generators for different `scenarios`:

In [None]:
def get_generator_for_scenario(scenario):
    # Get the right doctor and patient personas for the given scenario
    doctor, patient = get_doctor_patient_for_scenario(scenario)

    # Create and return a dialogue generator for them
    return PersonaDialogGenerator(
        model=MODEL_NAME,
        persona_a=doctor,
        persona_b=patient,
        scenario=scenario,
        # dialogue_details=""  # TODO: optional, in case scenario also requires defining certain properties outside the personas
    )

In [None]:
generator = get_generator_for_scenario(scenario)

Which will allow us to generate multiple dialogues belonging to the same `scenario`:

In [None]:
generator.generate()

Cool, huh? Congrats for finalizing the tutorial! you did a great job! 😎

## Acknowledgments

Content created for [JSALT 2025](https://jsalt2025.fit.vut.cz/) as a tutorial for the ["Play your part"](https://jsalt2025.fit.vut.cz/summer-workshop#play-your-part) research group.

License: MIT License. Copyright (c) 2025 Idiap Research Institute.

Author: Sergio Burdisso (sergio.burdisso@idiap.ch)