# Analysis

This notebook contains the code to reproduce the experiments presented in the paper.

Note that we focus on the dialogue policy and do not consider the other modules such as NLU and NLG of the conversational agent. The interaction model QRFA is used to build the dialogue policies for both the user simulators and the conversational agent. The user's actions are query and feedback, while the agent's actions are request and answer. Additionally, there is an action to finish the conversation that is shared by both.

## Data

For the experiments, we use the annotated datasets released along with the QRFA model. This choice is motivated by the fact that these datasets comprise different user behaviors to complete an information seeking task. Therefore, we assume a certain level of realism in the user simulators and conversational agents built using these datasets. The table below introduce the datasets.

## Methodology

For each datasets, we build a user simulator and a conversational agent.
Based one the idea of leave-one-out cross-validation, we study the implication relationships between the objectives of training and evaluation by considering the user population and agent associated to a dataset as the reference and the other user populations as simulated user populations. For each reference pair, we execute the following steps:

1. Get transition probabilities from the reference user population and conversational agent.
2. Train a success predicator for each simulated user population.
3. Generate synthetic dialogues between the reference agent and the simulated user populations. Each dialogue is given a success score using scoring predictors.
4. Compute metrics associated to the objectives of training and evaluation.
5. Identify the best user simulator for training and evaluation based on the computed metrics.


In [30]:
from typing import Any, List, Optional, Tuple, Dict
import pandas as pd
import random
import numpy as np
from collections import defaultdict
from statistics import mean, stdev

ParticipantTransitionProbs = Dict[str, Dict[str, float]]

## Outcome predictor

The outcome predictor is defined as follows:

$$\hat{o} = \frac{1}{1 + \exp(-h(L, a^t_u, a^t_{CA}, p, i))}$$

where $L$ is the length of the dialogue, $a^t_u$ and $a^t_{CA}$ are the actions of the user and the conversational agent at time $t$, $p$ is the patience, and $i$ is the inclination towards goal completion. The function $h$ is defined as follows:

$$h(L, a^t_u, a^t_{CA}, p, i) = w_1 * \frac{p}{L} + w_2 * \tanh(i) * \mathbb{1}(a^t_u = \text{F}) + w_3 * \mathbb{1}(a^t_{CA} = \text{A})$$

where $w_1$, $w_2$, $w_3$, and $w_4$ are the weights of the features and $\mathbb{1}$ is the indicator function.


In [19]:
def predict_dialogue_outcome(
    patience: float,
    inclination: float,
    dialogue: List[str],
    weights: List[float] = [1.0, 1.0, 0.5],
) -> int:
    """Predicts the outcome of a dialogue.

    Args:
        patience: User's patience.
        inclination: User's inclination towards goal completion.
        dialogue: Dialogue to predict outcome for.
        weights: Weights for the features.

    Returns:
        Outcome of the dialogue (0: failure, 1: success).
    """
    last_user_action = None
    last_agent_action = None

    for action in reversed(dialogue):
        if last_user_action is not None and last_agent_action is not None:
            break

        if action.startswith("U_") and last_user_action is None:
            last_user_action = 1.0 if "F" in action else 0.0
        elif action.startswith("S_") and last_agent_action is None:
            last_agent_action = 1.0 if "A" in action else 0.0

    if last_user_action is None:
        last_user_action = 0.0
    if last_agent_action is None:
        last_agent_action = 0.0

    features = [
        patience / len(dialogue),
        np.tanh(inclination) * last_user_action,
        last_agent_action,
    ]
    h = sum([w * f for w, f in zip(weights, features)])
    outcome_prob = 1 / (1 + np.exp(-h))
    return 1 if outcome_prob >= 0.5 else 0

## User populations and conversational agents

From the annotated dialogues, we can extract the transition probabilities for the user populations and conversational agents. The table summarizes the different user populations and conversational agents.

| Dataset | User populations | Conversational agents |
| ------- | ---------------- | --------------------- |
| DSTC1   | U1               | A1                    |
| DSTC2   | U2               | A2                    |
| ODE     | U3               | A3                    |
| SCS     | U4               | A4                    |


In [3]:
class UserPopulation:
    def __init__(
        self,
        name: str,
        patience: float,
        inclination: float,
        transition_probabilities: ParticipantTransitionProbs = None,
    ) -> None:
        """Initializes a user population.

        Args:
            name: Name of the user population.
            patience: User's patience.
            inclination: User's inclination towards goal completion.
            transition_probabilities: Transition probabilities. Defaults to None.
        """
        self.name = name
        self.patience = patience
        self.inclination = inclination
        self.transition_probabilities = transition_probabilities

    def add_historical_dialogues(self, dialogues: List[List[str]]) -> None:
        """Adds historical dialogues to the user population.

        Args:
            dialogues: List of dialogues.
        """
        self.historical_dialogues = dialogues
        self.historical_outcomes = [
            predict_dialogue_outcome(self.patience, self.inclination, dialogue)
            for dialogue in dialogues
        ]

    def get_user_actions(self) -> List[str]:
        """Returns the list of possible user actions."""
        user_actions = set()
        for a_action in self.transition_probabilities.keys():
            for u_action in self.transition_probabilities[a_action].keys():
                user_actions.add(u_action)
        return list(user_actions)

    def get_agent_actions(self) -> List[str]:
        """Returns the list of possible agent actions."""
        return list(self.transition_probabilities.keys()) + ["End"]

In [4]:
def preprocess_dialogues(utterances: pd.DataFrame) -> List[List[str]]:
    """Preprocesses utterances to get dialogues.

    Args:
        utterances: All utterances in dataset.

    Returns:
        List of dialogues, each dialogue is a list of utterances.
    """
    dialogues = []
    case = 0
    dialogue = []

    for _, utterance in utterances.iterrows():
        actions = np.unique(
            [a[0] for a in utterance["new"].split("+")]
        ).tolist()
        if utterance["case ID"] != case:
            dialogues.append(dialogue)
            dialogue = []
            case = utterance["case ID"]
        elif "Hello" not in utterance["new"] and "Bye" not in utterance["new"]:
            if len(dialogue) > 0 and dialogue[-1].startswith(
                f"{utterance['resource']}_"
            ):
                prev_actions = [a[-1] for a in dialogue.pop(-1).split("+")]
                actions = prev_actions + actions

            dialogue.append(
                "+".join(
                    [
                        f"{utterance['resource']}_{action[0]}"
                        for action in np.unique(actions)
                    ]
                )
            )

    dialogues = list(filter(None, dialogues))
    return dialogues

In [5]:
def get_transition_probabilities(dialogues: List[str]) -> Dict[str, float]:
    """Get transition probabilities for a list of dialogues.

    Args:
        dialogues: Dialogues where each dialogue is a string of actions.

    Returns:
        Transition probabilities for each action in the dialogues.
    """
    transitions = defaultdict(lambda: defaultdict(int))

    for dialogue in dialogues:
        for i in range(len(dialogue) - 1):
            current_action = dialogue[i]
            next_action = dialogue[i + 1]
            if i == 0:
                transitions["Start"][current_action] += 1

            transitions[current_action][next_action] += 1

        transitions[dialogue[-1]]["End"] += 1

    probabilities = {}
    for action in transitions.keys():
        total = sum(transitions[action].values())
        if total > 0:
            probabilities[action] = {
                next_action: count / total
                for next_action, count in transitions[action].items()
            }
        else:
            probabilities[action] = {}

    return probabilities


def get_participants_transition_probs(
    transition_probs: Dict[str, float]
) -> Tuple[ParticipantTransitionProbs, ParticipantTransitionProbs]:
    """Gets the transitions probabilities for each participant.

    Args:
        transition_probs: Transition probabilities for all actions.

    Returns:
        Transition probabilities for each participant.
    """
    user_transition_probs = {}
    agent_transition_probs = {}
    for state, transition in transition_probs.items():
        if state.startswith("U_"):
            agent_transition_probs[state] = transition
        elif state.startswith("S_"):
            user_transition_probs[state] = transition

    return user_transition_probs, agent_transition_probs

In [6]:
USER_POPULATIONS = {}
AGENT_POPULATIONS = {}

datasets = [
    (
        "U1",
        -0.9,
        -0.9,
        "A1",
        "data/annotated_datasets/1_dstc1_updated.csv",
    ),  # Impatient and critical user
    (
        "U2",
        0.9,
        -0.9,
        "A2",
        "data/annotated_datasets/2_dstc2_updated.csv",
    ),  # Patient and critical user
    (
        "U3",
        -0.9,
        0.9,
        "A3",
        "data/annotated_datasets/5_ode_updated.csv",
    ),  # Impatient and cooperative user
    (
        "U4",
        0.9,
        0.9,
        "A4",
        "data/annotated_datasets/4_scs_updated.csv",
    ),  # Patient and cooperative user
    (
        "U5",
        1e-5,
        1e-5,
        "A5",
        "data/annotated_datasets/6_mgshopdial_updated.csv",
    ),  # Neutral user
]

In [8]:
data_stats = {}

for user_pop, patience, inclination, agent, path in datasets:
    print(f"Processing {path}")
    data = pd.read_csv(path)
    data = data.dropna(subset=["new"])
    dialogues = preprocess_dialogues(data)

    # Compute statistics on the dialogues: avg. # utterance and std dev
    num_utterances = [len(dialogue) for dialogue in dialogues]
    data_stats[f"D({user_pop}, {agent})"] = {
        "# dialogues": len(dialogues),
        "Avg. # utterances": mean(num_utterances),
        "Std. dev. # utterances": stdev(num_utterances),
    }

    transition_probabilities = get_transition_probabilities(dialogues)
    user_transition_probs, agent_transition_probs = (
        get_participants_transition_probs(transition_probabilities)
    )

    population = UserPopulation(
        user_pop, patience, inclination, user_transition_probs
    )
    population.add_historical_dialogues(dialogues)
    USER_POPULATIONS[user_pop] = population

    AGENT_POPULATIONS[agent] = {
        "transition_probabilities": agent_transition_probs,
    }

Processing data/annotated_datasets/1_dstc1_updated.csv
Processing data/annotated_datasets/2_dstc2_updated.csv
Processing data/annotated_datasets/5_ode_updated.csv
Processing data/annotated_datasets/4_scs_updated.csv
Processing data/annotated_datasets/6_mgshopdial_updated.csv


Dialogues statistics


In [9]:
pd.DataFrame(data_stats).transpose().style.format(precision=3)

Unnamed: 0,# dialogues,Avg. # utterances,Std. dev. # utterances
"D(U1, A1)",15577.0,24.217,22.145
"D(U2, A2)",2117.0,10.171,4.497
"D(U3, A3)",25.0,15.0,8.679
"D(U4, A4)",38.0,1.579,0.5
"D(U5, A5)",63.0,20.159,9.474


In [10]:
del (
    data,
    dialogues,
    num_utterances,
    transition_probabilities,
    user_transition_probs,
    agent_transition_probs,
    data_stats,
)

## Generation of synthetic dialogues


In [11]:
def sample_next_action(
    current_action: str, transition_probs: ParticipantTransitionProbs
) -> str:
    """Samples the next action based on transition probabilities.

    Args:
        current_action: Current action.
        transition_probs: Transition probabilities.

    Returns:
        Next action.
    """
    next_actions = list(transition_probs[current_action].keys())
    probabilities = list(transition_probs[current_action].values())
    sampled_action = np.random.choice(next_actions, p=probabilities)
    return sampled_action


def sample_dialogue(
    agent_transition_probs: ParticipantTransitionProbs,
    user_transition_probs: ParticipantTransitionProbs,
) -> List[str]:
    """Samples a dialogue.

    Args:
        agent_transition_probs: Transition probabilities for the agent.
        user_transition_probs: Transition probabilities for the user.

    Returns:
        Dialogue as list of actions.
    """
    dialogue = []
    is_finished = False

    current_action = random.choice(
        list(agent_transition_probs.keys())
        + list(user_transition_probs.keys())
    )
    dialogue.append(current_action)
    while not is_finished:
        try:
            if current_action.startswith("U_"):
                current_action = sample_next_action(
                    current_action, agent_transition_probs
                )
            else:
                current_action = sample_next_action(
                    current_action, user_transition_probs
                )
            if current_action == "End":
                is_finished = True
                break
            dialogue.append(current_action)
        except KeyError:
            current_action = current_action.split("+")[-1]

    return dialogue


def sample_dialogues(
    agent_transition_probs: ParticipantTransitionProbs,
    user_transition_probs: ParticipantTransitionProbs,
    num_dialogues: int,
    patience: float,
    inclination: float,
) -> List[Tuple[List[str], bool]]:
    """Samples dialogues.

    Args:
        agent_transition_probs: Transition probabilities for the agent.
        user_transition_probs: Transition probabilities for the user.
        num_dialogues: Number of dialogues to sample.
        patience: User's patience.
        inclination: User's inclination towards goal completion.

    Returns:
        Dialogues with success status.
    """
    dialogues = []
    for _ in range(num_dialogues):
        dialogue = sample_dialogue(
            agent_transition_probs, user_transition_probs
        )

        success = predict_dialogue_outcome(patience, inclination, dialogue)
        dialogues.append((dialogue, success))

    return dialogues

## Metrics

This part contains the methods to compute the metrics associated to the training and evaluation objectives.


In [12]:
from scipy.spatial import distance
from rouge_score import rouge_scorer
from itertools import product

### Training

We choose to use Jensen-Shannon divergence (JSD) and ROUGE-L as metrics to assess the similarity between the user population and simulated user populations. These allow us to make an assessment at the utterance- and dialogue-level respectively.


In [13]:
def compute_jsd(
    user_policy: ParticipantTransitionProbs,
    simulated_user_policy: ParticipantTransitionProbs,
) -> float:
    """Computes Jensen-Shannon divergence between user and simulated user
    policies.

    It computes the Jensen-Shannon divergence between the transition
    probabilities for each state and then averages them. Epsilon is added to
    avoid division by zero.

    Args:
        user_policy: User policy.
        simulated_user_policy: Simulated user policy.

    Returns:
        Jensen-Shannon divergence.
    """
    epsilon = 1e-9
    total_jsd = 0.0
    for state, transitions_probabilities in user_policy.items():
        # Add epsilon to avoid division by zero
        simulated_user_policy[state] = {
            k: simulated_user_policy.get(state, {}).get(k, epsilon)
            for k in transitions_probabilities.keys()
        }

        probabilities = np.array(list(transitions_probabilities.values()))
        simulated_probabilities = np.array(
            list(simulated_user_policy[state].values())
        )

        total_jsd += distance.jensenshannon(
            probabilities, simulated_probabilities, base=2
        )
    return total_jsd / len(user_policy.keys())


def compute_rouge_score(
    historical_dialogues: List[List[str]], simulated_dialogues: List[List[str]]
) -> float:
    """Computes ROUGE-L score between historical and simulated dialogues.

    It computes the average ROUGE-L score between all pairs of historical and
    simulated dialogues.

    Args:
        historical_dialogues: Historical dialogues.
        simulated_dialogues: Simulated dialogues.

    Returns:
        ROUGE-L score.
    """
    historical_dialogues = [" ".join(d) for d in historical_dialogues]
    simulated_dialogues = [" ".join(d) for d in simulated_dialogues]
    total_score = 0.0
    scorer = rouge_scorer.RougeScorer(["rougeL"])
    pairs = list(product(historical_dialogues, simulated_dialogues))
    for h, s in pairs:
        total_score += scorer.score(h, s)["rougeL"].fmeasure
    return total_score / len(pairs)

### Evaluation

We use the success rate as the performance metric to evaluate the conversational agents.


In [14]:
def compute_success_rate(successes: List[int]) -> float:
    """Computes success rate.

    Args:
        successes: Successes.

    Returns:
        Success rate.
    """
    return sum(successes) / len(successes)

## Leave-one-out cross-validation

In this part, we perform a leave-out-one out experiment to answer the following questions: is the optimal user simulator for training also the best for evaluation, and vice versa?


In [15]:
import time

In [None]:
participant_pairs = [
    ("U1", "A1"),
    ("U2", "A2"),
    ("U3", "A3"),
    ("U4", "A4"),
    ("U5", "A5"),
]
# participant_pairs = [("U2", "A2"), ("U3", "A3"), ("U4", "A4"), ("U5", "A5")]
num_synthetic_dialogues = 500

results = defaultdict(dict)

for user_pop, agent in participant_pairs:
    print(f"Reference - {user_pop}, {agent}")
    user_population = USER_POPULATIONS[user_pop]
    historical_success_rate = compute_success_rate(
        user_population.historical_outcomes
    )
    for _, simulated_user_population in USER_POPULATIONS.items():
        if user_pop == simulated_user_population.name:
            continue

        print(
            f"{time.ctime()} - Simulated user population: {simulated_user_population.name}"
        )

        # Generate synthetic dialogues
        synthetic_dialogues_data = sample_dialogues(
            AGENT_POPULATIONS[agent]["transition_probabilities"],
            simulated_user_population.transition_probabilities,
            num_synthetic_dialogues,
            simulated_user_population.patience,
            simulated_user_population.inclination,
        )

        simulated_dialogues = []
        simulated_dialogues_success = []
        for dialogue, success in synthetic_dialogues_data:
            simulated_dialogues.append(dialogue)
            simulated_dialogues_success.append(success)

        # Compute ROUGE-L score
        rouge_l_score = compute_rouge_score(
            user_population.historical_dialogues,
            simulated_dialogues,
        )

        # Compute success rate
        success_rate = compute_success_rate(simulated_dialogues_success)

        # Absolute difference success rate
        abs_diff_success_rate = abs(success_rate - historical_success_rate)

        results[f"{user_pop}, {agent}"][simulated_user_population.name] = {
            "ROUGE-L": rouge_l_score,
            "Success rate": success_rate,
            "Abs. diff. success rate": abs_diff_success_rate,
        }

In [None]:
rows = []
for participant_pair, d in results.items():
    for simulated_user, metrics in d.items():
        rows.append(
            (
                participant_pair,
                simulated_user,
                *(value for _, value in metrics.items()),
            )
        )

summary = pd.DataFrame(
    rows,
    columns=[
        "Reference",
        "Simulated user pop.",
        "ROUGE-L",
        "Success rate",
        "Abs. diff. success rate",
    ],
)
summary.set_index(["Reference", "Simulated user pop."], inplace=True)

summary.style.format(precision=3)

Jensen-Shannon divergence


In [19]:
jsd_results = defaultdict(dict)

for user_pop1, user_pop2 in product(USER_POPULATIONS.keys(), repeat=2):
    if user_pop1 != user_pop2:
        user_policy1 = USER_POPULATIONS[user_pop1].transition_probabilities
        user_policy2 = USER_POPULATIONS[user_pop2].transition_probabilities
        jsd = compute_jsd(user_policy1, user_policy2)
        jsd_results[user_pop1][user_pop2] = jsd

In [None]:
pd.DataFrame(jsd_results).sort_index().style.format(precision=3)

## Extra experiments

In this part, we perform an additional experiment to verify that an agent trained with the optimal user simulator gets a better success rate than the untrained agent when used with the reference user population.


### Create custom environment

We create a custom OpenAI Gym environment to train the conversational agent. The environment comprises a user simulator that converse with the conversational agent and provide feedback. It is important to note that the simulated user is always the one starting the conversation.

#### Action space

The conversational agent can take four actions:

- Answer: The conversational agent answers the user simulator's query.
- Request: The conversational agent requests information from the user simulator.
- Answer+Request: Combines the request and answer actions.
- End: The conversational agent finishes the conversation.

#### Observation space

For simplicity, the observation space comprises the user simulator's possible actions:

- Feedback: The user simulator provides feedback to the conversational agent.
- Query: The user simulator requests information from the conversational agent.
- Feedback+Query: Combines the query and feedback actions.
- End: The user simulator finishes the conversation.


In [26]:
import gymnasium as gym
from gymnasium import spaces

In [131]:
class ConversationalEnv(gym.Env):
    """Custom environment to trained a dialogue policy using a user simulator."""

    metadata = {"render.modes": ["console"]}

    def __init__(
        self,
        agent_actions: List[str],
        user_actions: List[str],
        simulated_user_population: UserPopulation,
        max_nb_utterances: int = 100,
    ) -> None:
        """Initializes the conversational environment.

        Args:
            agent_actions: List of possible actions for the agent.
            user_actions: List of possible actions for the user.
            simulated_user_population: Simulated user population.
            max_nb_utterances: Maximum number of utterances in the dialogue.
              Defaults to 100.
        """
        super(ConversationalEnv, self).__init__()

        self.agent_possible_actions = agent_actions
        self.user_possible_actions = user_actions

        # Define the action and observation space
        self.action_space = spaces.Discrete(len(self.agent_possible_actions))
        self.observation_space = spaces.Discrete(
            len(self.user_possible_actions)
        )

        self.user = simulated_user_population
        self.max_nb_utterances = max_nb_utterances

        self.agent_action = None
        self.user_action = None
        self.dialogue = []

    def reset(self, seed: Optional[int] = 0) -> Tuple[int, Dict]:
        """Resets the environment and returns the initial observation.

        Args:
            seed: Random seed to use for reproducibility.

        Returns:
            Initial observation and additional information.
        """
        random.seed(seed)
        self.agent_action = None
        u_act = "End"
        while u_act == "End":
            u_act = random.choice(self.user_possible_actions)
        self.user_action = u_act
        self.dialogue = [self.user_action]

        return self.user_possible_actions.index(self.user_action), {}

    def _compute_reward(self) -> Tuple[float, bool]:
        """Computes the reward based on the success probability.

        Args:
            truncated: Flag indicating if the conversation was truncated, i.e.,
              unsuccessfully terminated.
        Returns:
            Reward value and success flag.
        """
        if self.agent_action == "End" or self.user_action == "End":
            b_success = predict_dialogue_outcome(
                self.user.patience, self.user.inclination, self.dialogue
            )
            return (1.0, True) if b_success else (-1.0, False)
        return -0.1, None

    def step(self, action: int) -> Tuple:
        """Executes the agent action and returns the next user action.

        Args:
            action: Action selected by the agent.

        Returns:
            Tuple with the next user action (observation), reward, termination
              flag, truncation flag, and additional information.
        """
        self.agent_action = self.agent_possible_actions[action]
        if self.agent_action == "End":
            # End the conversation
            is_terminated = True
            is_truncated = False
            reward, b_success = self._compute_reward()
            return (
                None,
                reward,
                is_terminated,
                is_truncated,
                {"success": b_success},
            )

        try:
            self.user_action = sample_next_action(
                self.agent_action, self.user.transition_probabilities
            )
        except KeyError:
            # All user population do not have the combined action
            self.user_action = sample_next_action(
                self.agent_action.split("+")[-1],
                self.user.transition_probabilities,
            )

        self.dialogue.extend([self.agent_action, self.user_action])

        # Check if the conversation is terminated or truncated
        is_truncated = len(self.dialogue) >= self.max_nb_utterances
        is_terminated = self.agent_action == "End" or self.user_action == "End"

        # Compute the reward
        reward, b_success = self._compute_reward()

        next_observation = self.user_possible_actions.index(self.user_action)
        return (
            next_observation,
            reward,
            is_terminated,
            is_truncated,
            {"success": b_success},
        )

    def render(self, mode: str = "console") -> None:
        """Renders the current state of the environment.

        Raises:
            NotImplementedError: If the rendering mode is not supported.
        """
        if mode != "console":
            raise NotImplementedError("Only console rendering is supported.")
        print(f"Dialogue: {self.dialogue}")

    def close(self) -> List[str]:
        """Cleans up the environment.

        Returns:
            The dialogue history.
        """
        return self.dialogue

Utility function


In [132]:
def create_env(env_kwargs: Dict[str, Any]) -> ConversationalEnv:
    """Creates a single environment instance.

    Args:
        env_kwargs: Keyword arguments to pass to the environment constructor.

    Returns:
        Environment.
    """
    return ConversationalEnv(**env_kwargs)

##### Environment validation and testing


In [133]:
from stable_baselines3.common.env_checker import check_env

env = create_env(
    {
        "agent_actions": ["S_R", "S_A", "S_A+S_R", "End"],
        "user_actions": ["U_F", "U_Q", "U_F+U_Q", "End"],
        "simulated_user_population": USER_POPULATIONS["U1"],
        "max_nb_utterances": 30,
    }
)
check_env(env)

In [None]:
obs = env.reset()
env.render()

print(
    f"Observation space: {env.observation_space}\nAction space: "
    f"{env.action_space}"
)

n_steps = 10
for _ in range(n_steps):
    action = env.action_space.sample()
    obs, reward, done, truncated, info = env.step(action)
    if done or truncated:
        env.render()
        print(f"Episode finished - Reward: {reward}")
        break

In [None]:
from stable_baselines3 import PPO, A2C
from stable_baselines3.common.vec_env import DummyVecEnv

env_kwargs = {
    "agent_actions": ["S_R", "S_A", "S_A+S_R", "End"],
    "user_actions": ["U_F", "U_Q", "U_F+U_Q", "End"],
    "simulated_user_population": USER_POPULATIONS["U1"],
    "max_nb_utterances": 30,
}
n_envs = 3
envs = DummyVecEnv([lambda: create_env(env_kwargs) for _ in range(n_envs)])

model = PPO("MlpPolicy", envs, verbose=1).learn(25_000)

env = create_env(env_kwargs)
obs = env.reset()[0]
n_steps = 10
for _ in range(n_steps):
    action, _ = model.predict(obs)
    obs, reward, done, truncated, info = env.step(action)
    if done or truncated:
        env.render()
        print(f"Episode finished - Reward: {reward}")
        break

### Training of conversational agents

We train conversational agents using different user simulators. The training process is based on the PPO algorithm.


In [135]:
from stable_baselines3.common.callbacks import (
    EvalCallback,
    StopTrainingOnNoModelImprovement,
)

In [136]:
def train_agent(
    env_kwargs: Dict[str, Any],
    population_name: str,
    n_envs: int,
    n_steps: int = int(1e10),
) -> PPO:
    """Trains an agent using the PPO algorithm.

    Args:
        env_kwargs: Keyword arguments to pass to the environment constructor.
        name: Name of the simulated user population.
        n_envs: Number of parallel environments.
        n_steps: Number of training steps. Defaults to 1e10.

    Returns:
        Trained agent.
    """
    envs = DummyVecEnv([lambda: create_env(env_kwargs) for _ in range(n_envs)])

    stop_train_callback = StopTrainingOnNoModelImprovement(
        max_no_improvement_evals=3, min_evals=5, verbose=1
    )

    # Eval environment
    eval_env = create_env(env_kwargs)
    eval_callback = EvalCallback(
        eval_env,
        best_model_save_path=f"./models/{population_name}/",
        log_path="./logs",
        eval_freq=500,
        callback_after_eval=stop_train_callback,
        deterministic=True,
        render=False,
    )

    model = PPO("MlpPolicy", envs, tensorboard_log="./logs")
    model.learn(n_steps, callback=eval_callback, tb_log_name=population_name)
    return model

##### Training plots

Start TensorBoard to check the training plots.


In [None]:
%load_ext tensorboard

In [None]:
%tensorboard --logdir ./logs

#### Training


In [141]:
TRAINED_POLICIES = {}

In [None]:
for pop_name, user_pop in USER_POPULATIONS.items():
    print(f"Training agent for {pop_name}")
    env_kwargs = {
        "agent_actions": ["S_R", "S_A", "S_A+S_R", "End"],
        "user_actions": ["U_F", "U_Q", "U_F+U_Q", "End"],
        "simulated_user_population": user_pop,
        "max_nb_utterances": 35,
    }
    model = train_agent(env_kwargs, pop_name, n_envs=3, n_steps=5000)
    TRAINED_POLICIES[f"A*_{pop_name}"] = model

#### Compute success rate

For each trained conversational agent, we compute the success rate when interacting with the user populations.

In [198]:
agents_success_rate = defaultdict(lambda: defaultdict(list))
num_eval_episodes = 500

In [199]:
for _ in range(100):
    for agent_name, model in TRAINED_POLICIES.items():
        for pop_name, user_pop in USER_POPULATIONS.items():
            env_kwargs = {
                "agent_actions": ["S_R", "S_A", "S_A+S_R", "End"],
                "user_actions": ["U_F", "U_Q", "U_F+U_Q", "End"],
                "simulated_user_population": user_pop,
                "max_nb_utterances": 30,
            }
            env = create_env(env_kwargs)
            obs = env.reset()[0]
            successes = []
            for _ in range(num_eval_episodes):
                action, _ = model.predict(obs)
                obs, reward, done, truncated, info = env.step(action)
                if done or truncated:
                    successes.append(int(info["success"] == True))
                    obs = env.reset()[0]
            success_rate = compute_success_rate(successes)
            agents_success_rate[agent_name][pop_name].append(success_rate)
        

In [None]:
pd.DataFrame({k: {kk: f"{round(mean(vv), 3)} +/- {round(stdev(vv),3)}" for kk, vv in v.items()} for k, v in agents_success_rate.items()}).style.format(precision=3)

Statistical analysis

1. Perform Kruskal-Wallis H-test to check if there is a significant difference between the success rates of the user populations.
2. Perform Mann-Withney U-test for pairwise comparisons between the user populations.

In [201]:
from scipy import stats

sst_results = defaultdict(dict)
alpha = 0.05

for user_pop in USER_POPULATIONS.keys():
    success_rates = []
    for agent_name, success_rate in agents_success_rate.items():
        success_rates.append(success_rate[user_pop])
    kruskal_result = stats.kruskal(*success_rates)
    sst_results[user_pop]["Kruskal"] = kruskal_result.pvalue

    if kruskal_result.pvalue < alpha:
        for i in range(len(success_rates)-1):
            for j in range(i+1, len(success_rates)):
                mannwhitney_result = stats.mannwhitneyu(
                    success_rates[i], success_rates[j]
                )
                sst_results[user_pop][f"Mann-Whitney {i+1}-{j+1}"] = mannwhitney_result.pvalue
    

In [None]:
pd.DataFrame(sst_results).style.format(precision=3)