# Theoretical analysis ICTIR '24

This notebook contains the code to reproduce the experiments presented in the paper.

Note that we focus on the dialogue policy and do not consider the other modules such as NLU and NLG of the conversational agent. The interaction model QRFA is used to build the dialogue policies for both the user simulators and the conversational agent. The user's actions are query and feedback, while the agent's actions are request and answer. Additionally, there is an action to finish the conversation that is shared by both.

## Data

For the experiments, we use the annotated datasets released along with the QRFA model. This choice is motivated by the fact that these datasets comprise different user behaviors to complete an information seeking task. Therefore, we assume a certain level of realism in the user simulators and conversational agents built using these datasets. The table below introduce the datasets.

| Dataset | # Dialogues |
| ------- | ----------- |
| DSTC1   | 15,577      |
| DSTC2   | 2,117       |
| ODE     | 25          |
| SCS     | 38          |


## Methodology

For each datasets, we build a user simulator and a conversational agent.
Based one the idea of leave-one-out cross-validation, we study the implication relationships between the objectives of training and evaluation by considering the user population and agent associated to a dataset as the reference and the other user populations as simulated user populations. For each reference pair, we execute the following steps:

1. Get transition probabilities from the reference user population and conversational agent.
2. Train a success predicator for each simulated user population.
3. Generate synthetic dialogues between the reference agent and the simulated user populations. Each dialogue is given a success score using scoring predictors.
4. Compute metrics associated to the objectives of training and evaluation.
5. Identify the best user simulator for training and evaluation based on the computed metrics.


In [1]:
!pip install simpletransformers



In [2]:
from typing import List, Tuple, Dict
import pandas as pd
import random
import numpy as np
from collections import defaultdict
from statistics import mean, stdev
from sklearn.model_selection import train_test_split
from simpletransformers.classification import (
    ClassificationModel,
    ClassificationArgs,
)


ParticipantTransitionProbs = Dict[str, Dict[str, float]]

  from .autonotebook import tqdm as notebook_tqdm


## User populations and conversational agents

From the annotated dialogues, we can extract the transition probabilities for the user populations and conversational agents. The table summarizes the different user populations and conversational agents.

| Dataset | User populations | Conversational agents |
| ------- | ---------------- | --------------------- |
| DSTC1   | U1               | A1                    |
| DSTC2   | U2               | A2                    |
| ODE     | U3               | A3                    |
| SCS     | U4               | A4                    |


In [3]:
class UserPopulation:
    def __init__(
        self,
        name: str,
        transition_probabilities: ParticipantTransitionProbs = None,
    ) -> None:
        """Initializes a user population.

        Agrs:
            name: Name of the user population.
            transition_probabilities: Transition probabilities. Defaults to None.
        """
        self.name = name
        self.transition_probabilities = transition_probabilities

    def add_historical_dialogues(self, dialogues: List[List[str]]) -> None:
        """Adds historical dialogues to the user population.

        Args:
            dialogues: List of dialogues.
        """
        self.historical_dialogues = dialogues

    def get_user_actions(self) -> List[str]:
        """Returns the list of possible user actions."""
        user_actions = set()
        for a_action in self.transition_probabilities.keys():
            for u_action in self.transition_probabilities[a_action].keys():
                user_actions.add(u_action)
        return list(user_actions)

    def get_agent_actions(self) -> List[str]:
        """Returns the list of possible agent actions."""
        return list(self.transition_probabilities.keys()) + ["End"]

    def set_success_predictor(self, success_predictor: ClassificationModel) -> None:
        """Sets the success predictor.

        Args:
            success_predictor: Success predictor model.
        """
        self.success_predictor = success_predictor

In [4]:
def preprocess_dialogues(utterances: pd.DataFrame) -> List[List[str]]:
    """Preprocesses utterances to get dialogues.

    Args:
        utterances: All utterances in dataset.

    Returns:
        List of dialogues, each dialogue is a list of utterances.
    """
    dialogues = []
    case = 0
    dialogue = []

    for _, utterance in utterances.iterrows():
        actions = np.unique(
            [a[0] for a in utterance["new"].split("+")]
        ).tolist()
        if utterance["case ID"] != case:
            dialogues.append(dialogue)
            dialogue = []
            case = utterance["case ID"]
        elif "Hello" not in utterance["new"] and "Bye" not in utterance["new"]:
            if len(dialogue) > 0 and dialogue[-1].startswith(
                f"{utterance['resource']}_"
            ):
                prev_actions = [a[-1] for a in dialogue.pop(-1).split("+")]
                actions = prev_actions + actions

            dialogue.append(
                "+".join(
                    [
                        f"{utterance['resource']}_{action[0]}"
                        for action in np.unique(actions)
                    ]
                )
            )

    dialogues = list(filter(None, dialogues))
    return dialogues

In [5]:
def get_transition_probabilities(dialogues: List[str]) -> Dict[str, float]:
    """Get transition probabilities for a list of dialogues.

    Args:
        dialogues: Dialogues where each dialogue is a string of actions.

    Returns:
        Transition probabilities for each action in the dialogues.
    """
    transitions = defaultdict(lambda: defaultdict(int))

    for dialogue in dialogues:
        for i in range(len(dialogue) - 1):
            current_action = dialogue[i]
            next_action = dialogue[i + 1]
            if i == 0:
                transitions["Start"][current_action] += 1

            transitions[current_action][next_action] += 1

        transitions[dialogue[-1]]["End"] += 1

    probabilities = {}
    for action in transitions.keys():
        total = sum(transitions[action].values())
        if total > 0:
            probabilities[action] = {
                next_action: count / total
                for next_action, count in transitions[action].items()
            }
        else:
            probabilities[action] = {}

    return probabilities


def get_participants_transition_probs(
    transition_probs: Dict[str, float]
) -> Tuple[ParticipantTransitionProbs, ParticipantTransitionProbs]:
    """Gets the transitions probabilities for each participant.

    Args:
        transition_probs: Transition probabilities for all actions.

    Returns:
        Transition probabilities for each participant.
    """
    user_transition_probs = {}
    agent_transition_probs = {}
    for state, transition in transition_probs.items():
        if state.startswith("U_"):
            agent_transition_probs[state] = transition
        elif state.startswith("S_"):
            user_transition_probs[state] = transition

    return user_transition_probs, agent_transition_probs

In [6]:
USER_POPULATIONS = {}
AGENT_POPULATIONS = {}

datasets = [
    # ("U1", "A1", "data/annotated_datasets/1_dstc1_updated.csv"),
    ("U2", "A2", "data/annotated_datasets/2_dstc2_updated.csv"),
    ("U3", "A3", "data/annotated_datasets/5_ode_updated.csv"),
    ("U4", "A4", "data/annotated_datasets/4_scs_updated.csv"),
]

In [7]:
data_stats = {}

for user_pop, agent, path in datasets:
    print(f"Processing {path}")
    data = pd.read_csv(path)
    data = data.dropna(subset=["new"])
    dialogues = preprocess_dialogues(data)

    # Compute statistics on the dialogues: avg. # utterance and std dev
    num_utterances = [len(dialogue) for dialogue in dialogues]
    data_stats[f"D({user_pop}, {agent})"] = {
        "# dialogues": len(dialogues),
        "Avg. # utterances": mean(num_utterances),
        "Std. dev. # utterances": stdev(num_utterances),
    }

    transition_probabilities = get_transition_probabilities(dialogues)
    user_transition_probs, agent_transition_probs = (
        get_participants_transition_probs(transition_probabilities)
    )

    population = UserPopulation(user_pop, user_transition_probs)
    population.add_historical_dialogues(dialogues)
    USER_POPULATIONS[user_pop] = population

    AGENT_POPULATIONS[agent] = {
        "transition_probabilities": agent_transition_probs,
    }

Processing data/annotated_datasets/2_dstc2_updated.csv
Processing data/annotated_datasets/5_ode_updated.csv
Processing data/annotated_datasets/4_scs_updated.csv


Dialogues statistics


In [8]:
pd.DataFrame(data_stats).transpose().style.format(precision=3)

Unnamed: 0,# dialogues,Avg. # utterances,Std. dev. # utterances
"D(U2, A2)",2117.0,10.171,4.497
"D(U3, A3)",25.0,15.0,8.679
"D(U4, A4)",38.0,1.579,0.5


In [9]:
del data, dialogues, num_utterances, transition_probabilities, user_transition_probs, agent_transition_probs, data_stats

## Training of success predictors

In this part, we train a success predictor for each user population. It is a binary classifier that predicts the success of a dialogue based on the sequence of actions. The success predictor is based on a transformer model. Note that the success annotations are available for ODE and SCS datasets. For DSTC2, we assume that thank you messages indicate a successfule dialogue. For DSTC1, we manually annotate a subset of the dialogues to train the success predictor. 


In [11]:
def train_success_predictor(
    train_data: pd.DataFrame,
    test_data: pd.DataFrame,
    output_dir: str,
    class_weights: List[float] = None,
) -> ClassificationModel:
    """Trains a success predictor model.

    Args:
        train_data: Training data.
        test_data: Test data.
        output_dir: Output directory to save the model.
        class_weights: Class weights for the model. Defaults to None.

    Returns:
        Success predictor model.
    """
    model_args = ClassificationArgs(
        num_train_epochs=1, overwrite_output_dir=True
    )
    model = ClassificationModel(
        "distilbert",
        "distilbert/distilbert-base-uncased",
        args=model_args,
        use_cuda=False,
        weight=class_weights,
    )

    model.train_model(train_data, output_dir=output_dir)

    # Evaluate the model
    result, _, _ = model.eval_model(test_data)
    print(f"Evaluation results: {result}")

    return model

U1


U2


In [12]:
dstc2_data = pd.read_csv("data/annotated_datasets/2_dstc2_updated.csv")
u2_success_labels = (
    dstc2_data.apply(
        lambda x: 1 if "be contented" in x["Sitter"] else 0, axis=1
    )
    .groupby(dstc2_data["case ID"])
    .max()
    .to_list()
)

data = pd.DataFrame(
    zip(
        [" ".join(d) for d in USER_POPULATIONS["U2"].historical_dialogues],
        u2_success_labels,
    )
)
train_data, test_data = train_test_split(
    data, test_size=0.1, random_state=123, stratify=data[1]
)

class_counts = data[1].value_counts()
class_weights = [class_counts[0] / len(data), class_counts[1] / len(data)]
output_dir = "success_predictors/dstc2"
success_predictor = train_success_predictor(
    train_data, test_data, output_dir=output_dir, class_weights=class_weights
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  0%|          | 0/3 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The curre

Evaluation results: {'mcc': 0.0, 'accuracy': 0.9528301886792453, 'f1_score': 0.0, 'tp': 0, 'tn': 202, 'fp': 0, 'fn': 10, 'auroc': 0.5683168316831683, 'auprc': 0.05917786066324769, 'eval_loss': 0.012262825776512424}





In [13]:
USER_POPULATIONS["U2"].set_success_predictor(success_predictor)

In [14]:
del dstc2_data, u2_success_labels, data, train_data, test_data, class_counts, class_weights

U3


In [15]:
ode_data = pd.read_csv("data/annotated_datasets/5_ode_updated.csv")
u3_success_labels = (
    ode_data.apply(
        lambda x: 1 if x["activity name"] == "success()" else 0, axis=1
    )
    .groupby(ode_data["case ID"])
    .max()
    .to_list()
)

data = pd.DataFrame(
    zip(
        [" ".join(d) for d in USER_POPULATIONS["U3"].historical_dialogues],
        u3_success_labels,
    )
)
train_data, test_data = train_test_split(
    data, test_size=0.1, random_state=123, stratify=data[1]
)

class_counts = data[1].value_counts()
class_weights = [class_counts[0] / len(data), class_counts[1] / len(data)]
output_dir = "success_predictors/ode"
success_predictor = train_success_predictor(
    train_data, test_data, output_dir=output_dir, class_weights=class_weights
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
0it [00:00, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
1it [00:03,  3.35s/it]
Epochs 1/1. Running Loss:    0.5247: 100%|██████████| 3/3 [00:02<00:00,  1.37it/s]
Epoch 1 of 1: 100%|██████████| 1/1 [00:03<00:00,  3.23s/it]
0it [00:00, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Evaluation results: {'mcc': 0.0, 'accuracy': 0.6666666666666666, 'f1_score': 0.8, 'tp': 2, 'tn': 0, 'fp': 1, 'fn': 0, 'auroc': 0.5, 'auprc': 0.8333333333333333, 'eval_loss': 0.462820440530777}





In [16]:
USER_POPULATIONS["U3"].set_success_predictor(success_predictor)

In [17]:
del ode_data, u3_success_labels, data, train_data, test_data, class_counts, class_weights

U4


In [18]:
u4_success_labels = pd.read_csv("data/success_annotation/scs.csv")
u4_success_labels = u4_success_labels.apply(max, axis=1).to_list()

data = pd.DataFrame(
    zip(
        [" ".join(d) for d in USER_POPULATIONS["U4"].historical_dialogues],
        u4_success_labels,
    )
)
train_data, test_data = train_test_split(
    data, test_size=0.1, random_state=123, stratify=data[1]
)

class_counts = data[1].value_counts()
class_weights = [class_counts[0] / len(data), class_counts[1] / len(data)]
output_dir = "success_predictors/scs"
success_predictor = train_success_predictor(
    train_data, test_data, output_dir=output_dir, class_weights=class_weights
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
0it [00:00, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
1it [00:02,  2.93s/it]
Epochs 1/1. Running Loss:    0.6860: 100%|██████████| 5/5 [00:02<00:00,  1.89it/s]
Epoch 1 of 1: 100%|██████████| 1/1 [00:03<00:00,  3.60s/it]
0it [00:00, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Evaluation results: {'mcc': 0.0, 'accuracy': 0.5, 'f1_score': 0.0, 'tp': 0, 'tn': 2, 'fp': 0, 'fn': 2, 'auroc': 0.25, 'auprc': 0.5, 'eval_loss': 0.7209495306015015}





In [19]:
USER_POPULATIONS["U4"].set_success_predictor(success_predictor)

In [20]:
del u4_success_labels, data, train_data, test_data, class_counts, class_weights

## Generation of synthetic dialogues


In [21]:
def sample_next_action(
    current_action: str, transition_probs: ParticipantTransitionProbs
) -> str:
    """Samples the next action based on transition probabilities.

    Args:
        current_action: Current action.
        transition_probs: Transition probabilities.

    Returns:
        Next action.
    """
    next_actions = list(transition_probs[current_action].keys())
    probabilities = list(transition_probs[current_action].values())
    sampled_action = np.random.choice(next_actions, p=probabilities)
    return sampled_action


def sample_dialogue(
    agent_transition_probs: ParticipantTransitionProbs,
    user_transition_probs: ParticipantTransitionProbs,
) -> List[str]:
    """Samples a dialogue.

    Args:
        agent_transition_probs: Transition probabilities for the agent.
        user_transition_probs: Transition probabilities for the user.

    Returns:
        Dialogue as list of actions.
    """
    dialogue = []
    is_finished = False

    current_action = random.choice(
        list(agent_transition_probs.keys())
        + list(user_transition_probs.keys())
    )
    dialogue.append(current_action)
    while not is_finished:
        try:
            if current_action.startswith("U_"):
                current_action = sample_next_action(
                    current_action, agent_transition_probs
                )
            else:
                current_action = sample_next_action(
                    current_action, user_transition_probs
                )
            if current_action == "End":
                is_finished = True
                break
            dialogue.append(current_action)
        except KeyError:
            current_action = current_action.split("+")[-1]

    return dialogue


def sample_dialogues(
    agent_transition_probs: ParticipantTransitionProbs,
    user_transition_probs: ParticipantTransitionProbs,
    success_predictor: ClassificationModel,
    num_dialogues: int,
) -> List[Tuple[List[str], bool]]:
    """Samples dialogues.

    Args:
        agent_transition_probs: Transition probabilities for the agent.
        user_transition_probs: Transition probabilities for the user.
        success_predictor: Success predictor model.
        num_dialogues: Number of dialogues to sample.

    Returns:
        Dialogues with success status.
    """
    dialogues = []
    for _ in range(num_dialogues):
        dialogue = sample_dialogue(
            agent_transition_probs, user_transition_probs
        )
        success = success_predictor.predict([" ".join(dialogue)])[0][0]
        dialogues.append((dialogue, success))

    return dialogues

## Metrics

This part contains the methods to compute the metrics associated to the training and evaluation objectives.


In [22]:
from scipy.spatial import distance
from rouge_score import rouge_scorer
from itertools import product

### Training

We choose to use Jensen-Shannon divergence (JSD) and ROUGE-L as metrics to assess the similarity between the user population and simulated user populations. These allow us to make an assessment at the utterance- and dialogue-level respectively.


In [23]:
def compute_jsd(
    user_policy: ParticipantTransitionProbs,
    simulated_user_policy: ParticipantTransitionProbs,
) -> float:
    """Computes Jensen-Shannon divergence between user and simulated user
    policies.

    It computes the Jensen-Shannon divergence between the transition
    probabilities for each state and then averages them. Epsilon is added to 
    avoid division by zero.

    Args:
        user_policy: User policy.
        simulated_user_policy: Simulated user policy.

    Returns:
        Jensen-Shannon divergence.
    """
    epsilon = 1e-9
    total_jsd = 0.0
    for state, transitions_probabilities in user_policy.items():
        # Add epsilon to avoid division by zero
        simulated_user_policy[state] = {
            k: simulated_user_policy.get(state, {}).get(k, epsilon) for k in transitions_probabilities.keys()
        }

        probabilities = np.array(list(transitions_probabilities.values()))
        simulated_probabilities = np.array(
            list(simulated_user_policy[state].values())
        )
            
        total_jsd += distance.jensenshannon(
            probabilities, simulated_probabilities, base=2
        )
    return total_jsd / len(user_policy.keys())


def compute_rouge_score(
    historical_dialogues: List[List[str]], simulated_dialogues: List[List[str]]
) -> float:
    """Computes ROUGE-L score between historical and simulated dialogues.

    It computes the average ROUGE-L score between all pairs of historical and
    simulated dialogues.

    Args:
        historical_dialogues: Historical dialogues.
        simulated_dialogues: Simulated dialogues.

    Returns:
        ROUGE-L score.
    """
    historical_dialogues = [" ".join(d) for d in historical_dialogues]
    simulated_dialogues = [" ".join(d) for d in simulated_dialogues]
    total_score = 0.0
    scorer = rouge_scorer.RougeScorer(["rougeL"])
    pairs = list(product(historical_dialogues, simulated_dialogues))
    for h, s in pairs:
        total_score += scorer.score(h, s)["rougeL"].fmeasure
    return total_score / len(pairs)

### Evaluation

We use the success rate as the performance metric to evaluate the conversational agents.


In [24]:
def compute_success_rate(successes: List[int]) -> float:
    """Computes success rate.

    Args:
        successes: Successes.

    Returns:
        Success rate.
    """
    return sum(successes) / len(successes)

## Leave-one-out cross-validation

In this part, we perform a leave-out-one out experiment to answer the following questions: is the optimal user simulator for training also the best for evaluation, and vice versa?


In [25]:
import transformers

transformers.logging.set_verbosity_error()

In [None]:
# participant_pairs = [("U1", "A1"), ("U2", "A2"), ("U3", "A3"), ("U4", "A4")]
participant_pairs = [("U2", "A2"), ("U3", "A3"), ("U4", "A4")]
num_synthetic_dialogues = 100

results = defaultdict(dict)

for user_pop, agent in participant_pairs:
    print(f"Reference - {user_pop}, {agent}")
    user_population = USER_POPULATIONS[user_pop]
    for _, simulated_user_population in USER_POPULATIONS.items():
        if user_pop == simulated_user_population.name:
            continue

        # Generate synthetic dialogues
        synthetic_dialogues_data = sample_dialogues(
            AGENT_POPULATIONS[agent]["transition_probabilities"],
            simulated_user_population.transition_probabilities,
            simulated_user_population.success_predictor,
            num_synthetic_dialogues,
        )

        simulated_dialogues = []
        simulated_dialogues_success = []
        for dialogue, success in synthetic_dialogues_data:
            simulated_dialogues.append(dialogue)
            simulated_dialogues_success.append(success)

        # Compute ROUGE-L score
        rouge_l_score = compute_rouge_score(
            user_population.historical_dialogues,
            simulated_dialogues,
        )

        # Compute success rate
        success_rate = compute_success_rate(simulated_dialogues_success)

        results[f"{user_pop}, {agent}"][simulated_user_population.name] = {
            "ROUGE-L": rouge_l_score,
            "Success rate": success_rate,
        }

In [27]:
rows = []
for participant_pair, d in results.items():
    for simulated_user, metrics in d.items():
        rows.append((participant_pair, simulated_user, *(value for _, value in metrics.items())))

summary = pd.DataFrame(rows, columns=["Reference", "Simulated user pop.","ROUGE-L", "Success rate"])
summary.set_index(["Reference", "Simulated user pop."], inplace=True)

summary.style.format(precision=3)


Unnamed: 0_level_0,Unnamed: 1_level_0,ROUGE-L,Success rate
Reference,Simulated user pop.,Unnamed: 2_level_1,Unnamed: 3_level_1
"U2, A2",U3,0.523,1.0
"U2, A2",U4,0.448,0.0
"U3, A3",U2,0.523,0.0
"U3, A3",U4,0.426,0.0
"U4, A4",U2,0.527,0.0
"U4, A4",U3,0.445,1.0


Jensen-Shannon divergence


In [28]:
jsd_results = defaultdict(dict)

for user_pop1, user_pop2 in product(USER_POPULATIONS.keys(), repeat=2):
    if user_pop1 != user_pop2:
        user_policy1 = USER_POPULATIONS[user_pop1].transition_probabilities
        user_policy2 = USER_POPULATIONS[user_pop2].transition_probabilities
        jsd = compute_jsd(user_policy1, user_policy2)
        jsd_results[user_pop1][user_pop2] = jsd

In [29]:
pd.DataFrame(jsd_results).sort_index().style.format(precision=3)

Unnamed: 0,U2,U3,U4
U2,,0.412,0.383
U3,0.412,,0.33
U4,0.383,0.33,
