```
Author : Javier Chiyah-Garcia
GitHub : https://github.com/JChiyah/what-are-you-referring-to
Date   : August 2023
Python : 3.7+
```

Notebook with experiments for the paper __'What are you referring to?' Evaluating the Ability of Multi-Modal Dialogue Models to Process Clarificational Exchanges__

You need to have cloned this repository, as well as the SIMMC2 original repository, as it uses the original dataset from there to find the clarifications. We also provide the analysed model outputs in a separate folder, so you can run the evaluation without having to train the models yourself.

Some values may differ than those from the paper, as we have improved the tagging algorithm and fixed some bugs in the code ('All Turns' split was skipping turns that did not have a target). The results are very similar, somewhat lower across all models by a few decimals.

Requirements:
- Python 3.7 or above
- Numpy
- Tqdm

In [1]:
import os
import sys
import json
import glob

from tqdm import tqdm

# we assume that the simmc2 data is just outside the current folder (sibling dir)
sys.path.append('../')
# imported here to make sure it works, but used in evaluation.py
from simmc2.model.mm_dst.utils.evaluate_dst import evaluate_from_flat_list
SIMMC2_FOLDER = '../simmc2/data'

from src import *
from src import evaluation

DATA_FOLDER = 'data'

68 Clarification Exchange Tagging tests passed!


In [2]:
# read original SIMMC 2.0 data
simmc2_metadata = {}
for domain in tqdm(['fashion', 'furniture'], desc='Reading Metadata'):
	with open(os.path.join(SIMMC2_FOLDER, f"{domain}_prefab_metadata_all.json"), 'r') as f_in:
		simmc2_metadata = {**simmc2_metadata, **json.load(f_in)}

simmc2_scenes_jsons ={}
_files = glob.glob(f"{SIMMC2_FOLDER}/simmc2_scene_jsons_dstc10_public/*.json")
for file in tqdm(_files, desc='  JSON scenes'):
	with open(file, "r") as f_in:
		simmc2_scenes_jsons[os.path.splitext(os.path.basename(file))[0]] = json.load(f_in)

with open(os.path.join(SIMMC2_FOLDER, 'simmc2_dials_dstc10_devtest.json'), 'r') as f_in:
	simmc2_dataset = json.load(f_in)

Reading Metadata: 100%|██████████| 2/2 [00:00<00:00, 1538.07it/s]
  JSON scenes: 100%|██████████| 2743/2743 [00:00<00:00, 5560.45it/s]


In [3]:
# read model output files
model_outputs = {}

for subdir, dirs, files in os.walk(DATA_FOLDER):
	if 'coref-pred-devtest-mini.json' in files:
		with open(f"{subdir}/coref-pred-devtest-mini.json", 'r') as f_in:
			model_name = subdir.split('/')[-1]
			model_outputs[model_name] = json.load(f_in)
			# make sure that the dialogues have at most 1 turn (it's a specific format from SIMMC2 challenge)
			for dialogue in model_outputs[model_name]['dialogue_data'][:3]:
				if len(dialogue['dialogue']) > 1:
					# we need to fix this dataset!
					model_outputs[model_name] = fix_prediction_data_format(model_outputs[model_name])
					break

# sort dictionary by key
model_outputs = dict(sorted(model_outputs.items(), key=lambda item: item[0]))

# This is another variant of the Baseline GPT-2 from the challenge without the MultiModal help
del model_outputs['Baseline_GPT2_noMM']		# ignore for now
# We process the output of this model, but Team9 skipped predictions for ambiguous (Before-CR) turns and
# that needs to be changed in their original code. We left here for comparing All and After-CR turns
del model_outputs['Team9']

all_models = list(model_outputs.keys())
print(f"Loaded outputs: {all_models}")

dict_keys(['1-Baseline_GPT2', '2-GroundedLan_GPT2', '3-VisLan_LXMERT', '4-MultiTask_BART', 'Baseline_GPT2_noMM', 'Team9'])
Loaded outputs: ['1-Baseline_GPT2', '2-GroundedLan_GPT2', '3-VisLan_LXMERT', '4-MultiTask_BART']


In [4]:
# do some pre-processing on the original simmc2 data
last_ambiguous_turn = None

# we can iterate the original data easily, but the model output data is formatted
# in a slightly different way, so we need the t_index to access the turn
print('Preprocessing dataset and printing example Clarification Exchanges (CEs)')
for t_index, simmc2_datum in enumerate(iterate_over_dataset_entries(simmc2_dataset)):
	simmc2_dialogue, simmc2_turn = simmc2_datum
	simmc2_turn['model_outputs'] = {}
	simmc2_turn['scene_idx'], simmc2_turn['previous_scene_idx'] = get_scene_idx(
		simmc2_dialogue['scene_ids'], simmc2_turn['turn_idx'])

	for model_name, model_output in model_outputs.items():
		pred_dialogue = model_output['dialogue_data'][t_index]
		pred_turn = pred_dialogue['dialogue'][0]	# in model outputs, there is at most 1 turn per dialogue

		# check that we are in the same dialogue and turn in all the model outputs
		assert simmc2_dialogue['dialogue_idx'] == pred_dialogue['dialogue_idx'] \
			and simmc2_turn['turn_idx'] == pred_turn['turn_idx'], \
			f"Model output {model_name} does not match the original SIMMC2 data: " \
			f"Dialogue {simmc2_dialogue['dialogue_idx']} Turn {simmc2_turn['turn_idx']} " \
			f"Dialogue {pred_dialogue['dialogue_idx']} Turn {pred_turn['turn_idx']}"

		# add the model output to the turn as 'synced' or joint data
		simmc2_turn['model_outputs'][model_name] = pred_turn

	# check for Clarification Exchanges
	if ce.is_ambiguous_turn(simmc2_turn):
		last_ambiguous_turn = simmc2_dialogue, simmc2_turn
	elif last_ambiguous_turn is not None:
		# this is the turn after the CR, make sure they are the same dialogue
		if last_ambiguous_turn[0]['dialogue_idx'] == simmc2_dialogue['dialogue_idx']:
			ce.mark_clarification_exchange(
				ambiguous_turn=last_ambiguous_turn[1], response_turn=simmc2_turn)
		last_ambiguous_turn = None

	# last_ambiguous_turn = None	# reset for new dialogues, probs redundant

Preprocessing dataset and printing example Clarification Exchanges (CEs)
  Clarification Exchange
	USR: Does the grey have good reviews? | ['individual_property']
	SYS: Which one do you mean? | []
	USR: The grey one on the hanging rack. | ['individual_property', 'relational_context']
	SYS: That dress has a high rating at 4.3.
	Tags=['individual_property', 'relational_context']
  Clarification Exchange
	USR: What size is the pair on the left and who makes it? | ['relational_context']
	SYS: Sorry, which one? | []
	USR: The jeans on the left. | ['individual_property', 'relational_context']
	SYS: It's a size L,  from Cats Are Great.
	Tags=['individual_property', 'relational_context']
  Clarification Exchange
	USR: Can you tell me who makes it and how much it costs? | []
	SYS: Which ones? | []
	USR: The grey pair of jeans. | ['individual_property']
	SYS: This pair is made by Cats Are Great and costs $164.99.
	Tags=['individual_property']
  Clarification Exchange
	USR: What's the prive of th

In [5]:
# Define dataset splits as filters through the original SIMMC2 data
all_splits = [
	('All Turns', None),
	# ('Unambiguous Turns (All - CR Turns)', lambda x: not ce.is_ce_turn(x)),
	('CR Turns', lambda x: ce.is_ce_turn(x)),
	('Individual Property', lambda x: ce.is_tag_in_ce(x, tagging.TAG_INDIVIDUAL_PROPERTY)),
	('Dialogue History', lambda x: ce.is_tag_in_ce(x, tagging.TAG_DIALOGUE_HISTORY)),
	('Relational Context', lambda x: ce.is_tag_in_ce(x, tagging.TAG_RELATIONAL_CONTEXT)),
]

In [6]:
# Create the Evaluation Table 2 from the paper by analysing the data and printing to a LaTex format
print(f"Evaluation Results Table (Latex)\n{'=' * 24}\n")

def format_row_as_latex(_analysis):
	final_str = []
	for _model_name in all_models:
		if 'Before-CR' in _analysis:
			final_str += [
				f"{format_f1(_analysis['Before-CR'][_model_name]):<14}",
				f"{format_f1(_analysis['After-CR'][_model_name]):<14}",
				f"{format_delta(_analysis['Before-CR'][_model_name], _analysis['After-CR'][_model_name]):<10}"]
		else:
			final_str += [f"\multicolumn{{2}}{{c}}{{{format_f1(_analysis[_model_name])}}} ", ' '*10]

	return ' & '.join(final_str) + ' \\\\'

analysis = {}
for split_name, filter_func in all_splits:
	if split_name == 'All Turns':
		# first time! print headers
		headers = ["\multicolumn{3}{c}{" + x + "}}" for x in all_models]
		subheaders = ['Before-CR     ', 'After-CR      ', '$\\Delta$  '] * len(all_models)
		print(' & '.join(['Model' + ' '*15] + [f"{h:<44}" for h in headers]) + ' \\\\')
		print(f"{' & '.join(['Split' + ' '*15] + subheaders)} \\\\")

	# use the filter func to create a split of the data, then check the results of each model
	analysis[split_name] = evaluation.evaluate_dataset(simmc2_dataset, filter_func)
	print(f"{split_name:<20} & {format_row_as_latex(analysis[split_name])}")

Evaluation Results Table (Latex)

Model                & \multicolumn{3}{c}{1-Baseline_GPT2}}         & \multicolumn{3}{c}{2-GroundedLan_GPT2}}      & \multicolumn{3}{c}{3-VisLan_LXMERT}}         & \multicolumn{3}{c}{4-MultiTask_BART}}        \\
Split                & Before-CR      & After-CR       & $\Delta$   & Before-CR      & After-CR       & $\Delta$   & Before-CR      & After-CR       & $\Delta$   & Before-CR      & After-CR       & $\Delta$   \\
All Turns            & \multicolumn{2}{c}{34.1 (.01)}  &            & \multicolumn{2}{c}{67.6 (.01)}  &            & \multicolumn{2}{c}{68.3 (.01)}  &            & \multicolumn{2}{c}{73.8 (.01)}  &            \\
CR Turns             & 36.4 (.01)     & 29.1 (.01)     & -20.1      & 64.8 (.01)     & 67.7 (.01)     & +4.4       & 65.7 (.01)     & 69.2 (.01)     & +5.4       & 66.9 (.01)     & 74.3 (.01)     & +11.1      \\
Individual Property  & 35.6 (.01)     & 28.1 (.01)     & -21.0      & 64.4 (.01)     & 67.6 (.01)     & +4.9       & 6

In [7]:
# Create the Candidate Objects Table from Appendix A.2
print(f"Candidate Objects Table (Latex)\n{'=' * 23}\n")

headers = ['Split' + ' '*15, 'Mean Candidate Objects Type (SD)  ', 'Mean Candidate Objects Colour (SD)', 'Entries']

for split_name, filter_func in all_splits:
	if split_name == 'All Turns':
		# first time! print headers
		print(f"{' & '.join(headers)} \\\\")

	analysis = evaluation.extract_candidate_objects(simmc2_dataset, simmc2_metadata, simmc2_scenes_jsons, filter_func)

	print(f"{split_name:<20} & {format_mean(analysis['type'])}{' '*23} & {format_mean(analysis['color'])}{' '*23} & {analysis['type']['count']} \\\\")
	# & {format_mean(analysis['brand'])}{' '*22} - used rarely in clarifications, so skipped from table

Candidate Objects Table (Latex)

Split                & Mean Candidate Objects Type (SD)   & Mean Candidate Objects Colour (SD) & Entries \\
All Turns            & 3.13 (5.18)                        & 2.61 (4.27)                        & 8609 \\
CR Turns             & 5.41 (5.62)                        & 4.53 (4.63)                        & 855 \\
Individual Property  & 5.49 (5.66)                        & 4.58 (4.68)                        & 825 \\
Dialogue History     & 5.33 (5.62)                        & 4.71 (4.81)                        & 198 \\
Relational Context   & 5.81 (5.95)                        & 4.67 (4.72)                        & 685 \\
