# Exploration of the MultiWOZ-2 dataset

In this notebook, we assess whether or not it is possible and feasible to convert the MultiWOZ-2 dataset to the Rasa Story format.

## Background information

Basic information from the corresponding paper: https://arxiv.org/abs/1810.00278

- 10x bigger than predicessor data sets
- Human-Human dialogues
- 3406 single-domain dialogues
- 7032 multi-domain dialogues
- 3406 + 7032 = 10438 dialogues in total
- 70% have more than 10 turns
- average number of turns: 8.93 (single-domain) and 15.39 (multi-domain)
- authors generate some benchmark tests

Data accessible here: [http://dialogue.mi.eng.cam.ac.uk/index.php/corpus/](http://dialogue.mi.eng.cam.ac.uk/index.php/corpus/)

## Import

Download the MultiWOZ-2 data set zip file.

In [1]:
import os
import urllib.request
import zipfile

In [2]:
url = "https://www.repository.cam.ac.uk/bitstream/handle/1810/280608/MULTIWOZ2.zip?sequence=3&isAllowed=y"
fname = os.path.abspath(os.path.join(".", "data", "MultiWOZ-2.zip"))

if not os.path.isdir(os.path.join(".", "data")):
    os.makedirs(os.path.join(".", "data"))

print(f"Downloading {url} to {fname}...")
urllib.request.urlretrieve(url, fname);
print("Done.")

Downloading https://www.repository.cam.ac.uk/bitstream/handle/1810/280608/MULTIWOZ2.zip?sequence=3&isAllowed=y to d:\rasa\research\report_multiWOZ\data\MultiWOZ-2.zip...
Done.


Unpack the data.

In [3]:
with zipfile.ZipFile(fname, "r") as zip_ref:
    zip_ref.extractall(os.path.abspath(os.path.join(".", "data", "MultiWOZ-2")))

In [4]:
data_file_name = os.path.abspath(os.path.join(".", "data", "MultiWOZ-2", "MULTIWOZ2 2", "data.json"))

Import the json data.

In [5]:
import json

In [6]:
with open(data_file_name, "r") as read_file:
    data = json.load(read_file)

Number of dialogues:

In [7]:
len(data)

10438

Files with multi-domain dialogues have "MUL" in their names. Single domain dialogues have either "SNG" or "WOZ" in their names.
(This is slightly different from the description on the website, where "WOZ" is not mentioned.)
These tags, however, are not necessarily the beginnings of the names:

In [8]:
names = list(data.keys())

In [9]:
set([name[:4] for name in names])

{'MUL0', 'MUL1', 'MUL2', 'PMUL', 'SNG0', 'SNG1', 'SSNG', 'WOZ2'}

In [10]:
def dialog_class_from_name(name):
    if "MUL" in name:
        return "MUL"
    elif "SNG" in name:
        return "SNG"
    elif "WOZ" in name:
        return "WOZ"
    else:
        return "Neither"

In [11]:
set([dialog_class_from_name(name) for name in names])

{'MUL', 'SNG', 'WOZ'}

### A SNG dialog example

In [12]:
example_sng = names[0]
example_sng

'SNG01856.json'

There are two entries in this example:

In [13]:
list(data[example_sng])

['goal', 'log']

### `log` data

The 'log' data contains the conversation. 
We can print out the conversation like this:

In [14]:
log = data[example_sng]["log"]
for step in log:
    if len(step["metadata"]) == 0:   # It looks like user-texts don't have metadata
        print("U:  " + step["text"])  # User
    else:
        print("W:  " + step["text"])  # Wizzard

U:  am looking for a place to to stay that has cheap price range it should be in a type of hotel
W:  Okay, do you have a specific area you want to stay in?
U:  no, i just need to make sure it's cheap. oh, and i need parking
W:  I found 1 cheap hotel for you that includes parking. Do you like me to book it?
U:  Yes, please. 6 people 3 nights starting on tuesday.
W:  I am sorry but I wasn't able to book that for you for Tuesday. Is there another day you would like to stay or perhaps a shorter stay?
U:  how about only 2 nights.
W:  Booking was successful.
Reference number is : 7GAWK763. Anything else I can do for you?
U:  No, that will be all. Good bye.
W:  Thank you for using our services.


Only for the Wizzard's messages, meta data about the user intent is available:

In [15]:
print(data[example_sng]["log"][2]["text"])
print(data[example_sng]["log"][3]["metadata"])  # Note: This is 3 instead of 2
print()
print(data[example_sng]["log"][3]["text"])
print(data[example_sng]["log"][4]["metadata"])

no, i just need to make sure it's cheap. oh, and i need parking
{'taxi': {'book': {'booked': []}, 'semi': {'leaveAt': '', 'destination': '', 'departure': '', 'arriveBy': ''}}, 'police': {'book': {'booked': []}, 'semi': {}}, 'restaurant': {'book': {'booked': [], 'time': '', 'day': '', 'people': ''}, 'semi': {'food': '', 'pricerange': '', 'name': '', 'area': ''}}, 'hospital': {'book': {'booked': []}, 'semi': {'department': ''}}, 'hotel': {'book': {'booked': [], 'stay': '', 'day': '', 'people': ''}, 'semi': {'name': 'not mentioned', 'area': 'not mentioned', 'parking': 'yes', 'pricerange': 'cheap', 'stars': 'not mentioned', 'internet': 'not mentioned', 'type': 'hotel'}}, 'attraction': {'book': {'booked': []}, 'semi': {'type': '', 'name': '', 'area': ''}}, 'train': {'book': {'booked': [], 'people': ''}, 'semi': {'leaveAt': '', 'destination': '', 'day': '', 'arriveBy': '', 'departure': ''}}}

I found 1 cheap hotel for you that includes parking. Do you like me to book it?
{}


In [16]:
data[example_sng]["log"][3]["metadata"]["hotel"].keys()

dict_keys(['book', 'semi'])

The meta data gets updated when the user provides new information:

In [17]:
from termcolor import colored

In [18]:
log = data[example_sng]["log"]
semi = {'name': 'not mentioned',
 'area': 'not mentioned',
 'parking': 'not mentioned',
 'pricerange': 'not mentioned',
 'stars': 'not mentioned',
 'internet': 'not mentioned',
 'type': 'not mentioned'}  # Baseline missing information about the hotel
book = semi.copy()
for step in log:
    if len(step["metadata"]) == 0:   # It looks like user-texts don't have metadata
        print("U:  " + step["text"])  # User
    else:
        # Print new information in red
        x = semi
        y = step['metadata']['hotel']['semi']
        changed_items = {k: y[k] for k in x if k in y and x[k] != y[k]}
        print(colored(f"new semi: {changed_items}", "red"))
        x = book
        y = step['metadata']['hotel']['book']
        changed_items = {k: y[k] for k in x if k in y and x[k] != y[k]}
        print(colored(f"new book: {changed_items}", "red"))
        semi = step['metadata']['hotel']['semi']
        book = step['metadata']['hotel']['book']
        
        print("W:  " + step["text"])  # Wizzard

U:  am looking for a place to to stay that has cheap price range it should be in a type of hotel
[31mnew semi: {'pricerange': 'cheap', 'type': 'hotel'}[0m
[31mnew book: {}[0m
W:  Okay, do you have a specific area you want to stay in?
U:  no, i just need to make sure it's cheap. oh, and i need parking
[31mnew semi: {'parking': 'yes'}[0m
[31mnew book: {}[0m
W:  I found 1 cheap hotel for you that includes parking. Do you like me to book it?
U:  Yes, please. 6 people 3 nights starting on tuesday.
[31mnew semi: {}[0m
[31mnew book: {'stay': '3', 'day': 'tuesday', 'people': '6'}[0m
W:  I am sorry but I wasn't able to book that for you for Tuesday. Is there another day you would like to stay or perhaps a shorter stay?
U:  how about only 2 nights.
[31mnew semi: {}[0m
[31mnew book: {'booked': [{'name': 'the cambridge belfry', 'reference': '7GAWK763'}], 'stay': '2'}[0m
W:  Booking was successful.
Reference number is : 7GAWK763. Anything else I can do for you?
U:  No, that will be 

This would get us the user utterances, but not most of the bot's actions. We could only infer that the bot tried to book when data arrived in the `book` category. 
But from the `log` data alone, there seems to be no way to infer that the first information gained (`{'pricerange': 'cheap', 'type': 'hotel'}`) triggers something like `action_ask_location`.

A possible workaround would be to **train an NLU model to predict actions from the Wizzard's utterances**, but I don't know if that is feasable.
As a final resort, let's have a look at the `goal` data.

### `goal` data

The 'goal' data contains information about all the domains. Since this is a single domain file (plus booking), most entries are empty:

In [19]:
print(data[example_sng]["goal"])

{'taxi': {}, 'police': {}, 'hospital': {}, 'hotel': {'info': {'type': 'hotel', 'parking': 'yes', 'pricerange': 'cheap', 'internet': 'yes'}, 'fail_info': {}, 'book': {'pre_invalid': True, 'stay': '2', 'day': 'tuesday', 'invalid': False, 'people': '6'}, 'fail_book': {'stay': '3'}}, 'topic': {'taxi': False, 'police': False, 'restaurant': False, 'hospital': False, 'hotel': False, 'general': False, 'attraction': False, 'train': False, 'booking': False}, 'attraction': {}, 'train': {}, 'message': ["You are looking for a <span class='emphasis'>place to stay</span>. The hotel should be in the <span class='emphasis'>cheap</span> price range and should be in the type of <span class='emphasis'>hotel</span>", "The hotel should <span class='emphasis'>include free parking</span> and should <span class='emphasis'>include free wifi</span>", "Once you find the <span class='emphasis'>hotel</span> you want to book it for <span class='emphasis'>6 people</span> and <span class='emphasis'>3 nights</span> sta

In [20]:
print(data[example_sng]["goal"]["taxi"])
print(data[example_sng]["goal"]["police"])
print(data[example_sng]["goal"]["hospital"])
print(data[example_sng]["goal"]["attraction"])
print(data[example_sng]["goal"]["train"])
print(data[example_sng]["goal"]["restaurant"])

{}
{}
{}
{}
{}
{}


The only topic that was actually relevant in this conversation was the `hotel`. 
Therefore, the `hotel` data contains the final requirements and booking intents, as well as the failed attempts:

In [21]:
list(data[example_sng]["goal"]["hotel"])

['info', 'fail_info', 'book', 'fail_book']

In [22]:
data[example_sng]["goal"]["hotel"]["info"]

{'type': 'hotel', 'parking': 'yes', 'pricerange': 'cheap', 'internet': 'yes'}

In [23]:
data[example_sng]["goal"]["hotel"]["book"]

{'pre_invalid': True,
 'stay': '2',
 'day': 'tuesday',
 'invalid': False,
 'people': '6'}

In [24]:
data[example_sng]["goal"]["hotel"]["fail_info"]

{}

In [25]:
data[example_sng]["goal"]["hotel"]["fail_book"]

{'stay': '3'}

The `message` data contains the goal that the user was given before the dialog started.
This should not be relevant for us.

In [26]:
for msg in data[example_sng]["goal"]["message"]:
    print("> " + msg)

> You are looking for a <span class='emphasis'>place to stay</span>. The hotel should be in the <span class='emphasis'>cheap</span> price range and should be in the type of <span class='emphasis'>hotel</span>
> The hotel should <span class='emphasis'>include free parking</span> and should <span class='emphasis'>include free wifi</span>
> Once you find the <span class='emphasis'>hotel</span> you want to book it for <span class='emphasis'>6 people</span> and <span class='emphasis'>3 nights</span> starting from <span class='emphasis'>tuesday</span>
> If the booking fails how about <span class='emphasis'>2 nights</span>
> Make sure you get the <span class='emphasis'>reference number</span>


What does the "topic" data signify? In this example, it is not saying what topics were encountered, but instead it is just `False` for all topics. 

In [27]:
data[example_sng]["goal"]["topic"]

{'taxi': False,
 'police': False,
 'restaurant': False,
 'hospital': False,
 'hotel': False,
 'general': False,
 'attraction': False,
 'train': False,
 'booking': False}

I think it _does_ represent the topics of the conversation (the user has to check boxes at the end of the conversation), but the user probably made a mistake in this example. 
It could serve as a check if user and wizzard topics concur.

Unfortunately, the `goal` data also does not contain any hints about the Wizzard's concrete actions.

### A MUL dialog example

In [28]:
example_mul = names[2]
example_mul

'PMUL1635.json'

This dataset contains new keys, but these refer to new contexts and do not seem to provide information that would be useful to us:

In [29]:
def compare_structure(ds1, ds2, name="root", level=0):
    #print(" " * level + str(type(ds1)))
    print(" " * level + name)
    if type(ds1) is type(ds2):
        if type(ds1) is dict:
            if set(ds1) == set(ds2):
                return all([compare_structure(ds1[k], ds2[k], k, level+2) for k in ds1.keys()])
            else:
                mismatch_l = {k for k in ds1 if k not in ds2}
                mismatch_r = {k for k in ds2 if k not in ds1}
                print(colored(f"Key mismatch: {mismatch_l} | {mismatch_r}", "red"))
                return False
        elif type(ds1) is list:
            return all([compare_structure(ds1[k], ds2[k], str(k), level+2) for k in range(len(ds1))])
        else:
            return True
    else:
        print(colored(f"Type mismatch", "red"))
        return False

In [30]:
compare_structure(data[example_sng], data[example_mul])

root
  goal
    taxi
    police
    hospital
    hotel
      info
[31mKey mismatch: {'type', 'pricerange'} | {'area', 'stars'}[0m
      fail_info
      book
[31mKey mismatch: {'pre_invalid'} | set()[0m
      fail_book
[31mKey mismatch: {'stay'} | set()[0m
    topic
      taxi
      police
      restaurant
      hospital
      hotel
      general
      attraction
      train
      booking
    attraction
    train
[31mKey mismatch: set() | {'info', 'reqt', 'fail_info'}[0m
    message
      0
      1
      2
      3
      4
    restaurant
  log
    0
      text
      metadata
    1
      text
      metadata
[31mKey mismatch: set() | {'bus'}[0m
    2
      text
      metadata
    3
      text
      metadata
[31mKey mismatch: set() | {'bus'}[0m
    4
      text
      metadata
    5
      text
      metadata
[31mKey mismatch: set() | {'bus'}[0m
    6
      text
      metadata
    7
      text
      metadata
[31mKey mismatch: set() | {'bus'}[0m
    8
      text
      metadata


False

Check if any of the dialoges have metadata about the wizzards.
To this end, we find dialogues where more than 50% of the turns are annotated.

In [31]:
def annotation_ratio(name):
    annotated = 0.0
    for x in data[name]["log"]:
        if len(x["metadata"]) > 0:
            annotated += 1.0
    annotated /= len(data[name]["log"])
    return annotated

In [32]:
annotation_ratio(names[0])

0.5

In [33]:
odd_sets = [name for name in names if annotation_ratio(name) > 0.5]
print(odd_sets)

['SNG1213.json', 'PMUL0382.json', 'PMUL0237.json']


The first one has more annotations, because the Wizzard kept talking and the user did not reply.

In [34]:
for step in data[odd_sets[0]]["log"]:
    if len(step["metadata"]) == 0:   # It looks like user-texts don't have metadata
        print("U:  " + step["text"])  # User
        print(step["metadata"])
    else:
        print("W:  " + step["text"])  # Wizzard

U:  I'm interested in finding an expensive guesthouse to stay at during my visit to Cambridge
{}
W:  I'm sorry I don't have any matches. Should we try a different price range?
U:  Can you search for hotels instead of a guesthouse?
{}
W:  I have expensive hotels in the east, centre, and south areas of town. Do you have a preference?
U:  The east would be great.
{}
W:  There is one in the east. It is express by holiday inn cambridge. Do you need a booking?
U:  Does the Holiday Inn Cambridge have free parking? I forgot to tell you I need that.
{}
W:  Yes, it has free parking. Would you like me to book that for you?
U:  No, what is the phone number for the hotel?
{}
W:  It is 01223866800.  May I help with anything else?
U:  Yes, can you please tell me how many stars it has? 
{}
W:  It is a 2 star hotel. Would you like me to make those reservations?
W:  It has 2 stars. Is there anything else I can do for you?
W:  Great, thank you and goodbye!


Same here:

In [35]:
for step in data[odd_sets[1]]["log"]:
    if len(step["metadata"]) == 0:   # It looks like user-texts don't have metadata
        print("U:  " + step["text"])  # User
        print(step["metadata"])
    else:
        print("W:  " + step["text"])  # Wizzard

U:  I'm looking for a cheap Italian restaurant. 
{}
W:  How about Pizza Hut City Centre?  I hear it is very good.
U:  I s it located in the north?
{}
W:  No, that one was in the Centre.  But, da vinci pizzeria is in the north
U:  Da Vinci Pizzeria is fine can you book a table for one for Monday at 12:00?
{}
W:  I will book it for you,is there anything else I can do for you ?
U:  Yes, what about the reference number?
{}
W:  My system is down and I am unable to book at the moment.  Here is their # 01223351707 or we can try again later.  Sorry for inconvenience.
U:  Wow, really? Could you try again please?
{}
W:  My apologies, this is actually a system issue and i cannot book.   I have reported in, in the meantime do you need other information?
U:  I will try back later for the reference number. Could you find me a place to stay that is near the restaurant?
{}
W:  There are several in the moderate to cheap range. Do you have a star preference?
U:  no stars would work for me. 
{}
W:  I wou

Again, the Wizzard replied multiple times.

In [36]:
for step in data[odd_sets[2]]["log"]:
    if len(step["metadata"]) == 0:   # It looks like user-texts don't have metadata
        print("U:  " + step["text"])  # User
        print(step["metadata"])
    else:
        print("W:  " + step["text"])  # Wizzard

U:  I am trying to find a Jamaican restaurant
{}
W:  We don't have any Jamaican restaurants here. Would you like to try something else?
U:  How about a expensive mediterranean?
{}
W:  There are two restaurants, La Mimosa or Shiraz
U:  let's book a table at Shiraz.
{}
W:  Can I get what day,time and how many people will be dining?
U:  Monday at noon and five.
{}
W:  Your booking was successful. The table will be reserved for 15 minutes. Reference number is:  QJI9U6C7. Is there anything else I can assist you with today?
U:  Also looking for a place to stay. The hotel should include free parking and should include free wifi.
{}
W:  I recommend acorn guest house on 154 chesterton road would you like me to book?
U:  Can I get the address, postcode and price range of that hotel?
{}
W:  It is a moderate priced guesthouse. The address is 154 chesterton road and postcode is cb41da. Can I book you any rooms?
U:  Sure, that would be great. one person for 2 nights please.
{}
W:  What day will you 

## Further samples

In [37]:
def print_dialog(n):
    dialog = data[names[n]]
    log = dialog["log"]
    print(f"Length: {len(log)}")
    
    for step in log:
        if len(step["metadata"]) == 0:   # It looks like user-texts don't have metadata
            print("U:  " + step["text"])  # User
        else:
            print("W:  " + step["text"])  # Wizzard

In [38]:
print_dialog(1)

Length: 10
U:  Hello, I have been robbed.  Can you please help me get in touch with the police?
W:  Parkside Police Station is in Parkside, Cambridge. Their number is 01223358966. Anything else I can do for you?
U:  Can I please have the postcode as well?
W:  The postcode for the Parkside Police Station is CB11JG. Can I help you with anything else?
U:  Was Parkside the address of the police station? If not, can I have the address please?
W:  Yes, Parkside is the address.
U:  Thank you that will be all for now.
W:  Great. Thank you for contacting Cambridge Towninfo Centre.
U:  You were great. Goodbye.
W:  We are happy to help. Have a good day!


Here the user asks for information that was clearly if the information given is correct. Otherwise the dialog seems fine.

In [39]:
print_dialog(2)

Length: 18
U:  I need to book a hotel in the east that has 4 stars.  
W:  I can help you with that. What is your price range?
U:  That doesn't matter as long as it has free wifi and parking.
W:  If you'd like something cheap, I recommend the Allenbell. For something moderately priced, I would recommend the Warkworth House.
U:  Could you book the Wartworth for one night, 1 person?
W:  What day will you be staying?
U:  Friday and Can you book it for me and get a reference number ?
W:  Booking was successful.
Reference number is : BMUKPTG6.  Can I help you with anything else today?
U:  I am looking to book a train that is leaving from Cambridge to Bishops Stortford on Friday. 
W:  There are a number of trains leaving throughout the day.  What time would you like to travel?
U:  I want to get there by 19:45 at the latest. 
W:  Okay! The latest train you can take leaves at 17:29, and arrives by 18:07. Would you like for me to book that for you?
U:  Yes please. I also need the travel time, de

Dialog looks fine. Here the Wizzard provides multiple options at once for the hotels, and a train is booked, too. 

In [40]:
print_dialog(42)

Length: 8
U:  I am looking for a train departing from london liverpool please.
W:  I'll be glad to help. You would like to from london liverpool street to what destination, please?
U:  Cambridge. I'd like to leave after 10:00 on Friday.
W:  I have seven trains, could you tell me when you would like to arrive by ?
U:  Just any time after 10:00, can I get the train ID of one of them please?
W:  TR7943 leaves at 11:39
U:  Thank you very much! That is all I need for today.
W:  Have a great day!


Looks fine.

In [41]:
print_dialog(-1)

Length: 12
U:  I am looking to travel to Cambridge by train. 
W:  What is your departure day, time, and location?
U:  I'll be leaving from london liverpool street on Tuesday. I need to arrive by 12:30.
W:  There is a train that leaves at 07:39 and arrives at 09:07. Should I book this one?
U:  No, I just need the price and Train ID for the 9:07 train.
W:  Sure, the trainID is TR2715 and the price is 16.60 GBP per ticket. Can I help with anything else today?
U:  Actually yes, I also want to find out what kind of boating attractions there are.
W:  I see Camboats, Riverboat Georgina, Scudamores Punting Co., and The Cambridge Punter listed. 
U:  I'd like the address for the Riverboat Georgina, please.
W:  Certainly, riverboat georgina, cambridge passenger cruisers, jubilee house. Is there anything else I can do for you?
U:  I think that's everything, thank you for your time.
W:  Thank you, enjoy your visit!


Looks good.

In [42]:
print_dialog(-2)

Length: 14
U:  I am looking for a cheap modern European restaurant. Can you help me?
W:  The River Bar Steakhouse and Grill fits that description. Would you like me to book you? 
U:  Yes. 3 people at 13:00 on Thursday. 
W:  I will book that for you now.
U:  Thank you! What's the reference number?
W:  I have booked your table and the reference number is : A36LRSL9.
U:  I am also looking for places to go where I can take a boat.
W:  Yes I have places in the east, north and center. Any preference?
U:  Preferably in the centre, I'll be staying in that area with some friends. 
W:  There is a place in the centre named Scudamores punting co where you can get a boat.  Would you like their phone number or address?
U:  Yes, may I get the address, entrance fee and phone number please?
W:  Yes, they are located at Granta Place, Mill Lane, postcode cb21rs.  You will have to call for an entrance fee, the phone number is 01223359750.  
U:  Thanks, that's all I needed today. Goodbye!
W:  You're welcom

Looks good.

In [43]:
print_dialog(100)

Length: 22
U:  I need a hotel close to downtown Cambridge please.
W:  I assume you mean the center of town? We have many hotels in Cambridge covering all price ranges.
U:  I'm sorry, I wasn't thinking clearly. I'm looking for a place in the West. I need free wifi as well, please.
W:  There are four places to stay in the west that provide internet. What price range would you prefer?
U:  I don't care about the price range, but do you have one that is a guesthouse?
W:  Yes, there are two when will you need the room?
U:  Does one of those two have free parking? If so, I'll book it for 4 people for 5 nights starting Friday.
W:  They both have free parking and internet.  Would you prefer cheap or moderate priced?
U:  I don't mind cheap as long as it's still reasonably nice.
W:  I recommend Finches Bed & Breakfast. It's cheap, but still 4 stars. Would you like a reservation?
U:  I need to book 4 people for 5 nights starting friday please
W:  I'm sorry, but I wasn't able to book for 5 nights. 

Looks good.

In [44]:
print_dialog(101)

Length: 10
U:  I'd like to find a train that leaves Monday and arrives by 19:00.
W:  Very well. Where are you traveling from, and to where?
U:  I am departing from birmingham new street, going to cambridge.
W:  I have a train that departs at 15:40 and arrives in Cambridge at 18:23. Will that do?
U:  That would work just fine. Could you tell me the price and travel time for that train? 
W:  The TR5385 has a travel time of 163 minutes. Would you like me to book you some tickets?
U:  No, I don't want to book today.  Could you just give me the price for that train?  I think that is all I need today.
W:  Sure.  The price of those tickets are 75.10 pounds each.
U:  Great! That's all I need. Thank you for your help.
W:  Thank you for using Cambridge TownInfo Centre. Have a wonderful day!


Looks good.

In [45]:
print_dialog(202)

Length: 26
U:  I want to know if any trains run on Sunday?
W:  Yes, I have 404 trains running on Sunday. Where are you departing from?
U:  I'll be heading to Cambridge from Stevenage, and I'd like to arrive on Sunday by 12:45, please.
W:  How about train TR8799, which departs on Sunday at 11:00 and arrives by 11:51?
U:  That sounds good, how much is a ticket?
W:  The price is 18.88 pounds. Would you like me to book it?
U:  Yes please go ahead and book that for me. I am also looking for a place to stay and can you find me a Guest house with free wifi?
W:  I'm unsure if the wifi at Alexander Bed and Breakfast is free but it is cheap and in the center of town and definitely does have internet access. Would that do?
U:  Actually, I would prefer a guesthouse is the north.
W:  Okay I have many options. Any price preferences?
U:  no, i just need to know the price
W:  I would recommend Archway House, it is moderately priced and 4 stars. Would you like a reservation?
U:  Okay.  I would like to 

Looks good, but this might be difficult: "I don't actually need the train booked yet".

## Other files

There are other files in the data set, besides data.json.

In [46]:
with open(os.path.join(".", "data", "MultiWOZ-2", "MULTIWOZ2 2", "README.json")) as f:
    print(f.read())

#####################################################
#####################################################
#  Copyright Cambridge Dialogue Systems Group, 2018 #
#####################################################
#####################################################

Dataset contains the following json files:
1. data.json: the woz dialogue dataset, which contains the conversation  users and wizards, as well as a set of coarse labels for each user turn. Files with multi-domain dialogues have "MUL" in their names. Single domain dialogues have either "SNG" or "WOZ" in their names.
2. restaurant_db.json: the Cambridge restaurant database file, containing restaurants in the Cambridge UK area and a set of attributes.
3. attraction_db.json: the Cambridge attraction database file, contining attractions in the Cambridge UK area and a set of attributes.
4. hotel_db.json: the Cambridge hotel database file, containing hotels in the Cambridge UK area and a set of attributes.
5. train_db.json: th

Aha! The "system_acts.json" file seems interesting, but it was apparently renamed to "dialogue_acts.json".

In [47]:
acts_file_name = os.path.abspath(os.path.join(".", "data", "MultiWOZ-2", "MULTIWOZ2 2", "dialogue_acts.json"))

with open(acts_file_name, "r") as read_file:
    acts = json.load(read_file)

len(acts)

10438

In [48]:
acts[names[2][:-5]]

{'1': {'Hotel-Request': [['Price', '?']]},
 '6': {'Train-Inform': [['Leave', '17:29'], ['Arrive', '18:07']],
  'Train-OfferBook': [['none', 'none']]},
 '9': {'general-bye': [['none', 'none']],
  'general-welcome': [['none', 'none']]},
 '5': {'Train-Inform': [['Choice', 'a number'],
   ['Leave', 'throughout the day']],
  'Train-Request': [['Leave', '?']]},
 '4': {'general-reqmore': [['none', 'none']],
  'Booking-Book': [['Ref', 'BMUKPTG6']]},
 '7': {'general-reqmore': [['none', 'none']],
  'Train-OfferBooked': [['Time', '38 minutes'],
   ['Ref', 'UIFV8FAS'],
   ['Ticket', '10.1 GBP']]},
 '2': {'Hotel-Recommend': [['Price', 'cheap'],
   ['Price', ' moderately priced'],
   ['Name', 'Allenbell'],
   ['Name', ' Warkworth House']]},
 '8': {'Booking-Book': [['Ref', 'YF86GE4J']]},
 '3': {'Booking-Request': [['Day', '?']]}}

In [49]:
def print_dialog_acts(n):
    name = names[n]
    dialog = data[names[n]]
    log = dialog["log"]
    print(f"Dialog {name[:-5]}, length: {len(log)}")
    
    i = 0
    for step in log:
        if len(step["metadata"]) == 0:   # It looks like user-texts don't have metadata
            print("U:  " + step["text"])  # User
        else:
            print("W:  " + step["text"])  # Wizzard
            print(colored(str(acts[name[:-5]][str(i+1)]), "red"))
            i += 1

In [50]:
print_dialog_acts(0)

Dialog SNG01856, length: 10
U:  am looking for a place to to stay that has cheap price range it should be in a type of hotel
W:  Okay, do you have a specific area you want to stay in?
[31m{'Hotel-Request': [['Area', '?']]}[0m
U:  no, i just need to make sure it's cheap. oh, and i need parking
W:  I found 1 cheap hotel for you that includes parking. Do you like me to book it?
[31m{'Booking-Inform': [['none', 'none']], 'Hotel-Inform': [['Price', 'cheap'], ['Choice', '1'], ['Parking', 'none']]}[0m
U:  Yes, please. 6 people 3 nights starting on tuesday.
W:  I am sorry but I wasn't able to book that for you for Tuesday. Is there another day you would like to stay or perhaps a shorter stay?
[31m{'Booking-NoBook': [['Day', 'Tuesday']], 'Booking-Request': [['Stay', '?'], ['Day', '?']]}[0m
U:  how about only 2 nights.
W:  Booking was successful.
Reference number is : 7GAWK763. Anything else I can do for you?
[31m{'general-reqmore': [['none', 'none']], 'Booking-Book': [['Ref', '7GAWK763']

In [51]:
print_dialog_acts(123)

Dialog PMUL1229, length: 16
U:  I need a train going to cambridge. 
W:  Absolutely! What day and time would you like to leave?
[31m{'Train-Request': [['Leave', '?'], ['Day', '?']]}[0m
U:  On Thursday any time after 21:00.
W:  Where will you be departing from? 
[31m{'Train-Request': [['Depart', '?']]}[0m
U:  I will departing from birmingham new street.
W:  I would recommend TR7324 which leaves Birmingham New Street at 21:40 and arrives at Cambridge at 24:23. 
[31m{'Train-Inform': [['Id', 'TR7324 '], ['Dest', 'Cambridge '], ['Arrive', '24:23'], ['Leave', '21:40'], ['Depart', 'Birmingham New Street']]}[0m
U:  Yes can you book that for 1 person?
W:  Your booking for one ticket is complete. Your reference number is VEG5Q87Q and 75.09GBP will be due at the station.
[31m{'Train-OfferBooked': [['Ref', 'VEG5Q87Q '], ['Ticket', '75.09GBP '], ['People', 'one ']]}[0m
U:  I am also looking for a attraction called old schools. 
W:  Yes, Old Schools is located in the centre area, and has no e

So **the Wizzard's actions are actually provided in this separate file!**
Lets see what actions the Wizzard's can take over all.

In [52]:
actions = set()

for name in names:
    act = acts[name[:-5]]
    for a in act.values():
        actions.update(set(list(a)))
actions = list(actions)
actions.sort()
for a in actions:
    print(a)

 
A
Attraction-Inform
Attraction-NoOffer
Attraction-Recommend
Attraction-Request
Attraction-Select
Booking-Book
Booking-Inform
Booking-NoBook
Booking-Request
Hotel-Inform
Hotel-NoOffer
Hotel-Recommend
Hotel-Request
Hotel-Select
N
Restaurant-Inform
Restaurant-NoOffer
Restaurant-Recommend
Restaurant-Request
Restaurant-Select
Taxi-Inform
Taxi-Request
Train-Inform
Train-NoOffer
Train-OfferBook
Train-OfferBooked
Train-Request
Train-Select
a
general-bye
general-greet
general-reqmore
general-welcome
i
n
o
t


Another helpful piece of information is that, according to the paper, "the **validation and test sets only contain fully successful dialogues**", and these sets are defined in the `testListFile.json` and `valListFile.json` files.

## Conclusions

Our goal is to convert the MultiWOZ-2 data set into a Rasa Story. 
The roles of user and wizzard in the MultiWOZ setup correspond to the user and the bot.

The MultiWOZ data set `data.json` is annotated well enough to infer the abstract user utterances (like `inform{"location": "rome", "price": "cheap"}`), and the wizzard's actions are labeled in the `dialogue_actions.json` file.
This should enable us to write a parser, at least for the "successful dialogues" on the training and validation lists, but dialog quality seems generally ok.


# ToDo: Exploration of the MultiWOZ-1 data set