# Parser for MultiWOZ-2

In [1]:
# Use %%black at the beginning of a cell to auto-convert to
# "black" format. See https://github.com/csurfer/blackcellmagic
# %load_ext blackcellmagic

In [2]:
import sys
sys.path.append("./multiwoz")

In [3]:
from multiwoz.parser import MultiWOZParser

## Import

Download the MultiWOZ-2 data set's zip file and unpack it, if it was not already done.

In [4]:
parser = MultiWOZParser()  # Use `debug=True` for smaller dataset

Number of dialogues:

In [5]:
assert len(parser.data) == len(parser.acts)
len(parser.data)

10438

In [6]:
parser.story_names[:3]

['SNG01856.json', 'SNG0129.json', 'PMUL1635.json']

The following actions exist in `acts`:

In [7]:
def list_actions():
    actions = set()
    for name in parser.story_names:
        act = parser.acts[name[:-5]]
        for a in act.values():
            actions.update(set(list(a)))
    actions = list(actions)
    actions.sort()
    for a in actions:
        print(a)

list_actions()

 
A
Attraction-Inform
Attraction-NoOffer
Attraction-Recommend
Attraction-Request
Attraction-Select
Booking-Book
Booking-Inform
Booking-NoBook
Booking-Request
Hotel-Inform
Hotel-NoOffer
Hotel-Recommend
Hotel-Request
Hotel-Select
N
Restaurant-Inform
Restaurant-NoOffer
Restaurant-Recommend
Restaurant-Request
Restaurant-Select
Taxi-Inform
Taxi-Request
Train-Inform
Train-NoOffer
Train-OfferBook
Train-OfferBooked
Train-Request
Train-Select
a
general-bye
general-greet
general-reqmore
general-welcome
i
n
o
t


## Parsing examples

### Parsing actions

In [8]:
test_actions = {
    "Booking-Inform": [["none", "none"]],
    "Hotel-Inform": [["Price", "cheap"], ["Choice", "1"], ["Parking", "none"]],
}

In [9]:
parsed_actions, problems, _ = parser.parse_action(test_actions)
print(parsed_actions)

   - inform_booking
   - inform_choice_parking_price_hotel



In [10]:
problems

[]

### Parsing whole stories

In [11]:
story, problems = parser.parse_story(parser.story_names[4], verbose=2)

[32m## story_SNG0073[0m
U:  I would like a taxi from Saint John's college to Pizza Hut Fen Ditton.
[34m* inform{"taxi_destination": "pizza hut fen ditton", "taxi_departure": "saint john's college"}[0m
W:  What time do you want to leave and what time do you want to arrive by?
[31m   - request_arrive_leave_taxi
[0m
U:  I want to leave after 17:15.
[34m* inform{"taxi_leaveAt": "17:15"}
   - action_book_taxi
   - slot{"taxi_phone": "07218068540", "taxi_type": "blue honda"}[0m
W:  
Booking completed! your taxi will be blue honda Contact number is 07218068540
[31m   - inform_car_phone_taxi
[0m
U:  Thank you for all the help! I appreciate it.
[34m* chitchat[0m
W:  You are welcome.  Is there anything else I can help you with today?
[31m   - reqmore_general
[0m
U:  No, I am all set.  Have a nice day.  Bye.
[34m* bye[0m
W:  you too! thank you
[31m   - bye_general
[0m


In [12]:
story, problems = parser.parse_story(parser.story_names[13], verbose=2)

[32m## story_MUL2261[0m
U:  I am looking for information about the City Centre North B and B hotel.  Can you help me with that?
[34m* inform{"hotel_name": "city centre north b and b"}[0m
W:  Yes! City Centre North B and B is an inexpensively-priced guesthouse in the north of town at 328A Histon Road. Would you like to book a room?
[31m   - inform_addr_area_name_price_type_hotel
   - inform_booking
[0m
U:  Yes can I book it for 6 people starting on Saturday for 4 nights please?
[34m* inform{"hotel_stay": "4", "hotel_day": "saturday", "hotel_people": "6"}[0m
W:  They weren't able to accommodate those parameters. Would you like to try a shorter stay?
[31m   - nobook_booking
   - request_stay_booking
[0m
U:  Sure.  Let's try 2 nights.  Thanks. I'll need a reference number, too, please.
[34m* inform{"hotel_stay": "2", "hotel_pricerange": "cheap"}
   - action_book_hotel
   - slot{"hotel_name": "city centre north b and b", "hotel_reference": "0FAFY5PP"}[0m
W:  The booking for City

In [13]:
problems

[]

There are no recognized problems. But when the user says _"Sure.  Let's try 2 nights.  Thanks. I'll need a reference number, too, please."_, this is recognized as `* inform{"hotel_stay": "2", "hotel_pricerange": "cheap"}`. 
Let's check why that is, by printing out the metadata structure before and after this sentence.

In [14]:
def print_structure(ds, name="root", level=0):
    indent = " " * level
    if type(ds) is str:
        if ds.strip() != "" and ds.strip() != "not mentioned":
            print(f"{indent}{name} > {ds.strip()}")
    else:
        print(f"{indent}{name}")
        if type(ds) is dict:
            for k in ds.keys():
                print_structure(ds[k], k, level+2)
        elif type(ds) is list:
            for k in range(len(ds)):
                print_structure(ds[k], str(k), level+2)
        else:
            raise ValueError("Bad type")

In [15]:
print_structure(parser.data[parser.story_names[13]]["log"][3]["metadata"])
print()
print(parser.data[parser.story_names[13]]["log"][4]["text"])
print()
print_structure(parser.data[parser.story_names[13]]["log"][5]["metadata"])

root
  taxi
    book
      booked
    semi
  police
    book
      booked
    semi
  restaurant
    book
      booked
    semi
  hospital
    book
      booked
    semi
  hotel
    book
      booked
      stay > 4
      day > saturday
      people > 6
    semi
      name > city centre north b and b
  attraction
    book
      booked
    semi
  train
    book
      booked
    semi

Sure.  Let's try 2 nights.  Thanks. I'll need a reference number, too, please.

root
  taxi
    book
      booked
    semi
  police
    book
      booked
    semi
  restaurant
    book
      booked
    semi
  hospital
    book
      booked
    semi
  hotel
    book
      booked
        0
          name > city centre north b and b
          reference > 0FAFY5PP
      stay > 2
      day > saturday
      people > 6
    semi
      name > city centre north b and b
      pricerange > cheap
  attraction
    book
      booked
    semi
  train
    book
      booked
    semi


So this is labeling noise. 
The information `pricerange > cheap` was added at this point, even though it is not anything the user says. 
The Wizzard has told this information to the user, however, at the beginning of the conversation.

Here are a few more parsed stories:

This one makes it clear that there realy is not much information available about what the user says:

In [16]:
parser.parse_story(parser.story_names[101], verbose=2);

[32m## story_SNG0385[0m
U:  I'd like to find a train that leaves Monday and arrives by 19:00.
[34m* inform{"train_day": "monday", "train_arriveBy": "19:00"}[0m
W:  Very well. Where are you traveling from, and to where?
[31m   - request_depart_dest_train
[0m
U:  I am departing from birmingham new street, going to cambridge.
[34m* inform{"train_destination": "cambridge", "train_departure": "birmingham new street"}[0m
W:  I have a train that departs at 15:40 and arrives in Cambridge at 18:23. Will that do?
[31m   - inform_arrive_dest_leave_train
[0m
U:  That would work just fine. Could you tell me the price and travel time for that train? 
[34m* chitchat[0m
W:  The TR5385 has a travel time of 163 minutes. Would you like me to book you some tickets?
[31m   - inform_id_time_train
   - offerbook_train
[0m
U:  No, I don't want to book today.  Could you just give me the price for that train?  I think that is all I need today.
[34m* chitchat[0m
W:  Sure.  The price of those tick

This might be especially problematic if the user is confused and we have to label this as "chitchat".
Also, at the end of the following dialog, the Wizzard seems to give the wrong phone number.

In [17]:
parser.parse_story(parser.story_names[201], verbose=2);

[32m## story_PMUL1296[0m
U:  Hi, can you help me find a train on Monday, please?
[34m* inform{"train_day": "monday"}[0m
W:  Sure, where will you be departing from?
[31m   - request_depart_train
[0m
U:  I'm departing from leicester.
[34m* inform{"train_departure": "leicester"}[0m
W:  There are many trains available on that day. At what time will you be traveling?
[31m   - inform_choice_train
   - request_leave_train
[0m
U:  I will be leaving by 5:15.
[34m* inform{"train_leaveAt": "05:15"}[0m
W:  what is your destination?
[31m   - request_dest_train
[0m
U:  It should go to leicester
[34m* chitchat[0m
W:  Did you mean departing from Leicester and going to Cambridge?  We do not have trains from Leicester to Leicester.  
[31m   - nooffer_depart_dest_train
   - request_depart_dest_train
[0m
U:  sorry, im going from cambridge to leicester
[34m* inform{"train_destination": "leicester", "train_departure": "cambridge"}[0m
W:  TR8631 will leave at 5:21 and arrive by 7:06. Woul

This wizzard is very eager to sell train tickets...

In [18]:
parser.parse_story(parser.story_names[400], verbose=2);

[32m## story_PMUL1021[0m
U:  I need a train from  london liverpool street to cambridge.
[34m* inform{"train_destination": "cambridge", "train_departure": "london liverpool street"}[0m
W:  There are many trains. What day and what time?
[31m   - request_day_leave_train
[0m
U:  I would like to leave on Friday after 21:45 please. 
[34m* inform{"train_leaveAt": "21:45", "train_day": "friday"}[0m
W:  Train TR6769 fits what you're looking for. Would you like me to book you passage?
[31m   - inform_id_train
   - offerbook_train
[0m
U:  No, I just need to know the travel time of the train at the moment.
[34m* bye[0m
W:  I will book it for you,is there anything else I can do for you ?
[31m   - offerbook_train
   - reqmore_general
[0m


Here is a conversation were everything went well:

In [19]:
parser.parse_story(parser.story_names[2597], verbose=2);

[32m## story_SNG01775[0m
U:  help me get a taxi to the cambridge museum of technology please
[34m* inform{"taxi_destination": "cambridge museum of technology"}[0m
W:  Sure. What is your departure site?
[31m   - request_depart_taxi
[0m
U:  I'll be departing from rice boat
[34m* inform{"taxi_departure": "rice boat"}[0m
W:  Okay. Can you give me a time that you'd prefer to leave after or arrive by?
[31m   - request_arrive_leave_taxi
[0m
U:  Have the taxi leave after 13:45 please
[34m* inform{"taxi_leaveAt": "300"}
   - action_book_taxi
   - slot{"taxi_phone": "07774899596", "taxi_type": "white audi"}[0m
W:  Booking completed!
Booked car type	:	white audi
Contact number	:	07774899596
[31m   - inform_car_phone_taxi
[0m


Now, we **parse all the 'good' stories into a file** for training and testing (good = no problems found).
We also create separate files that only contain those stories which appear in the validation and test lists, respectively.
At the same time, we also check for how many dialogues there are no problems.

In [20]:
all_problems = []
count = 0
count_good = 0

# Write all stories
with open("stories_all.md", "w") as story_file:
    with open("stories_vallist.md", "w") as val_file:
        with open("stories_testlist.md", "w") as test_file:
            for name in parser.story_names:
                story, problems = parser.parse_story(name)
                count += 1
                if len(problems) == 0:
                    count_good += 1
                    story_file.write(story)
                else:
                    all_problems += problems

                if name in parser.validation_list:
                    val_file.write(story)

                if name in parser.test_list:
                    test_file.write(story)

# Inform about number of bad stories
assert count == len(parser.story_names)
print(f"{count_good}/{count} = {100.0 * count_good / count:.2f}%")

9074/10438 = 86.93%


Write the domain file for Rasa Core.

In [21]:
import domain_info

In [22]:
with open("domain.yml", "w") as domain_file:

    # Intents
    domain_file.write("intents:\n")
    domain_file.write("  - inform\n")
    domain_file.write("  - chitchat\n")
    domain_file.write("  - bye\n")
    domain_file.write("\n")
    
    # Entities
    domain_file.write("entities:\n")
    for slot in domain_info.slots:
        domain_file.write(f"  - {slot}\n")
    domain_file.write("\n")    

    # Actions
    domain_file.write("actions:\n")
    for a in domain_info.actions:
        domain_file.write(f"  - {a}\n")
    domain_file.write("\n")

    # Templates
    domain_file.write("templates:\n")
    for a in domain_info.actions:
        domain_file.write(f"  {a}:\n")
        domain_file.write(f'  - text: "{a}"\n')
    domain_file.write("\n")

    # Slots
    domain_file.write("slots:\n")
    for slot in domain_info.slots:
        domain_file.write(f"  {slot}:\n")
        domain_file.write(f"    type: text\n")

The number of problems encountered is:

In [23]:
len(all_problems)

1924

Let's see what the problems are.

In [24]:
def tally(data):
    """Tally identical elements in the list `data`"""
    counts = {}
    for datum in data:
        if datum in counts:
            counts[datum] += 1
        else:
            counts[datum] = 1
    return counts

In [25]:
tally([p["type"] for p in all_problems])

{'no_annotation': 1904, 'no_action': 16, 'long_baseline': 4}

So almost all problems occur because some data are not annotated, and a few data sets miss at least one action annotation for the wizzard.
In addition, there are 4 cases where information was deleted from the slots.

In [26]:
[p for p in all_problems if p["type"] == "long_baseline"]

[{'type': 'long_baseline',
  'branch': '_taxi_book_booked',
  'baseline': [{'phone': '07715015033', 'type': 'white tesla'}],
  'dataset': []},
 {'type': 'long_baseline',
  'branch': '_taxi_book_booked',
  'baseline': [{'phone': '07383383242', 'type': 'yellow bmw'}],
  'dataset': []},
 {'type': 'long_baseline',
  'branch': '_restaurant_book_booked',
  'baseline': [{'name': 'da vinci pizzeria', 'reference': 'QLKCDA7V'}],
  'dataset': []},
 {'type': 'long_baseline',
  'branch': '_taxi_book_booked',
  'baseline': [{'phone': '07410117478', 'type': 'black volvo'}],
  'dataset': []}]

Let's find the data sets in which data was deleted.

In [27]:
for name in parser.story_names:
    story, problems = parser.parse_story(name)
    if len(problems) > 0:
        for problem in problems:
            if problem["type"] == "long_baseline":
                print(name)
                break

PMUL2430.json
PMUL0382.json
PMUL0181.json


All the examples in which data were deleted (long-baseline problem) also contain a missing annotation. 
Therefore, the parser is complete.

In [28]:
parser.parse_story("PMUL2430.json")[1]

[{'type': 'long_baseline',
  'branch': '_taxi_book_booked',
  'baseline': [{'phone': '07715015033', 'type': 'white tesla'}],
  'dataset': []},
 {'type': 'no_annotation'}]

In [29]:
parser.parse_story("PMUL0382.json")[1]

[{'type': 'no_annotation'},
 {'type': 'long_baseline',
  'branch': '_taxi_book_booked',
  'baseline': [{'phone': '07383383242', 'type': 'yellow bmw'}],
  'dataset': []},
 {'type': 'long_baseline',
  'branch': '_restaurant_book_booked',
  'baseline': [{'name': 'da vinci pizzeria', 'reference': 'QLKCDA7V'}],
  'dataset': []}]

In [30]:
parser.parse_story("PMUL0181.json")[1]

[{'type': 'long_baseline',
  'branch': '_taxi_book_booked',
  'baseline': [{'phone': '07410117478', 'type': 'black volvo'}],
  'dataset': []},
 {'type': 'no_annotation'}]