In [4]:
import pandas as pd
import numpy as np
import sys

In [8]:
# multi-intent SNIPS dataset
training_snips = pd.read_csv("./MixSNIPS_clean/train.txt", sep = " " 
                             ,warn_bad_lines=True, error_bad_lines=False, header = None)

In [9]:
training_snips

Unnamed: 0,0,1
0,play,O
1,isham,B-artist
2,jones,I-artist
3,and,O
4,swine,B-object_name
...,...,...
822203,a,B-object_name
822204,hat,I-object_name
822205,in,I-object_name
822206,time,I-object_name


In [10]:
training_snips.columns = ["utterance", "slot-label"]

In [11]:
training_snips.head(60)

Unnamed: 0,utterance,slot-label
0,play,O
1,isham,B-artist
2,jones,I-artist
3,and,O
4,swine,B-object_name
5,not,I-object_name
6,deserves,O
7,four,B-rating_value
8,points,B-rating_unit
9,PlayMusic#RateBook,


The dataset showing above is a sample dataset for the multi-intent detection dataset. The left column is the utterance in each sentence, and the right column is the slot label of each utterance. As you can see, several examples are shown in the format of ...#...#..., this is the multi-intent label of the corresponding sequence. For example "RateBook#SearchScreeningEvent" means there are 2 intents, RateBook and SearchScreenEvent, in this sentence. However, as you can see, not all sentences would have multi-intent labels. Some of the sentence may contain one intent, and other may contain multiple intents.

You might be curious why we have the slot labels. Here is the reason: Traditionally, intent detection and slot filling have been deemed to proceed independently.  However, more recently, joint models for intent classification and slot filling have achieve state-of-art performance, and have proved that there exists a strong relationship between these two tasks, intent classification and slot filling, which is called joint intent detection and slot-filling models.  Currently,there are three milestones under this research task so far:  1.  Intent detection to identify the speaker’s intention, 2.  slot filling to label each word token in the speech/text, 3.  joint intent classification and slot filling tasks.

In our case, I recommend to do the the joint intent classification and slot filling tasks at the same time in order to ensure the maximum performance on our task. However, we can first explore how the multi-intent dataset is labelled.

In [12]:
def show_multi_intent(df):
    intent_out = []
    for idx in range(len(df["utterance"])):
        if type(df["slot-label"][idx]) == float:
            intent_out.append(df["utterance"][idx])
    return intent_out
        

In [13]:
# print first 50 intents
intents = show_multi_intent(training_snips)
print(intents[:50])

['PlayMusic#RateBook', 'PlayMusic#SearchCreativeWork', 'SearchCreativeWork', 'RateBook#SearchScreeningEvent', 'AddToPlaylist#SearchScreeningEvent', 'BookRestaurant#SearchScreeningEvent', 'AddToPlaylist#RateBook#SearchCreativeWork', 'BookRestaurant#GetWeather', 'BookRestaurant', 'SearchScreeningEvent', 'SearchScreeningEvent', 'RateBook', 'BookRestaurant#SearchCreativeWork', 'SearchScreeningEvent', 'AddToPlaylist#PlayMusic', 'BookRestaurant#PlayMusic', 'GetWeather#PlayMusic', 'GetWeather', 'PlayMusic#RateBook', 'RateBook', 'AddToPlaylist#SearchCreativeWork', 'BookRestaurant#GetWeather', 'AddToPlaylist', 'AddToPlaylist#BookRestaurant', 'AddToPlaylist#PlayMusic', 'AddToPlaylist#SearchCreativeWork', 'SearchCreativeWork', 'BookRestaurant', 'AddToPlaylist', 'BookRestaurant', 'AddToPlaylist#GetWeather#SearchScreeningEvent', 'AddToPlaylist#PlayMusic#SearchScreeningEvent', 'AddToPlaylist#GetWeather', 'PlayMusic', 'GetWeather#SearchCreativeWork', 'PlayMusic#SearchCreativeWork', 'SearchCreativeWor

In [14]:
# analysis on the snips training dataset
print("Total number of intents is {}.".format(len(intents)))
print("unique number of intents is {}.".format(len(set(intents))))

Total number of intents is 39776.
unique number of intents is 63.


Most of the intents can be same, and repetitive. This case should depend on how the dataset is constructed. However, if we want to construct such kind of dataset, we should follow the following way shown in the next 2 sections.

In [15]:
utterance_intent_label_dat = training_snips["utterance"]

In [16]:
utterance_intent_label_dat[:50]

0                             play
1                            isham
2                            jones
3                              and
4                            swine
5                              not
6                         deserves
7                             four
8                           points
9               PlayMusic#RateBook
10                            play
11                             the
12                            last
13                            song
14                              by
15                          goldie
16                             and
17                            then
18                             can
19                             you
20                             get
21                              me
22                             the
23                          rakuen
24                        tsuihou:
25                        expelled
26                            from
27                        paradise
28                  