In [1]:
import matchzoo as mz
import pandas as pd

Using TensorFlow backend.


# Data Pack

## Structure

`matchzoo.DataPack` is a MatchZoo native data structure that most MatchZoo data handling processes build upon. A `matchzoo.DataPack` consists of three parts: `left`, `right` and `relation`, each one of is a `pandas.DataFrame`. 

In [2]:
data_pack = mz.datasets.toy.load_data()

In [3]:
data_pack.left.head()

Unnamed: 0_level_0,text_left
id_left,Unnamed: 1_level_1
Q1,how are glacier caves formed?
Q2,How are the directions of the velocity and for...
Q5,how did apollo creed die
Q6,how long is the term for federal judges
Q7,how a beretta model 21 pistols magazines works


In [4]:
data_pack.right.head()

Unnamed: 0_level_0,text_right
id_right,Unnamed: 1_level_1
D1-0,A partly submerged glacier cave on Perito More...
D1-1,The ice facade is approximately 60 m high
D1-2,Ice formations in the Titlis glacier cave
D1-3,A glacier cave is a cave formed within the ice...
D1-4,"Glacier caves are often called ice caves , but..."


In [5]:
data_pack.relation.head()

Unnamed: 0,id_left,id_right,label
0,Q1,D1-0,0.0
1,Q1,D1-1,0.0
2,Q1,D1-2,0.0
3,Q1,D1-3,1.0
4,Q1,D1-4,0.0


The main reason for using a `matchzoo.DataPack` instead of `pandas.DataFrame` is efficiency: we save space from storing duplicate texts and save time from processing duplicate texts.

## FrameView

However, since a big table is easier to understand and manage, we provide the `frame` that merges three parts into a single `pandas.DataFrame` when called.

In [6]:
data_pack.frame().head()

Unnamed: 0,id_left,text_left,id_right,text_right,label
0,Q1,how are glacier caves formed?,D1-0,A partly submerged glacier cave on Perito More...,0.0
1,Q1,how are glacier caves formed?,D1-1,The ice facade is approximately 60 m high,0.0
2,Q1,how are glacier caves formed?,D1-2,Ice formations in the Titlis glacier cave,0.0
3,Q1,how are glacier caves formed?,D1-3,A glacier cave is a cave formed within the ice...,1.0
4,Q1,how are glacier caves formed?,D1-4,"Glacier caves are often called ice caves , but...",0.0


Notice that `frame` is not a method, but a property that returns a `matchzoo.DataPack.FrameView` object.

In [7]:
type(data_pack.frame)

matchzoo.data_pack.data_pack.DataPack.FrameView

This view reflects changes in the data pack, and can be called to create a `pandas.DataFrame` at any time.

In [8]:
frame = data_pack.frame
data_pack.relation['label'] = data_pack.relation['label'] + 1

In [9]:
frame().head()

Unnamed: 0,id_left,text_left,id_right,text_right,label
0,Q1,how are glacier caves formed?,D1-0,A partly submerged glacier cave on Perito More...,1.0
1,Q1,how are glacier caves formed?,D1-1,The ice facade is approximately 60 m high,1.0
2,Q1,how are glacier caves formed?,D1-2,Ice formations in the Titlis glacier cave,1.0
3,Q1,how are glacier caves formed?,D1-3,A glacier cave is a cave formed within the ice...,2.0
4,Q1,how are glacier caves formed?,D1-4,"Glacier caves are often called ice caves , but...",1.0


## Slicing a DataPack

You may use `[]` to slice a `matchzoo.DataPack` similar to slicing a `list`. This also returns a shallow copy of the sliced data like slicing a `list`.

In [10]:
data_slice = data_pack[5:10]

A sliced data pack's `relation` will directly reflect the slicing.

In [11]:
data_slice.relation

Unnamed: 0,id_left,id_right,label
0,Q2,D2-0,1.0
1,Q2,D2-1,1.0
2,Q2,D2-2,1.0
3,Q2,D2-3,1.0
4,Q2,D2-4,1.0


In addition, `left` and `right` will be processed so only relevant information are kept.

In [12]:
data_slice.left

Unnamed: 0_level_0,text_left
id_left,Unnamed: 1_level_1
Q2,How are the directions of the velocity and for...


In [13]:
data_slice.right

Unnamed: 0_level_0,text_right
id_right,Unnamed: 1_level_1
D2-0,"In physics , circular motion is a movement of ..."
D2-1,"It can be uniform, with constant angular rate ..."
D2-2,The rotation around a fixed axis of a three-di...
D2-3,The equations of motion describe the movement ...
D2-4,Examples of circular motion include: an artifi...


It is also possible to slice a frame view object.

In [14]:
data_pack.frame[5:10]

Unnamed: 0,id_left,text_left,id_right,text_right,label
0,Q2,How are the directions of the velocity and for...,D2-0,"In physics , circular motion is a movement of ...",1.0
1,Q2,How are the directions of the velocity and for...,D2-1,"It can be uniform, with constant angular rate ...",1.0
2,Q2,How are the directions of the velocity and for...,D2-2,The rotation around a fixed axis of a three-di...,1.0
3,Q2,How are the directions of the velocity and for...,D2-3,The equations of motion describe the movement ...,1.0
4,Q2,How are the directions of the velocity and for...,D2-4,Examples of circular motion include: an artifi...,1.0


And this is equivalent to slicing the data pack first, then the frame, since both of them are based on the `relation` column. 

In [15]:
data_slice.frame() == data_pack.frame[5:10]

Unnamed: 0,id_left,text_left,id_right,text_right,label
0,True,True,True,True,True
1,True,True,True,True,True
2,True,True,True,True,True
3,True,True,True,True,True
4,True,True,True,True,True


Slicing is extremely useful for partitioning data for training vs testing.

In [16]:
num_train = int(len(data_pack) * 0.8)
data_pack.shuffle(inplace=True)
train_slice = data_pack[:num_train]
test_slice = data_pack[num_train:]

## Transforming Texts

Use `apply_on_text` to transform texts in a `matchzoo.DataPack`. Check the documentation for more information.

In [17]:
data_slice.apply_on_text(len).frame()

Processing text_left with len: 100%|██████████| 1/1 [00:00<00:00, 1730.32it/s]
Processing text_right with len: 100%|██████████| 5/5 [00:00<00:00, 9562.94it/s]


Unnamed: 0,id_left,text_left,id_right,text_right,label
0,Q2,85,D2-0,126,1.0
1,Q2,85,D2-1,128,1.0
2,Q2,85,D2-2,99,1.0
3,Q2,85,D2-3,78,1.0
4,Q2,85,D2-4,312,1.0


In [18]:
data_slice.apply_on_text(len, rename=('left_length', 'right_length')).frame()

Processing left_length with len: 100%|██████████| 1/1 [00:00<00:00, 1904.77it/s]
Processing right_length with len: 100%|██████████| 5/5 [00:00<00:00, 8456.26it/s]


Unnamed: 0,id_left,text_left,left_length,id_right,text_right,right_length,label
0,Q2,How are the directions of the velocity and for...,85,D2-0,"In physics , circular motion is a movement of ...",126,1.0
1,Q2,How are the directions of the velocity and for...,85,D2-1,"It can be uniform, with constant angular rate ...",128,1.0
2,Q2,How are the directions of the velocity and for...,85,D2-2,The rotation around a fixed axis of a three-di...,99,1.0
3,Q2,How are the directions of the velocity and for...,85,D2-3,The equations of motion describe the movement ...,78,1.0
4,Q2,How are the directions of the velocity and for...,85,D2-4,Examples of circular motion include: an artifi...,312,1.0


Since adding a column indicating text length is a quite common usage, you may simply do:

In [19]:
data_slice.append_text_length().frame()

Processing length_left with len: 100%|██████████| 1/1 [00:00<00:00, 1575.03it/s]
Processing length_right with len: 100%|██████████| 5/5 [00:00<00:00, 11037.64it/s]


Unnamed: 0,id_left,text_left,length_left,id_right,text_right,length_right,label
0,Q2,How are the directions of the velocity and for...,85,D2-0,"In physics , circular motion is a movement of ...",126,1.0
1,Q2,How are the directions of the velocity and for...,85,D2-1,"It can be uniform, with constant angular rate ...",128,1.0
2,Q2,How are the directions of the velocity and for...,85,D2-2,The rotation around a fixed axis of a three-di...,99,1.0
3,Q2,How are the directions of the velocity and for...,85,D2-3,The equations of motion describe the movement ...,78,1.0
4,Q2,How are the directions of the velocity and for...,85,D2-4,Examples of circular motion include: an artifi...,312,1.0


To one-hot encode the labels:

In [20]:
data_pack.relation['label'] = data_pack.relation['label'].astype(int)
data_pack.one_hot_encode_label(num_classes=3).frame().head()

Unnamed: 0,id_left,text_left,id_right,text_right,label
0,Q15,how are pointe shoes made,D15-3,Pointe shoes developed from the desire for dan...,"[0, 1, 0]"
1,Q12,how big did girl scout cookie boxes used to be,D12-6,"As of 2007, sales were estimated at about 200 ...","[0, 1, 0]"
2,Q6,how long is the term for federal judges,D6-1,In addition to the Supreme Court of the United...,"[0, 1, 0]"
3,Q13,how big is the purdue greek system,D13-1,"Purdue University, located in West Lafayette, ...","[0, 1, 0]"
4,Q12,how big did girl scout cookie boxes used to be,D12-0,A mound of Girl Scout cookies.,"[0, 1, 0]"


## Building Your own DataPack

Use `matchzoo.pack` to build your own data pack. Check documentation for more information.

In [21]:
data = pd.DataFrame({
    'text_left': list('ARSAARSA'),
    'text_right': list('arstenus')
})
my_pack = mz.pack(data)
my_pack.frame()

Unnamed: 0,id_left,text_left,id_right,text_right
0,L-0,A,R-0,a
1,L-1,R,R-1,r
2,L-2,S,R-2,s
3,L-0,A,R-3,t
4,L-0,A,R-4,e
5,L-1,R,R-5,n
6,L-2,S,R-6,u
7,L-0,A,R-2,s


## Unpack

Format data in a way so that MatchZoo models can directly fit it. For more details, consult `matchzoo/tutorials/models.ipynb`.

In [51]:
x, y = data_pack[:3].unpack()

In [52]:
x

{'id_left': array(['Q15', 'Q12', 'Q6'], dtype='<U3'),
 'text_left': array(['how are pointe shoes made',
        'how big did girl scout cookie boxes used to be',
        'how long is the term for federal judges'], dtype='<U46'),
 'id_right': array(['D15-3', 'D12-6', 'D6-1'], dtype='<U5'),
 'text_right': array(['Pointe shoes developed from the desire for dancers to appear weightless and sylph -like and have evolved to enable dancers to dance en pointe (on the tips of their toes) for extended periods of time.',
        'As of 2007, sales were estimated at about 200 million boxes per year.',
        'In addition to the Supreme Court of the United States , whose existence and some aspects of whose jurisdiction are beyond the constitutional power of Congress to alter, acts of Congress have established 13 courts of appeals (also called "circuit courts") with appellate jurisdiction over different regions of the United States, and 94 United States district courts .'],
       dtype='<U366')}

In [53]:
y

array([[1],
       [1],
       [1]])

# Data Sets

MatchZoo incorporates various datasets that can be loaded as MatchZoo native data structures.

In [22]:
mz.datasets.list_available()

['toy', 'wiki_qa', 'embeddings', 'snli']

The toy dataset doesn't need to be downloaded and can be directly used. It's the best choice to get things rolling.

In [23]:
toy_train_rank = mz.datasets.toy.load_data()
toy_train_rank.frame().head()

Unnamed: 0,id_left,text_left,id_right,text_right,label
0,Q1,how are glacier caves formed?,D1-0,A partly submerged glacier cave on Perito More...,0.0
1,Q1,how are glacier caves formed?,D1-1,The ice facade is approximately 60 m high,0.0
2,Q1,how are glacier caves formed?,D1-2,Ice formations in the Titlis glacier cave,0.0
3,Q1,how are glacier caves formed?,D1-3,A glacier cave is a cave formed within the ice...,1.0
4,Q1,how are glacier caves formed?,D1-4,"Glacier caves are often called ice caves , but...",0.0


In [24]:
toy_dev_classification, classes = mz.datasets.toy.load_data(stage='train', task='classification')
toy_dev_classification.frame().head()

Unnamed: 0,id_left,text_left,id_right,text_right,label
0,Q1,how are glacier caves formed?,D1-0,A partly submerged glacier cave on Perito More...,"[1, 0]"
1,Q1,how are glacier caves formed?,D1-1,The ice facade is approximately 60 m high,"[1, 0]"
2,Q1,how are glacier caves formed?,D1-2,Ice formations in the Titlis glacier cave,"[1, 0]"
3,Q1,how are glacier caves formed?,D1-3,A glacier cave is a cave formed within the ice...,"[0, 1]"
4,Q1,how are glacier caves formed?,D1-4,"Glacier caves are often called ice caves , but...","[1, 0]"


In [25]:
classes

[False, True]

Other larger datasets will be automatically downloaded the first time you use it. Run the following lines to trigger downloading.

In [26]:
wiki_dev_entailment_rank = mz.datasets.wiki_qa.load_data(stage='dev')
wiki_dev_entailment_rank.frame().head()

Unnamed: 0,id_left,text_left,id_right,text_right,label
0,D8,How are epithelial tissues joined together?,D8-0,Cross section of sclerenchyma fibers in plant ...,0
1,D8,How are epithelial tissues joined together?,D8-1,Microscopic view of a histologic specimen of h...,0
2,D8,How are epithelial tissues joined together?,D8-2,"In Biology , Tissue is a cellular organization...",0
3,D8,How are epithelial tissues joined together?,D8-3,A tissue is an ensemble of similar cells from ...,0
4,D8,How are epithelial tissues joined together?,D8-4,Organs are then formed by the functional group...,0


In [27]:
snli_test_classification, classes = mz.datasets.snli.load_data(stage='test', task='classification')
snli_test_classification.frame().head()

Unnamed: 0,id_left,text_left,id_right,text_right,label
0,L-0,This church choir sings to the masses as they ...,R-0,The church has cracks in the ceiling.,"[0, 0, 1, 0]"
1,L-0,This church choir sings to the masses as they ...,R-1,The church is filled with song.,"[1, 0, 0, 0]"
2,L-0,This church choir sings to the masses as they ...,R-2,A choir singing at a baseball game.,"[0, 1, 0, 0]"
3,L-1,"A woman with a green headscarf, blue shirt and...",R-3,The woman is young.,"[0, 0, 1, 0]"
4,L-1,"A woman with a green headscarf, blue shirt and...",R-4,The woman is very happy.,"[1, 0, 0, 0]"


In [28]:
classes

['entailment', 'contradiction', 'neutral', '-']