## Hierarchical Indexing (Multi-index)

Hierarachical indexing opens the door to some quite sophisticated data analysis, especially for working with higher dimensional data. 

In essence, it enables you to store and manipulate data with an arbitrary number of dimensions in lower dimensional data structures like Series (1d) and DataFrame (2d).

The *MultiIndex* object is the hierarchical analogue of the standard *Index* object which typically stores the axis labels in pandas objects. You can think of *MultiIndex* as an array of tuples where each tuple is unique.

In [1]:
import pandas as pd
import numpy as np

arrays = [['bar', 'bar', 'cat', 'cat', 'dog', 'dog', 'ant', 'ant'],
          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
print(arrays)
tuples = list(zip(*arrays))
tuples

[['bar', 'bar', 'cat', 'cat', 'dog', 'dog', 'ant', 'ant'], ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]


[('bar', 'one'),
 ('bar', 'two'),
 ('cat', 'one'),
 ('cat', 'two'),
 ('dog', 'one'),
 ('dog', 'two'),
 ('ant', 'one'),
 ('ant', 'two')]

In [2]:
# Create MultiIndex from tuples
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
index

MultiIndex(levels=[['ant', 'bar', 'cat', 'dog'], ['one', 'two']],
           labels=[[1, 1, 2, 2, 3, 3, 0, 0], [0, 1, 0, 1, 0, 1, 0, 1]],
           names=['first', 'second'])

In [3]:
s = pd.Series(np.random.randn(8), index=index)
s

first  second
bar    one      -0.725250
       two      -0.282291
cat    one      -0.219755
       two      -1.588924
dog    one       0.108462
       two      -0.730629
ant    one      -0.406205
       two       0.214514
dtype: float64

In [4]:
# Create MultiIndex from every pairing of the elements
arrays2 = [['bar', 'cat', 'dog', 'ant'],
           ['one', 'two']]
s2 = pd.Series(np.random.randn(8), 
               index=pd.MultiIndex.from_product(arrays2, names=['first', 'second']))
s2

first  second
bar    one      -1.533847
       two       1.753168
cat    one      -0.703716
       two       0.650929
dog    one       0.344623
       two      -1.948729
ant    one      -0.799954
       two       1.326959
dtype: float64

In [5]:
# Create MultiIndex from a list of Arrays automatically
arrays3 = [np.array(['bar', 'bar', 'cat', 'cat', 'dog', 'dog', 'ant', 'ant']),
          np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]
s3 = pd.Series(np.random.randn(8), index = arrays3)
print(s3)

df3 = pd.DataFrame(np.random.randn(8, 4), index=arrays3)
df3

bar  one    0.708273
     two    0.310426
cat  one   -0.295035
     two    1.012179
dog  one    1.066491
     two   -1.093649
ant  one   -0.083057
     two    0.707237
dtype: float64


Unnamed: 0,Unnamed: 1,0,1,2,3
bar,one,-0.909474,1.499622,0.821909,-1.122663
bar,two,-0.387693,2.249439,0.658162,0.419776
cat,one,-0.129603,0.242902,0.738085,-0.252586
cat,two,-1.432529,1.773576,0.352101,-0.464754
dog,one,0.025586,-1.951273,-0.973766,0.044732
dog,two,0.696683,-1.193574,1.560547,-0.654714
ant,one,0.892502,-0.940173,0.291252,1.636593
ant,two,0.705944,0.430051,0.880048,0.246392


In [6]:
# To retrieve the names of the two-level index
print(s.index.names)
print(s2.index.names)
print(s3.index.names)
print(df3.index.names)

['first', 'second']
['first', 'second']
[None, None]
[None, None]


The reason that the *MultiIndex* matters is that it can allow you to do grouping, selection, and reshaping operations as we will describe below and in subsequent areas of the documentation. 

As you will see in later sections, you can find yourself working with hierarchically-indexed data, without creating a *MultiIndex* explicitly yourself. However, when loading data from a file, you may wich to generate your own *MultiIndex* when preparing the dataset.

In [7]:
# Note that the display of the index can be altered.
pd.set_option('display.multi_sparse', False)
print(df3)
pd.set_option('display.multi_sparse', True)
print(df3)

                0         1         2         3
bar one -0.909474  1.499622  0.821909 -1.122663
bar two -0.387693  2.249439  0.658162  0.419776
cat one -0.129603  0.242902  0.738085 -0.252586
cat two -1.432529  1.773576  0.352101 -0.464754
dog one  0.025586 -1.951273 -0.973766  0.044732
dog two  0.696683 -1.193574  1.560547 -0.654714
ant one  0.892502 -0.940173  0.291252  1.636593
ant two  0.705944  0.430051  0.880048  0.246392
                0         1         2         3
bar one -0.909474  1.499622  0.821909 -1.122663
    two -0.387693  2.249439  0.658162  0.419776
cat one -0.129603  0.242902  0.738085 -0.252586
    two -1.432529  1.773576  0.352101 -0.464754
dog one  0.025586 -1.951273 -0.973766  0.044732
    two  0.696683 -1.193574  1.560547 -0.654714
ant one  0.892502 -0.940173  0.291252  1.636593
    two  0.705944  0.430051  0.880048  0.246392


#### Reconstrucing the level levels
The method *get_level_values* will return a vector of the labels for each location at a particular level.

In [8]:
df3.index.get_level_values(0)

Index(['bar', 'bar', 'cat', 'cat', 'dog', 'dog', 'ant', 'ant'], dtype='object')

In [9]:
s2.index.get_level_values('second')

Index(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'], dtype='object', name='second')

## Basic Indexing using MultiIndex

One of the important features of hierarchical indexing is that you can select data by a 'partial' label identifying a subgroup in the data.

Partial selection "drops" levels of the hierarchical index in the result in a completely analogous way to selecting a column in a regular DataFrame.


In [10]:
print(df3)
print(df3.ix['bar', [0, 3]])
print("\n")
print(df3.ix['bar', [0, 3]].ix['two'])

                0         1         2         3
bar one -0.909474  1.499622  0.821909 -1.122663
    two -0.387693  2.249439  0.658162  0.419776
cat one -0.129603  0.242902  0.738085 -0.252586
    two -1.432529  1.773576  0.352101 -0.464754
dog one  0.025586 -1.951273 -0.973766  0.044732
    two  0.696683 -1.193574  1.560547 -0.654714
ant one  0.892502 -0.940173  0.291252  1.636593
    two  0.705944  0.430051  0.880048  0.246392
            0         3
one -0.909474 -1.122663
two -0.387693  0.419776


0   -0.387693
3    0.419776
Name: two, dtype: float64


In [11]:
print(df3.loc['bar', 'two'])

0   -0.387693
1    2.249439
2    0.658162
3    0.419776
Name: (bar, two), dtype: float64


In [12]:
# Calculation and slicing are the same
print(s)
print('\n')
print(s[:-2])
print('\n')
print(s + s[:-2])

first  second
bar    one      -0.725250
       two      -0.282291
cat    one      -0.219755
       two      -1.588924
dog    one       0.108462
       two      -0.730629
ant    one      -0.406205
       two       0.214514
dtype: float64


first  second
bar    one      -0.725250
       two      -0.282291
cat    one      -0.219755
       two      -1.588924
dog    one       0.108462
       two      -0.730629
dtype: float64


first  second
ant    one            NaN
       two            NaN
bar    one      -1.450499
       two      -0.564582
cat    one      -0.439510
       two      -3.177847
dog    one       0.216923
       two      -1.461258
dtype: float64


In [13]:
# Transpose works as you expected
df3.T

Unnamed: 0_level_0,bar,bar,cat,cat,dog,dog,ant,ant
Unnamed: 0_level_1,one,two,one,two,one,two,one,two
0,-0.909474,-0.387693,-0.129603,-1.432529,0.025586,0.696683,0.892502,0.705944
1,1.499622,2.249439,0.242902,1.773576,-1.951273,-1.193574,-0.940173,0.430051
2,0.821909,0.658162,0.738085,0.352101,-0.973766,1.560547,0.291252,0.880048
3,-1.122663,0.419776,-0.252586,-0.464754,0.044732,-0.654714,1.636593,0.246392


#### Cross-section

The *xs* method of DataFrame additionally takes a level argument to make selecting data at a particular level of a MultiIndex easier.

In [14]:
print(df3)
df3.xs('two', level=1)

                0         1         2         3
bar one -0.909474  1.499622  0.821909 -1.122663
    two -0.387693  2.249439  0.658162  0.419776
cat one -0.129603  0.242902  0.738085 -0.252586
    two -1.432529  1.773576  0.352101 -0.464754
dog one  0.025586 -1.951273 -0.973766  0.044732
    two  0.696683 -1.193574  1.560547 -0.654714
ant one  0.892502 -0.940173  0.291252  1.636593
    two  0.705944  0.430051  0.880048  0.246392


Unnamed: 0,0,1,2,3
bar,-0.387693,2.249439,0.658162,0.419776
cat,-1.432529,1.773576,0.352101,-0.464754
dog,0.696683,-1.193574,1.560547,-0.654714
ant,0.705944,0.430051,0.880048,0.246392


In [15]:
# To select columns with xs(), you need to provide the axis argument.
df_T = df3.T
df_T.xs('two', level=1, axis=1)

Unnamed: 0,bar,cat,dog,ant
0,-0.387693,-1.432529,0.696683,0.705944
1,2.249439,1.773576,-1.193574,0.430051
2,0.658162,0.352101,1.560547,0.880048
3,0.419776,-0.464754,-0.654714,0.246392


In [16]:
# No one stops you from selecting using multiple keys
df_T.xs(('two', 'bar'), level=(1, 0), axis=1)

Unnamed: 0_level_0,bar
Unnamed: 0_level_1,two
0,-0.387693
1,2.249439
2,0.658162
3,0.419776


#### Reindexing and Alignment

The parameter *level* is useful to broadcast values across a level.

In [17]:
midx = pd.MultiIndex.from_product([['one', 'zero'], 
                                   ['x', 'y']], 
                                  )
df = pd.DataFrame(np.random.randn(4,2), index=midx)
df

Unnamed: 0,Unnamed: 1,0,1
one,x,-0.757636,-1.535119
one,y,1.348964,-1.808988
zero,x,-2.162444,-0.215149
zero,y,0.498657,-1.024355


In [18]:
df2 = df.mean(level = 0)
df2

Unnamed: 0,0,1
one,0.295664,-1.672054
zero,-0.831894,-0.619752


In [19]:
# Reindexing
df2.reindex(df.index, level=0)

Unnamed: 0,Unnamed: 1,0,1
one,x,0.295664,-1.672054
one,y,0.295664,-1.672054
zero,x,-0.831894,-0.619752
zero,y,-0.831894,-0.619752


In [20]:
# Aligning
df_aligned, df2_aligned = df.align(df2, level=0)
df_aligned
df2_aligned

Unnamed: 0,Unnamed: 1,0,1
one,x,0.295664,-1.672054
one,y,0.295664,-1.672054
zero,x,-0.831894,-0.619752
zero,y,-0.831894,-0.619752


#### Swapping levels with swaplevel()

The swaplevel() function can switch the order of two levels.

In [21]:
df.swaplevel(0, 1, axis=0)

Unnamed: 0,Unnamed: 1,0,1
x,one,-0.757636,-1.535119
y,one,1.348964,-1.808988
x,zero,-2.162444,-0.215149
y,zero,0.498657,-1.024355


#### Sorting the index


In [22]:
import random
print(tuples)
random.shuffle(tuples)
print(tuples)

[('bar', 'one'), ('bar', 'two'), ('cat', 'one'), ('cat', 'two'), ('dog', 'one'), ('dog', 'two'), ('ant', 'one'), ('ant', 'two')]
[('cat', 'two'), ('dog', 'one'), ('ant', 'one'), ('bar', 'one'), ('cat', 'one'), ('dog', 'two'), ('ant', 'two'), ('bar', 'two')]


In [23]:
s = pd.Series(np.random.randn(8), index=pd.MultiIndex.from_tuples(tuples))
s

cat  two   -0.655848
dog  one   -0.379822
ant  one   -0.156211
bar  one    0.403121
cat  one   -0.493898
dog  two   -0.662357
ant  two   -0.246847
bar  two   -0.128355
dtype: float64

In [24]:
s.sort_index()

ant  one   -0.156211
     two   -0.246847
bar  one    0.403121
     two   -0.128355
cat  one   -0.493898
     two   -0.655848
dog  one   -0.379822
     two   -0.662357
dtype: float64

In [25]:
s.index.set_names(['L1', 'L2'], inplace=True)
s.sort_index()

L1   L2 
ant  one   -0.156211
     two   -0.246847
bar  one    0.403121
     two   -0.128355
cat  one   -0.493898
     two   -0.655848
dog  one   -0.379822
     two   -0.662357
dtype: float64

Note that sort_index() is essential if you would like to use slicing syntax to select partial data from the DataFrame, in addition to the benefit of neat display.

## Categorical Index

A new index object, *CategoricalIndex*, is useful for supporting indexing with duplicates. This is a container that allows efficient indexing and storage of an index with a large number of duplciated elements.

In [26]:
df = pd.DataFrame({'A': np.arange(6),
                   'B': list('aabbca')})
print(df)
print(df.dtypes)
df['B'] = df['B'].astype('category', categories=list('cab'))
print(df.dtypes)
df

   A  B
0  0  a
1  1  a
2  2  b
3  3  b
4  4  c
5  5  a
A     int64
B    object
dtype: object
A       int64
B    category
dtype: object


Unnamed: 0,A,B
0,0,a
1,1,a
2,2,b
3,3,b
4,4,c
5,5,a


In [27]:
# Setting the index will create a CategoricalIndex
df2 = df.set_index('B')
df2.index

CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

In [28]:
df2.sort_index()

Unnamed: 0_level_0,A
B,Unnamed: 1_level_1
c,4
a,0
a,1
a,5
b,2
b,3


In [29]:
# Groupby operations on the index will preserve the index nature as well

df2.groupby(level=0).sum()

Unnamed: 0_level_0,A
B,Unnamed: 1_level_1
c,4
a,6
b,5


## MultiIndex Summary

In [30]:
df = pd.DataFrame({"row" : [0, 1, 2],
                   'One_X': [1.1, 1.1, 1.1],
                   'One_Y': [1.2, 1.2, 1.2],
                   'Two_X': [1.11, 1.11, 1.11],
                   'Two_Y': [1.22, 1.22, 1.22]}); df

Unnamed: 0,One_X,One_Y,Two_X,Two_Y,row
0,1.1,1.2,1.11,1.22,0
1,1.1,1.2,1.11,1.22,1
2,1.1,1.2,1.11,1.22,2


In [31]:
# As labelled index
df = df.set_index('row'); df

Unnamed: 0_level_0,One_X,One_Y,Two_X,Two_Y
row,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,1.1,1.2,1.11,1.22
1,1.1,1.2,1.11,1.22
2,1.1,1.2,1.11,1.22


In [32]:
# With hierarchiacl columns
df.columns = pd.MultiIndex.from_tuples(
             [tuple(c.split('_')) for c in df.columns])
df

Unnamed: 0_level_0,One,One,Two,Two
Unnamed: 0_level_1,X,Y,X,Y
row,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
0,1.1,1.2,1.11,1.22
1,1.1,1.2,1.11,1.22
2,1.1,1.2,1.11,1.22


In [33]:
# Now stack and Reset
print(df.stack(0))
df = df.stack(0).reset_index(1)
df

            X     Y
row                
0   One  1.10  1.20
    Two  1.11  1.22
1   One  1.10  1.20
    Two  1.11  1.22
2   One  1.10  1.20
    Two  1.11  1.22


Unnamed: 0_level_0,level_1,X,Y
row,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,One,1.1,1.2
0,Two,1.11,1.22
1,One,1.1,1.2
1,Two,1.11,1.22
2,One,1.1,1.2
2,Two,1.11,1.22


In [34]:
# Rename the variables
df.columns = ['Sample', 'All_X', 'All_Y']; df

Unnamed: 0_level_0,Sample,All_X,All_Y
row,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,One,1.1,1.2
0,Two,1.11,1.22
1,One,1.1,1.2
1,Two,1.11,1.22
2,One,1.1,1.2
2,Two,1.11,1.22


In [35]:
list(df.columns)

['Sample', 'All_X', 'All_Y']