## TL;DR

Hierarchical indexing:

- Also known as **multi-indexing**
- Incorporate multiple index *levels* within a single index

- Each extra level multi-index represents an extra dimension of data



`MultiIndex` :

- Creation:

  - Implicit: pass a dicitionary with approriate tuples as keys
  - Explicit constructors
    - `pd.MultiIndex.from_arrays()`
    - `pd.MultiIndex.from_tuples()`
    - `pd.MultiIndex.from_product()`
    - `pd.MultiIndex()`

- Level names:

  - Pass the name arguments to `MultiIndex` constructions
  - Set the names of the index (`index.names`) after the fact

- `MultiIndex` also works by columns

- Indexing: think about the indices as **added dimensions**

  - Multiply indexed `Series`

    - Partial indexing: `series[first_level_index, second_level_index]`
    - Partial slicing 
    - Selection

  - Multiply indexed `DataFrame`

    - Columns are primary

    - Single-index: `loc`, `iloc` indexer

      - Array-like view of the two-dimensional data

      - Each individual index can be passed a tuple of multiple indices:

        `df.loc[(row_level_1, row_level_2,...), (col_level_1, col_level_2,...)]`

    - Slices: use `IndexSlice` object

- Rearranging 

  - Sort: `sort_index()`
  - Stack: 
    - `stack()`: Convert a dataset from a stacked multi-index to a simple two-dimensional representation
    - `unstack()`: inverse of `stack()`
  - Set and reset index:
    - `reset_index()`: Turn the index labels into columns
    - `set_index()`: Build a `MultiIndex` from the column values
  - Data aggregation on Multi-Indices
    - `level`: controls which subset of the data the aggregate is computed on
    - `axis`: Specify the columns on which aggregation along levels takes place



In [12]:
import numpy as np
import pandas as pd

Make use of *hierarchical indexing* (also known as *multi-indexing*) to incorporate multiple index *levels* within a single index. In this way, higher-dimensional data can be compactly represented within the familiar one-dimensional `Series` and two-dimensional `DataFrame` objects.



## Multiply indexed `Series`

For concreteness, we will consider a series of data where each point has a character and numerical key.

In [13]:
index = [('California', 2000), ('California', 2010),
         ('New York', 2000), ('New York', 2010),
         ('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
               18976457, 19378102,
               20851820, 25145561]

### The bad way

In [14]:
pop = pd.Series(populations, index=index)
pop

(California, 2000)    33871648
(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
(Texas, 2010)         25145561
dtype: int64

Why is this bad? 

For example, if you need to select all values from 2010, you'll need to do some messy (and potentially slow) munging to make it happen:

In [15]:
pop[[i for i in pop.index if i[1] == 2010]]

(California, 2010)    37253956
(New York, 2010)      19378102
(Texas, 2010)         25145561
dtype: int64

### The better way: Pandas `MultiIndex`

Create a multi-index (`MultiIndex`) from the tuples:

In [16]:
multi_index = pd.MultiIndex.from_tuples(index)
multi_index

MultiIndex([('California', 2000),
            ('California', 2010),
            (  'New York', 2000),
            (  'New York', 2010),
            (     'Texas', 2000),
            (     'Texas', 2010)],
           )

Re-index the series `pop` with this `MultiIndex` and see the hierarchical representation of the data:

In [17]:
pop = pop.reindex(multi_index)
pop

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

To access all data for which the second index is 2010, we can simply use the Pandas slicing notation:

In [18]:
pop[:, 2010]

California    37253956
New York      19378102
Texas         25145561
dtype: int64

In [19]:
pop.loc['California']

2000    33871648
2010    37253956
dtype: int64

### Multiindex as extra dimension

`unstack()`: quickly convert a multiply indexed `Series` into a conventionally indexed `DataFrame`

In [20]:
pop_df = pop.unstack()
pop_df

Unnamed: 0,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


`stack()`: provides the opposite operation

In [21]:
pop_df.stack()

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

Why hierarchical indexing?

Just as we were able to use multi-indexing to represent two-dimensional data within a one-dimensional `Series`, we can also use it to represent data of three or more dimensions in a `Series` or `DataFrame`. Each extra level in a multi-index represents an extra dimension of data; taking advantage of this property gives us much more flexibility in the types of data we can represent. 

Concretely, we might want to add another column of demographic data for each state at each year (say, population under 18) ; with a `MultiIndex` this is as easy as adding another column to the `DataFrame`:

In [27]:
pop_df = pd.DataFrame({'total': pop, 
                       'under18': [9267089, 9284094,
                                   4687374, 4318033,
                                   5906301, 6879014]})
pop_df

Unnamed: 0,Unnamed: 1,total,under18
California,2000,33871648,9267089
California,2010,37253956,9284094
New York,2000,18976457,4687374
New York,2010,19378102,4318033
Texas,2000,20851820,5906301
Texas,2010,25145561,6879014


ufuncs and other functionality work with hierarchical indices as well. This allows us to easily and quickly manipulate and explore even high-dimensional data.

In [24]:
f_u18 = pop_df['under18'] / pop_df['total']

In [25]:
f_u18

California  2000    0.273594
            2010    0.249211
New York    2000    0.247010
            2010    0.222831
Texas       2000    0.283251
            2010    0.273568
dtype: float64

In [26]:
f_u18.unstack()

Unnamed: 0,2000,2010
California,0.273594,0.249211
New York,0.24701,0.222831
Texas,0.283251,0.273568


## Methods of `MultiIndex` Creation

### Implicit: Simply pass a list of two or more index arrays to the constructor

The work of creating the `MultiIndex` is done in the background.

In [36]:
df = pd.DataFrame(np.random.randint(10, size=(4, 2)), 
                  index=[list('aabc'), [1, 2, 1, 2]],
                  columns = ['data1', 'data2'])
df

Unnamed: 0,Unnamed: 1,data1,data2
a,1,3,8
a,2,6,0
b,1,9,4
c,2,8,4


### Implicit: Pass a dictionary with appropriate tuples as keys

Pandas will automatically recognize this and use a `MultiIndex` by default:

In [37]:
data = {('California', 2000): 33871648,
        ('California', 2010): 37253956,
        ('Texas', 2000): 20851820,
        ('Texas', 2010): 25145561,
        ('New York', 2000): 18976457,
        ('New York', 2010): 19378102}
pd.Series(data)

California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
New York    2000    18976457
            2010    19378102
dtype: int64

### Explicit `MultiIndex` constructors

Use the class method constructors available in the `pd.MultiIndex`

- From a simple list of arrays giving the index values within each level

In [38]:
pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

- From a list of tuples giving the multiple index values of each point

In [39]:
pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

- From a Cartesian product of single indices 

In [40]:
pd.MultiIndex.from_product([['a', 'b'], [1, 2]])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

Construct the `MultiIndex` directly using its internal encoding by passing 

- `levels`: a list of lists containing available index values for each level

- ~~`labels`: a list of lists that reference these labels ~~(deprecated)

- `codes`: Integers for each level designating which label at each location.

In [43]:
pd.MultiIndex(levels=[['a', 'b'], [1, 2]],
              codes=[[0, 0, 1, 1], [0, 1, 1, 1]])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 2),
            ('b', 2)],
           )

### `MultiIndex` level names

Pass the names argument to any of the above MultiIndex constructors, or set the names attribute of the index after the fact.

In [46]:
pop

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [49]:
pop.index.names = ['state', 'year']
pop

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

### `MultiIndex` for columns

In a `DataFrame`, the rows and columns are completely symmetric, and just as the rows can have multiple levels of indices, the columns can have multiple levels as well. 

In [53]:
# Hierarchical index
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
                                    names=['year', 'visit'])

# Hierarchical columns
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'],
                                      ['HR', 'Temp']],
                                      names=['subject', 'type'])

In [56]:
# Mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37
data

array([[66. , 36.8, 21. , 37.2, 46. , 38. ],
       [28. , 37. , 17. , 36.2, 43. , 35.5],
       [39. , 35.3, 23. , 35.7, 53. , 34.4],
       [39. , 37.2, 51. , 36.1, 32. , 37.6]])

In [57]:
# Create DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,66.0,36.8,21.0,37.2,46.0,38.0
2013,2,28.0,37.0,17.0,36.2,43.0,35.5
2014,1,39.0,35.3,23.0,35.7,53.0,34.4
2014,2,39.0,37.2,51.0,36.1,32.0,37.6


This is four-dimensional data:

- subject
- measurement type
- year
- visit number

Access a person's information:

In [58]:
health_data['Bob']

Unnamed: 0_level_0,type,HR,Temp
year,visit,Unnamed: 2_level_1,Unnamed: 3_level_1
2013,1,66.0,36.8
2013,2,28.0,37.0
2014,1,39.0,35.3
2014,2,39.0,37.2


## Indexing and Slicing a `MultiIndex`

Think about the indices as **added dimensions.**

### Multiply indexed `Series`

In [59]:
pop

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

Access single elements by indexing with multiple terms

In [60]:
pop['California', 2000]

33871648

The `MultiIndex` also supports *partial indexing*, or indexing just one of the levels in the index. The result is another `Series`, with the lower-level indices maintained:

In [61]:
pop['California']

year
2000    33871648
2010    37253956
dtype: int64

Partial slicing (as long as the `MultiIndex` is sorted):

In [62]:
pop.loc['California':'New York']

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
dtype: int64

Partial indexing on lower levels:

In [63]:
pop[:, 2000]

state
California    33871648
New York      18976457
Texas         20851820
dtype: int64

Indexing and selection:

In [64]:
pop[['California', 'Texas']]

state       year
California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
dtype: int64

### Multiply indexed DataFrames

Columns are primary in a`DataFrame`, and the syntax used for multiply indexed `Series` applies to the columns.

In [65]:
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,66.0,36.8,21.0,37.2,46.0,38.0
2013,2,28.0,37.0,17.0,36.2,43.0,35.5
2014,1,39.0,35.3,23.0,35.7,53.0,34.4
2014,2,39.0,37.2,51.0,36.1,32.0,37.6


In [66]:
health_data['Bob']

Unnamed: 0_level_0,type,HR,Temp
year,visit,Unnamed: 2_level_1,Unnamed: 3_level_1
2013,1,66.0,36.8
2013,2,28.0,37.0
2014,1,39.0,35.3
2014,2,39.0,37.2


In [67]:
health_data['Bob', 'HR']

year  visit
2013  1        66.0
      2        28.0
2014  1        39.0
      2        39.0
Name: (Bob, HR), dtype: float64

For Single-index case use `loc`, `iloc` indexers

- Array-like view of the two-dimensional data

- Each individual index can be passed a tuple of multiple indices.

`loc`:\
Syntax: `df.loc[(row_level_1, row_level_2,...), (col_level_1, col_level_2,...)]`

In [76]:
health_data.loc[(2013, 1), ('Bob', 'HR')]

66.0

In [77]:
health_data.loc[:, ('Guido', 'Temp')]

year  visit
2013  1        37.2
      2        36.2
2014  1        35.7
      2        36.1
Name: (Guido, Temp), dtype: float64

In [80]:
health_data.iloc[0, :3]

subject  type
Bob      HR      66.0
         Temp    36.8
Guido    HR      21.0
Name: (2013, 1), dtype: float64

Working with slices within these index tuples is not especially convenient; trying to create a slice within a tuple will lead to a syntax error:

In [81]:
health_data.loc[(:, 1), (:, 'HR')]

SyntaxError: invalid syntax (<ipython-input-81-fb34fa30ac09>, line 1)

In [82]:
health_data.iloc[(:, 1), (:, :2)]

SyntaxError: invalid syntax (<ipython-input-82-df8586a5f979>, line 1)

We can build the desired slice with `IndexSlice` object:

In [83]:
idx = pd.IndexSlice
health_data.loc[idx[:, 1], idx[:, 'HR']] # HR of each person in the first visit of each year

Unnamed: 0_level_0,subject,Bob,Guido,Sue
Unnamed: 0_level_1,type,HR,HR,HR
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2013,1,66.0,21.0,46.0
2014,1,39.0,23.0,53.0


## Rearranging Multi-indices

### Sorted and unsorted indices

*Many of the `MultiIndex` slicing operations will fail if the index is not sorted.* Let's take a look at this here.

In [89]:
index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])
data = pd.Series(np.random.randint(6, size=6), index=index)
data.index.names = ['char', 'int']
data

char  int
a     1      2
      2      5
c     1      4
      2      0
b     1      1
      2      0
dtype: int64

Try to take a partial slice of this index, it will result in an error:

In [88]:
try:
    data['a':'b']
except KeyError as e:
    print(type(e))
    print(e)

<class 'pandas.errors.UnsortedIndexError'>
'Key length (1) was greater than MultiIndex lexsort depth (0)'


In [90]:
data = data.sort_index()
data

char  int
a     1      2
      2      5
b     1      1
      2      0
c     1      4
      2      0
dtype: int64

In [91]:
data['a':'b']

char  int
a     1      2
      2      5
b     1      1
      2      0
dtype: int64

### Stacking and unstacking indices

Convert a dataset from a stacked multi-index to a simple two-dimensional representation, optionally specifying the level to use

In [92]:
pop

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [93]:
pop.unstack()

year,2000,2010
state,Unnamed: 1_level_1,Unnamed: 2_level_1
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [95]:
pop.unstack(level=0) # Specified level of index will be unstacked

state,California,New York,Texas
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2000,33871648,18976457,20851820
2010,37253956,19378102,25145561


In [97]:
pop.unstack().stack()

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

### Index setting and restting

`reset_index`: Turn the index labels into columns

In [98]:
pop

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [99]:
pop_flat = pop.reset_index(name='population')
pop_flat 

Unnamed: 0,state,year,population
0,California,2000,33871648
1,California,2010,37253956
2,New York,2000,18976457
3,New York,2010,19378102
4,Texas,2000,20851820
5,Texas,2010,25145561


In [100]:
type(pop_flat)

pandas.core.frame.DataFrame

Build a `MultiIndex` from the column values:

In [101]:
pop_flat.set_index(['state', 'year'])

Unnamed: 0_level_0,Unnamed: 1_level_0,population
state,year,Unnamed: 2_level_1
California,2000,33871648
California,2010,37253956
New York,2000,18976457
New York,2010,19378102
Texas,2000,20851820
Texas,2010,25145561


This is one of the most useful patterns when encountering real-world datasets.

## Data Aggregation on Multi-Indices

The built-in data aggregation methods of Pandas can be passed a `level` parameter that controls which subset of the data the aggregate is computed on.

In [102]:
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,66.0,36.8,21.0,37.2,46.0,38.0
2013,2,28.0,37.0,17.0,36.2,43.0,35.5
2014,1,39.0,35.3,23.0,35.7,53.0,34.4
2014,2,39.0,37.2,51.0,36.1,32.0,37.6


Average-out the measurements in the two visits each year: 

In [103]:
data_mean = health_data.mean(level='year')
data_mean

subject,Bob,Bob,Guido,Guido,Sue,Sue
type,HR,Temp,HR,Temp,HR,Temp
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2013,47.0,36.9,19.0,36.7,44.5,36.75
2014,39.0,36.25,37.0,35.9,42.5,36.0


Use `axis` keyword to take the mean along levels on the columns as well:

In [104]:
data_mean.mean(axis=1, level='type')

type,HR,Temp
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2013,36.833333,36.783333
2014,39.5,36.05
