In [1]:
import numpy as np
import pandas as pd


-----


Hierarchical Indexing also known as Multi-indexing.  This helps to incorporate multiple index levels. In this way higer dimentional data can be compactly presented as one-dimentional `Series` or two-dimentional `DataFrame` object.

## A Multiply Indexed Series - Bad Example

Suppose you would like to track data about states from two different years

In [2]:
index = [('California', 2000), ('California', 2010),
         ('New York', 2000), ('New York', 2010),
         ('Texas', 2000), ('Texas', 2010)]

populations = [33871648, 37253956,
               18976457, 19378102,
               20851820, 25145561]

pop_series = pd.Series(populations, index=index)
pop_series

(California, 2000)    33871648
(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
(Texas, 2010)         25145561
dtype: int64

For example, if you need to select all values from 2010, you'll need to do some messy (and potentially slow) munging to make it happen

In [3]:
[states for states in pop_series.index if states[1]==2010]

[('California', 2010), ('New York', 2010), ('Texas', 2010)]

In [4]:
pop_series[[states for states in pop_series.index if states[1]==2010]]

(California, 2010)    37253956
(New York, 2010)      19378102
(Texas, 2010)         25145561
dtype: int64


-----


### The Better Way: Pandas MultiIndex

Tuple-based indexing is essentially a rudimentary multi-index.
Pandas `.MultiIndex()`  can create a multi-index from the tuples as follows:

In [5]:
print(index)

[('California', 2000), ('California', 2010), ('New York', 2000), ('New York', 2010), ('Texas', 2000), ('Texas', 2010)]


In [6]:
# Type of the index is List with tuples

print("Type is:", type(index), "   ",type(index[0]))

Type is: <class 'list'>     <class 'tuple'>


In [7]:
# Covert index into multi-index 

index = pd.MultiIndex.from_tuples(index)
print(index)

MultiIndex([('California', 2000),
            ('California', 2010),
            (  'New York', 2000),
            (  'New York', 2010),
            (     'Texas', 2000),
            (     'Texas', 2010)],
           )


In [8]:
# Shows Index Levels 

index.levels

FrozenList([['California', 'New York', 'Texas'], [2000, 2010]])

In [9]:
# Shows Index Level Shape

print("Index Shape (Unique):",index.levshape)

Index Shape (Unique): (3, 2)


In [10]:
# Number of Index Levels  
print("Index Levels :", index.nlevels)

Index Levels : 2


In [11]:
index.values

array([('California', 2000), ('California', 2010), ('New York', 2000),
       ('New York', 2010), ('Texas', 2000), ('Texas', 2010)], dtype=object)

Notice that the ``MultiIndex`` contains multiple *levels* of indexing–in this case, the state names and the years, as well as multiple *labels* for each data point which encode these levels.

If we re-index the `Series` `pop_series` with this ``MultiIndex``, we see the hierarchical representation of the data:

In [12]:
pop_series = pop_series.reindex(index)
pop_series

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

Here the first two columns of the ``Series`` representation show the multiple index values, while the third column shows the data.

***Notice that some entries are missing in the first column:*** in this multi-index representation, any blank entry indicates the same value as the line above it.

In [13]:
pop_series[['Texas','California']]

Texas       2000    20851820
            2010    25145561
California  2000    33871648
            2010    37253956
dtype: int64

In [14]:
pop_series[:,2000]

California    33871648
New York      18976457
Texas         20851820
dtype: int64

In [15]:
pop_series['New York']

2000    18976457
2010    19378102
dtype: int64

In [16]:
print(pop_series['New York'][2010])
print(pop_series['New York'].iloc[1])
print(pop_series['New York'].iloc[1:2])

19378102
19378102
2010    19378102
dtype: int64



-----


### MultiIndex as extra dimension

You might notice something else here: we could easily have stored the same data using a simple ``DataFrame`` with index and column labels.
In fact, Pandas is built with this equivalence in mind. The ``unstack()`` method will quickly convert a multiply indexed ``Series`` into a conventionally indexed ``DataFrame``:

In [17]:
df_pop = pop_series.unstack()
df_pop

Unnamed: 0,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


The ``stack()`` method provides the opposite operation of `unstack()`

In [18]:
df_pop = df_pop.stack()
df_pop

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [19]:
# Unstack based on level 0

df_pop.unstack(level=0)

Unnamed: 0,California,New York,Texas
2000,33871648,18976457,20851820
2010,37253956,19378102,25145561


In [20]:
# Unstack based on level 1

df_pop.unstack(level=1)

Unnamed: 0,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [21]:
# Unstack based on level -1 (last level)

df_pop.unstack(level=-1)

Unnamed: 0,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [22]:
# Unstack based on level -2, second level in reverse

df_pop.unstack(level=-2)

Unnamed: 0,California,New York,Texas
2000,33871648,18976457,20851820
2010,37253956,19378102,25145561


In [23]:
df_pop = pop_series.unstack()
df_pop

Unnamed: 0,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [24]:
df_pop = pd.DataFrame({'Total': pop_series,
                       'Under18': [9267089, 9284094,
                                   4687374, 4318033,
                                   5906301, 6879014]})
df_pop

Unnamed: 0,Unnamed: 1,Total,Under18
California,2000,33871648,9267089
California,2010,37253956,9284094
New York,2000,18976457,4687374
New York,2010,19378102,4318033
Texas,2000,20851820,5906301
Texas,2010,25145561,6879014


In [25]:
df_u18 = df_pop['Under18'] / df_pop['Total']
df_u18

California  2000    0.273594
            2010    0.249211
New York    2000    0.247010
            2010    0.222831
Texas       2000    0.283251
            2010    0.273568
dtype: float64

In [26]:
df_u18.unstack()

Unnamed: 0,2000,2010
California,0.273594,0.249211
New York,0.24701,0.222831
Texas,0.283251,0.273568


In [27]:
df_pop['U18_Ratio'] = df_pop['Under18'] / df_pop['Total']
df_pop

Unnamed: 0,Unnamed: 1,Total,Under18,U18_Ratio
California,2000,33871648,9267089,0.273594
California,2010,37253956,9284094,0.249211
New York,2000,18976457,4687374,0.24701
New York,2010,19378102,4318033,0.222831
Texas,2000,20851820,5906301,0.283251
Texas,2010,25145561,6879014,0.273568


In [28]:
df_pop_clevels = df_pop.unstack(level=1)
df_pop_clevels

Unnamed: 0_level_0,Total,Total,Under18,Under18,U18_Ratio,U18_Ratio
Unnamed: 0_level_1,2000,2010,2000,2010,2000,2010
California,33871648,37253956,9267089,9284094,0.273594,0.249211
New York,18976457,19378102,4687374,4318033,0.24701,0.222831
Texas,20851820,25145561,5906301,6879014,0.283251,0.273568


In [29]:
df_pop_clevels['U18_Ratio']

Unnamed: 0,2000,2010
California,0.273594,0.249211
New York,0.24701,0.222831
Texas,0.283251,0.273568


In [30]:
df_pop_clevels['U18_Ratio'][2010]

California    0.249211
New York      0.222831
Texas         0.273568
Name: 2010, dtype: float64


-----


## Methods of MultiIndex Creation

The most straightforward way to construct a multiply indexed ``Series`` or ``DataFrame`` is to simply pass a list of two or more index arrays to the constructor. For example:

In [31]:
df_mi = pd.DataFrame(np.random.rand(4, 2),
                  index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                  columns=['data1', 'data2'])
df_mi

Unnamed: 0,Unnamed: 1,data1,data2
a,1,0.628333,0.282772
a,2,0.475996,0.208109
b,1,0.804909,0.920961
b,2,0.245935,0.339669


The work of creating the ``MultiIndex`` is done in the background.

Similarly, if you pass a dictionary with appropriate tuples as keys, Pandas will automatically recognize this and use a ``MultiIndex`` by default:

In [32]:
mydict_data = {('California', 2000): 33871648,
        ('California', 2010): 37253956,
        ('Texas', 2000): 20851820,
        ('Texas', 2010): 25145561,
        ('New York', 2000): 18976457,
        ('New York', 2010): 19378102}

myseries = pd.Series(mydict_data)
myseries

California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
New York    2000    18976457
            2010    19378102
dtype: int64

### Explicit MultiIndex constructors

For more flexibility in how the index is constructed, you can instead use the class method constructors available in the ``pd.MultiIndex``.

We can construct the ``MultiIndex`` from a simple list of arrays giving the index values within each level:

In [33]:
pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

In [34]:
pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

In [35]:
pd.MultiIndex.from_product([['a', 'b'], [1, 2]])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

Similarly, you can construct the ``MultiIndex`` directly using its internal encoding by passing ``levels`` (a list of lists containing available index values for each level) and ``Codes`` (a list of lists that reference these labels)

In [36]:
# Codes are internal references, 2 Levels 2 x 2 values for codes

pd.MultiIndex(levels=[['a', 'b'], [1, 2]],
              codes=[[0, 0, 1, 1], [0, 1, 0, 1]])           

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

In [37]:
# for 3 item in Level 0 and 2 Items in level 1 
# we need 3x2 in entries

pd.MultiIndex(levels=[['a', 'b', 'c'], [1, 2, 3]],
              codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2], 
                     [0, 1, 2, 0, 1, 2, 0, 1, 2]])  

MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 2),
            ('b', 3),
            ('c', 1),
            ('c', 2),
            ('c', 3)],
           )

In [38]:
# Equivalent of above using .from_product()

pd.MultiIndex.from_product([['a', 'b', 'c'], [1, 2, 3]])

MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 2),
            ('b', 3),
            ('c', 1),
            ('c', 2),
            ('c', 3)],
           )

### MultiIndex level names

Sometimes it is convenient to name the levels of the ``MultiIndex``.
This can be accomplished by passing the ``names`` argument to any of the above ``MultiIndex`` constructors, or by setting the ``names`` attribute of the index after the fact:

In [39]:
index = [('California', 2000), ('California', 2010),
         ('New York', 2000), ('New York', 2010),
         ('Texas', 2000), ('Texas', 2010)]

populations = [33871648, 37253956,
               18976457, 19378102,
               20851820, 25145561]

pop_series = pd.Series(populations, index=index)
index = pd.MultiIndex.from_tuples(index)
pop_series = pop_series.reindex(index)
pop_series

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [40]:
# Names assigned to the Index

pop_series.index.names = ['state', 'year']
pop_series

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

### MultiIndex for columns

In a ``DataFrame``, the rows and columns are completely symmetric, and just as the rows can have multiple levels of indices, the columns can have multiple levels as well.
Consider the following, which is a mock-up of some (somewhat realistic) medical data:

In [41]:
# hierarchical indices and columns
index = pd.MultiIndex.from_product(
    [[2016, 2017, 2018, 2019, 2020],
     [1, 2, 3, 4]],
    names=['year', 'visit'])

columns = pd.MultiIndex.from_product(
    [['Bob', 'Guido', 'Sue'], 
     ['HR', 'Temp']],
    names=['subject', 'type'])

# mock some data
data = np.round(np.random.randn(20, 6), 1)
data[:, ::2] *= 10
data += 37

# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2016,1,52.0,36.7,35.0,35.8,41.0,37.4
2016,2,43.0,37.2,29.0,37.2,42.0,36.5
2016,3,41.0,36.9,35.0,36.3,27.0,38.0
2016,4,39.0,36.4,21.0,36.9,32.0,38.2
2017,1,50.0,36.6,46.0,37.8,25.0,36.3
2017,2,35.0,37.6,37.0,36.2,39.0,35.8
2017,3,35.0,38.4,49.0,37.4,17.0,37.3
2017,4,48.0,35.9,56.0,37.9,10.0,37.3
2018,1,40.0,38.0,39.0,37.9,27.0,36.8
2018,2,34.0,37.8,50.0,38.1,22.0,37.1


Here we see where the multi-indexing for both rows and columns can come in *very* handy.

This is fundamentally four-dimensional data, where the dimensions are the subject, the measurement type, the year, and the visit number.

With this in place we can, for example, index the top-level column by the person's name and get a full ``DataFrame`` containing just that person's information:

In [42]:
health_data['Guido']

Unnamed: 0_level_0,type,HR,Temp
year,visit,Unnamed: 2_level_1,Unnamed: 3_level_1
2016,1,35.0,35.8
2016,2,29.0,37.2
2016,3,35.0,36.3
2016,4,21.0,36.9
2017,1,46.0,37.8
2017,2,37.0,36.2
2017,3,49.0,37.4
2017,4,56.0,37.9
2018,1,39.0,37.9
2018,2,50.0,38.1


## Indexing and Slicing a MultiIndex

Indexing and slicing on a ``MultiIndex`` is designed to be intuitive, and it helps if you think about the indices as added dimensions.
We'll first look at indexing multiply indexed ``Series``, and then multiply-indexed ``DataFrames``.

### Multiply indexed Series

Consider the multiply indexed ``Series`` of state populations we saw earlier

In [43]:
pop_series

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

####  Access single elements by indexing with multiple terms

In [44]:
pop_series['New York'][2000]

18976457

The ``MultiIndex`` also supports *partial indexing*, or indexing just one of the levels in the index. The result is another ``Series``, with the **lower-level indices maintained**.

In [45]:
pop_series['New York']

year
2000    18976457
2010    19378102
dtype: int64

Partial slicing is available as well, as long as the ``MultiIndex`` is sorted.

In [46]:
pop_series['California':'New York']

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
dtype: int64

**_With sorted indices_**, partial indexing can be performed on lower levels by passing an empty slice in the first index:

In [47]:
pop_series[:, 2000]

state
California    33871648
New York      18976457
Texas         20851820
dtype: int64

Other types of indexing and selection work as well; for example, selection based on Boolean masks:

In [48]:
pop_series[ pop_series> 22000000]

state       year
California  2000    33871648
            2010    37253956
Texas       2010    25145561
dtype: int64

Selection based on fancy indexing also works:

In [49]:
pop_series[['California', 'Texas']]

state       year
California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
dtype: int64

### Multiply indexed DataFrames

A multiply indexed ``DataFrame`` behaves in a similar manner.
We use the ``DataFrame`` ``health_data`` for testing.

**Key Point to _Remember that columns are primary_** in a ``DataFrame``, and the syntax used for multiply indexed ``Series`` applies to the columns.

In [67]:
#  Guido's heart rate data
health_data['Guido', 'HR']

year  visit
2016  1        35.0
      2        29.0
      3        35.0
      4        21.0
2017  1        46.0
      2        37.0
      3        49.0
      4        56.0
2018  1        39.0
      2        50.0
      3        40.0
      4        28.0
2019  1        39.0
      2        35.0
      3        33.0
      4        24.0
2020  1        42.0
      2        35.0
      3        41.0
      4        29.0
Name: (Guido, HR), dtype: float64

In [82]:
health_data.iloc[:6, [1,3,5]]

Unnamed: 0_level_0,subject,Bob,Guido,Sue
Unnamed: 0_level_1,type,Temp,Temp,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2016,1,36.7,35.8,37.4
2016,2,37.2,37.2,36.5
2016,3,36.9,36.3,38.0
2016,4,36.4,36.9,38.2
2017,1,36.6,37.8,36.3
2017,2,37.6,36.2,35.8


In [80]:
health_data.iloc[12::2, [0,2,4]]

Unnamed: 0_level_0,subject,Bob,Guido,Sue
Unnamed: 0_level_1,type,HR,HR,HR
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2019,1,28.0,39.0,13.0
2019,3,56.0,33.0,41.0
2020,1,38.0,42.0,32.0
2020,3,36.0,41.0,7.0


In [96]:
health_data.loc[2019:2020]['Bob','HR']

year  visit
2019  1        28.0
      2        34.0
      3        56.0
      4        25.0
2020  1        38.0
      2        20.0
      3        36.0
      4        41.0
Name: (Bob, HR), dtype: float64

Working with slices within these index tuples is not especially convenient; trying to create a slice within a tuple will lead to a syntax error:

In [107]:
idx = pd.IndexSlice
health_data.loc[idx[:, 1], idx[:, 'HR']]

Unnamed: 0_level_0,subject,Bob,Guido,Sue
Unnamed: 0_level_1,type,HR,HR,HR
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2016,1,52.0,35.0,41.0
2017,1,50.0,46.0,25.0
2018,1,40.0,39.0,27.0
2019,1,28.0,39.0,13.0
2020,1,38.0,42.0,32.0


In [112]:
idx = pd.IndexSlice
health_data.loc[idx[[2020], [1,2]], idx[['Bob','Sue'], 'HR']]

Unnamed: 0_level_0,subject,Bob,Sue
Unnamed: 0_level_1,type,HR,HR
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2
2020,1,38.0,32.0
2020,2,20.0,34.0


In [103]:
idx = pd.IndexSlice
health_data.loc[idx[[2016,2018], :], idx[:, 'HR']]

Unnamed: 0_level_0,subject,Bob,Guido,Sue
Unnamed: 0_level_1,type,HR,HR,HR
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2016,1,52.0,35.0,41.0
2016,2,43.0,29.0,42.0
2016,3,41.0,35.0,27.0
2016,4,39.0,21.0,32.0
2018,1,40.0,39.0,27.0
2018,2,34.0,50.0,22.0
2018,3,42.0,40.0,29.0
2018,4,43.0,28.0,38.0


In [115]:
idx = pd.IndexSlice
health_data.loc[idx[:,:], idx[:,:]]

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2016,1,52.0,36.7,35.0,35.8,41.0,37.4
2016,2,43.0,37.2,29.0,37.2,42.0,36.5
2016,3,41.0,36.9,35.0,36.3,27.0,38.0
2016,4,39.0,36.4,21.0,36.9,32.0,38.2
2017,1,50.0,36.6,46.0,37.8,25.0,36.3
2017,2,35.0,37.6,37.0,36.2,39.0,35.8
2017,3,35.0,38.4,49.0,37.4,17.0,37.3
2017,4,48.0,35.9,56.0,37.9,10.0,37.3
2018,1,40.0,38.0,39.0,37.9,27.0,36.8
2018,2,34.0,37.8,50.0,38.1,22.0,37.1



---


## Rearranging Multi-Indices

### Sorted and unsorted indices

Earlier, we briefly mentioned a caveat, but we should emphasize it more here.
*Many of the ``MultiIndex`` slicing operations will fail if the index is not sorted.*
Let's take a look at this here.

We'll start by creating some simple multiply indexed data where the indices are **not lexographically sorted**:


In [117]:
# hierarchical indices and columns
# Index is not sorted 

index = pd.MultiIndex.from_product(
    [[2019, 2018, 2020],
     [1, 2]],
    names=['year', 'visit'])

columns = pd.MultiIndex.from_product(
    [['Bob', 'Guido', 'Sue'], 
     ['HR', 'Temp']],
    names=['subject', 'type'])

# mock some data
data = np.round(np.random.randn(6, 6), 1)
data[:, ::2] *= 10
data += 37

# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2019,1,33.0,37.4,29.0,38.6,36.0,37.4
2019,2,45.0,36.6,51.0,36.5,35.0,39.3
2018,1,52.0,38.5,11.0,36.2,45.0,37.2
2018,2,48.0,36.1,49.0,37.1,20.0,38.0
2020,1,29.0,35.6,41.0,37.3,41.0,36.7
2020,2,44.0,36.3,20.0,36.8,36.0,38.4


In [119]:
try:
    health_data.loc[2019:2020]['Bob','HR']
except KeyError as e:
    print(type(e))
    print(e)


<class 'pandas.errors.UnsortedIndexError'>
'Key length (1) was greater than MultiIndex lexsort depth (0)'


Although it is not entirely clear from the error message, this is the result of the MultiIndex not being sorted. For various reasons, partial slices and other similar operations require the levels in the ``MultiIndex`` to be in sorted (i.e., lexographical) order.

Pandas provides a number of convenience routines to perform this type of sorting; examples are the ``sort_index()`` and ``sortlevel()`` methods of the ``DataFrame``.

We'll use the simplest, ``sort_index()``, here:

In [120]:
health_data = health_data.sort_index()
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2018,1,52.0,38.5,11.0,36.2,45.0,37.2
2018,2,48.0,36.1,49.0,37.1,20.0,38.0
2019,1,33.0,37.4,29.0,38.6,36.0,37.4
2019,2,45.0,36.6,51.0,36.5,35.0,39.3
2020,1,29.0,35.6,41.0,37.3,41.0,36.7
2020,2,44.0,36.3,20.0,36.8,36.0,38.4


In [122]:
health_data.loc[2019:2020]['Bob','HR']

year  visit
2019  1        33.0
      2        45.0
2020  1        29.0
      2        44.0
Name: (Bob, HR), dtype: float64


---


### Stacking and unstacking indices

It is possible to convert a dataset from a stacked multi-index to a simple two-dimensional representation, optionally specifying the level to use

In [124]:
pop_series

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [125]:
pop_series.unstack(level=0)

state,California,New York,Texas
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2000,33871648,18976457,20851820
2010,37253956,19378102,25145561


In [126]:
pop_series.unstack(level=1)

year,2000,2010
state,Unnamed: 1_level_1,Unnamed: 2_level_1
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [127]:
pop_series.unstack(level=-1)

year,2000,2010
state,Unnamed: 1_level_1,Unnamed: 2_level_1
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [128]:
pop_series.unstack(level=-2)

state,California,New York,Texas
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2000,33871648,18976457,20851820
2010,37253956,19378102,25145561


In [130]:
pop_series.unstack(level=-2).stack()

year  state     
2000  California    33871648
      New York      18976457
      Texas         20851820
2010  California    37253956
      New York      19378102
      Texas         25145561
dtype: int64

### Index setting and resetting

Another way to rearrange hierarchical data is to turn the index labels into columns; this can be accomplished with the ``reset_index`` method.
Calling this on the population dictionary will result in a ``DataFrame`` with a *state* and *year* column holding the information that was formerly in the index.
For clarity, we can optionally specify the name of the data for the column representation:

In [133]:
pop_series

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [137]:
pop_series_flat=pop_series.reset_index(name="Population")
pop_series_flat

Unnamed: 0,state,year,Population
0,California,2000,33871648
1,California,2010,37253956
2,New York,2000,18976457
3,New York,2010,19378102
4,Texas,2000,20851820
5,Texas,2010,25145561


Often when working with data in the real world, the raw input data looks like this and it's useful to build a ``MultiIndex`` from the column values.

This can be done with the ``set_index`` method of the ``DataFrame``, which returns a multiply indexed ``DataFrame``:

In [138]:
pop_series_flat.set_index(['state', 'year'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Population
state,year,Unnamed: 2_level_1
California,2000,33871648
California,2010,37253956
New York,2000,18976457
New York,2010,19378102
Texas,2000,20851820
Texas,2010,25145561



****


## Data Aggregations on Multi-Indices

We've previously seen that Pandas has built-in data aggregation methods, such as ``mean()``, ``sum()``, and ``max()``.
For hierarchically indexed data, these can be passed a ``level`` parameter that controls which subset of the data the aggregate is computed on.

For example, let's return to our health data:

In [161]:
data_mean = health_data.groupby(level='year', axis=0).mean()
data_mean

subject,Bob,Bob,Guido,Guido,Sue,Sue
type,HR,Temp,HR,Temp,HR,Temp
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2018,50.0,37.3,30.0,36.65,32.5,37.6
2019,39.0,37.0,40.0,37.55,35.5,38.35
2020,36.5,35.95,30.5,37.05,38.5,37.55


In [165]:
data_mean = health_data.groupby(level='subject', axis=1).mean()
data_mean

Unnamed: 0_level_0,subject,Bob,Guido,Sue
year,visit,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2018,1,45.25,23.6,41.1
2018,2,42.05,43.05,29.0
2019,1,35.2,33.8,36.7
2019,2,40.8,43.75,37.15
2020,1,32.3,39.15,38.85
2020,2,40.15,28.4,37.2



****
