In [1]:
import pandas as pd
import numpy as np

# Hierarchical indexing (MultiIndex)

In essence, A MultiIndex, also known as a multi-level index or hierarchical index enables us to store and manipulate data with an arbitrary number of dimensions in lower dimensional data structures like Series (1d) and DataFrame (2d).

It allows us to have multiple columns acting as a row identifier, while having each index column related to another through a parent/child relationship.

Reference: [https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html)



## Load and explore data

In [2]:
# load datasets
df = pd.read_csv('../../datasets/various/drinks.csv')

In [3]:
df.head(3)

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF


In [5]:
#get index labels:
print(df.index)
print(df.index.values[:20])
# print(df.index.name)
print(df.index.names)
# print(df.index.value_counts())

RangeIndex(start=0, stop=193, step=1)
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]
[None]


## Create MultiIndex





### From columns with [set_index()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html)


In [6]:
print(df.columns)

Index(['country', 'beer_servings', 'spirit_servings', 'wine_servings',
       'total_litres_of_pure_alcohol', 'continent'],
      dtype='object')


In [7]:
df_multi = df.set_index(['continent','country'])

print(df_multi.head())

print(df_multi.index.names)
print(df_multi.index.values[:5])

                       beer_servings  spirit_servings  wine_servings  \
continent country                                                      
AS        Afghanistan              0                0              0   
EU        Albania                 89              132             54   
AF        Algeria                 25                0             14   
EU        Andorra                245              138            312   
AF        Angola                 217               57             45   

                       total_litres_of_pure_alcohol  
continent country                                    
AS        Afghanistan                           0.0  
EU        Albania                               4.9  
AF        Algeria                               0.7  
EU        Andorra                              12.4  
AF        Angola                                5.9  
['continent', 'country']
[('AS', 'Afghanistan') ('EU', 'Albania') ('AF', 'Algeria')
 ('EU', 'Andorra') ('AF', 'Angola

Note, that the index is not sorted by default. To do that, we can use [sort_index()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_index.html) method

## Remove Multiindex

### With [reset_index()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html)

Reset the index of the DataFrame, and use the default one instead. If the DataFrame has a MultiIndex, this method can remove one or more levels.

In [8]:
# remove all index levels
tmp = df_multi.reset_index()
tmp.head(3)


Unnamed: 0,continent,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol
0,AS,Afghanistan,0,0,0,0.0
1,EU,Albania,89,132,54,4.9
2,AF,Algeria,25,0,14,0.7


In [9]:
# reset only a subset of idnex
tmp = df_multi.reset_index(level='country')
tmp.head(10)
# df_multi.head(3)

Unnamed: 0_level_0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AS,Afghanistan,0,0,0,0.0
EU,Albania,89,132,54,4.9
AF,Algeria,25,0,14,0.7
EU,Andorra,245,138,312,12.4
AF,Angola,217,57,45,5.9
,Antigua & Barbuda,102,128,45,4.9
SA,Argentina,193,25,221,8.3
EU,Armenia,21,179,11,3.8
OC,Australia,261,72,212,10.4
EU,Austria,279,75,191,9.7


### Sorting a MultiIndex

For MultiIndex-ed objects to be indexed and sliced effectively, they need to be sorted. As with any index, you can use sort_index().

Indexing will work even if the data are not sorted, but will be rather inefficient (and show a PerformanceWarning). It will also return a copy of the data rather than a view.