<a href="https://colab.research.google.com/github/Shuraimi/DataScience-Handbook-Notes/blob/main/2.%20Data_manipulation_with_Pandas/6.%20Hierarchical_Indexing_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hierrarchical Indexing

Upto now, we have discussed about one dimensional and two dimensional Pandas Series and DataFrame objects but to store or deal with higher dimensional data, that is data indexed by one or more keys.<br>

While Pandas has Panel and Panel4D to handle three dimensional and four dimensional data, one common practise is to use hierarchical indexing also known as multi indexing to incorporate multiple index levels for a single index. In this ways, higher dimensional data can be be compactly represented by the one dimensional Series and two dimensional Pandas Dataframe

In this section, we'll discuss about:-
1. Creation of `MultiIndex` object
2. Considerations around indexing, slicing and computing statistics for multiply indexed data
3. Useful routines to interchange between single and multiple indexed representations of your data.

In [None]:

import numpy as np
import pandas as pd

## A Multiply Indexed Series

We'll start by considering how we can represent a two dimensional data in onde dimensional Series

### The bad way

The bad way is to use tuples of two different years of states as keys.

In [None]:
index=[('California',2000),('California',2010),
('New York',2000),('New York',2010),('Texas',2000),('Texas',2010)]
populations = [33871648, 37253956,
 18976457, 19378102,
 20851820, 25145561]
s=pd.Series(populations,index=index)
s

(California, 2000)    33871648
(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
(Texas, 2010)         25145561
dtype: int64

With this indexing , we can use indexing or slicing for this Series based on the multiple index.

In [None]:
#indexing
s[('New York',2010)]

19378102

In [None]:
#slicing
s[:('Texas',2000)]

(California, 2000)    33871648
(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
dtype: int64

This convenience ends here.
Suppose if we want to find all states in the year 2010 we need complex code (list comprehensions) to acces them

In [None]:
s[[i for i in s.index if i[1] == 2010]]

(California, 2010)    37253956
(New York, 2010)      19378102
(Texas, 2010)         25145561
dtype: int64

This is where `MultiIndex` is helpful.

### The better way: Pandas MultiIndex

The Pandas MultiIndex provides a better way. Our tuple based indexing is essentially a rudimentary multi index and Pandas MultiIndex gives us the type of operations we wish to have.

Let's create a MultiIndex from the tuple of index

In [None]:
index=pd.MultiIndex.from_tuples(index)
index

MultiIndex([('California', 2000),
            ('California', 2010),
            (  'New York', 2000),
            (  'New York', 2010),
            (     'Texas', 2000),
            (     'Texas', 2010)],
           )

Notice that this MultiIndex contains multiple levels, which include state names and years and also multiple lables for each data point that encode these levels.

If we reindex your series with this MultiIndex, we see our hierarchical representation of data.

In [None]:
s=s.reindex(index)
s

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

Here the first 2 columns of the Series representation shows the MultiIndex values and third column is the value.

In this representation, any blank entry indicates the same value as the line above it.

Now to access all data that have 2010 we can use the pandas slicing notation

In [None]:
s[:,2010]

California    37253956
New York      19378102
Texas         25145561
dtype: int64

The result is a singly indexed array with just the keys we are interested in. This syntax is much more convenient and efficient than the previous tuple based multi indexing approach.

We'll discuss this sort of indexing in hierarchical indexed data.

### MultiIndex as Extra dimension

You may have noticed that we can store the data in a DataFrame with the index and column labels. But Pandas is built with this equivalence in their mind.

The `unstack()` will quickly convert a multiply indexed data into a DataFrame

In [None]:
s_df=s.unstack()
s_df

Unnamed: 0,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


`stack()` is just the opposite to `unstack()`.

In [None]:
s_df.stack()

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

So why we need multi indexing??

The reason is that, just like we represented two dimensional data in a Series , we can represent three or more dimensional data in a Series or DataFrame.

Each extra level in a MultiIndex represents extra dimension which gives us flexibility to represent any type of data.

It's as easy as adding another column to a DataFrame when we want to add a new column for each state for each year.

In [None]:
s_df=pd.DataFrame({'total':s,'under18':[9267089, 9284094,
 4687374, 4318033,
 5906301, 6879014]})
s_df

Unnamed: 0,Unnamed: 1,total,under18
California,2000,33871648,9267089
California,2010,37253956,9284094
New York,2000,18976457,4687374
New York,2010,19378102,4318033
Texas,2000,20851820,5906301
Texas,2010,25145561,6879014


All Ufuncs discussed earlier work with Multi indexing

In [None]:
frac=s_df['under18']/s_df['total']
frac.unstack()

Unnamed: 0,2000,2010
California,0.273594,0.249211
New York,0.24701,0.222831
Texas,0.283251,0.273568


This allows us to easily and quickly manipulate and explore higher dimensional data.

## Methods of Multi index creation

The most straightforward way is to construct a multiply indexed Series or DataFrame is to  pass a list of arrays to the index constructor.

In [None]:
df=pd.DataFrame(np.random.randint(1,20,(4,2)),index=[['a','a','b','b'],[1,2,1,2]],columns=['hello','bye'])
df

Unnamed: 0,Unnamed: 1,hello,bye
a,1,14,8
a,2,3,8
b,1,8,15
b,2,16,11


The MultiIndex creation is done by Pandas in the background.

Similarly, if we pass a tuple on keys in a dictionary, Pandas will recognise this as a MultiIndex by default.

In [None]:
data = {('California', 2000): 33871648,
 ('California', 2010): 37253956,
 ('Texas', 2000): 20851820,
 ('Texas', 2010): 25145561,
 ('New York', 2000): 18976457,
 ('New York', 2010): 19378102}
data=pd.Series(data)
data

California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
New York    2000    18976457
            2010    19378102
dtype: int64

Nevertheless, it's sometimes useful to explicitly create a MultiIndex.

### Explicit MultiIndex construction

For more flexibility in how the index is constructed, we can use the class method constructors in the `pd.MultiIndex`.

Like we can create a MultiIndex from list of arrays giving index values in each level.

In [None]:
m=pd.MultiIndex.from_arrays([['a','a','b','b'],[1,2,1,2]])
m

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

We can construct a MultiIndex from list of tuples specifying the index ay each point.

In [None]:
mt=pd.MultiIndex.from_tuples([('a',1),('a',2),('b',1),('b',2)])
mt

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

We can even construct from a Cartesian product of single

In [None]:
cp=pd.MultiIndex.from_product([['a','b'],[1,2]])
cp

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

Similarly, we can construct MultiIndex using its internal encoding *levels* a list of lists containing available index values for each level and  *labels* a list of lists that reference these lables.

We can pass these objects as the index attribute of Series or DataFrame or to the reindex method of an existing series or DataFrame.

### MultiIndex level names

We can name the levels of MultiIndex for convenience.
This is accomplished by passing the names argument in any of the above **MultiIndex constructors** or setting the *names* attribute of the index

In [None]:
s.index.names=['state','year']
s

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

This can be used to keep track of meanings of various index values for larger datasets

### MultiIndex for columns

In a *DataFrame* ,the rows and columns are symmetric and as the rows can have multiple levels of indices, the columns can also have multiple indices.

We'll implement this to some mock up data.

In [None]:

#hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
 names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
 names=['subject', 'type'])

#create mock up data
data=np.round(np.random.randn(4,6),1)
data[:,::2]*=10
data+=37

#create DataFrame
d=pd.DataFrame(data,index=index,columns=columns)
d

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,29.0,38.2,46.0,38.2,43.0,36.4
2013,2,32.0,37.8,50.0,37.9,39.0,37.2
2014,1,44.0,35.8,48.0,36.4,40.0,36.7
2014,2,41.0,36.7,40.0,36.6,45.0,38.4


Here we see multi indexing for both rows and columns and this is very handy. This is a four dimensional data where the dimensions are
1. subject
2. measurement type
3. year
4. visit

In [None]:
d['Guido']

Unnamed: 0_level_0,type,HR,Temp
year,visit,Unnamed: 2_level_1,Unnamed: 3_level_1
2013,1,46.0,38.2
2013,2,50.0,37.9
2014,1,48.0,36.4
2014,2,40.0,36.6


For complicated records containing multiple labeled measurements across multiple
times for many subjects (people, countries, cities, etc.), use of hierarchical rows and
columns can be extremely convenient!

## Indexing and slicing MultiIndex

Indexing and slicing is designed to be intuitive and is helpful is we think it as added dimensions. We'll first look at multiply indexed Series and then multiply indexed Dataframes.

### Multiply indexed Series

Consider the Series example of population before, we can access single element by indexing with multiple terms.

In [None]:
s['New York',2010]

19378102

The MultiIndex supports *partial indexing* or indexing just one of the levels of index. The result is a Series with lower level indices maintained.

In [None]:
s['California']

year
2000    33871648
2010    37253956
dtype: int64

It also supports *partial slicing* as long as MultiIndex is sorted.

In [None]:
s.loc['New York':'Texas']

state     year
New York  2000    18976457
          2010    19378102
Texas     2000    20851820
          2010    25145561
dtype: int64

With sorted indicies, we can perform partial slicing on lower level by passing an empty slice in the first index.

In [None]:
s[:,2000]

state
California    33871648
New York      18976457
Texas         20851820
dtype: int64

Similarly, masking and fancy indexing can also be performed on these MultiIndex.

### Multiply index Dataframes

This behaves in a similar manner.

In [None]:
d

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,29.0,38.2,46.0,38.2,43.0,36.4
2013,2,32.0,37.8,50.0,37.9,39.0,37.2
2014,1,44.0,35.8,48.0,36.4,40.0,36.7
2014,2,41.0,36.7,40.0,36.6,45.0,38.4


Remember that columns are primary and the syntax use for MultiIndex Series aoolys to columns.

In [None]:
d['Guido','HR']
#guidos

year  visit
2013  1        46.0
      2        50.0
2014  1        48.0
      2        40.0
Name: (Guido, HR), dtype: float64

Also with the single index case, we can use loc, iloc and ix indexers as discussed before.

In [None]:
d.iloc[:3,:1]

Unnamed: 0_level_0,subject,Bob
Unnamed: 0_level_1,type,HR
year,visit,Unnamed: 2_level_2
2013,1,29.0
2013,2,32.0
2014,1,44.0


These indexers give array like view of the underlying two dimensional data, and each individual index can be passed with a tuple of indices.

In [None]:
d.loc[:,('Bob','HR')]

year  visit
2013  1        29.0
      2        32.0
2014  1        44.0
      2        41.0
Name: (Bob, HR), dtype: float64

It's not efficient to perform slices with these index tuples and a slice in these tuples will give an error.

In [None]:
d.loc[(:,2),(:'HR')]

SyntaxError: ignored

We can overcome this situation by explicitly slicing using the pythons built in `slice()` but a better way is to slice using the Pandas `IndexSlice' object which is suitable for this situation.

In [None]:
idx=pd.IndexSlice
d.loc[idx[:,1],idx[:,'HR']]

Unnamed: 0_level_0,subject,Bob,Guido,Sue
Unnamed: 0_level_1,type,HR,HR,HR
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2013,1,29.0,46.0,43.0
2014,1,44.0,48.0,40.0


Here I've got the first 2 rows and all columns which belong to HR

**There are many ways to interact with multiply indexed Series and DataFrame and the best way to become familiar with them is to try them out.**

## Rearranging Multi Indices

One of the keys to working with Multi Indices is knowing how to effectively transform the data. There are a number of methods that will preserve all the information of the dataset but will rearrange information for various computations.

We've seen about `stack()` and `unstack()` but there are many more methods fro rearranging between hierarchical indexing and columns.

### Sorted and Unsorted indices

Earlier we have explained a caveat but we need to explain that in detail. Many slicing operations in *MultiIndex* will fail if the indices are not sorted.

In [None]:
index=pd.MultiIndex.from_product([['a','c','b'],[1,2]])
data=pd.Series(np.random.rand(6),index=index)

#adding index names
data.index.names=['char','int']
data

char  int
a     1      0.789663
      2      0.327627
c     1      0.129235
      2      0.642455
b     1      0.742926
      2      0.559889
dtype: float64

If we try to partial slice the data, we get an error

In [None]:
data['a':'b']

UnsortedIndexError: ignored

Although this error message doesn't specify that the slicing has failed as the indices are not sorted in the *MultiIndex*. Various operations such as partial slicing and other such operations require the *levels* in MultiIndex to be sorted in lexographical order.

Pandas provides many routines to sort these like
`sort_index()` and `sort_levels()` but the simplest is `sort_index()`.

In [None]:
data=data.sort_index()
data

char  int
a     1      0.789663
      2      0.327627
b     1      0.742926
      2      0.559889
c     1      0.129235
      2      0.642455
dtype: float64

With this sorted data, partial slicing will work as expected.

In [None]:
data['a':'b']

char  int
a     1      0.789663
      2      0.327627
b     1      0.742926
      2      0.559889
dtype: float64

### Stacked and Unstacked indices

As discussed earlier, we can convert a stacked MultiIndex into a simple two dimensional representation optionally Specifying the level to use.

In [None]:
s

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

Here, the state is of level 0 and year is level 1.

In [None]:
s.unstack(level=0)

state,California,New York,Texas
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2000,33871648,18976457,20851820
2010,37253956,19378102,25145561


The indices in level 0 become the column names to give a two dimensional representation

In [None]:
s.unstack(level=1)

year,2000,2010
state,Unnamed: 1_level_1,Unnamed: 2_level_1
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


Similarly, the indices in level 1 will be Unstacked to give a two dimensional representation.

The opposite of `unstack()` is `stack()` which can be used to recover the MultiIndex

In [None]:
s.unstack().stack()

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

### Index setting and resetting

Another method to rearrange hierarchical data is to turn index labels into columns which can be done using the `reset_index()'.

Calling this method on the population dictionary will result in a DataFrame with  columns *state* and *year* holding the info which was formerly the index.

Fro clarity, we can optionally specify the name of the data fro column representation because if we don't specify the column name, it'll be named as 0

In [None]:
s_flat=s.reset_index()
s_flat

Unnamed: 0,state,year,0
0,California,2000,33871648
1,California,2010,37253956
2,New York,2000,18976457
3,New York,2010,19378102
4,Texas,2000,20851820
5,Texas,2010,25145561


In [None]:
s_flat=s.reset_index(name='population')
s_flat

Unnamed: 0,state,year,population
0,California,2000,33871648
1,California,2010,37253956
2,New York,2000,18976457
3,New York,2010,19378102
4,Texas,2000,20851820
5,Texas,2010,25145561


Often when you are working with real world data, the raw data looks like the it's often helpful to build a *MultiIndex* from the column values.

This can be done with the `set_index()` of the DataFrame with gives the multiply indexed Dataframe

In [None]:
s_flat.set_index(['state','year'])

Unnamed: 0_level_0,Unnamed: 1_level_0,population
state,year,Unnamed: 2_level_1
California,2000,33871648
California,2010,37253956
New York,2000,18976457
New York,2010,19378102
Texas,2000,20851820
Texas,2010,25145561


This is the most useful reindexing

## Data Aggregation on Multi indices

Pandas has built in aggregate functions like min() max() sum().

For hierarchically indexed data, these can be passed a *level* parameter that controls which subset of data the aggregate is computed on.

In [None]:
data

char  int
a     1      0.789663
      2      0.327627
b     1      0.742926
      2      0.559889
c     1      0.129235
      2      0.642455
dtype: float64

In [None]:
data_mean=d.mean(level='year')
data_mean

  data_mean=d.mean(level='year')


subject,Bob,Bob,Guido,Guido,Sue,Sue
type,HR,Temp,HR,Temp,HR,Temp
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2013,30.5,38.0,48.0,38.05,41.0,36.8
2014,42.5,36.25,44.0,36.5,42.5,37.55


By further making use of *axis* keyword, we can take the mean among levels on columns as well.

In [None]:
d.mean(axis=1,level='type')

  d.mean(axis=1,level='type')


Unnamed: 0_level_0,type,HR,Temp
year,visit,Unnamed: 2_level_1,Unnamed: 3_level_1
2013,1,39.333333,37.6
2013,2,40.333333,37.633333
2014,1,44.0,36.3
2014,2,42.0,37.233333


## Panel data

Pandas has two other data structures Panel and Panel4D which are nothing but three dimensional or 4 dimensional generalizations of the one dimensional Series and DataFrame objects.

Once we have learnt about indexing and manipulation of Series and DataFrame, dealing with Panel and Panel4D are straightforward.

Multi indexing is useful and more conceptually simpler representation of higher dimensional data.

Panel and Panel4D are dense data representation and multi indexing is sparse data representation.

As the number of dimensions increases, the dense representation can become very inefficient for majority of real world data.