<a href="https://colab.research.google.com/github/ProfessorPatrickSlatraigh/CST3512/blob/main/HierarchicalIndexing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Hierarchical Indexing    

This notebook is based on [Section 8.1 Hierarchical Indexing](https://wesmckinney.com/book/data-wrangling.html) from Chapter 8 - Data Wrangling: Join, Combine, and Reshape in Wes Mckinney's 'Python for Data Analysis'    



In many applications, data may be spread across a number of files or databases or be arranged in a form that is not convenient to analyze. This chapter focuses on tools to help combine, join, and rearrange data.    

This notebook introduces the concept of **hierarchical indexing** in pandas, which is used extensively in some of these operations. Chapter 8 of the book then digs into the particular data manipulations. Various applied usages of these tools can be seen in [Data Analysis Examples](https://wesmckinney.com/book/data-wrangling.html#data-analysis-examples).



---



##**Housekeeping**    

Import required modules    


In [1]:
# Import pandas 
import pandas as pd

# Import numpy   
import numpy as np


**Hierarchical indexing** is an important feature of pandas that enables you to have multiple (two or more) index levels on an axis. Another way of thinking about it is that it provides a way for you to work with higher dimensional data in a lower dimensional form. Let’s start with a simple example: create a Series with a list of lists (or arrays) as the index:

In [16]:
data = pd.Series(np.random.randn(9),
       index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
       [1, 2, 3, 1, 3, 1, 2, 2, 3]])

print(data)    


a  1    0.819729
   2    1.591709
   3   -0.380602
b  1   -0.481111
   3    0.321316
c  1   -0.528414
   2    1.701760
d  2    0.112483
   3    0.580383
dtype: float64


In [26]:
corp = pd.Series(['first', 'second', 'third', 'fourth', 'fifth', 'sixth', 'seventh', 'eighth', 'ninth'], 
index=[['Americas', 'Americas', 'Americas', 'EMEA', 'EMEA', 'AsiaPac', 'AsiaPac', 'Corp', 'Corp'], 
[101, 201, 301, 101, 301, 101, 201, 201, 301]])

print(corp)    

Americas  101      first
          201     second
          301      third
EMEA      101     fourth
          301      fifth
AsiaPac   101      sixth
          201    seventh
Corp      201     eighth
          301      ninth
dtype: object


What you’re seeing is a prettified view of a Series with a MultiIndex as its index. The “gaps” in the index display mean “use the label directly above”:

In [17]:
print(data.index)

MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 3),
            ('c', 1),
            ('c', 2),
            ('d', 2),
            ('d', 3)],
           )


In [27]:
print(corp.index)

MultiIndex([('Americas', 101),
            ('Americas', 201),
            ('Americas', 301),
            (    'EMEA', 101),
            (    'EMEA', 301),
            ( 'AsiaPac', 101),
            ( 'AsiaPac', 201),
            (    'Corp', 201),
            (    'Corp', 301)],
           )


With a hierarchically indexed object, so-called partial indexing is possible, enabling you to concisely select subsets of the data:

In [18]:
data['b']

1   -0.481111
3    0.321316
dtype: float64

In [28]:
corp['EMEA']

101    fourth
301     fifth
dtype: object

In [20]:
data['c']

1   -0.528414
2    1.701760
dtype: float64

In [24]:
data['b':'c']

b  1   -0.481111
   3    0.321316
c  1   -0.528414
   2    1.701760
dtype: float64

In [23]:
data['b':'d']

b  1   -0.481111
   3    0.321316
c  1   -0.528414
   2    1.701760
d  2    0.112483
   3    0.580383
dtype: float64

In [22]:
data.loc[['b', 'd']]

b  1   -0.481111
   3    0.321316
d  2    0.112483
   3    0.580383
dtype: float64

In [29]:
corp.loc[['EMEA', 'Americas', 'AsiaPac']]

EMEA      101     fourth
          301      fifth
Americas  101      first
          201     second
          301      third
AsiaPac   101      sixth
          201    seventh
dtype: object

Selection is even possible from an “inner” level. Here I select all of the values having the value "2" from the second index level:

In [30]:
data.loc[:, 2]

a    1.591709
c    1.701760
d    0.112483
dtype: float64

In [32]:
corp.loc[:,201]

Americas     second
AsiaPac     seventh
Corp         eighth
dtype: object

Hierarchical indexing plays an important role in reshaping data and group-based operations like forming a pivot table. For example, you can rearrange this data into a DataFrame using its `unstack` method:

In [33]:
data.unstack()

Unnamed: 0,1,2,3
a,0.819729,1.591709,-0.380602
b,-0.481111,,0.321316
c,-0.528414,1.70176,
d,,0.112483,0.580383


In [34]:
corp.unstack()

Unnamed: 0,101,201,301
Americas,first,second,third
AsiaPac,sixth,seventh,
Corp,,eighth,ninth
EMEA,fourth,,fifth


The inverse operation of unstack is stack:

In [35]:
data.unstack().stack()

a  1    0.819729
   2    1.591709
   3   -0.380602
b  1   -0.481111
   3    0.321316
c  1   -0.528414
   2    1.701760
d  2    0.112483
   3    0.580383
dtype: float64

In [36]:
corp.unstack().stack()

Americas  101      first
          201     second
          301      third
AsiaPac   101      sixth
          201    seventh
Corp      201     eighth
          301      ninth
EMEA      101     fourth
          301      fifth
dtype: object

`stack` and `unstack` are explored in more detail in [Chapter 8 of Wes Mckinney's Python for Data Analysis](https://wesmckinney.com/book/data-wrangling.html).

With a DataFrame, either axis can have a hierarchical index:


In [37]:
frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
            index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
            columns=[['Ohio', 'Ohio', 'Colorado'],
            ['Green', 'Red', 'Green']])

In [38]:
print(frame)

     Ohio     Colorado
    Green Red    Green
a 1     0   1        2
  2     3   4        5
b 1     6   7        8
  2     9  10       11


The hierarchical levels can have names (as strings or any Python objects). If so, these will show up in the console output:


In [45]:
# Assign key1 and key2 as `frame` index hierarchy names, respectively   
frame.index.names = ['key1', 'key2'] 

# Assign state and color as `frame` column hierarchy names, respectively 
frame.columns.names = ['state', 'color']


In [46]:
print(frame)

state      Ohio     Colorado
color     Green Red    Green
key1 key2                   
a    1        0   1        2
     2        3   4        5
b    1        6   7        8
     2        9  10       11


In [41]:
# Assign key1 and key2 as `frame` index hierarchy names, respectively   
frame.index.names = ['Region', 'Product'] 

# Assign state and color as `frame` column hierarchy names, respectively 
frame.columns.names = ['state', 'color']

In [42]:
print(frame)

state           Ohio     Colorado
color          Green Red    Green
Region Product                   
a      1           0   1        2
       2           3   4        5
b      1           6   7        8
       2           9  10       11


***Caution***    
*Be careful to note the index names 'state' and 'color' are not part of the row labels (the `frame.index values`).*

With partial column indexing you can similarly select groups of columns:

In [43]:
frame['Ohio']

Unnamed: 0_level_0,color,Green,Red
Region,Product,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0,1
a,2,3,4
b,1,6,7
b,2,9,10


A `MultiIndex` can be created by itself and then reused; the columns in the preceding DataFrame with level names could be created like this:

In [44]:
pd.MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'],
                          ['Green', 'Red', 'Green']],
                          names=['state', 'color'])

MultiIndex([(    'Ohio', 'Green'),
            (    'Ohio',   'Red'),
            ('Colorado', 'Green')],
           names=['state', 'color'])

##Reordering and Sorting Levels    



At times you may need to rearrange the order of the levels on an axis or sort the data by the values in one specific level. The swaplevel takes two level numbers or names and returns a new object with the levels interchanged (but the data is otherwise unaltered):

In [47]:
frame.swaplevel('key1', 'key2')

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


`sort_index`, on the other hand, sorts the data using only the values in a single level. When swapping levels, it’s not uncommon to also use `sort_index` so that the result is lexicographically sorted by the indicated level:

In [48]:
frame.sort_index(level=1)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
b,1,6,7,8
a,2,3,4,5
b,2,9,10,11


In [None]:
frame.swaplevel(0, 1).sort_index(level=0)

***Note:***    

*Data selection performance is much better on hierarchically indexed objects if the index is lexicographically sorted starting with the outermost level—that is, the result of calling `sort_index(level=0)` or `sort_index()`.*    



##Summary Statistics by Level    



Many descriptive and summary statistics on DataFrame and Series have a `level` option in which you can specify the level you want to aggregate by on a particular axis. Consider the above DataFrame; we can aggregate by `level` on either the rows or columns like so:

In [None]:
frame

In [49]:
frame.groupby(level='key2').sum()

state,Ohio,Ohio,Colorado
color,Green,Red,Green
key2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,6,8,10
2,12,14,16


In [50]:
frame.groupby(level='color', axis=1).sum()

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,2,1
a,2,8,4
b,1,14,7
b,2,20,10


Internally, this utilizes pandas’s `groupby` machinery, which is discussed in more detail in the book [Python for Data Analysis](https://wesmckinney.com/book/data-aggregation.html).

##Indexing with a DataFrame's Columns    



It’s not unusual to want to use one or more columns from a DataFrame as the row index; alternatively, you may wish to move the row index into the DataFrame’s columns. Here’s an example DataFrame:

In [56]:
frame = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1),
            'c': ['one', 'one', 'one', 'two', 'two',
                 'two', 'two'],
            'd': [0, 1, 2, 0, 1, 2, 3]})


In [53]:
print(frame)

   a  b    c  d
0  0  7  one  0
1  1  6  one  1
2  2  5  one  2
3  3  4  two  0
4  4  3  two  1
5  5  2  two  2
6  6  1  two  3


DataFrame’s `set_index` function will create a new DataFrame using one or more of its columns as the index:

In [54]:
frame2 = frame.set_index(['c', 'd'])

In [55]:
print(frame2)

       a  b
c   d      
one 0  0  7
    1  1  6
    2  2  5
two 0  3  4
    1  4  3
    2  5  2
    3  6  1


By default the columns are removed from the DataFrame, though you can leave them in by passing `drop=False` to `set_index`:

In [57]:
frame.set_index(['c', 'd'], drop=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c,d
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
one,0,0,7,one,0
one,1,1,6,one,1
one,2,2,5,one,2
two,0,3,4,two,0
two,1,4,3,two,1
two,2,5,2,two,2
two,3,6,1,two,3


`reset_index`, on the other hand, does the opposite of `set_index`; the hierarchical index levels are moved into the columns:

In [58]:
frame2.reset_index()

Unnamed: 0,c,d,a,b
0,one,0,0,7
1,one,1,1,6
2,one,2,2,5
3,two,0,3,4
4,two,1,4,3
5,two,2,5,2
6,two,3,6,1




---



#**Related Exercise**


*See the notebook ['Dewey_Dictionary'](https://bit.ly/dewey_notebook) for a related exercise on hierarchical indexing using the Dewey Decimal System.* 



---

