<a href="https://colab.research.google.com/github/ProfessorPatrickSlatraigh/CST3512/blob/main/HierarchicalIndexing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Hierarchical Indexing    

This notebook is based on [Section 8.1 Hierarchical Indexing](https://wesmckinney.com/book/data-wrangling.html) from Chapter 8 - Data Wrangling: Join, Combine, and Reshape in Wes Mckinney's 'Python for Data Analysis'    



In many applications, data may be spread across a number of files or databases or be arranged in a form that is not convenient to analyze. This chapter focuses on tools to help combine, join, and rearrange data.    

This notebook introduces the concept of **hierarchical indexing** in pandas, which is used extensively in some of these operations. Chapter 8 of the book then digs into the particular data manipulations. Various applied usages of these tools can be seen in [Data Analysis Examples](https://wesmckinney.com/book/data-wrangling.html#data-analysis-examples).



---



##**Housekeeping**    

Import required modules    


In [2]:
# Import pandas 
import pandas as pd

# Import numpy   
import numpy as np


**Hierarchical indexing** is an important feature of pandas that enables you to have multiple (two or more) index levels on an axis. Another way of thinking about it is that it provides a way for you to work with higher dimensional data in a lower dimensional form. Let’s start with a simple example: create a Series with a list of lists (or arrays) as the index:

In [None]:
data = pd.Series(np.random.randn(9),
       index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
       [1, 2, 3, 1, 3, 1, 2, 2, 3]])

print(data)    


What you’re seeing is a prettified view of a Series with a MultiIndex as its index. The “gaps” in the index display mean “use the label directly above”:

In [None]:
print(data.index)

With a hierarchically indexed object, so-called partial indexing is possible, enabling you to concisely select subsets of the data:

In [None]:
data['b']

In [None]:
data['b':'c']

In [None]:
data.loc[['b', 'd']]

Selection is even possible from an “inner” level. Here I select all of the values having the value "2" from the second index level:

In [None]:
data.loc[:, 2]

Hierarchical indexing plays an important role in reshaping data and group-based operations like forming a pivot table. For example, you can rearrange this data into a DataFrame using its `unstack` method:

In [None]:
data.unstack()

The inverse operation of unstack is stack:

In [None]:
data.unstack().stack()

`stack` and `unstack` are explored in more detail in [Chapter 8 of Wes Mckinney's Python for Data Analysis](https://wesmckinney.com/book/data-wrangling.html).

With a DataFrame, either axis can have a hierarchical index:


In [11]:
frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
            index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
            columns=[['Ohio', 'Ohio', 'Colorado'],
            ['Green', 'Red', 'Green']])

In [None]:
print(frame)

The hierarchical levels can have names (as strings or any Python objects). If so, these will show up in the console output:


In [15]:
# Assign key1 and key2 as `frame` index hierarchy names, respectively   
frame.index.names = ['key1', 'key2'] 

# Assign state and color as `frame` column hierarchy names, respectively 
frame.columns.names = ['state', 'color']


In [None]:
print(frame)

***Caution***    
*Be careful to note the index names 'state' and 'color' are not part of the row labels (the `frame.index values`).*

With partial column indexing you can similarly select groups of columns:

In [None]:
frame['Ohio']

A `MultiIndex` can be created by itself and then reused; the columns in the preceding DataFrame with level names could be created like this:

In [None]:
pd.MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'],
                          ['Green', 'Red', 'Green']],
                          names=['state', 'color'])

##Reordering and Sorting Levels    



At times you may need to rearrange the order of the levels on an axis or sort the data by the values in one specific level. The swaplevel takes two level numbers or names and returns a new object with the levels interchanged (but the data is otherwise unaltered):

In [None]:
frame.swaplevel('key1', 'key2')

`sort_index`, on the other hand, sorts the data using only the values in a single level. When swapping levels, it’s not uncommon to also use `sort_index` so that the result is lexicographically sorted by the indicated level:

In [None]:
frame.sort_index(level=1)

In [None]:
frame.swaplevel(0, 1).sort_index(level=0)

***Note:***    

*Data selection performance is much better on hierarchically indexed objects if the index is lexicographically sorted starting with the outermost level—that is, the result of calling `sort_index(level=0)` or `sort_index()`.*    



##Summary Statistics by Level    



Many descriptive and summary statistics on DataFrame and Series have a `level` option in which you can specify the level you want to aggregate by on a particular axis. Consider the above DataFrame; we can aggregate by `level` on either the rows or columns like so:

In [None]:
frame.groupby(level='key2').sum()

In [None]:
frame.groupby(level='color', axis=1).sum()

Internally, this utilizes pandas’s `groupby` machinery, which is discussed in more detail in the book [Python for Data Analysis](https://wesmckinney.com/book/data-aggregation.html).

##Indexing with a DataFrame's Columns    



It’s not unusual to want to use one or more columns from a DataFrame as the row index; alternatively, you may wish to move the row index into the DataFrame’s columns. Here’s an example DataFrame:

In [24]:
frame = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1),
            'c': ['one', 'one', 'one', 'two', 'two',
                 'two', 'two'],
            'd': [0, 1, 2, 0, 1, 2, 3]})


In [None]:
print(frame)

DataFrame’s `set_index` function will create a new DataFrame using one or more of its columns as the index:

In [26]:
frame2 = frame.set_index(['c', 'd'])

In [None]:
print(frame2)

By default the columns are removed from the DataFrame, though you can leave them in by passing `drop=False` to `set_index`:

In [None]:
frame.set_index(['c', 'd'], drop=False)

`reset_index`, on the other hand, does the opposite of `set_index`; the hierarchical index levels are moved into the columns:

In [None]:
frame2.reset_index()



---



#**Related Exercise**


*See the notebook ['Dewey_Dictionary'](https://bit.ly/dewey_notebook) for a related exercise on hierarchical indexing using the Dewey Decimal System.* 



---

