## Hierarchical Indexing
* Also known as *multi-indexing*
* to incorporate multiple index *levels* within a single index.
* Higher-dimensional data can be compactly represented within the familiar 1D **Series** and 2D **DataFrame** objects

### A Multiply Indexed Series
#### The bad way

In [1]:
import pandas as pd
import numpy as np

In [40]:
# The bad way
index = [('California', 2000), ('California', 2010),
         ('New York', 2000), ('New York', 2010),
         ('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
               18976457, 19378102,
               20851820, 25145561]
pop = pd.Series(populations, index=index)
pop

(California, 2000)    33871648
(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
(Texas, 2010)         25145561
dtype: int64

In [None]:
# The bad way
pop[('California', 2010):('Texas', 2000)]

In [None]:
# The bad way
# select all values from 2010
pop[[i for i in pop.index if i[1] == 2010]]

#### The better way: Pandas MultiIndex

In [41]:
# Create a multi-index from the tuples as follows:
index = pd.MultiIndex.from_tuples(index)
index
# the MultiIndex contains multiple levels of indexing - in this case, 
# the state names and the years, as well as multiple label for each data point
# which encode these levels.

MultiIndex(levels=[['California', 'New York', 'Texas'], [2000, 2010]],
           labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])

In [42]:
# reindex to see the hierarchical representation
pop = pop.reindex(index)
pop
# data type: pandas.core.series.Series
# The first two columns show the multiple index values.
# The third column shows the data.
# In this multi-index representation, any blank entry indicates the same values 
# as the line above it.

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [None]:
type(pop)

In [None]:
# Access all data for which the 2nd index is 2010
pop[:, 2010]

#### MultiIndex as extra dimension
* Store the same data using a df.
* pop.unstack(level=-1, fill_value=None)
    * Unstack, a.k.a. pivot, <mark> Series with MultiIndex to produce DataFrame.</mark>
    * The level involved will automatically get sorted.
    * level: int, string, or list of these, default last level
        * Level(s) to unstack, can pass level name.
    * fill_value: replace NaN with this value if the unstack produces missing values.
* pop_df.stack(level=-1, dropna=True)
    * Pivot a level of the (possibly hierarchical) column labels, returning a DataFrame (or Series in the case of an object with a single level of column labels) having a hierarchical index with a new inner-most level of row labels.
    * The level involved will automatically get sorted.
    * level : int, string, or list of these, default last level 
        * Level(s) to stack, can pass level name
    * dropna : boolean, default True
        * Whether to drop rows in the resulting Frame/Series with no valid
    values

In [None]:
pop.unstack()

In [None]:
pop.unstack(level = 0)

In [None]:
pop_df = pop.unstack()
pop_df.stack()

<font color = red size = 2> **Why bother?!**
* As we were able to use multi-indexing to represent two-dimensional data within a one-dimensional Series, we can also use it to represent data of three or more dimensions in a Series or DataFrame. 
* Each extra level in a multi-index represents an extra dimension of data; * taking advantage of this property gives us much more flexibility in the types of data we can represent. 


In [None]:
# Concretely, we might want to add another column of demographic data 
# for each state at each year (say, population under 18); 
# with a MultiIndex this is as easy as adding another column to the DataFrame
pop_df = pd.DataFrame({'total': pop,
                       'under18': [9267089, 9284094,
                                   4687374, 4318033,
                                   5906301, 6879014]})
pop_df

In [None]:
pop_df.stack()

In [None]:
pop_df.unstack(level = 0)

In [None]:
# Calculation
f_u18 = pop_df['under18'] / pop_df['total']
f_u18

In [None]:
f_u18.unstack()

### Methods of MultiIndex Creation
* pass a list of two or more index arrays to the constructor
* If you pass a dicttionary with appropriate tubles as keys, Pandas will automatically recognize this and use a MultiIndex by default

In [None]:
# pass a list of two or more index arrays to the constructor
df = pd.DataFrame(np.random.rand(4, 2),
                 index = [['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                 columns = ['data1', 'data2'])
df

In [None]:
pd.DataFrame?

In [37]:
# pass a list of two or more index arrays to the constructor
data = {('California', 2000): 33871648,
        ('California', 2010): 37253956,
        ('Texas', 2000): 20851820,
        ('Texas', 2010): 25145561,
        ('New York', 2000): 18976457,
        ('New York', 2010): 19378102}
pd.Series(data)

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

### Explicit MultiIndex Constructors
* Use class method constructors available in the pd.MultiIndex
    * from a simple list of arrays, *giving the index values within each level*
    * from a list of tuples, *giving the multiple index values of each point*
    * from a Cartesian product of single indices
    * Using its internal encoding by passing levels and lables
        * levels: a list of lists containing available index values for each level
        * lables: a list of lists that reference these labels.

In [None]:
# construct the MultiIndex from a simple list of arrays
# giving the index values within each level
pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1,2]])

In [None]:
# Construct it from a list of tuples, 
# giving the multiple index values of each point
pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)], 
                          names =['Letters', 'Numbers'])

In [None]:
# Construct it from a Cartesian product of single indices
inda = pd.MultiIndex.from_product([['a', 'b'], [1, 2]],
                          names=['Letters', 'Numbers'])

In [None]:
# Using its internal encoding by passing levels and lables
#   levels: a list of lists containing available index values for each level
#   lables: a list of lists that reference these labels.
pd.MultiIndex(levels = [['a', 'b'], [1, 2]],
             labels = [[0, 0, 1, 1], [0, 1, 0, 1]])

In [None]:
df2 = df.reindex(inda)

In [None]:
df2.keys

#### MultiIndex Level Names
* Passing the **names** argument to any of the above *MultiIndex*
* by setting the **names** attribute of the index after the fact

In [None]:
# setting the **names** attribute of the index after the fact
pop.index.names = ['state', 'year']
pop

### MultiIndex for columns
* df columns can have multiple levels of indices

In [2]:
# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]], 
                                   names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
                                     names=['subject', 'type'])

# mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37

# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data

# 4D data:
#  Dimensions: subject, measurement type, year, and visit number.

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,47.0,37.2,34.0,35.1,56.0,36.6
2013,2,29.0,37.2,42.0,38.4,49.0,35.5
2014,1,59.0,37.6,31.0,36.1,44.0,37.2
2014,2,31.0,36.3,17.0,37.4,13.0,36.6


In [None]:
health_data.loc[2013, 1]

### Indexing and Slicing a MultiIndex
* Think about the indices as added dimensions.

#### Multiply indexed Series
* Access single elements by indexing with mulitple terms
* Support *partial indexing* or indexing just one of the levels in the index.
    * The results is another *Series*, with the lower-leverl indices maintained.
* Partial slicing is available, as long as the MultiIndex is sorted.
* With sorted indices, we can perform partial indexing on lower levels by passing an empty slice in the first index
* Other types of indexing and selection:
    * Selection based on Boolean masks
    * Selection based on fancy indexing

In [None]:
pop

In [None]:
# Access single elements by indexing with mulitple terms:
pop['California', 2000]

In [None]:
# Support *partial indexing* or indexing just one of the levels in the index.
  # The results is another *Series*, with the lower-leverl indices maintained.
pop['California']

In [None]:
# Partial slicing is available, as long as the MultiIndex is sorted.
pop.loc['California':'New York']

In [None]:
# With sorted indices, we can perform partial indexing on lower levels 
# by passing an empty slice in the first index
pop[:, 2000]

In [None]:
# Selection based on Boolean masks
pop[pop > 22000000]

In [None]:
# Selection based on fancy indexing
pop[['California', 'Texas']]

### Multiply indexed DataFrame
* <font color = red> Remeber the columns are primary in a DataFrame, and the syntax used for multiply indexed Series applies to the columns.</font>
* As with the single-index case, use the *loc*, *iloc* and ix indexers
* These indexers provide an array-like view of the underlying two-dimensional data, but each individual index in *loc* or *iloc* can be passed a tuple of multiple indices.
    * Trying to create a slice within a tuple will lead to a syntax error
    * Get around by building slice explicitly using Python's built in *Slice* function. (See the pd.IndexSlice help example)
    * better to use an *IndexSlice* object

In [None]:
health_data

In [None]:
# Remeber the columns are primary in a DataFrame, 
# and the syntax used for multiply indexed Series applies to the columns.
health_data['Guido', 'HR']

In [None]:
# As with the single-index case, use the *loc*, *iloc* and ix indexers
health_data.iloc[:3, :3]

In [None]:
# each individual index in *loc* or *iloc* 
# can be passed a tuple of multiple indices
health_data.loc[:, ['Bob', 'HR']]

In [None]:
# Trying to create a slice within a tuple will lead to a syntax error
health_data.loc[(:, 1), (:, 'HR')]

In [None]:
idx = pd.IndexSlice
health_data.loc[idx[:, 1], idx[:, 'HR']]

### Rearranging Multi-Indices
#### Sorted and unsorted indices
* <font color = red> *Many of the MultiIndex slicing operations will fail if the index is not sorted.* </font>
* partial slices and other similar operations require the levels in the MultiIndex to be in sorted (i.e., lexographical) order
* Sorting:
    * data.sort_index(axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True)
        * **axis**: index to direct sorting
        * **level**: int or level name or list of ints or list of level names
            * if not None, sort on values in specified index level(s)
        * **ascending** : boolean, default True
            * Sort ascending vs. descending
        * **inplace**: bool, default False
            * if True, perform operation in-place
        * **kind**: {'quicksort', 'mergesort', 'heapsort'}, default 'quicksort'
            * Choice of sorting algorithm. See also ndarray.np.sort for more
             information.
            * `mergesort` is the only stable algorithm. 
            * For DataFrames, this option is only applied when sorting on a single column or label.
        * **na_position**: {'first', 'last'}, default 'last'
            * `first` puts NaNs at the beginning, `last` puts NaNs at the end.
             Not implemented for MultiIndex.
        * **sort_remaining**: bool, default True
            * if true and sorting by level and index is multilevel, sort by other levels too (in order) after sorting by specified level


In [None]:
data.sort_index?

In [None]:
index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])
data = pd.Series(np.random.rand(6), index = index)
data.index.names = ['char', 'int']
data

In [None]:
# Many of the MultiIndex slicing operations will fail if the index is not sorted
try:
    data['a':'b']
except KeyError as e:
    print(type(e))
    print(e)

<font color = red size = 2> In mathematics, the lexicographic or lexicographical order (also known as lexical order, dictionary order, alphabetical order or lexicographic(al) product) is a generalization of the way words are alphabetically ordered based on the alphabetical order of their component letters </color>

In [None]:
data.sort_index(inplace = True)
data

In [None]:
data['a':'b']

#### Stacking and unstacking indices
* Convert a dataset from a stacked multi-index to a simple 2D representation, optionally specifying the leve to use
    * **unstack()** and **stack()**
    * pop.unstack(level=-1, fill_value=None)
        * fill_value: replace NaN with this value if the unstack produces missing value.
    * a.stack(level=-1, dropna=True)
        * Whether to drop rows in the resulting Frame/Series with no valid
    values

In [None]:
pop

In [None]:
pop.unstack(level = 0)

In [None]:
pop.unstack(level = 1)

In [None]:
pop.unstack().stack()

#### Index setting and resetting
* **reset_index method**: turn the index labels into columns
    * b.reset_index(level=None, drop=False, name=None, inplace=False)
    * level : int, str, tuple, or list, default None
        * Only remove the given levels from the index. Removes all levels by 
        default
    * drop : boolean, default False
        * Do not try to insert index into dataframe columns
    * name : object, default None
        * The name of the column corresponding to the Series values
    * inplace : boolean, default False
        * Modify the Series in place (do not create a new object)

* **set_index** method: Set the df index (row labels) using one or more existing columns.
* c.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)
    * keys : column label or list of column labels / arrays
    * drop : boolean, default True
        * Delete columns to be used as the new index
    * append : boolean, default False
        * Whether to append columns to existing index
    * inplace : boolean, default False
        * Modify the DataFrame in place (do not create a new object)
    * verify_integrity : boolean, default False
        * Check the new index for duplicates. Otherwise defer the check until
        necessary. 
        * Setting to False will improve the performance of this method

In [3]:
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,47.0,37.2,34.0,35.1,56.0,36.6
2013,2,29.0,37.2,42.0,38.4,49.0,35.5
2014,1,59.0,37.6,31.0,36.1,44.0,37.2
2014,2,31.0,36.3,17.0,37.4,13.0,36.6


In [8]:
health_data.stack()

Unnamed: 0_level_0,Unnamed: 1_level_0,subject,Bob,Guido,Sue
year,visit,type,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2013,1,HR,47.0,34.0,56.0
2013,1,Temp,37.2,35.1,36.6
2013,2,HR,29.0,42.0,49.0
2013,2,Temp,37.2,38.4,35.5
2014,1,HR,59.0,31.0,44.0
2014,1,Temp,37.6,36.1,37.2
2014,2,HR,31.0,17.0,13.0
2014,2,Temp,36.3,37.4,36.6


In [44]:
a = health_data.stack()
b = a.stack()

In [45]:
b.name = 'Temperature'
b

year  visit  type  subject
2013  1      HR    Bob        47.0
                   Guido      34.0
                   Sue        56.0
             Temp  Bob        37.2
                   Guido      35.1
                   Sue        36.6
      2      HR    Bob        29.0
                   Guido      42.0
                   Sue        49.0
             Temp  Bob        37.2
                   Guido      38.4
                   Sue        35.5
2014  1      HR    Bob        59.0
                   Guido      31.0
                   Sue        44.0
             Temp  Bob        37.6
                   Guido      36.1
                   Sue        37.2
      2      HR    Bob        31.0
                   Guido      17.0
                   Sue        13.0
             Temp  Bob        36.3
                   Guido      37.4
                   Sue        36.6
Name: Temperature, dtype: float64

In [11]:
idx = pd.IndexSlice
b = a.loc[idx[:, :, :], idx['Bob']]
b

year  visit  type
2013  1      HR      47.0
             Temp    37.2
      2      HR      29.0
             Temp    37.2
2014  1      HR      59.0
             Temp    37.6
      2      HR      31.0
             Temp    36.3
Name: Bob, dtype: float64

In [54]:
c = b.reset_index()
c.set_index?

In [52]:
b.reset_index().set_index(['year', 'visit'])

Unnamed: 0_level_0,Unnamed: 1_level_0,type,subject,Temperature
year,visit,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2013,1,HR,Bob,47.0
2013,1,HR,Guido,34.0
2013,1,HR,Sue,56.0
2013,1,Temp,Bob,37.2
2013,1,Temp,Guido,35.1
2013,1,Temp,Sue,36.6
2013,2,HR,Bob,29.0
2013,2,HR,Guido,42.0
2013,2,HR,Sue,49.0
2013,2,Temp,Bob,37.2


In [16]:
b.reset_index(level = 0, name = 'value')

Unnamed: 0_level_0,Unnamed: 1_level_0,year,value
visit,type,Unnamed: 2_level_1,Unnamed: 3_level_1
1,HR,2013,47.0
1,Temp,2013,37.2
2,HR,2013,29.0
2,Temp,2013,37.2
1,HR,2014,59.0
1,Temp,2014,37.6
2,HR,2014,31.0
2,Temp,2014,36.3


### Data Aggregations on Multi-Indices
* Pass a level parameter that controls which subset of the data the aggregate is computed on.
    * health_data.mean(axis=None, skipna=None, level=None, numeric_only=None, \**kwargs)
        * axis : {index (0), columns (1)}
        * skipna : boolean, default True
            * Exclude NA/null values when computing the result.
        * level : int or level name, default None
            * If the axis is a MultiIndex (hierarchical), count along a
            particular level, collapsing into a Series
        * numeric_only : boolean, default None
            * Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

In [60]:
health_data.mean(level = 'year')

subject,Bob,Bob,Guido,Guido,Sue,Sue
type,HR,Temp,HR,Temp,HR,Temp
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2013,38.0,37.2,38.0,36.75,52.5,36.05
2014,45.0,36.95,24.0,36.75,28.5,36.9


In [61]:
health_data.mean(axis = 1, level = 'subject')

Unnamed: 0_level_0,subject,Bob,Guido,Sue
year,visit,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2013,1,42.1,34.55,46.3
2013,2,33.1,40.2,42.25
2014,1,48.3,33.55,40.6
2014,2,33.65,27.2,24.8
