In [1]:
import pandas as pd
import numpy as np

# DATA MANIPULATION IN PANDAS - PART 2

## Hierarchical Indexing (Multi-Indexing)
Part 1 of this course deals with one-dimensional or two-dimensional data stored in Pandas Series or DataFrames
This Part 2 starts by dealing with indexing higher-dimensional data i.e data indexed by one of two keys to incorporate multiple index levels within a single index. In this way, higher-dimensional data are adequately represented within a 1D-Series or 2D-dataframe.

* In this section, we explore the following:

  1. Direct Creation of MultiIndex Objects
  2. Considerations around Indexing
  3. Computing Statistics across multiple-indexed data
  4. Routines for converting between simple and hierarchically indexed data


### 1. MULTIPLE INDEXED SERIES
Lets start by considering how to represent 2D data within a 1D-Series. 

### The Bad Way to Index

* Example - Lets consider a series of data where each point has a character and a numerical key

In [2]:
populations = [33871648, 37253956, 18976457, 19378102, 20851820, 25145561]
pindex = [('California',2000),('California',2010),('New York',2000),('New York',2010),('Texas', 2000),('Texas',2010)]

popn = pd.Series(populations, index=pindex)
popn

(California, 2000)    33871648
(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
(Texas, 2010)         25145561
dtype: int64

One can slice or index this scheme based on this multiple indexing scheme using tuples

In [3]:
#Example 
popn[('California',2010):('Texas',2000)] #remove the two entries using the indices provided

(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
dtype: int64

#### This does not make it easy to select all values from 2010 or do some sql slection operations without doing some data munging 

* Example - `Retrieve States population in 2010`

In [4]:
popn[[i for i in popn.index if i[1]==2010]]
#This is not clean code

(California, 2010)    37253956
(New York, 2010)      19378102
(Texas, 2010)         25145561
dtype: int64

### The Better Way: Pandas MultiIndex
The idea here is to first index and then reindex the series with the MutiIndex obtained initially

In [5]:
# Example
pindex = pd.MultiIndex.from_tuples(pindex)
pindex

MultiIndex([('California', 2000),
            ('California', 2010),
            (  'New York', 2000),
            (  'New York', 2010),
            (     'Texas', 2000),
            (     'Texas', 2010)],
           )

In [58]:
popn = popn.reindex(pindex)
popn

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

The first two columns show the multiple index values while the third column shows the data
* Example - To access all data for which the second index is 2010

In [71]:
popn[:,2010]

state
California    37253956
New York      19378102
Texas         25145561
dtype: int64

* Example - To access all Texas records

In [75]:
popn['Texas':]

state  year
Texas  2000    20851820
       2010    25145561
dtype: int64

The result is a single indexed array with just the keys we're interested in

### 2. MULTIINDEX AS EXTRA DIMENSION 

Referring back to the earlier example, we could easily have stored the data using a simple dataframe with index and column labels. Pandas can easily make this conversion using the unstack() method

* Example - Converting a multiple indexed Series into a dataframe

In [8]:
popn_df = popn.unstack()
popn_df

Unnamed: 0,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


* Example - Converting a dataframe to a Multiple indexed Series

In [9]:
popn_df.stack()

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

### Why bother about Hierarchical Indexing ?
As we are able to represent 2D data within a 1D-series, we can use it to represent data of 3 or more dimensions in a Series or dataframe

* Example - Adding an extra column to our population data

In [10]:
pops_df = pd.DataFrame({'total_pop': popn, 'under18_pop': [9267089, 9284094, 4687374, 4318033, 5906301, 6879014]},index=pindex)
pops_df

Unnamed: 0,Unnamed: 1,total_pop,under18_pop
California,2000,33871648,9267089
California,2010,37253956,9284094
New York,2000,18976457,4687374
New York,2010,19378102,4318033
Texas,2000,20851820,5906301
Texas,2010,25145561,6879014


* Example - Performing operations on Hierarchical Data
* Calculating the fraction of people under 18 by year

In [11]:
frac_u18 = pops_df['under18_pop']/pops_df['total_pop']
frac_u18

California  2000    0.273594
            2010    0.249211
New York    2000    0.247010
            2010    0.222831
Texas       2000    0.283251
            2010    0.273568
dtype: float64

Now unstack to see it by year

In [12]:
frac_u18.unstack()

Unnamed: 0,2000,2010
California,0.273594,0.249211
New York,0.24701,0.222831
Texas,0.283251,0.273568


## METHODS OF MULTIINDEX CREATION
The most straight forward way to construct a multiple indexed series or dataframe is to pass a list of two or more index arrays to the constructor

In [13]:
new_df = pd.DataFrame(np.random.rand(4,2),index=[['a','a','b','b'],[1,2,1,2]],columns=['data1','data2'])
new_df

Unnamed: 0,Unnamed: 1,data1,data2
a,1,0.543847,0.618855
a,2,0.628742,0.407979
b,1,0.337872,0.315155
b,2,0.034455,0.620377


#### Passing a dictionary with appropriate tuples as keys will also create a multiindex

* Example 

In [14]:
data = {('California',2000):33871648, ('California',2010): 37253956, ('Texas',2000): 20851820, ('Texas',2010):25145161,('New York',2000):12783211,('New York',2010):31267822}
dat = pd.Series(data)
dat.unstack()

Unnamed: 0,2000,2010
California,33871648,37253956
New York,12783211,31267822
Texas,20851820,25145161


### 1. EXPLICIT MULTIINDEX CONSTRUCTORS


* A. From a simple list of Arrays

In [15]:
pd.MultiIndex.from_arrays([['a','a','b','b'],[1,2,1,2]])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

* B. From a list of tuples giving the multiple index values of each point

In [16]:
pd.MultiIndex.from_tuples([('a',1),('a',2),('b',1),('b',2)])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

* C. From a Cartesian product of single Indices

In [17]:
pd.MultiIndex.from_product([['a','b'],[1,2]])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

MultiIndex Level Names - It is conveinient to name the levels of the MultiIndex
* Example 

In [18]:
dat.index.names = ['state', 'year']
dat

state       year
California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145161
New York    2000    12783211
            2010    31267822
dtype: int64

### 2. MultiIndex For Columns
In a DataFrame, rows and columns are completely symmetric, just as the rows can have multiple level of indices, so does the columns can have multiple level of indices

* Example 

In [19]:
# Obtain some hierarchical Indices & Columns
index = pd.MultiIndex.from_product([[2013,2014],[1,2]],names=['year','visit'])
columns = pd.MultiIndex.from_product([['Bob','Guido','Sue'],['HR8','Temp']],names=['subject','type'])

# Make-up some mock data
data = np.round(np.random.randn(4,6),1)
data[:,::2] #select every other column
data += 37 #Multiply it by 37

# Create a DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR8,Temp,HR8,Temp,HR8,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,36.9,35.8,36.3,36.8,36.4,38.4
2013,2,38.0,36.2,37.7,37.8,35.5,35.5
2014,1,37.0,36.4,37.0,35.7,35.1,38.0
2014,2,36.1,36.5,34.7,35.7,37.7,39.1


##### This is basically 4D-Data where dimensions are subject, measurement type, year and visit number. We can then index top level column by person's name and get a full dataframe

* Example

In [20]:
health_data['Guido']

Unnamed: 0_level_0,type,HR8,Temp
year,visit,Unnamed: 2_level_1,Unnamed: 3_level_1
2013,1,36.3,36.8
2013,2,37.7,37.8
2014,1,37.0,35.7
2014,2,34.7,35.7


# INDEXING & SLICING A MULTIINDEX
This can best be achieved if Indexing can be thought of as extra dimensions. We shall look at Multiple_Indexed Series followed by DataFrame

## Multiple-Indexed Series
We shall consider the state population Multilvel Indexed-Series dataset (pop) studied earlier

* Example - Given the multilevel indexed Series

In [21]:
popn

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

* Extract California record for the year 2000

In [22]:
popn['California',2000]

33871648

* Partial Indexing is supported or indexing one of the levels in the index

In [23]:
popn['California']

2000    33871648
2010    37253956
dtype: int64

* Partial Slicing is also supported as long as MultiIndex is sorted

In [24]:
popn.loc['California':'New York']

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
dtype: int64

* Selection based on boolean masks

In [78]:
popn[popn > 22000000] #Select records with population greater than 22 million

state       year
California  2000    33871648
            2010    37253956
Texas       2010    25145561
dtype: int64

* Selection based on Fancy Indexing

In [26]:
popn[['California','Texas']]

California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
dtype: int64

# MULTIPLE INDEXED DATAFRAMES

* Example Considering our health data

In [27]:
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR8,Temp,HR8,Temp,HR8,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,36.9,35.8,36.3,36.8,36.4,38.4
2013,2,38.0,36.2,37.7,37.8,35.5,35.5
2014,1,37.0,36.4,37.0,35.7,35.1,38.0
2014,2,36.1,36.5,34.7,35.7,37.7,39.1


We can retrieve Guido's HeartRate data with a simple indexing operation

In [28]:
health_data['Guido','HR8']

year  visit
2013  1        36.3
      2        37.7
2014  1        37.0
      2        34.7
Name: (Guido, HR8), dtype: float64

We can use the loc and iloc attributes to index data as well

In [29]:
health_data.iloc[:3,:3]#Give me the first 3 rows & first 3 columns

Unnamed: 0_level_0,subject,Bob,Bob,Guido
Unnamed: 0_level_1,type,HR8,Temp,HR8
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2013,1,36.9,35.8,36.3
2013,2,38.0,36.2,37.7
2014,1,37.0,36.4,37.0


We can pass a tuple of multiple indices into the loc & iloc attributes to give an array-like view

In [30]:
health_data.loc[:,('Bob','HR8')]

year  visit
2013  1        36.9
      2        38.0
2014  1        37.0
      2        36.1
Name: (Bob, HR8), dtype: float64

### Working with Slices in Multiple Indexed Series & DataFrames
Working with slices within this tuple index is done using Pandas IndexSliceObject function

* Example

In [31]:
idx = pd.IndexSlice
health_data.loc[idx[:,2],idx[:,'HR8']]

Unnamed: 0_level_0,subject,Bob,Guido,Sue
Unnamed: 0_level_1,type,HR8,HR8,HR8
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2013,2,38.0,37.7,35.5
2014,2,36.1,34.7,37.7


### DATA TRANSFORMATION 
### Rearranging MultiIndices for Operations

One of the keys to working with multiple-indexed data is to know how top transform the data. There are a number of operations that will preserve all the information in the dataset but rearrange it for purpose of computations. We will explore 3 ways to do that here:

1. Sorted and Unsorted Indices
2. Stacking and Unstacking Indices
3. Index Setting & Resetting

### † MANY OF THE MULTIINDEX SLICING OPERATIONS WILL FAIL IF NOT SORTED

### 1. Sorted & Unsorted Indices

* Example - Lets create a multiply indexed data where the indices are not lexicographically sorted

In [32]:
index = pd.MultiIndex.from_product([['a','c','b'],[1,2]])
data = pd.Series(np.random.rand(6), index=index)
data.index.names = ['char','int']
data

char  int
a     1      0.009352
      2      0.513924
c     1      0.275427
      2      0.501500
b     1      0.294214
      2      0.925914
dtype: float64

* Perform slicing operation using the indices requires the indices to be sorted first

In [33]:
data = data.sort_index()
data

char  int
a     1      0.009352
      2      0.513924
b     1      0.294214
      2      0.925914
c     1      0.275427
      2      0.501500
dtype: float64

* Perform the slicing operation

In [34]:
data['a':'b']

char  int
a     1      0.009352
      2      0.513924
b     1      0.294214
      2      0.925914
dtype: float64

### 2. Stacking & Unstacking Indices
Stacking and Unstacking can be done in two levels - 0 & 1. 0 indicating by rows and 1 indicating by columns

* Example - Going back to our population data

In [35]:
popn

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [36]:
popn.unstack(level=0) #Unstack by rows

Unnamed: 0,California,New York,Texas
2000,33871648,18976457,20851820
2010,37253956,19378102,25145561


In [37]:
 popn.unstack(level=1) #Unstack by columns

Unnamed: 0,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [38]:
popn.unstack().stack() #Recover the original series using stack

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

### 3. Index Setting & Resetting
Working with raw data in the real world does not come with Multiple-indexed data. You will have to create this. And this can be accomplished using the (set_index method).

* Example - Flatten the population dataset

In [39]:
popn.index.names = ['state','year']
popn
popn_flat = popn.reset_index(name='population')
popn_flat

Unnamed: 0,state,year,population
0,California,2000,33871648
1,California,2010,37253956
2,New York,2000,18976457
3,New York,2010,19378102
4,Texas,2000,20851820
5,Texas,2010,25145561


* Set the index

In [40]:
popn_flat.set_index(['state','year'])

Unnamed: 0_level_0,Unnamed: 1_level_0,population
state,year,Unnamed: 2_level_1
California,2000,33871648
California,2010,37253956
New York,2000,18976457
New York,2010,19378102
Texas,2000,20851820
Texas,2010,25145561


## DATA AGGREGATION ON MULTIINDICES
Pandas has built-in data aggregation methods such as mean(), sum() and max(). For hierarchically indexed data, a level parameter can be passed on to control which subset of data the aggregate is computed on.

* Example - Lets consider the health data

In [41]:
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR8,Temp,HR8,Temp,HR8,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,36.9,35.8,36.3,36.8,36.4,38.4
2013,2,38.0,36.2,37.7,37.8,35.5,35.5
2014,1,37.0,36.4,37.0,35.7,35.1,38.0
2014,2,36.1,36.5,34.7,35.7,37.7,39.1


* Example - Compute the average of heart rate and temperature per year 

In [42]:
mean_data_peryr = health_data.groupby(level='year').mean()
mean_data_peryr

subject,Bob,Bob,Guido,Guido,Sue,Sue
type,HR8,Temp,HR8,Temp,HR8,Temp
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2013,37.45,36.0,37.0,37.3,35.95,36.95
2014,36.55,36.45,35.85,35.7,36.4,38.55


* Example - Compute the average of heart rate and temperature of each type of visit

In [43]:
mean_healthvisit_peryr = health_data.groupby(level='visit').mean()
mean_healthvisit_peryr

subject,Bob,Bob,Guido,Guido,Sue,Sue
type,HR8,Temp,HR8,Temp,HR8,Temp
visit,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
1,36.95,36.1,36.65,36.25,35.75,38.2
2,37.05,36.35,36.2,36.75,36.6,37.3


By further making use of the axis keyword, we can take the mean among levels on columns 

In [44]:
mean_data_peryr.groupby(level='type',axis=1).mean()#This gives mean data by type for the 3

type,HR8,Temp
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2013,36.8,36.75
2014,36.266667,36.9
