# Exercise 4: Hierarchical Indexing in pandas
<span style="color: brown;">Jessica Reyes</span>

In [1]:
import pandas as pd
import numpy as np

This set of exercises focuses on hierarchical indexing.  First, let's create some data that includes data for three individuals ('A', 'B', 'C') across several years:

In [2]:
data_list = [ [1977, 'A', 1.20, 0.60], 
        [1977, 'B', 1.50, 0.50], 
        [1977, 'C', 1.70, 0.80],
        [1978, 'A', 0.20, 0.06],
        [1978, 'B', 0.70, 0.20],
        [1978, 'C', 0.80, 0.30],
        [1978, 'D', 0.90, 0.50],
        [1978, 'E', 1.40, 0.90],
        [1979, 'C', 0.20, 0.15],
        [1979, 'D', 0.14, 0.05],
        [1979, 'E', 0.50, 0.15],
        [1979, 'F', 1.20, 0.50],
        [1979, 'G', 3.40, 1.90],
        [1979, 'H', 5.40, 2.70],
        [1979, 'I', 6.40, 1.20] ]
data = pd.DataFrame(data_list, columns=['year', 'indiv', 'x', 'z'])
data

Unnamed: 0,year,indiv,x,z
0,1977,A,1.2,0.6
1,1977,B,1.5,0.5
2,1977,C,1.7,0.8
3,1978,A,0.2,0.06
4,1978,B,0.7,0.2
5,1978,C,0.8,0.3
6,1978,D,0.9,0.5
7,1978,E,1.4,0.9
8,1979,C,0.2,0.15
9,1979,D,0.14,0.05


Using this data answer the following questions:

* Create a hierarchical index for the DataFrame based on the ``year`` and ``indiv`` columns using the [set_index](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html) method.

In [3]:
# add a hierarchical index
data = data.set_index(['year', 'indiv'])
data

Unnamed: 0_level_0,Unnamed: 1_level_0,x,z
year,indiv,Unnamed: 2_level_1,Unnamed: 3_level_1
1977,A,1.2,0.6
1977,B,1.5,0.5
1977,C,1.7,0.8
1978,A,0.2,0.06
1978,B,0.7,0.2
1978,C,0.8,0.3
1978,D,0.9,0.5
1978,E,1.4,0.9
1979,C,0.2,0.15
1979,D,0.14,0.05


* Extract the data from 1977

In [5]:
# the 1977 data
data.loc[1977]

Unnamed: 0_level_0,x,z
indiv,Unnamed: 1_level_1,Unnamed: 2_level_1
A,1.2,0.6
B,1.5,0.5
C,1.7,0.8


* Extract the data for individual A

In [6]:
# the data for individual A
# keep in mind that the following won't work:
# data.loc[:, 'A']
# use pd.IndexSlice or reorder the levels of the index using the pandas 
# swaplevel method
data.loc[pd.IndexSlice[:, 'A'], :]

Unnamed: 0_level_0,Unnamed: 1_level_0,x,z
year,indiv,Unnamed: 2_level_1,Unnamed: 3_level_1
1977,A,1.2,0.6
1978,A,0.2,0.06


* When performing `data.unstack()` on your multi-indexed DataFrame you get a lot of `Nan`s.  Explain why!


In [10]:
# data.unstack()
data.unstack()

Unnamed: 0_level_0,x,x,x,x,x,x,x,x,x,z,z,z,Mean,Mean,Mean,Mean,Mean,Mean,Mean,Mean,Mean
indiv,A,B,C,D,E,F,G,H,I,A,...,I,A,B,C,D,E,F,G,H,I
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
1977,1.2,1.5,1.7,,,,,,,0.6,...,,0.9,1.0,1.25,,,,,,
1978,0.2,0.7,0.8,0.9,1.4,,,,,0.06,...,,0.13,0.45,0.55,0.7,1.15,,,,
1979,,,0.2,0.14,0.5,1.2,3.4,5.4,6.4,,...,1.2,,,0.175,0.095,0.325,0.85,2.65,4.05,3.8


<span style="color: brown;">Unstack() pivots one of the index levels into columns. If a particular combination, (such as 1977 & D), of index values are missing in the 'original' data frame, unstack() creates a placeholder for it, filling it in with NaN.</span>

* Run the statement `data['z'] == data.loc[:,'z']` and explain the result.

In [8]:
# data['z'] == data.loc[:,'z']
data['z'] == data.loc[:,'z']

year  indiv
1977  A        True
      B        True
      C        True
1978  A        True
      B        True
      C        True
      D        True
      E        True
1979  C        True
      D        True
      E        True
      F        True
      G        True
      H        True
      I        True
Name: z, dtype: bool

<span style="color: brown;">If the column 'z' exists in the data frame, the comparison data['z'] == data.loc[:, 'z'] will return True for every row, since both expressions reference to the same column. The result will be a series of True values, indicating that the column values are identical when compared to themselves.</span>

* Add a new column to the data called `Mean` which equals the average of the two columns.  Use the `mean` method of a pandas DataFrame.

In [9]:
# add a mean column
data['Mean'] = data.mean(axis=1)
data

Unnamed: 0_level_0,Unnamed: 1_level_0,x,z,Mean
year,indiv,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1977,A,1.2,0.6,0.9
1977,B,1.5,0.5,1.0
1977,C,1.7,0.8,1.25
1978,A,0.2,0.06,0.13
1978,B,0.7,0.2,0.45
1978,C,0.8,0.3,0.55
1978,D,0.9,0.5,0.7
1978,E,1.4,0.9,1.15
1979,C,0.2,0.15,0.175
1979,D,0.14,0.05,0.095
