In [77]:
from IPython.core.display import HTML
HTML('''
<script>
var logoParent = document.getElementById("kernel_logo_widget")
var logo = document.getElementById("kernel_logo_widget").getElementsByClassName("current_kernel_logo")[0];
logo.src = "https://i.ibb.co/mD4jTGQ/itclogo.jpg";
logo.style = "display: inline; width:138px; height:40px";
logoParent.innerHTML = '<a href="https://i.ibb.co/mD4jTGQ/itclogo.jpg">' + logoParent.innerHTML + '</a>';
</script>
''')

<font size="36"><b>Pandas - Part III</b></font> <img src = "https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2018/03/pandas.jpg" height=200 width=200>

## Multi-Index

Often, it is useful to store higher-dimensional data–that is, data indexed by more than one or two keys. 

A common pattern in practice is to make use of hierarchical indexing (AKA: multi-indexing) to incorporate multiple index levels within a single index. 

In this way, higher-dimensional data can be compactly represented within a Series or a two-dimensional DataFrame objects.

Recall the number of customers data from before:

In [136]:
import pandas as pd

customer_num = pd.Series([100, 80, 60, 200],
                       index=['humus_hakerem', 'falafel_gina', 
                              '24_rupee', 'pizza_munch']) # number of customers per day

hours_open = pd.Series([10, 12, 9, 17],
                      index=['humus_hakerem', 'falafel_gina', 
                             'al_harampa', '24_rupee'])

print('customer_num=\n{}\n'.format(customer_num))
print('hours_open=\n{}\n'.format(hours_open))

print('average number of customers per hour\n{}\n'.format(customer_num/hours_open))

customer_num=
humus_hakerem    100
falafel_gina      80
24_rupee          60
pizza_munch      200
dtype: int64

hours_open=
humus_hakerem    10
falafel_gina     12
al_harampa        9
24_rupee         17
dtype: int64

average number of customers per hour
24_rupee          3.529412
al_harampa             NaN
falafel_gina      6.666667
humus_hakerem    10.000000
pizza_munch            NaN
dtype: float64



It turns out this is data from 2017, while the new data reflecting the ITC fellows increasing hunger for data and food is

In [137]:
customer_num_2019 = pd.Series([120, 90, 62, 180], 
                             index=['humus_hakerem', 'falafel_gina',
                                   '24_rupee', 'pizza_munch'])
customer_num_2019

humus_hakerem    120
falafel_gina      90
24_rupee          62
pizza_munch      180
dtype: int64

It sure would be nice to show them side by side. I have an idea, let's use a data frame

In [138]:
customer_num_df = pd.DataFrame({'2017':customer_num, '2019':customer_num_2019})
customer_num_df

Unnamed: 0,2017,2019
humus_hakerem,100,120
falafel_gina,80,90
24_rupee,60,62
pizza_munch,200,180


This is nice. Let's do the same for opening hours

In [139]:
hours_open_2019 = pd.Series([11, 11, 9, 15],
                            index=['humus_hakerem', 'falafel_gina', 
                                   'al_harampa', '24_rupee'])
hours_open_df = pd.DataFrame({'2017':hours_open, '2019':hours_open_2019})
hours_open_df

Unnamed: 0,2017,2019
humus_hakerem,10,11
falafel_gina,12,11
al_harampa,9,9
24_rupee,17,15


But how can we represent opening hours and number of customers on the same table. Here's a trick

In [140]:
print(customer_num_df.stack())

display(pd.DataFrame(customer_num_df.stack()))

humus_hakerem  2017    100
               2019    120
falafel_gina   2017     80
               2019     90
24_rupee       2017     60
               2019     62
pizza_munch    2017    200
               2019    180
dtype: int64


Unnamed: 0,Unnamed: 1,0
humus_hakerem,2017,100
humus_hakerem,2019,120
falafel_gina,2017,80
falafel_gina,2019,90
24_rupee,2017,60
24_rupee,2019,62
pizza_munch,2017,200
pizza_munch,2019,180


What's this now? You've just created your first multi-index. See how well it turns out

**Stacking** is moving the last column name level to be the last index.  
- Since we had only 1 level of column indexing (2017, 2019), it was moved to be a 2nd level of row index with MultiIndexing.  
- Since there was only 1 level of column indexing, no column names were left after stacking, so a new running column names index was created

In [141]:
customer_num_df.stack().index

MultiIndex([('humus_hakerem', '2017'),
            ('humus_hakerem', '2019'),
            ( 'falafel_gina', '2017'),
            ( 'falafel_gina', '2019'),
            (     '24_rupee', '2017'),
            (     '24_rupee', '2019'),
            (  'pizza_munch', '2017'),
            (  'pizza_munch', '2019')],
           )

Also note that `customer_num_df` was a DataFrame, while `customer_num_df.stack()` is a Series since only 1 column was left after stacking.  That wouldn't happen if we would have initially more columns

In [142]:
type(customer_num_df), type(customer_num_df.stack())

(pandas.core.frame.DataFrame, pandas.core.series.Series)

Let's stack and give names to the columns:

In [143]:
res_data = pd.DataFrame({'opening_hours': hours_open_df.stack(), \
                         'customer_num':customer_num_df.stack()})
res_data

Unnamed: 0,Unnamed: 1,opening_hours,customer_num
24_rupee,2017,17.0,60.0
24_rupee,2019,15.0,62.0
al_harampa,2017,9.0,
al_harampa,2019,9.0,
falafel_gina,2017,12.0,80.0
falafel_gina,2019,11.0,90.0
humus_hakerem,2017,10.0,100.0
humus_hakerem,2019,11.0,120.0
pizza_munch,2017,,200.0
pizza_munch,2019,,180.0


Notice that we could have created the Series / DataFrames with 2 levels of indexing originally, and then we wouldn't need to give new column names

In [144]:
hours_open_df

Unnamed: 0,2017,2019
humus_hakerem,10,11
falafel_gina,12,11
al_harampa,9,9
24_rupee,17,15


In [145]:
hours_open_df_multi = hours_open_df.copy()
hours_open_df_multi.columns = [['opening_hours', 'opening_hours'], ['2017', '2019']]
display(hours_open_df_multi)
print('After stacking:')
display(hours_open_df_multi.stack())

customer_num_df_multi = customer_num_df.copy()
customer_num_df_multi.columns = [['customer_num', 'customer_num'], ['2017', '2019']]
display(customer_num_df_multi)
print('After stacking:')
display(customer_num_df_multi.stack())

df_multi_col = pd.concat([hours_open_df_multi, customer_num_df_multi], axis=1)
display(df_multi_col)
print('After stacking:')
display(df_multi_col.stack())  

Unnamed: 0_level_0,opening_hours,opening_hours
Unnamed: 0_level_1,2017,2019
humus_hakerem,10,11
falafel_gina,12,11
al_harampa,9,9
24_rupee,17,15


After stacking:


Unnamed: 0,Unnamed: 1,opening_hours
humus_hakerem,2017,10
humus_hakerem,2019,11
falafel_gina,2017,12
falafel_gina,2019,11
al_harampa,2017,9
al_harampa,2019,9
24_rupee,2017,17
24_rupee,2019,15


Unnamed: 0_level_0,customer_num,customer_num
Unnamed: 0_level_1,2017,2019
humus_hakerem,100,120
falafel_gina,80,90
24_rupee,60,62
pizza_munch,200,180


After stacking:


Unnamed: 0,Unnamed: 1,customer_num
humus_hakerem,2017,100
humus_hakerem,2019,120
falafel_gina,2017,80
falafel_gina,2019,90
24_rupee,2017,60
24_rupee,2019,62
pizza_munch,2017,200
pizza_munch,2019,180


Unnamed: 0_level_0,opening_hours,opening_hours,customer_num,customer_num
Unnamed: 0_level_1,2017,2019,2017,2019
humus_hakerem,10.0,11.0,100.0,120.0
falafel_gina,12.0,11.0,80.0,90.0
al_harampa,9.0,9.0,,
24_rupee,17.0,15.0,60.0,62.0
pizza_munch,,,200.0,180.0


After stacking:


Unnamed: 0,Unnamed: 1,customer_num,opening_hours
humus_hakerem,2017,100.0,10.0
humus_hakerem,2019,120.0,11.0
falafel_gina,2017,80.0,12.0
falafel_gina,2019,90.0,11.0
al_harampa,2017,,9.0
al_harampa,2019,,9.0
24_rupee,2017,60.0,17.0
24_rupee,2019,62.0,15.0
pizza_munch,2017,200.0,
pizza_munch,2019,180.0,


Notice that the output DataFrame was the same as in the way where we stacked stacked frames individually

We can also reverse this process of stacking by calling `DataFrame.unstack()` that moves last index to be last column:

In [146]:
print('Before unstacking:')
display(res_data)
print('After unstacking:')
display(res_data.unstack())

Before unstacking:


Unnamed: 0,Unnamed: 1,opening_hours,customer_num
24_rupee,2017,17.0,60.0
24_rupee,2019,15.0,62.0
al_harampa,2017,9.0,
al_harampa,2019,9.0,
falafel_gina,2017,12.0,80.0
falafel_gina,2019,11.0,90.0
humus_hakerem,2017,10.0,100.0
humus_hakerem,2019,11.0,120.0
pizza_munch,2017,,200.0
pizza_munch,2019,,180.0


After unstacking:


Unnamed: 0_level_0,opening_hours,opening_hours,customer_num,customer_num
Unnamed: 0_level_1,2017,2019,2017,2019
24_rupee,17.0,15.0,60.0,62.0
al_harampa,9.0,9.0,,
falafel_gina,12.0,11.0,80.0,90.0
humus_hakerem,10.0,11.0,100.0,120.0
pizza_munch,,,200.0,180.0


We can do a lot of clever stuff with indices. Let's give them names to make this easier.

In [147]:
res_data.index.names = ['restaurant', 'year']
res_data

Unnamed: 0_level_0,Unnamed: 1_level_0,opening_hours,customer_num
restaurant,year,Unnamed: 2_level_1,Unnamed: 3_level_1
24_rupee,2017,17.0,60.0
24_rupee,2019,15.0,62.0
al_harampa,2017,9.0,
al_harampa,2019,9.0,
falafel_gina,2017,12.0,80.0
falafel_gina,2019,11.0,90.0
humus_hakerem,2017,10.0,100.0
humus_hakerem,2019,11.0,120.0
pizza_munch,2017,,200.0
pizza_munch,2019,,180.0


We can convert an index to a column with `reset_index` and a column to index with `set index`. You try it
1. turn restaurant and year to columns
2. from the resulting data frame convert restaurant and opening hours back to indices

In [148]:
res_data.reset_index(inplace=True)
print(res_data.columns)


Index(['restaurant', 'year', 'opening_hours', 'customer_num'], dtype='object')


In [149]:
res_data.set_index(['opening_hours','restaurant'],inplace=True)
print(res_data.index)

MultiIndex([(17.0,      '24_rupee'),
            (15.0,      '24_rupee'),
            ( 9.0,    'al_harampa'),
            ( 9.0,    'al_harampa'),
            (12.0,  'falafel_gina'),
            (11.0,  'falafel_gina'),
            (10.0, 'humus_hakerem'),
            (11.0, 'humus_hakerem'),
            ( nan,   'pizza_munch'),
            ( nan,   'pizza_munch')],
           names=['opening_hours', 'restaurant'])


Most methods of pandas objects can take level (index name) as a parameter. Use the `mean` method of `res_data` to derive
1. average number of customers per restaurant
2. average number of customers per year

<div class="alert alert-info">
<b>Note:</b>
<code>pandas</code> object methods are mostly <code>NaN</code> safe by default
</div>

In [150]:
res_data.mean(skipna = True)

year            2.017202e+38
customer_num    1.115000e+02
dtype: float64

In [151]:
res_data.mean(level='restaurant')

Unnamed: 0_level_0,customer_num
restaurant,Unnamed: 1_level_1
24_rupee,61.0
al_harampa,
falafel_gina,85.0
humus_hakerem,110.0
pizza_munch,190.0


In [154]:
res_data.reset_index(inplace=True)
print(res_data.columns)
res_data.set_index(['year'],inplace=True)
print(res_data.columns)
res_data.mean(level='year')

Index(['year', 'opening_hours', 'restaurant', 'customer_num'], dtype='object')
Index(['opening_hours', 'restaurant', 'customer_num'], dtype='object')


Unnamed: 0_level_0,opening_hours,customer_num
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2017,12.0,110.0
2019,11.5,113.0


In [155]:
res_data

Unnamed: 0_level_0,opening_hours,restaurant,customer_num
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017,17.0,24_rupee,60.0
2019,15.0,24_rupee,62.0
2017,9.0,al_harampa,
2019,9.0,al_harampa,
2017,12.0,falafel_gina,80.0
2019,11.0,falafel_gina,90.0
2017,10.0,humus_hakerem,100.0
2019,11.0,humus_hakerem,120.0
2017,,pizza_munch,200.0
2019,,pizza_munch,180.0


Multi-indexing is great and allows great flexibilty in handling and displaying complicated data. 

It's best to think as adding a multi-index as adding a new dimension (or subdimension) to your data. 

Slicing and indexing dataframes with multi-index is tricky and should be handled with care. 

The intricacies of indexing and the multiple ways of creating multi-indexes are out of the scope of this exercise