___

<a href='http://www.pieriandata.com'><img src='../Pierian_Data_Logo.png'/></a>
___
<center><em>Copyright by Pierian Data Inc.</em></center>
<center><em>For more information, visit us at <a href='http://www.pieriandata.com'>www.pieriandata.com</a></em></center>

# Groupby Operations and Multi-level Index

In [None]:
import numpy as np
import pandas as pd

## Data

In [None]:
df = pd.read_csv('./data/mpg.csv')

In [None]:
df

## groupby() method

In [None]:
# Creates a groupby object waiting for an aggregate method
df.groupby('model_year')

In [None]:
# model_year becomes the index! It is NOT a column name,it is now the name of the index
df.groupby('model_year').mean()

In [None]:
avg_year = df.groupby('model_year').mean()

In [None]:
avg_year.index

In [None]:
avg_year.columns

In [None]:
avg_year['mpg']

In [None]:
df.groupby('model_year').mean()['mpg']

In [None]:
df.groupby('model_year').describe()

In [None]:
df.groupby('model_year').describe().transpose()

## Groupby Multiple Columns
Let's explore average mpg per year per cylinder count

In [None]:
df.groupby(['model_year','cylinders']).mean()

In [None]:
df.groupby(['model_year','cylinders']).mean().index

# MultiIndex

## The MultiIndex Object

In [None]:
year_cyl = df.groupby(['model_year','cylinders']).mean()

In [None]:
year_cyl

In [None]:
year_cyl.index

In [None]:
year_cyl.index.levels

In [None]:
year_cyl.index.names

# Indexing with the Hierarchical Index

Full Documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html

In [None]:
year_cyl.head()

## Grab Based on Outside Index

In [None]:
year_cyl.loc[70]

In [None]:
year_cyl.loc[[70,72]]

## Grab a Single Row

In [None]:
year_cyl.loc[(70,8)]

# Grab Based on Cross-section with .xs()

This method takes a `key` argument to select data at a particular
level of a MultiIndex.

Parameters
----------
    key : label or tuple of label
        Label contained in the index, or partially in a MultiIndex.
    axis : {0 or 'index', 1 or 'columns'}, default 0
        Axis to retrieve cross-section on.
    level : object, defaults to first n levels (n=1 or len(key))
        In case of a key partially contained in a MultiIndex, indicate
        which levels are used. Levels can be referred by label or position.

In [None]:
year_cyl.xs(key=70,axis=0,level='model_year')

In [None]:
# Mean column values for 4 cylinders per year
year_cyl.xs(key=4,axis=0,level='cylinders')

### Careful note!

Keep in mind, its usually much easier to filter out values **before** running a groupby() call, so you should attempt to filter out any values/categories you don't want to use. For example, its much easier to remove **4** cylinder cars before the groupby() call, very difficult to this sort of thing after a group by.

In [None]:
df[df['cylinders'].isin([6,8])].groupby(['model_year','cylinders']).mean()

## Swap Levels

* Swapping Levels: https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#swapping-levels-with-swaplevel
* Generalized Method is reorder_levels: https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#reordering-levels-with-reorder-levels

In [None]:
year_cyl.swaplevel().head()

## Sorting MultiIndex

* https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#sorting-a-multiindex 

In [None]:
year_cyl.sort_index(level='model_year',ascending=False)

In [None]:
year_cyl.sort_index(level='cylinders',ascending=False)

# Advanced: agg() method

The agg() method allows you to customize what aggregate functions you want per category

In [None]:
df

## agg() on a DataFrame

In [None]:
# These strings need to match up with built-in method names
df.agg(['median','mean'])

In [None]:
df.agg(['sum','mean'])[['mpg','weight']]

### Specify aggregate methods per column

**agg()** is very powerful,allowing you to pass in a dictionary where the keys are the columns and the values are a list of aggregate methods.

In [None]:
df.agg({'mpg':['median','mean'],'weight':['mean','std']})

## agg() with groupby()

In [None]:
df.groupby('model_year').agg({'mpg':['median','mean'],'weight':['mean','std']})