# Split-Apply-Combine

### Objectives
After this lesson you should be able to...
+ Define split, apply, combine and why it is useful for data analysis
+ Know the definition of an aggregation
+ Group on a single and multiple keys
+ Know multiple different syntax for performing an aggregation
+ Perform a group by aggregation on a subset of the DataFrame columns
+ Perform a group by with multiple aggregation columns
+ Perform a group by with different aggregations for different columns
### Prepare for this lesson by
+ Read intro to Hadley Wickham's paper on [split-apply-combine](http://vita.had.co.nz/papers/plyr.pdf) and skim the rest as it is heavy on R.
+ Read the pandas [split apply combine documentation](http://pandas.pydata.org/pandas-docs/stable/groupby.html) stopping at 'transformation'.

### Introduction
Split-apply-combine is a pattern of investigating data that has been around for a long time. The term was only recently coined by perhaps the most famous data scientist ever to have lived in Houston, Hadley Wickham. Hadley Wickham has developed many similar tools for the R dataframe as Wes McKinney did for the pandas DataFrame.

* **Split** - Instead of calculating a statistic on the entirety of the data, the data is split into groups based on each member meeting a certain criteria
* **Apply** - Apply a function to each group independently
* **Combine** - Combine the results of the function applied to each group back together to form a single dataset again

![SAC](./images/sac.png)

In [210]:
import pandas as pd
import numpy as np

### A manual split, apply, combine
Lets use for loops to split then apply then combine.

In [211]:
# lets recreate the data from the image above
df = pd.DataFrame({'key': list('abc') * 2, 'data': range(1,7)}, columns=['key', 'data'])
df

Unnamed: 0,key,data
0,a,1
1,b,2
2,c,3
3,a,4
4,b,5
5,c,6


### Split
The data will be split into a dictionary that maps each key to all of it's data from the DataFrame. A **`defaultdict`** from the very useful [collections module](https://docs.python.org/3/library/collections.html) will store the data. A **`defaultdict`** differs from a normal **`dict`** by having a default object returned even if the key doesn't exist.

Looping through DataFrames is seldomly done and a sign you are doing something wrong but necessary for this demonstration.  There are a few ways to loop through a DataFrame, one of which is the **`itertuples`** method which returns a [namedtuple](https://docs.python.org/3/library/collections.html#namedtuple-factory-function-for-tuples-with-named-fields), another object from the collections module, that is just like a normal tuple except with reference names for each element in the tuple. The references in this instance happen to be the column names.

In [212]:
from collections import defaultdict

In [213]:
# create a defaultdict with an empty list as its default
keys_dict = defaultdict(list)

# loop through DataFrame and use namedtuple to store values in a defaultdict
for tup in df.itertuples():
    keys_dict[tup.key].append(tup.data)
    
# output the defaultdict
keys_dict

defaultdict(list, {'a': [1, 4], 'b': [2, 5], 'c': [3, 6]})

In [214]:
# Turn the default dict into a DataFrame
split_data = pd.DataFrame(keys_dict)

split_data

Unnamed: 0,a,b,c
0,1,2,3
1,4,5,6


### Apply
Sum up all the all columns

In [215]:
sum_dict = split_data.sum()

sum_dict

a    5
b    7
c    9
dtype: int64

### Combine
The data is already combined but we can turn it into a DataFrame by creating a new column out of the index. The **`reset_index`** will accomplish this. Then the columns will be renamed and the task complete.

In [770]:
df_combined = pd.Series(sum_dict).reset_index()
df_combined.columns = ['Key', 'Data']
df_combined

Unnamed: 0,Key,Data
0,a,5
1,b,7
2,c,9


### That was about 100 lines too many
Lets do this the Python way in one line.
* Use the **`groupby`** method to split your data
* Use the **`agg`** method on the resulting `groupby` object to apply an aggregating function to the groups.
* Pandas combines everything for you back together again

In [97]:
# pass in the column you are grouping by and the function you would
df.groupby('key').agg(np.sum)

Unnamed: 0_level_0,data
key,Unnamed: 1_level_1
a,5
b,7
c,9


### Alternative ways to get same result
Like many operations in pandas there are many ways to achieve the same result. Some of the aggregation functions have string aliases that can be passed to the **`agg`** method.

In [203]:
# most functions have string aliases
df.groupby('key').agg('sum') 

Unnamed: 0_level_0,data
key,Unnamed: 1_level_1
a,5
b,7
c,9


In [99]:
# can use the builtin python Functions
df.groupby('key').agg(sum) 

Unnamed: 0_level_0,data
key,Unnamed: 1_level_1
a,5
b,7
c,9


It's possible and actually preferable to bypass the **`agg`** method altogether and directly call the aggregate function.

In [190]:
# you can actually bypass the agg method all together for quite a few aggregate functions
df.groupby('key').sum()

Unnamed: 0_level_0,data
key,Unnamed: 1_level_1
a,5
b,7
c,9


[Check the docs as usual](http://pandas.pydata.org/pandas-docs/stable/groupby.html) for all the groupby powers (all the methods you can call after df.groupby('key).`<insert method>`

### Aliases
**`agg`** is an alias for **`aggregate`**. An alias just means another name for a Python construct that does the exact same thing. Aliases happen when importing Python modules and reference them with another name. For instance, **`import numpy as np`** allows access to all the **`numpy`** names with the alias **`np`**. This alias adds no additional functionality just shortens the amount of code needed to access the underlying functionality.

In [100]:
# if you like being verbose you can use aggregate - aliased as agg
df.groupby('key').aggregate(np.sum)

Unnamed: 0_level_0,data
key,Unnamed: 1_level_1
a,5
b,7
c,9


### Removing the key from the index
The default behavior of a groupby operation is to put the key in the index of the resulting object. This may be undesirable so you can change the **`as_index`** argument to **False**. It is defaulted to **True**.

In [101]:
# if you want to move the index out use the as_index = False parameter 
df.groupby('key', as_index=False).aggregate(np.sum)

Unnamed: 0,key,data
0,a,5
1,b,7
2,c,9


Alternatively, you can chain the **`reset_index`** method to the result to have the key remain a column.

In [102]:
df.groupby('key').aggregate('sum').reset_index()

Unnamed: 0,key,data
0,a,5
1,b,7
2,c,9


### Definition of Aggregation
The term aggregation has been tossed around quite liberally in this notebook and it is important that it's definition is clear. An [**aggregation**](https://en.wikipedia.org/wiki/Aggregate_function) produces a single result for each group. A single summary statistic is returned after an aggregation. Examples include, max, min, sum, average, variance, count, count missing, etc... Not every function used with **`.agg`** produces a single result but almost all do.

### The groupby object
Occasionally, you will want to inspect the groups by hand. All the examples above have implicitly used the **`groupby`** object by chaining a method directly after creating it. Here we will save and inspect it.

In [162]:
# save groupby object
grouped = df.groupby('key')

In [163]:
# verify it is a groupby object
type(grouped)

pandas.core.groupby.DataFrameGroupBy

The groupby object is an iterable (can be placed directly in a for loop) and all groups can be iterated through one by one.

In [167]:
# grouped is actually an iterable and can be iterated over 
# its rare to ever do this but occasionally you will need to inspect by hand
for name, group in grouped:
    print(name, group, sep="\n", end="\n\n")

a
  key  data
0   a     1
3   a     4

b
  key  data
1   b     2
4   b     5

c
  key  data
2   c     3
5   c     6



In [169]:
# print all the public methods
print([g for g in dir(grouped) if g[0] != '_'])

['agg', 'aggregate', 'all', 'any', 'apply', 'backfill', 'bfill', 'boxplot', 'corr', 'corrwith', 'count', 'cov', 'cumcount', 'cummax', 'cummin', 'cumprod', 'cumsum', 'data', 'describe', 'diff', 'dtypes', 'expanding', 'ffill', 'fillna', 'filter', 'first', 'get_group', 'groups', 'head', 'hist', 'idxmax', 'idxmin', 'indices', 'irow', 'key', 'last', 'mad', 'max', 'mean', 'median', 'min', 'name', 'ndim', 'ngroups', 'nth', 'ohlc', 'pad', 'pct_change', 'plot', 'prod', 'quantile', 'rank', 'resample', 'rolling', 'sem', 'shift', 'size', 'skew', 'std', 'sum', 'tail', 'take', 'transform', 'tshift', 'var']


### Exploring the methods and attributes of a groupby object
Most of the methods and attributes are straightfoward.

In [175]:
# get the number of groups
grouped.ngroups

3

In [184]:
# see all the groups and their indices
grouped.groups

{'a': Int64Index([0, 3], dtype='int64'),
 'b': Int64Index([1, 4], dtype='int64'),
 'c': Int64Index([2, 5], dtype='int64')}

In [188]:
# get first value of every group - an aggregation
grouped.first()

Unnamed: 0_level_0,data
key,Unnamed: 1_level_1
a,1
b,2
c,3


In [189]:
# count the non-missing values in each group
grouped.count()

Unnamed: 0_level_0,data
key,Unnamed: 1_level_1
a,2
b,2
c,2


In [191]:
# get the index of the max value for each group
grouped.idxmax()

Unnamed: 0_level_0,data
key,Unnamed: 1_level_1
a,3
b,4
c,5


In [195]:
# shift each element in each group down 1
grouped.shift(-1)

Unnamed: 0,data
0,4.0
1,5.0
2,6.0
3,
4,
5,


In [196]:
# one of the methods that returns many aggregations all at once for each group
grouped.describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,data
key,Unnamed: 1_level_1,Unnamed: 2_level_1
a,count,2.0
a,mean,2.5
a,std,2.12132
a,min,1.0
a,25%,1.75
a,50%,2.5
a,75%,3.25
a,max,4.0
b,count,2.0
b,mean,3.5


In [197]:
# gets total number of elements regardless if missing or not
df.groupby('key').size()

key
a    2
b    2
c    2
dtype: int64

### More than one aggregate function
You can apply more than one aggregate function to each group. Use a comma separated list of functions that you would like to apply to each group in the **`agg`** method.

In [204]:
# lets keep using our grouped by object
grouped.agg([np.max, np.median, np.size])

Unnamed: 0_level_0,data,data,data
Unnamed: 0_level_1,amax,median,size
key,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
a,4,2.5,2
b,5,3.5,2
c,6,4.5,2


In [205]:
# can use strings of the same name
grouped.agg(['max', 'median', 'size'])

Unnamed: 0_level_0,data,data,data
Unnamed: 0_level_1,max,median,size
key,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
a,4,2.5,2
b,5,3.5,2
c,6,4.5,2


### Renaming Columns
The **`agg`** groupby method can accept a dictionary of dictionaries that map column names to aggregation functions. The format works as such. 

**`{aggregated_column : {'new column name': agg function, 'new column name2' : agg function 2}} `**

In [214]:
# rename columns with dictionary of dictionary
grouped.agg({'data' : {'MAX': 'max', 'MEDIAN!': np.median, 'CT':'size'}})

Unnamed: 0_level_0,data,data,data
Unnamed: 0_level_1,CT,MAX,MEDIAN!
key,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
a,2,4,2.5
b,2,5,3.5
c,2,6,4.5


### An index with multiple levels - a hierarchical index

The code cell above produced a DataFrame that has not been covered before. The column index has two levels. This is called a **hierarchical index** and allows for selection of data at different levels. Hierarchical indexes are represented by **`MultiIndex`** pandas objects.

In [218]:
# store above DataFrame into a variable
df1 = grouped.agg({'data' : {'MAX': 'max', 'MEDIAN!': np.median, 'CT':'size'}})

# a multiindex is produced
df1.columns

MultiIndex(levels=[['data'], ['CT', 'MAX', 'MEDIAN!']],
           labels=[[0, 0, 0], [0, 1, 2]])

In [215]:
# alternate syntax
# can specify in brackets the column name that is being aggregated first
# now there is no need for a nested dictionary. A single dictionary will suffice

# no hierarchical index is produced
grouped['data'].agg({'MAX': 'max', 'MEDIAN!': np.median, 'CT':'size'})

Unnamed: 0_level_0,CT,MAX,MEDIAN!
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,2,4,2.5
b,2,5,3.5
c,2,6,4.5


### Data with more than 2 columns
The simple DataFrame used thus far has only two columns, one being the grouping column and the other being the aggregated column. It is possible to have larger datasets. The **`college`** dataset will now be used.

In [4]:
college = pd.read_csv('data/college.csv')

In [5]:
# make all columns visible
pd.options.display.max_columns = 40

In [6]:
# inspect the top 5 rows
college.head()

Unnamed: 0,INSTNM,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
0,Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
1,University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
2,Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
3,University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
4,Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


### Groupby a single variable again
The college DataFrame has many more columns than the simple one above. The basic groupby works nearly the same for these frames with more than 2 columns but most of the operations require that the column being aggregated be specified.

In [7]:
# Find the states that have the most colleges 
# groupby STABBR and then aggregate using the size method
# no aggregation column specified

college.groupby('STABBR').size().head(10)

STABBR
AK     10
AL     96
AR     86
AS      1
AZ    133
CA    773
CO    125
CT    102
DC     26
DE     19
dtype: int64

The **`size`** groupby method is one of the few where the aggregation column does not need to be specified. In fact this method does nothing more than **`value_counts`**.

In [8]:
# value_counts does the same thing but also sorts
college['STABBR'].value_counts().head(10)

CA    773
TX    472
NY    459
FL    436
PA    394
OH    352
IL    300
MI    207
NC    204
MA    194
Name: STABBR, dtype: int64

### Issues with not specifying the aggregating columns
When doing a groupby and then an aggregation, two different sets of columns are being used - the columns being grouped and the columns being aggregated. The grouping columns are necessary. The aggregation columns on the other hand are not. If no aggregation columns are given, then the aggregation method will be applied to all of the non-grouped columns.

Most code thus far has specified the aggregation column but since the simple DataFrame only had one non-grouped columnm, the aggregation was applied on just that column.

In [9]:
# The aggregation will be applied to all non-grouped columns
# No aggregation column is specified
# All non-grouped columns get aggregated - the mean is found in this case

college.groupby('STABBR').mean().head(10)

Unnamed: 0_level_0,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV
STABBR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
AK,0.0,0.0,0.0,0.3,555.0,503.0,0.0,2493.2,0.44107,0.02832,0.07401,0.04605,0.24254,0.01601,0.04828,0.00523,0.09847,0.32555,1.0,0.39453,0.38166,0.50624
AL,0.166667,0.0,0.011111,0.25,508.47619,504.285714,0.011111,2789.865169,0.477513,0.422081,0.023597,0.009043,0.006187,0.001679,0.012596,0.008462,0.038836,0.247658,0.9375,0.603621,0.509734,0.387039
AR,0.04878,0.0,0.0,0.209302,491.875,515.9375,0.0,1644.146341,0.593701,0.291888,0.052059,0.009322,0.009121,0.001143,0.018289,0.010313,0.014157,0.190789,0.965116,0.58147,0.505556,0.356059
AS,0.0,0.0,0.0,0.0,,,0.0,1276.0,0.0016,0.0,0.0,0.0047,0.0,0.9193,0.0,0.0721,0.0024,0.4389,1.0,0.7245,0.0,0.1774
AZ,0.0,0.0,0.0,0.067669,538.333333,536.666667,0.030075,4130.468254,0.439933,0.079717,0.270277,0.019196,0.062446,0.005358,0.029239,0.006382,0.087448,0.259733,0.879699,0.549792,0.543702,0.480859
CA,0.001429,0.001429,0.004286,0.21216,549.083333,562.902778,0.002857,3518.308397,0.285552,0.093199,0.377686,0.109572,0.006368,0.008625,0.032323,0.027345,0.057802,0.249757,0.860285,0.518089,0.446283,0.429779
CO,0.0,0.0,0.0,0.056,537.714286,540.214286,0.040323,2324.880342,0.549994,0.072085,0.181465,0.027321,0.012725,0.004146,0.039187,0.017256,0.078726,0.28915,0.904,0.479792,0.547177,0.505041
CT,0.0,0.0,0.01087,0.166667,517.857143,522.5,0.01087,1873.550562,0.524426,0.158501,0.179922,0.028096,0.002631,0.000835,0.024175,0.016624,0.053537,0.281913,0.882353,0.496229,0.49074,0.419941
DC,0.083333,0.0,0.041667,0.346154,589.166667,588.333333,0.0,2645.277778,0.256667,0.518117,0.068989,0.027639,0.002283,0.001694,0.033928,0.030394,0.060306,0.247194,0.961538,0.467861,0.573283,0.420782
DE,0.052632,0.0,0.0,0.157895,486.666667,495.0,0.0,2491.052632,0.522147,0.311711,0.066089,0.019105,0.003611,0.001026,0.037542,0.016953,0.021805,0.331932,1.0,0.484211,0.551926,0.422658


Since no aggregation columns were specified the mean of all the numeric columns were returned. The non-numeric columns were silently dropped.

### Easy syntax to specify aggregation column
Aggregation columns are selected with the brackets after the groupby method. A list of columns can be used to aggregate multiple columns.

In [10]:
# Aggregate just a single column
# Same syntax as selecting columns from a DataFrame

college.groupby('STABBR')['HBCU'].mean().head(10)

STABBR
AK    0.000000
AL    0.166667
AR    0.048780
AS    0.000000
AZ    0.000000
CA    0.001429
CO    0.000000
CT    0.000000
DC    0.083333
DE    0.052632
Name: HBCU, dtype: float64

In [11]:
# select more than 1 column to aggregate 

college.groupby('STABBR')['HBCU', 'UGDS'].mean().head(10)

Unnamed: 0_level_0,HBCU,UGDS
STABBR,Unnamed: 1_level_1,Unnamed: 2_level_1
AK,0.0,2493.2
AL,0.166667,2789.865169
AR,0.04878,1644.146341
AS,0.0,1276.0
AZ,0.0,4130.468254
CA,0.001429,3518.308397
CO,0.0,2324.880342
CT,0.0,1873.550562
DC,0.083333,2645.277778
DE,0.052632,2491.052632


### Flexible use of dictionaries on aggregation

A **single dictionary** can be used to perform different aggregations on different columns.

A **dictionary of dictionaries** can be used to do the same thing but gives you the ability to rename the columns.

In [35]:
# can also use a single dictionary to perform different aggregations on different columns
college.groupby('STABBR').agg({'HBCU':'mean', 'UGDS':'max'}).head(10)

Unnamed: 0_level_0,UGDS,HBCU
STABBR,Unnamed: 1_level_1,Unnamed: 2_level_1
AK,12865.0,0.0
AL,29851.0,0.166667
AR,21405.0,0.04878
AS,1276.0,0.0
AZ,151558.0,0.0
CA,44744.0,0.001429
CO,25873.0,0.0
CT,18016.0,0.0
DC,10433.0,0.083333
DE,18222.0,0.052632


### Use lists of functions to perform different aggregations on the same column

Many times, different aggregations will need to be performed on the same column. Use a list of functions in the dictionary to accomplish this.

In [45]:
# to make it easier to read use multiple lines for each column

college.groupby('STABBR').agg({'HBCU':'mean', 
                               'UGDS':['min', 'mean', 'max'],
                               'UG25ABV':['median', 'std']}).head()

Unnamed: 0_level_0,UGDS,UGDS,UGDS,HBCU,UG25ABV,UG25ABV
Unnamed: 0_level_1,min,mean,max,mean,median,std
STABBR,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
AK,27.0,2493.2,12865.0,0.0,0.5191,0.134334
AL,12.0,2789.865169,29851.0,0.166667,0.36795,0.223052
AR,18.0,1644.146341,21405.0,0.04878,0.358,0.184185
AS,1276.0,1276.0,1276.0,0.0,0.1774,
AZ,1.0,4130.468254,151558.0,0.0,0.4684,0.204215


### Use a dictionary of dictionaries to rename aggregated columns

The syntax for this was covered above but here it is again. Look at [this stackoverflow answer](http://stackoverflow.com/a/40962126/3707607) as well

**`{aggregated_column : {'new column name': agg function, 'new column name2' : agg function 2},
   aggregated_column2 : {'new column name3': agg function, 'new column name4' : agg function 3}} `**

In [251]:
# to make it easier to read use multiple lines for each column

college_1 = college.groupby('STABBR').agg({'HBCU': {'HBCU_mean':'mean'}, 
                               'UGDS':{'UGDS_min':'min', 'UGDS_mean':'mean', 'UGDS_max':'max'},
                               'UG25ABV':{'UG25ABV_median':'median', 'UG25ABV_std': 'std'}})

college_1.head(10)

Unnamed: 0_level_0,UGDS,UGDS,UGDS,HBCU,UG25ABV,UG25ABV
Unnamed: 0_level_1,UGDS_mean,UGDS_max,UGDS_min,HBCU_mean,UG25ABV_median,UG25ABV_std
STABBR,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
AK,2493.2,12865.0,27.0,0.0,0.5191,0.134334
AL,2789.865169,29851.0,12.0,0.166667,0.36795,0.223052
AR,1644.146341,21405.0,18.0,0.04878,0.358,0.184185
AS,1276.0,1276.0,1276.0,0.0,0.1774,
AZ,4130.468254,151558.0,1.0,0.0,0.4684,0.204215
CA,3518.308397,44744.0,0.0,0.001429,0.4043,0.206506
CO,2324.880342,25873.0,0.0,0.0,0.48555,0.229806
CT,1873.550562,18016.0,0.0,0.0,0.4249,0.245002
DC,2645.277778,10433.0,22.0,0.083333,0.3918,0.309858
DE,2491.052632,18222.0,42.0,0.052632,0.3875,0.246804


### Removing level of column index

Now that the columns have been renamed, the top **level** of the DataFrame columns above is not needed. The **`droplevel`** index method will remove one of the levels. Each level is 0-indexed starting from the top. Pass the level number to **`droplevel`** in order to drop it.

In [252]:
# drop a level and store result as new columns
college_1.columns = college_1.columns.droplevel(0)

# unnecessary column level removed
college_1.head()

Unnamed: 0_level_0,UGDS_mean,UGDS_max,UGDS_min,HBCU_mean,UG25ABV_median,UG25ABV_std
STABBR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
AK,2493.2,12865.0,27.0,0.0,0.5191,0.134334
AL,2789.865169,29851.0,12.0,0.166667,0.36795,0.223052
AR,1644.146341,21405.0,18.0,0.04878,0.358,0.184185
AS,1276.0,1276.0,1276.0,0.0,0.1774,
AZ,4130.468254,151558.0,1.0,0.0,0.4684,0.204215


### Grouping multiple columns

Thus far, all examples have formed groups from a single column. Any number of columns can be used to form groups. Simply pass all the columns in a list into the **`groupby`** function. All grouped columns will be forced into the index.

In [92]:
# Group by state and religious affiliation
# find average undergraduate populate by state and religious affiliation
# returns a series since its aggregating one column

college.groupby(['STABBR', 'RELAFFIL'])['UGDS'].mean().head(10)

STABBR  RELAFFIL
AK      0           3508.857143
        1            123.333333
AL      0           3248.774648
        1            979.722222
AR      0           1793.691176
        1            917.785714
AS      0           1276.000000
AZ      0           4363.533898
        1            692.750000
CA      0           3802.089810
Name: UGDS, dtype: float64

In [98]:
# multiple aggregation functions

college.groupby(['STABBR', 'RELAFFIL'])['UGDS'].agg(['size', 'min', 'max']).head(12)

Unnamed: 0_level_0,Unnamed: 1_level_0,size,min,max
STABBR,RELAFFIL,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AK,0,7,109.0,12865.0
AK,1,3,27.0,275.0
AL,0,72,12.0,29851.0
AL,1,24,13.0,3033.0
AR,0,68,18.0,21405.0
AR,1,18,20.0,4485.0
AS,0,1,1276.0,1276.0
AZ,0,124,1.0,151558.0
AZ,1,9,25.0,4102.0
CA,0,609,0.0,44744.0


In [170]:
# multiple grouping columns, multiple aggregated columns and multiple aggregation functions

college.groupby(['STABBR', 'RELAFFIL'])['UGDS', 'SATMTMID'].agg(['size', 'min', 'max']).head(12)

Unnamed: 0_level_0,Unnamed: 1_level_0,UGDS,UGDS,UGDS,SATMTMID,SATMTMID,SATMTMID
Unnamed: 0_level_1,Unnamed: 1_level_1,size,min,max,size,min,max
STABBR,RELAFFIL,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
AK,0,7,109.0,12865.0,7,,
AK,1,3,27.0,275.0,3,503.0,503.0
AL,0,72,12.0,29851.0,72,420.0,590.0
AL,1,24,13.0,3033.0,24,400.0,560.0
AR,0,68,18.0,21405.0,68,427.0,565.0
AR,1,18,20.0,4485.0,18,495.0,600.0
AS,0,1,1276.0,1276.0,1,,
AZ,0,124,1.0,151558.0,124,503.0,580.0
AZ,1,9,25.0,4102.0,9,480.0,480.0
CA,0,609,0.0,44744.0,609,445.0,785.0


### Multiple levels for index and columns
Both the **`index`** and the **`columns`** have multiple levels and are officially **`MultiIndex`** objects. More will be said on MultiIndex objects in another notebook.

### Reset index

All grouping columns are pushed into the index unless the argument **as_index** is False (although I've found this doesn't always work.) You can always use **`reset_index`** to convert the index to a column.

In [912]:
# save DataFrame from above
temp = college.groupby(['STABBR', 'RELAFFIL'])['UGDS', 'SATMTMID'].agg(['size', 'min', 'max']).head(12)

# inspect index
temp.index

MultiIndex(levels=[['AK', 'AL', 'AR', 'AS', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'FM', 'GA', 'GU', 'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME', 'MH', 'MI', 'MN', 'MO', 'MP', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM', 'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'PR', 'PW', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VA', 'VI', 'VT', 'WA', 'WI', 'WV', 'WY'], [0, 1]],
           labels=[[0, 0, 1, 1, 2, 2, 3, 4, 4, 5, 5, 6], [0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0]],
           names=['STABBR', 'RELAFFIL'])

In [913]:
# move the index back out as columns
temp.reset_index()

Unnamed: 0_level_0,STABBR,RELAFFIL,UGDS,UGDS,UGDS,SATMTMID,SATMTMID,SATMTMID
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,size,min,max,size,min,max
0,AK,0,7,109.0,12865.0,7,,
1,AK,1,3,27.0,275.0,3,503.0,503.0
2,AL,0,72,12.0,29851.0,72,420.0,590.0
3,AL,1,24,13.0,3033.0,24,400.0,560.0
4,AR,0,68,18.0,21405.0,68,427.0,565.0
5,AR,1,18,20.0,4485.0,18,495.0,600.0
6,AS,0,1,1276.0,1276.0,1,,
7,AZ,0,124,1.0,151558.0,124,503.0,580.0
8,AZ,1,9,25.0,4102.0,9,480.0,480.0
9,CA,0,609,0.0,44744.0,609,445.0,785.0


# End of Section Summary

1. Be familiar with the split-apply-combine methodology
2. Know that a groupby object is returned from the groupby function
3. The most common thing to do is aggregate after a groupby
4. Aggregate means to return a single value
5. It's possible to aggregate with **`.agg`** alias as **`.aggregate`** or just directly using the method name
6. When using **`.agg`** the name of the aggregating function can be in quotes
7. Use multiple aggregate functions with a list or a dictionary
8. Use multiple grouping columns with a list
9. Use a dictionary of dictionaries to specify different aggregation functions and rename result
10. Use **`droplevel`** method to remove a level of a hierarchical index

In [48]:
college = pd.read_csv('data/college.csv')
college.columns[college.columns.str.contains('C')]

Index(['CITY', 'HBCU', 'DISTANCEONLY', 'UGDS_BLACK', 'CURROPER', 'PCTPELL',
       'PCTFLOAN'],
      dtype='object')

### Problem 1
<span  style="color:green; font-size:16px">In the **`college`** DataFrame without using a groupby, which city name appears the most frequently?</span>

In [15]:
import pandas as pd
import numpy as np

college = pd.read_csv('data/college.csv')
college['CITY'].value_counts().index[0]


'New York'

### Problem 2
<span  style="color:green; font-size:16px">Does the city **`Houston`** only appear in the state of **`Texas`**?</span>

In [23]:
college.loc[college.CITY == "Houston","STABBR"].unique()

array(['TX', 'MO'], dtype=object)

### Problem 3
<span  style="color:green; font-size:16px">Find the maximum undergraduate population for each state?</span>

In [26]:
college.groupby('STABBR')['UGDS'].agg([np.max]).head(5)

Unnamed: 0_level_0,amax
STABBR,Unnamed: 1_level_1
AK,12865.0
AL,29851.0
AR,21405.0
AS,1276.0
AZ,151558.0


### Problem 4
<span  style="color:green; font-size:16px">Among colleges that have the largest undergrad population for each state, what is the difference between the most and least populous college?</span>

In [68]:
totaldiff = college.groupby('STABBR')['UGDS'].agg([np.max]).max() - college.groupby('STABBR')['UGDS'].agg([np.max]).min()
totaldiff

amax    602.0
dtype: float64

### Problem 5: Advanced
<span  style="color:green; font-size:16px">Find the name and population of the largest college per state.</span>

In [73]:
# first way
group = college.groupby('STABBR')['UGDS']
s = group.nlargest(1)
college.loc[s.index.get_level_values(1),['INSTNM','UGDS']].head(5)


Unnamed: 0,INSTNM,UGDS
60,University of Alaska Anchorage,12865.0
5,The University of Alabama,29851.0
137,University of Arkansas,21405.0
4138,American Samoa Community College,1276.0
7116,University of Phoenix-Arizona,151558.0


In [72]:
# second way
college_instm = college.set_index('INSTNM')[['STABBR', 'UGDS']]
max_colleges = college_instm.groupby('STABBR')['UGDS'].idxmax()
college_instm.loc[max_colleges].head(5)

Unnamed: 0_level_0,STABBR,UGDS
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1
University of Alaska Anchorage,AK,12865.0
The University of Alabama,AL,29851.0
University of Arkansas,AR,21405.0
American Samoa Community College,AS,1276.0
University of Phoenix-Arizona,AZ,151558.0


### Problem 6
<span  style="color:green; font-size:16px">Do distance only schools tend to have more or less student population than non-distance-only schools?</span>

In [83]:
college.columns[college.columns.str.contains('REL')]
college.groupby(['DISTANCEONLY'])['UGDS'].agg([np.mean])

Unnamed: 0_level_0,mean
DISTANCEONLY,Unnamed: 1_level_1
0.0,2334.648135
1.0,6245.74359


### Problem 7
<span  style="color:green; font-size:16px">Do distance only schools tend to be more or less religously affiliated than non-distance-only schools?</span>

In [85]:
college.columns[college.columns.str.contains('REL')]
college.groupby(['DISTANCEONLY'])['RELAFFIL'].agg([np.mean])

Unnamed: 0_level_0,mean
DISTANCEONLY,Unnamed: 1_level_1
0.0,0.149635
1.0,0.05


### Problem 8
<span  style="color:green; font-size:16px">What state has the highest percentage of currently operating schools of those that have religious affiliation?</span>

In [127]:
# your code here
cr = college[college['CURROPER'] == 1]

# Utah. Answer makes sense.
cr.groupby(['STABBR'])['RELAFFIL'].mean().sort_values(ascending = False).head(5)


STABBR
VI    0.500000
IN    0.406667
GU    0.333333
DC    0.320000
IA    0.314607
Name: RELAFFIL, dtype: float64

### Problem 9: Advanced
<span  style="color:green; font-size:16px">Trim the **`college`** DataFrame to only the 'race' columns - those beginning with **`UGDS_`**. Create a new column called **`UGDS_OTHER`** that is the sum of any race column that averages under 4% for the entire dataset.</span>

In [192]:
ctrim = college.loc[:, college.columns[college.columns.str.contains('UGDS_')]]

In [193]:
minoritycols = ctrim.columns[ctrim.mean() < .04]
minoritycols

Index(['UGDS_ASIAN', 'UGDS_AIAN', 'UGDS_NHPI', 'UGDS_2MOR', 'UGDS_NRA'], dtype='object')

In [194]:
ctrim['UGDS_OTHER'] = ctrim[minoritycols].sum(axis=1)
ctrim.head(10)

Unnamed: 0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,UGDS_OTHER
0,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138,0.0121
1,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01,0.1094
2,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715,0.0034
3,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035,0.1025
4,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137,0.0376
5,0.7825,0.1119,0.0348,0.0106,0.0038,0.0009,0.0261,0.0268,0.0026,0.0682
6,0.7255,0.2613,0.0044,0.0025,0.0044,0.0,0.0,0.0,0.0019,0.0069
7,0.7823,0.12,0.0191,0.0053,0.0157,0.001,0.0174,0.0057,0.0334,0.0451
8,0.5328,0.3376,0.0074,0.0221,0.0044,0.0016,0.0297,0.0397,0.0246,0.0975
9,0.8507,0.0704,0.0248,0.0227,0.0074,0.0,0.0,0.01,0.014,0.0401


### Problem 10
<span  style="color:green; font-size:16px">Use the column **`UG25ABV`** and the **`quantile`** Series function to get 5 evenly spaced quantiles (use 6 numbers). Use this output to create a categorical variable using the **`cut`** function and label the bins Youngest, Young, Average, Old, Oldest and assign it to the **`AGEGROUP`** column.

Then find the average SAT math scores by AGEGROUP. Any surprising result?</span>

In [208]:
quants = college.UG25ABV.quantile([0, .2, .4, .6, .8, 1])
college['AGEGROUP'] = pd.cut(college.UG25ABV, quants, labels=['Youngest', 'Young', 'Average', 'Old', 'Oldest'])
college.groupby('AGEGROUP')['SATMTMID'].mean()

0       0.1049
1       0.2422
2       0.8540
3       0.2640
4       0.1270
5       0.0853
6       0.3153
7       0.6410
8       0.2930
9       0.0415
10      0.0152
11      0.3876
12      0.2367
13      0.5726
14      0.3399
15      0.2909
16      0.4589
17      0.3733
18      0.3933
19      0.3920
20      0.3229
21      0.3318
22      0.7813
23      0.1937
24      0.5942
25      0.4555
26      0.2200
27      0.4060
28      0.3671
29      0.3688
         ...  
7505       NaN
7506       NaN
7507       NaN
7508       NaN
7509       NaN
7510       NaN
7511       NaN
7512       NaN
7513       NaN
7514       NaN
7515       NaN
7516       NaN
7517       NaN
7518       NaN
7519       NaN
7520       NaN
7521       NaN
7522       NaN
7523       NaN
7524       NaN
7525       NaN
7526       NaN
7527       NaN
7528       NaN
7529       NaN
7530       NaN
7531       NaN
7532       NaN
7533       NaN
7534       NaN
Name: UG25ABV, dtype: float64

### Problem 11
<span  style="color:green; font-size:16px">Which are top 5 historically black colleges that have the highest white percentage?</span>

In [197]:
college.loc[college.HBCU == 1, ['INSTNM', 'UGDS_WHITE']].sort_values('UGDS_WHITE', ascending=False).head()

Unnamed: 0,INSTNM,UGDS_WHITE
4021,Bluefield State College,0.8437
17,Gadsden State Community College,0.6921
4050,West Virginia State University,0.5816
48,Shelton State Community College,0.5613
55,H Councill Trenholm State Community College,0.3951


### Problem 12: Advanced
<span  style="color:green; font-size:16px">Again make a DataFrame of all the race percentage columns. Read the documentation on the **`mul`** DataFrame method and use it to multiply the race percentage DataFrame to get an actual population of each race.</span>

In [205]:
df_race = college.filter(like='UGDS_')
df_race.mul(college['UGDS'], axis=0).round(0).head(5) # not in place


Unnamed: 0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
0,140.0,3934.0,23.0,8.0,10.0,8.0,0.0,25.0,58.0
1,6741.0,2960.0,322.0,590.0,25.0,8.0,419.0,204.0,114.0
2,87.0,122.0,2.0,1.0,0.0,0.0,0.0,0.0,79.0
3,3809.0,684.0,208.0,205.0,78.0,1.0,94.0,181.0,191.0
4,76.0,4430.0,58.0,9.0,5.0,3.0,47.0,117.0,66.0
