# Split, Apply, Combine

In [1]:
import numpy as np
import pandas as pd

pd.set_option('display.precision', 2)

In [2]:
dat = pd.read_csv('starmine.csv', parse_dates=['date'])
# dat = dat.set_index(['date', 'symbol'], verify_integrity=True).sort_index()

sectors = 'Durbl Enrgy HiTec'.split(' ')
dates = ['1995-01-31', '1995-02-28']
cols = ['date', 'symbol', 'sector', 'smi', 'ret_0_1_m', 'cap_usd']
dat = dat[cols].query('sector in @sectors and date in @dates').reset_index(drop=True)

## Split, Apply, Combine

Hadley Wickham  
**The split-apply-combine strategy for data analysis.**  
_Journal of Statistical Software_, vol. 40, no. 1, pp. 1–29, 2011  

In [3]:
grp = dat.groupby('sector')

## Split

Conceptually, allows iteration over the split `DataFrame`

In [4]:
for sector_name, sector_df in grp:
    print(f'Sector Name: {sector_name}')
    break # stop the iteration

sector_df.head()

Sector Name: Durbl


Unnamed: 0,date,symbol,sector,smi,ret_0_1_m,cap_usd
4,1995-01-31,3FDMLQ,Durbl,31.0,0.15,501000000.0
28,1995-01-31,AICOQ,Durbl,68.0,-0.04,168000000.0
29,1995-01-31,AIHI,Durbl,45.0,0.11,303000000.0
48,1995-01-31,APN,Durbl,,0.1,124000000.0
52,1995-01-31,ARV,Durbl,18.0,-0.02,511000000.0


In [5]:
sector_df.head()

Unnamed: 0,date,symbol,sector,smi,ret_0_1_m,cap_usd
4,1995-01-31,3FDMLQ,Durbl,31.0,0.15,501000000.0
28,1995-01-31,AICOQ,Durbl,68.0,-0.04,168000000.0
29,1995-01-31,AIHI,Durbl,45.0,0.11,303000000.0
48,1995-01-31,APN,Durbl,,0.1,124000000.0
52,1995-01-31,ARV,Durbl,18.0,-0.02,511000000.0


## Apply

Default applies to all non-numeric columns.

In [6]:
grp.mean()

Unnamed: 0_level_0,smi,ret_0_1_m,cap_usd
sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Durbl,47.6,0.01,2060000000.0
Enrgy,31.3,0.05,3870000000.0
HiTec,61.95,0.06,1480000000.0


# Apply

Apply to a single column.

In [7]:
# Returns a Series
grp['smi'].mean()

sector
Durbl    47.60
Enrgy    31.30
HiTec    61.95
Name: smi, dtype: float64

In [8]:
# Returns a DataFrame
grp[['smi']].mean()

Unnamed: 0_level_0,smi
sector,Unnamed: 1_level_1
Durbl,47.6
Enrgy,31.3
HiTec,61.95


# Apply

Apply the same function to multiple *selected* columns.

In [9]:
grp[['smi', 'cap_usd']].mean()

Unnamed: 0_level_0,smi,cap_usd
sector,Unnamed: 1_level_1,Unnamed: 2_level_1
Durbl,47.6,2060000000.0
Enrgy,31.3,3870000000.0
HiTec,61.95,1480000000.0


# Apply

Apply different functions to a single column and give the
result `DataFrame` custom names.

Use the `aggregate` or `agg` method.

In [10]:
grp['smi'].agg({
    'avg': lambda x: x.mean(),
    'median': lambda x: x.median(),
    'q75': lambda x: x.dropna().quantile(0.75)
})

is deprecated and will be removed in a future version
  after removing the cwd from sys.path.


Unnamed: 0_level_0,avg,median,q75
sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Durbl,47.6,44.0,69.0
Enrgy,31.3,27.0,44.0
HiTec,61.95,65.0,88.0


# Apply

Apply different functions to a single column. Results have the same
names as the functions.

In [11]:
grp['smi'].agg(['mean', 'median'])

Unnamed: 0_level_0,mean,median
sector,Unnamed: 1_level_1,Unnamed: 2_level_1
Durbl,47.6,44.0
Enrgy,31.3,27.0
HiTec,61.95,65.0


# Apply

Apply different functions to different columns.

Again, use the `aggregate` or `agg` method.

In [12]:
grp.agg({
    'smi': lambda x: x.median(),
    'cap_usd': 'mean'
})

Unnamed: 0_level_0,smi,cap_usd
sector,Unnamed: 1_level_1,Unnamed: 2_level_1
Durbl,44.0,2060000000.0
Enrgy,27.0,3870000000.0
HiTec,65.0,1480000000.0


## Grouping by Multiple Variables

Same idea as before, except our results now have a MultiIndex.

In [13]:
grp2 = dat.groupby(['sector', 'date'])
grp2[['cap_usd', 'smi']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,cap_usd,smi
sector,date,Unnamed: 2_level_1,Unnamed: 3_level_1
Durbl,1995-01-31,1970000000.0,50.58
Durbl,1995-02-28,2150000000.0,44.44
Enrgy,1995-01-31,3790000000.0,28.49
Enrgy,1995-02-28,3940000000.0,34.17
HiTec,1995-01-31,1430000000.0,61.98
HiTec,1995-02-28,1530000000.0,61.92


# Flexible Apply

Use `apply` to operate on each grouped subset of the `DataFrame`

In [14]:
grp3 = dat.groupby('sector')
grp3.apply(lambda df: pd.Series(df.shape, index=['nrow', 'ncol']))

Unnamed: 0_level_0,nrow,ncol
sector,Unnamed: 1_level_1,Unnamed: 2_level_1
Durbl,131,6
Enrgy,164,6
HiTec,672,6
