## Grouping using heterogeneous operators via an abstracted function
One nice thing about SQL's GROUP BY clause is the simplified use of mixed operators.  I created a function that facilitates this type of task. 

In [1]:
from agg_records import hybrid_groupby

We can import this function along with existing modules

In [2]:
import numpy as np
import pandas as pd
import sqlalchemy as sql
import matplotlib.pyplot as plt
import seaborn as sns

sql_engine = sql.create_engine('mssql+pyodbc://@localhost')

#### Pull some data

In [16]:
query = f"SELECT * FROM FOREST_FIRES"
df = pd.read_sql_query(query, sql_engine)
df.head()

Unnamed: 0,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
0,mar,fri,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,0.0
1,oct,tue,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,0.0
2,oct,sat,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,0.0
3,mar,fri,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,0.0
4,mar,sun,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,0.0


#### Now let's say we want to summarize every month, but the operator depends on the column
We want the max temp & RH, but the min ISI & DC. We also want the total rain, but the average wind. 

In [17]:
grp_df = hybrid_groupby(df, grp_cols = ['month'], 
                            max_cols = ['temp', 'RH'], 
                            min_cols = ['ISI', 'DC'], 
                            sum_cols = ['rain'], 
                            avg_cols = ['wind']) 

grp_df

Unnamed: 0,month,day,FFMC,DMC,area,rain,wind,ISI,DC,temp,RH
0,mar,fri,86.2,26.2,0.0,0.2,4.968519,0.7,15.5,18.8,99
1,oct,tue,90.6,35.4,0.0,0.0,3.46,3.0,664.2,21.7,60
2,aug,sun,92.3,85.3,0.0,10.8,4.086413,0.8,100.7,33.3,96
3,sep,tue,91.0,129.5,0.0,0.0,3.557558,0.4,665.3,30.2,86
4,apr,sat,86.3,27.4,0.0,0.0,4.666667,2.3,7.9,17.6,75
5,jun,sun,94.3,96.3,0.0,0.0,4.135294,0.4,200.0,28.0,90
6,jul,tue,79.5,60.6,0.0,0.2,3.734375,1.5,296.3,30.2,90
7,feb,mon,84.0,9.3,0.0,0.0,3.755,0.8,15.3,15.7,82
8,jan,sat,82.1,3.7,0.0,0.0,2.0,0.0,9.3,5.3,100
9,dec,sun,84.4,27.2,8.98,0.0,7.644444,2.0,349.7,5.1,61


The columns that we didn't explicitly pass in hybrid_groupby are still here, but we just kept the first record. For example, we are indifferent to the max or min day of each month. 

#### Now we want just the max temp and minimum wind for each day of the week

In [19]:
day_df = hybrid_groupby(df, grp_cols = ['day'], 
                            max_cols = ['temp'], 
                            min_cols = ['wind']) 

day_df

Unnamed: 0,month,day,FFMC,DMC,DC,ISI,RH,rain,area,wind,temp
0,mar,fri,86.2,26.2,94.3,5.1,51,0.0,0.0,0.9,32.4
1,oct,tue,90.6,35.4,669.1,6.7,33,0.0,0.0,0.9,33.3
2,oct,sat,90.6,43.7,686.9,6.7,33,0.0,0.0,0.9,30.8
3,mar,sun,89.3,51.3,102.2,9.6,99,0.0,0.0,0.9,33.1
4,aug,mon,92.3,88.9,495.6,8.5,27,0.0,0.0,1.3,32.6
5,sep,wed,92.9,133.3,699.6,9.2,21,0.0,0.0,0.4,30.8
6,sep,thu,92.9,137.0,706.4,9.2,17,0.0,0.0,0.9,32.4


We can simplify the view like this 

In [24]:
day_df[['day', 'temp', 'wind']].sort_values('wind')

Unnamed: 0,day,temp,wind
5,wed,30.8,0.4
0,fri,32.4,0.9
1,tue,33.3,0.9
2,sat,30.8,0.9
3,sun,33.1,0.9
6,thu,32.4,0.9
4,mon,32.6,1.3


Looks like the wind got the most calm on some random Wednesday.  This is not statistically interesting, but the tool is cool. 