# PandasSQLWindow Examples

Despite all the brilliant and user-friendly functionalities making Pandas the preferred data-manipulation framework in Python (especially when compared to PySpark), I have often received questions/requests regarding how to conveniently perform SQL Window Function-like operations in Pandas especially when working on datasets with many logically-partitioned groups.


The walkthrough below is an attempt to unify some features common in SQL and PySpark using our beloved Pandas.

In [0]:
import PandasSQLWindow
import numpy as np

In [9]:
df = pd.DataFrame({'group':['a', 'b', 'b', 'c', 'c', 'c'],
                   'timestamp':[1, 2, 1, 3, 2, 1], 
                   'value': [1,2,3,4,np.nan,6]})

df

Unnamed: 0,group,timestamp,value
0,a,1,1.0
1,b,2,2.0
2,b,1,3.0
3,c,3,4.0
4,c,2,
5,c,1,6.0


In [10]:
data = df
partition_by = ['group']
order_by = ['timestamp']
rows_rolling = 2

w = PandasSQLWindow(data=data, 
                    partition_by=partition_by,
                    order_by=order_by, 
                    rows_rolling=rows_rolling)

df['count'] = w.rank(method='first')
df['value_lag'] = w.lag('value')
df['value_last'] = w.last('value')

df['expanding_sum'] = w.expanding_sum('value')
df['expanding_min'] = w.expanding_min('value')

df['rolling_mean'] = w.rolling_mean('value')
df['rolling_sum'] = w.rolling_sum('value')

# Just for ease of reading:
df.sort_values(['group', 'timestamp'])

Unnamed: 0,group,timestamp,value,count,value_lag,value_last,expanding_sum,expanding_min,rolling_mean,rolling_sum
0,a,1,1.0,1,,,1.0,1.0,1.0,1.0
2,b,1,3.0,1,,,3.0,3.0,3.0,3.0
1,b,2,2.0,2,3.0,3.0,5.0,2.0,2.5,5.0
5,c,1,6.0,1,,,6.0,6.0,6.0,6.0
4,c,2,,2,6.0,6.0,6.0,6.0,6.0,6.0
3,c,3,4.0,3,,6.0,10.0,4.0,4.0,4.0
