# PandasSQLWindow Examples

Despite all the brilliant and user-friendly functionalities making Pandas the preferred data-manipulation framework in Python (especially when compared to PySpark), I have often received questions/requests regarding how to conveniently perform SQL Window Function-like operations in Pandas especially when working on datasets with many logically-partitioned groups.


As a result, I have written the PandasSQLWindow class as an attempt to unify some features common in SQL and PySpark using our beloved Pandas. This notebook walks through some of the functionalities of (the not so creatively named) PandasSQLWindow.
**Both rolling and expanding windows can used.**

Currently implemented Window functions:
- shift
- lag
- lead
- last (last known, non-null value)
- rank
- count

- min
- max
- mean
- median
- quantile
- sum
- var
- std


**Tip:** use `help(Window)` to find out all parameters and available methods

In [None]:
import pandas as pd
import numpy as np

from PandasSQLWindow import Window

In [None]:
df = pd.DataFrame({'group':['a', 'b', 'b', 'c', 'c', 'c'],
                   'timestamp':[1, 2, 1, 3, 2, 1], 
                   'value': [1, 2, 3, 4, np.nan, 6]})

df

In [3]:
w = Window(data=df, 
           partition_by='group', # alternatively, use a list to partition by multiple columns
           order_by='timestamp', 
           rows_rolling=2)

df['count'] = w.count() # necessarily runs an expanding count (i.e. row number)
df['lag'] = w.lag('value')
df['last_known'] = w.last('value')

df['rolling_sum'] = w.sum('value') # runs a rolling sum since rows_rolling is specified
df['expanding_sum'] = w.expanding_sum('value') # expanding sum on the entire partition is still available if explicitly called

df['rolling_mean'] = w.rolling_mean('value') # you can also explicitly specify rolling_mean() instead of mean()
df['expanding_mean'] = w.expanding_mean('value') # expanding mean on the entire partition is still available if explicitly called

column_order = [
    'group',
    'timestamp',
    'value',
    'count',
    'lag',
    'last_known',
    'rolling_sum',
    'expanding_sum',
    'rolling_mean',
    'expanding_mean'
]

# Just for ease of reading:
df[column_order].sort_values(['group', 'timestamp'])

Unnamed: 0,group,timestamp,value,count,lag,last_known,rolling_sum,expanding_sum,rolling_mean,expanding_mean
0,a,1,1.0,1,,,1.0,1.0,1.0,1.0
2,b,1,3.0,1,,,3.0,3.0,3.0,3.0
1,b,2,2.0,2,3.0,3.0,5.0,5.0,2.5,2.5
5,c,1,6.0,1,,,6.0,6.0,6.0,6.0
4,c,2,,2,6.0,6.0,6.0,6.0,6.0,6.0
3,c,3,4.0,3,,6.0,4.0,10.0,4.0,5.0
