# [Go to "Computational Tools" in Pandas Docs](https://pandas.pydata.org/docs/user_guide/computation.html)

In [1]:
import pandas as pd
import numpy as np

# 1. Statistical Functions

## 1.1 Percentage Change

>Use the [pct_change][1] method.
>
>Parameters:
>- `periods` (int, default 1)
>- `fill_method` (str, default ‘pad’)
>- `limit` (int, default None)
>- `freq` (DateOffset, timedelta, or str, optional)

[1]: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pct_change.html#pandas-dataframe-pct-change

In [2]:
s = pd.Series([42, 57, 35, 101, 88])
spc = s.pct_change()

# by default, each value is compared to the previous one (periods=1)
# e.g. 0.357143 == 35.7143% increase: 57 from 42
pd.concat([s, spc], axis=1, keys=['s', '% change'])

Unnamed: 0,s,% change
0,42,
1,57,0.357143
2,35,-0.385965
3,101,1.885714
4,88,-0.128713


In [3]:
df = pd.DataFrame(np.random.randint(1, 10, (7, 3)))
dfpct = df.pct_change(periods=3)

pd.concat([df, dfpct], axis=1)

Unnamed: 0,0,1,2,0.1,1.1,2.1
0,4,8,8,,,
1,7,7,1,,,
2,8,1,8,,,
3,2,4,5,-0.5,-0.5,-0.375
4,8,4,1,0.142857,-0.428571,0.0
5,4,4,5,-0.5,3.0,-0.375
6,5,1,8,1.5,-0.75,0.6


## 1.2 Covariance

>Use [Series.cov()][1] for covariance between series, and [DataFrame.cov()][2] for pairwise covariances among the series/columns in a dataframe. Missing values are excluded.

[1]: https://pandas.pydata.org/docs/reference/api/pandas.Series.cov.html
[2]: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.cov.html

In [4]:
s1 = pd.Series(np.random.randn(200), name='s1')
s2 = pd.Series(np.random.randn(200), name='s2')

s1.cov(s2), s2.cov(s1)

(0.0004713130710796497, 0.0004713130710796497)

In [5]:
df = pd.DataFrame(np.random.rand(50, 4))
df.cov()

Unnamed: 0,0,1,2,3
0,0.07653,-0.001002,0.002664,-0.017268
1,-0.001002,0.07783,-0.012775,-0.010667
2,0.002664,-0.012775,0.085897,0.011337
3,-0.017268,-0.010667,0.011337,0.077523


## 1.3 Correlation

>Use the [corr()][1] method. You can specify `method` as one of `pearson (default)`, `kendall` or `spearman`.

[1]: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html

In [6]:
s1.corr(s2, method='pearson') 

0.0004241222803424599

In [7]:
s1.corr(s2, method='kendall')

0.02381909547738693

In [8]:
s1.corr(s2, method='spearman')

0.03335333383334584

In [9]:
# Pairwise correlation of DataFrame columns
df.corr()

Unnamed: 0,0,1,2,3
0,1.0,-0.012979,0.032863,-0.224188
1,-0.012979,1.0,-0.156247,-0.137321
2,0.032863,-0.156247,1.0,0.138932
3,-0.224188,-0.137321,0.138932,1.0


>[DataFrame.corrwith()][1] calculates the correlation between like-labeled `Series` in different `DataFrame`s

[1]: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corrwith.html

In [10]:
df1 = pd.DataFrame(np.random.randn(7, 4), columns=list('ABCD'))
df2 = pd.DataFrame(np.random.randn(7, 4), columns=list('ACDF'))
df1.corrwith(df2) 

A   -0.217784
C    0.220371
D    0.157940
B         NaN
F         NaN
dtype: float64

## 1.4 Rank

> Use the [rank()][1] method to numerically rank data (rank 1 through n) along an axis.

[1]: https://pandas.pydata.org/docs/reference/api/pandas.Series.rank.html

In [11]:
s = pd.Series([2.2, 4.5, 3.8, 5.9, 3.8])

# By default, the average is returned for records with equal value
# e.g 3.8 ties for 2nd and 3rd, and is thus ranked as mean([2, 3]) = 2.5
s.rank() # method=average

0    1.0
1    4.0
2    2.5
3    5.0
4    2.5
dtype: float64

In [12]:
# To return largest rank for equal records
s.rank(method='max')

0    1.0
1    4.0
2    3.0
3    5.0
4    3.0
dtype: float64

In [13]:
df = pd.DataFrame(np.random.randn(5,3))
print(df)

# ranking rows (vertically)
df.rank() # axis=0

          0         1         2
0  0.079170  1.201684 -0.451849
1 -1.037938  0.434777 -0.433522
2  1.085266 -0.728022 -0.866959
3 -0.087712 -2.083456 -2.352187
4  0.009106  0.265087 -0.795417


Unnamed: 0,0,1,2
0,4.0,5.0,4.0
1,1.0,4.0,5.0
2,5.0,2.0,2.0
3,2.0,1.0,1.0
4,3.0,3.0,3.0


In [14]:
print(df)
# ranking columns (horizontally)
df.rank(axis=1)

          0         1         2
0  0.079170  1.201684 -0.451849
1 -1.037938  0.434777 -0.433522
2  1.085266 -0.728022 -0.866959
3 -0.087712 -2.083456 -2.352187
4  0.009106  0.265087 -0.795417


Unnamed: 0,0,1,2
0,2.0,3.0,1.0
1,1.0,3.0,2.0
2,3.0,2.0,1.0
3,3.0,2.0,1.0
4,2.0,3.0,1.0


# 2. Window Functions

> `.rolling` for aggregations or to apply functions to "windows" of the data

>`.expanding` for aggregations or to apply functions to all the data available up to that point in time. 

>`.ewm` applies exponentially weighted statistical functions

## 2.1 Rolling Windows

### 2.1.1 Method Summary

In [16]:
#pd.read_html('https://pandas.pydata.org/docs/user_guide/computation.html')[1]

In [17]:
s = pd.Series(np.random.randn(500),
              index=pd.date_range('2020-01-01', periods=500, freq='s'))

s.rolling(window=50).sum()

2020-01-01 00:00:00         NaN
2020-01-01 00:00:01         NaN
2020-01-01 00:00:02         NaN
2020-01-01 00:00:03         NaN
2020-01-01 00:00:04         NaN
                         ...   
2020-01-01 00:08:15    4.411485
2020-01-01 00:08:16    2.616218
2020-01-01 00:08:17   -0.363874
2020-01-01 00:08:18   -1.394876
2020-01-01 00:08:19   -1.173139
Freq: S, Length: 500, dtype: float64

### 2.1.2 Rolling Apply

> The `apply()` method takes an extra `func` argument and performs generic rolling computations

In [18]:
def foo(x):
    return min(x) + max(x) 

s.rolling(25).apply(foo)

2020-01-01 00:00:00         NaN
2020-01-01 00:00:01         NaN
2020-01-01 00:00:02         NaN
2020-01-01 00:00:03         NaN
2020-01-01 00:00:04         NaN
                         ...   
2020-01-01 00:08:15   -0.280747
2020-01-01 00:08:16   -0.081495
2020-01-01 00:08:17   -0.081495
2020-01-01 00:08:18   -0.081495
2020-01-01 00:08:19   -0.510110
Freq: S, Length: 500, dtype: float64

In [19]:
s.rolling(10).apply(np.ptp)

2020-01-01 00:00:00         NaN
2020-01-01 00:00:01         NaN
2020-01-01 00:00:02         NaN
2020-01-01 00:00:03         NaN
2020-01-01 00:00:04         NaN
                         ...   
2020-01-01 00:08:15    3.364631
2020-01-01 00:08:16    3.364631
2020-01-01 00:08:17    3.364631
2020-01-01 00:08:18    3.364631
2020-01-01 00:08:19    3.364631
Freq: S, Length: 500, dtype: float64

> [scipy.signal window functions][1] can be used. The weights used in the window are specified by the `win_type` keyword.

[1]: https://docs.scipy.org/doc/scipy/reference/signal.windows.html

### 2.1.3 Time-Aware Rolling

You can set `window` to a time offset. This can be particularly useful for an irregular time frequency index.

In [20]:
st = pd.Series([2.5, 4.8, 1.3, np.nan, 5.9],
                index=pd.Index([pd.Timestamp(2020, 1, 1, 1, 0, 1),
                                pd.Timestamp(2020, 1, 1, 1, 0, 3),
                                pd.Timestamp(2020, 1, 1, 1, 0, 4),
                                pd.Timestamp(2020, 1, 1, 1, 0, 7),
                                pd.Timestamp(2020, 1, 1, 1, 0, 11)]))
st 

2020-01-01 01:00:01    2.5
2020-01-01 01:00:03    4.8
2020-01-01 01:00:04    1.3
2020-01-01 01:00:07    NaN
2020-01-01 01:00:11    5.9
dtype: float64

In [21]:
st.rolling('5s').max()

2020-01-01 01:00:01    2.5
2020-01-01 01:00:03    4.8
2020-01-01 01:00:04    4.8
2020-01-01 01:00:07    4.8
2020-01-01 01:00:11    5.9
dtype: float64

### 2.1.4 Rolling Window Endpoints

The inclusion of the interval endpoints in rolling window calculations can be specified with the `closed` parameter:

In [23]:
#pd.read_html('https://pandas.pydata.org/docs/user_guide/computation.html')[3].set_index('closed')

In [24]:
st2 = pd.Series([1] * 5, index=st.index)
pd.DataFrame({'right': st2.rolling('5s', closed='right').sum(),
              'left': st2.rolling('5s', closed='left').sum(),
              'both': st2.rolling('5s', closed='both').sum(),
              'neither': st2.rolling('5s', closed='neither').sum()})

Unnamed: 0,right,left,both,neither
2020-01-01 01:00:01,1.0,,1.0,
2020-01-01 01:00:03,2.0,1.0,2.0,1.0
2020-01-01 01:00:04,3.0,2.0,3.0,2.0
2020-01-01 01:00:07,3.0,2.0,3.0,2.0
2020-01-01 01:00:11,2.0,1.0,2.0,1.0


### 2.1.5 Iteration Over Window

`Rolling` and `Expanding` objects accept iteration.

In [25]:
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})

for win in df.rolling(2):
    print(win)

   A  B
0  1  4
   A  B
0  1  4
1  2  5
   A  B
1  2  5
2  3  6


### 2.1.6 Centering Windows

Use the `center` keyword to set the labels at the center (the default is to set the labels to the right edge of the window).

In [26]:
s.rolling(5).mean()

2020-01-01 00:00:00         NaN
2020-01-01 00:00:01         NaN
2020-01-01 00:00:02         NaN
2020-01-01 00:00:03         NaN
2020-01-01 00:00:04    0.037254
                         ...   
2020-01-01 00:08:15   -0.255695
2020-01-01 00:08:16   -0.450270
2020-01-01 00:08:17   -0.402972
2020-01-01 00:08:18   -0.702490
2020-01-01 00:08:19   -0.579671
Freq: S, Length: 500, dtype: float64

In [27]:
s.rolling(5, center=True).mean()

2020-01-01 00:00:00         NaN
2020-01-01 00:00:01         NaN
2020-01-01 00:00:02    0.037254
2020-01-01 00:00:03   -0.194403
2020-01-01 00:00:04   -0.281057
                         ...   
2020-01-01 00:08:15   -0.402972
2020-01-01 00:08:16   -0.702490
2020-01-01 00:08:17   -0.579671
2020-01-01 00:08:18         NaN
2020-01-01 00:08:19         NaN
Freq: S, Length: 500, dtype: float64

### 2.1.7 Binary Window Functions

`cov()` and `corr()` can compute moving window statistics about two Series or any combination of `DataFrame/Series` or `DataFrame/DataFrame`.

In [28]:
s1 = pd.Series(np.random.randn(200))
s2 = pd.Series(np.linspace(-1, 1, 75))

s1.rolling(window=50).cov(s2).dropna().head()

49   -0.064074
50   -0.061207
51   -0.057924
52   -0.066899
53   -0.071555
dtype: float64

In [29]:
s1.rolling(window=50).corr(s2).dropna().head()

49   -0.202614
50   -0.193949
51   -0.183656
52   -0.210291
53   -0.223290
dtype: float64

In [30]:
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))

df.rolling(25).cov(s2).dropna().head()

Unnamed: 0,A,B,C,D
24,0.032381,-0.015729,-0.037395,-0.080062
25,0.034755,-0.036984,-0.031313,-0.093851
26,0.040741,-0.01809,-0.031408,-0.10843
27,0.02202,-0.012121,-0.016808,-0.088513
28,0.001322,-0.011965,-0.02323,-0.084722


In [31]:
df2 = pd.DataFrame(np.random.randn(70, 5), columns=list('ACDEI'))

# pairwise=True required for dataframe/dataframe
df.rolling(25).cov(df2, pairwise=True).dropna().head()

Unnamed: 0,Unnamed: 1,A,B,C,D
24,A,0.099812,0.084246,-0.037931,-0.25432
24,C,-0.288167,0.068294,0.05961,-0.13629
24,D,0.147117,-0.007319,-0.510334,-0.27395
24,E,-0.140294,-0.228852,0.254669,0.28695
24,I,0.186871,0.376234,0.053578,-0.052365


# 3. Aggregations

In [32]:
s.rolling(window=15, min_periods=5).aggregate(np.std)

2020-01-01 00:00:00         NaN
2020-01-01 00:00:01         NaN
2020-01-01 00:00:02         NaN
2020-01-01 00:00:03         NaN
2020-01-01 00:00:04    1.172901
                         ...   
2020-01-01 00:08:15    0.954081
2020-01-01 00:08:16    0.923337
2020-01-01 00:08:17    1.012017
2020-01-01 00:08:18    1.059645
2020-01-01 00:08:19    1.063690
Freq: S, Length: 500, dtype: float64

## 3.1 Applying Multiple Functions

Just pass a list of functions to aggregate with.

In [33]:
s.rolling(20).agg([max, min, np.mean, np.std]).dropna().head()

Unnamed: 0,max,min,mean,std
2020-01-01 00:00:19,2.398171,-3.188709,-0.181614,1.457391
2020-01-01 00:00:20,2.398171,-3.188709,-0.195338,1.457802
2020-01-01 00:00:21,2.398171,-3.188709,-0.178194,1.460581
2020-01-01 00:00:22,2.398171,-3.188709,-0.227011,1.472324
2020-01-01 00:00:23,2.398171,-3.188709,-0.317295,1.390012


In [34]:
df.rolling(20).agg([np.mean, np.std]).dropna().head()

Unnamed: 0_level_0,A,A,B,B,C,C,D,D
Unnamed: 0_level_1,mean,std,mean,std,mean,std,mean,std
19,-0.0052,0.82157,0.265957,1.121051,-0.020599,0.981455,-0.133167,0.986771
20,0.038861,0.814978,0.289102,1.090982,-0.02295,0.978578,0.000187,1.016078
21,-0.026102,0.866989,0.196051,1.092259,-0.066525,1.001465,-0.088621,1.072933
22,0.046091,0.902908,0.173368,1.108488,-0.048073,1.00622,-0.075147,1.086425
23,0.162479,0.934632,0.245575,1.130927,-0.179093,0.954806,-0.161402,1.174323


## 3.2 Applying Different Functions to DataFrame Columns

Just pass a `dict` to `agg`, mapping column names to aggregating functions.

In [35]:
df.rolling(20).agg({'A': [max, min], 'B': np.std, 'D': lambda x: np.quantile(x, 0.5)}
                  ).dropna().head()

Unnamed: 0_level_0,A,A,B,D
Unnamed: 0_level_1,max,min,std,<lambda>
19,1.360824,-1.420331,1.121051,-0.137902
20,1.360824,-1.420331,1.090982,0.094189
21,1.360824,-1.420331,1.092259,-0.137902
22,1.360824,-1.420331,1.108488,-0.137902
23,1.506076,-1.420331,1.130927,-0.175253


> The aggregating functions can also be given as strings

In [36]:
df.rolling(20).agg({'A': 'sum', 'B': ['max', 'min'], 'D': 'std'}
                  ).dropna().head()

Unnamed: 0_level_0,A,B,B,D
Unnamed: 0_level_1,sum,max,min,std
19,-0.103998,2.522166,-1.318976,0.986771
20,0.777216,2.522166,-0.881766,1.016078
21,-0.522041,2.522166,-0.881766,1.072933
22,0.921819,2.522166,-0.881766,1.086425
23,3.249582,2.522166,-0.881766,1.174323


# 4. Expanding Windows

A common alternative to rolling statistics is to use an expanding window, which yields the value of the statistic with all the data available up to that point in time.

## 4.1 Method Summary

In [38]:
#pd.read_html('https://pandas.pydata.org/docs/user_guide/computation.html')[4]

In [39]:
s.expanding(min_periods=2).sum()

2020-01-01 00:00:00          NaN
2020-01-01 00:00:01    -0.226221
2020-01-01 00:00:02    -0.275775
2020-01-01 00:00:03     1.594472
2020-01-01 00:00:04     0.186271
                         ...    
2020-01-01 00:08:15    12.184333
2020-01-01 00:08:16    12.032256
2020-01-01 00:08:17    10.331376
2020-01-01 00:08:18     9.012293
2020-01-01 00:08:19     9.412297
Freq: S, Length: 500, dtype: float64

# 5. Exponentially Weighted Windows

A related set of functions are exponentially weighted versions of several of the above statistics. A similar interface to `.rolling` and `.expanding` is accessed through the `.ewm` method.

>One must specify precisely one of `span`, `center of mass`, `half-life` and `alpha` to the EW functions:

## 5.1 Method Summary

In [40]:
#pd.read_html('https://pandas.pydata.org/docs/user_guide/computation.html')[5]

In [41]:
s.ewm(span=15).mean()

2020-01-01 00:00:00   -0.092702
2020-01-01 00:00:01   -0.114471
2020-01-01 00:00:02   -0.089887
2020-01-01 00:00:03    0.502201
2020-01-01 00:00:04    0.011943
                         ...   
2020-01-01 00:08:15   -0.037161
2020-01-01 00:08:16   -0.051525
2020-01-01 00:08:17   -0.257694
2020-01-01 00:08:18   -0.390368
2020-01-01 00:08:19   -0.291572
Freq: S, Length: 500, dtype: float64

In [42]:
s.ewm(halflife=50).mean()

2020-01-01 00:00:00   -0.092702
2020-01-01 00:00:01   -0.113252
2020-01-01 00:00:02   -0.091724
2020-01-01 00:00:03    0.409015
2020-01-01 00:00:04    0.035426
                         ...   
2020-01-01 00:08:15    0.096308
2020-01-01 00:08:16    0.092885
2020-01-01 00:08:17    0.068164
2020-01-01 00:08:18    0.049047
2020-01-01 00:08:19    0.053883
Freq: S, Length: 500, dtype: float64

In [43]:
s.ewm(com=80).mean()

2020-01-01 00:00:00   -0.092702
2020-01-01 00:00:01   -0.113237
2020-01-01 00:00:02   -0.091745
2020-01-01 00:00:03    0.407930
2020-01-01 00:00:04    0.035624
                         ...   
2020-01-01 00:08:15    0.094997
2020-01-01 00:08:16    0.091940
2020-01-01 00:08:17    0.069761
2020-01-01 00:08:18    0.052580
2020-01-01 00:08:19    0.056878
Freq: S, Length: 500, dtype: float64