# [Go to "Computational Tools" in Pandas Docs](https://pandas.pydata.org/docs/user_guide/computation.html)

In [1]:
import pandas as pd
import numpy as np

# 1. Statistical Functions

## 1.1 Percentage Change

>Use the [pct_change][1] method.
>
>Parameters:
>- `periods` (int, default 1)
>- `fill_method` (str, default ‘pad’)
>- `limit` (int, default None)
>- `freq` (DateOffset, timedelta, or str, optional)

[1]: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pct_change.html#pandas-dataframe-pct-change

In [2]:
s = pd.Series([42, 57, 35, 101, 88])
spc = s.pct_change()

# by default, each value is compared to the previous one (periods=1)
# e.g. 0.357143 == 35.7143% increase: 57 from 42
pd.concat([s, spc], axis=1, keys=['s', '% change'])

Unnamed: 0,s,% change
0,42,
1,57,0.357143
2,35,-0.385965
3,101,1.885714
4,88,-0.128713


In [3]:
df = pd.DataFrame(np.random.randint(1, 10, (7, 3)))
dfpct = df.pct_change(periods=3)

pd.concat([df, dfpct], axis=1)

Unnamed: 0,0,1,2,0.1,1.1,2.1
0,4,5,7,,,
1,5,2,6,,,
2,2,8,7,,,
3,3,9,3,-0.25,0.8,-0.571429
4,1,4,1,-0.8,1.0,-0.833333
5,7,7,3,2.5,-0.125,-0.571429
6,3,8,9,0.0,-0.111111,2.0


## 1.2 Covariance

>Use [Series.cov()][1] for covariance between series, and [DataFrame.cov()][2] for pairwise covariances among the series/columns in a dataframe. Missing values are excluded.

[1]: https://pandas.pydata.org/docs/reference/api/pandas.Series.cov.html
[2]: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.cov.html

In [4]:
s1 = pd.Series(np.random.randn(200), name='s1')
s2 = pd.Series(np.random.randn(200), name='s2')

s1.cov(s2), s2.cov(s1)

(-0.07033665510798649, -0.07033665510798649)

In [5]:
df = pd.DataFrame(np.random.rand(50, 4))
df.cov()

Unnamed: 0,0,1,2,3
0,0.114131,-0.019598,0.000311,0.016941
1,-0.019598,0.089141,-0.01424,0.018757
2,0.000311,-0.01424,0.067451,-0.004582
3,0.016941,0.018757,-0.004582,0.077566


## 1.3 Correlation

>Use the [corr()][1] method. You can specify `method` as one of `pearson (default)`, `kendall` or `spearman`.

[1]: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html

In [6]:
s1.corr(s2, method='pearson') 

-0.07193544390691518

In [7]:
s1.corr(s2, method='kendall')

-0.05155778894472361

In [8]:
s1.corr(s2, method='spearman')

-0.07335633390834771

In [9]:
# Pairwise correlation of DataFrame columns
df.corr()

Unnamed: 0,0,1,2,3
0,1.0,-0.194297,0.003539,0.180058
1,-0.194297,1.0,-0.18364,0.225571
2,0.003539,-0.18364,1.0,-0.063341
3,0.180058,0.225571,-0.063341,1.0


>[DataFrame.corrwith()][1] calculates the correlation between like-labeled `Series` in different `DataFrame`s

[1]: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corrwith.html

In [10]:
df1 = pd.DataFrame(np.random.randn(7, 4), columns=list('ABCD'))
df2 = pd.DataFrame(np.random.randn(7, 4), columns=list('ACDF'))
df1.corrwith(df2) 

A    0.056148
C   -0.199097
D   -0.234600
B         NaN
F         NaN
dtype: float64

## 1.4 Rank

> Use the [rank()][1] method to numerically rank data (rank 1 through n) along an axis.

[1]: https://pandas.pydata.org/docs/reference/api/pandas.Series.rank.html

In [11]:
s = pd.Series([2.2, 4.5, 3.8, 5.9, 3.8])

# By default, the average is returned for records with equal value
# e.g 3.8 ties for 2nd and 3rd, and is thus ranked as mean([2, 3]) = 2.5
s.rank() # method=average

0    1.0
1    4.0
2    2.5
3    5.0
4    2.5
dtype: float64

In [12]:
# To return largest rank for equal records
s.rank(method='max')

0    1.0
1    4.0
2    3.0
3    5.0
4    3.0
dtype: float64

In [13]:
df = pd.DataFrame(np.random.randn(5,3))
print(df)

# ranking rows (vertically)
df.rank() # axis=0

          0         1         2
0 -1.638948  1.040232 -1.701030
1 -2.659299 -0.316685 -0.295384
2 -0.153426  0.708146  0.530945
3  0.753937  0.999822 -0.309514
4  0.175637 -0.378550 -0.700197


Unnamed: 0,0,1,2
0,2.0,5.0,1.0
1,1.0,2.0,4.0
2,3.0,3.0,5.0
3,5.0,4.0,3.0
4,4.0,1.0,2.0


In [14]:
print(df)
# ranking columns (horizontally)
df.rank(axis=1)

          0         1         2
0 -1.638948  1.040232 -1.701030
1 -2.659299 -0.316685 -0.295384
2 -0.153426  0.708146  0.530945
3  0.753937  0.999822 -0.309514
4  0.175637 -0.378550 -0.700197


Unnamed: 0,0,1,2
0,2.0,3.0,1.0
1,1.0,2.0,3.0
2,1.0,3.0,2.0
3,2.0,3.0,1.0
4,3.0,2.0,1.0


# 2. Window Functions

> `.rolling` for aggregations or to apply functions to "windows" of the data

>`.expanding` for aggregations or to apply functions to all the data available up to that point in time. 

>`.ewm` applies exponentially weighted statistical functions

## 2.1 Rolling Windows

### 2.1.1 Method Summary

In [15]:
pd.read_html('https://pandas.pydata.org/docs/user_guide/computation.html')[1]

Unnamed: 0,Method,Description
0,count(),Number of non-null observations
1,sum(),Sum of values
2,mean(),Mean of values
3,median(),Arithmetic median of values
4,min(),Minimum
5,max(),Maximum
6,std(),Sample standard deviation
7,var(),Sample variance
8,skew(),Sample skewness (3rd moment)
9,kurt(),Sample kurtosis (4th moment)


In [16]:
s = pd.Series(np.random.randn(500),
              index=pd.date_range('2020-01-01', periods=500, freq='s'))

s.rolling(window=50).sum()

2020-01-01 00:00:00         NaN
2020-01-01 00:00:01         NaN
2020-01-01 00:00:02         NaN
2020-01-01 00:00:03         NaN
2020-01-01 00:00:04         NaN
                         ...   
2020-01-01 00:08:15    3.121968
2020-01-01 00:08:16    5.669720
2020-01-01 00:08:17    4.869147
2020-01-01 00:08:18    3.302875
2020-01-01 00:08:19    3.131162
Freq: S, Length: 500, dtype: float64

### 2.1.2 Rolling Apply

> The `apply()` method takes an extra `func` argument and performs generic rolling computations

In [17]:
def foo(x):
    return min(x) + max(x) 

s.rolling(25).apply(foo)

2020-01-01 00:00:00         NaN
2020-01-01 00:00:01         NaN
2020-01-01 00:00:02         NaN
2020-01-01 00:00:03         NaN
2020-01-01 00:00:04         NaN
                         ...   
2020-01-01 00:08:15   -1.044606
2020-01-01 00:08:16    0.618902
2020-01-01 00:08:17    0.618902
2020-01-01 00:08:18    0.618902
2020-01-01 00:08:19    0.618902
Freq: S, Length: 500, dtype: float64

In [18]:
s.rolling(10).apply(np.ptp)

2020-01-01 00:00:00         NaN
2020-01-01 00:00:01         NaN
2020-01-01 00:00:02         NaN
2020-01-01 00:00:03         NaN
2020-01-01 00:00:04         NaN
                         ...   
2020-01-01 00:08:15    2.842497
2020-01-01 00:08:16    4.508891
2020-01-01 00:08:17    4.508891
2020-01-01 00:08:18    4.508891
2020-01-01 00:08:19    4.141751
Freq: S, Length: 500, dtype: float64

> [scipy.signal window functions][1] can be used. The weights used in the window are specified by the `win_type` keyword.

[1]: https://docs.scipy.org/doc/scipy/reference/signal.windows.html

### 2.1.3 Time-Aware Rolling

You can set `window` to a time offset. This can be particularly useful for an irregular time frequency index.

In [19]:
st = pd.Series([2.5, 4.8, 1.3, np.nan, 5.9],
                index=pd.Index([pd.Timestamp(2020, 1, 1, 1, 0, 1),
                                pd.Timestamp(2020, 1, 1, 1, 0, 3),
                                pd.Timestamp(2020, 1, 1, 1, 0, 4),
                                pd.Timestamp(2020, 1, 1, 1, 0, 7),
                                pd.Timestamp(2020, 1, 1, 1, 0, 11)]))
st 

2020-01-01 01:00:01    2.5
2020-01-01 01:00:03    4.8
2020-01-01 01:00:04    1.3
2020-01-01 01:00:07    NaN
2020-01-01 01:00:11    5.9
dtype: float64

In [20]:
st.rolling('5s').max()

2020-01-01 01:00:01    2.5
2020-01-01 01:00:03    4.8
2020-01-01 01:00:04    4.8
2020-01-01 01:00:07    4.8
2020-01-01 01:00:11    5.9
dtype: float64

### 2.1.4 Rolling Window Endpoints

The inclusion of the interval endpoints in rolling window calculations can be specified with the `closed` parameter:

In [21]:
pd.read_html('https://pandas.pydata.org/docs/user_guide/computation.html')[3].set_index('closed')

Unnamed: 0_level_0,Description,Default for
closed,Unnamed: 1_level_1,Unnamed: 2_level_1
right,close right endpoint,time-based windows
left,close left endpoint,
both,close both endpoints,fixed windows
neither,open endpoints,


In [22]:
st2 = pd.Series([1] * 5, index=st.index)
pd.DataFrame({'right': st2.rolling('5s', closed='right').sum(),
              'left': st2.rolling('5s', closed='left').sum(),
              'both': st2.rolling('5s', closed='both').sum(),
              'neither': st2.rolling('5s', closed='neither').sum()})

Unnamed: 0,right,left,both,neither
2020-01-01 01:00:01,1.0,,1.0,
2020-01-01 01:00:03,2.0,1.0,2.0,1.0
2020-01-01 01:00:04,3.0,2.0,3.0,2.0
2020-01-01 01:00:07,3.0,2.0,3.0,2.0
2020-01-01 01:00:11,2.0,1.0,2.0,1.0


### 2.1.5 Iteration Over Window

`Rolling` and `Expanding` objects accept iteration.

In [23]:
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})

for win in df.rolling(2):
    print(win)

   A  B
0  1  4
   A  B
0  1  4
1  2  5
   A  B
1  2  5
2  3  6


### 2.1.6 Centering Windows

Use the `center` keyword to set the labels at the center (the default is to set the labels to the right edge of the window).

In [24]:
s.rolling(5).mean()

2020-01-01 00:00:00         NaN
2020-01-01 00:00:01         NaN
2020-01-01 00:00:02         NaN
2020-01-01 00:00:03         NaN
2020-01-01 00:00:04   -0.066459
                         ...   
2020-01-01 00:08:15   -0.198437
2020-01-01 00:08:16    0.301173
2020-01-01 00:08:17    0.252032
2020-01-01 00:08:18   -0.129130
2020-01-01 00:08:19    0.125827
Freq: S, Length: 500, dtype: float64

In [25]:
s.rolling(5, center=True).mean()

2020-01-01 00:00:00         NaN
2020-01-01 00:00:01         NaN
2020-01-01 00:00:02   -0.066459
2020-01-01 00:00:03    0.065352
2020-01-01 00:00:04    0.298356
                         ...   
2020-01-01 00:08:15    0.252032
2020-01-01 00:08:16   -0.129130
2020-01-01 00:08:17    0.125827
2020-01-01 00:08:18         NaN
2020-01-01 00:08:19         NaN
Freq: S, Length: 500, dtype: float64

### 2.1.7 Binary Window Functions

`cov()` and `corr()` can compute moving window statistics about two Series or any combination of `DataFrame/Series` or `DataFrame/DataFrame`.

In [26]:
s1 = pd.Series(np.random.randn(200))
s2 = pd.Series(np.linspace(-1, 1, 75))

s1.rolling(window=50).cov(s2).dropna().head()

49   -0.072340
50   -0.055808
51   -0.052393
52   -0.061466
53   -0.062777
dtype: float64

In [27]:
s1.rolling(window=50).corr(s2).dropna().head()

49   -0.165880
50   -0.127058
51   -0.119802
52   -0.138848
53   -0.141888
dtype: float64

In [28]:
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))

df.rolling(25).cov(s2).dropna().head()

Unnamed: 0,A,B,C,D
24,-0.057412,0.01666,0.040934,0.023718
25,-0.0295,0.010963,0.058961,0.027977
26,-0.043282,0.041234,0.103737,0.032296
27,-0.021037,0.052134,0.112376,0.004187
28,0.00115,0.05492,0.092018,0.015283


In [29]:
df2 = pd.DataFrame(np.random.randn(70, 5), columns=list('ACDEI'))

# pairwise=True required for dataframe/dataframe
df.rolling(25).cov(df2, pairwise=True).dropna().head()

Unnamed: 0,Unnamed: 1,A,B,C,D
24,A,0.160062,0.447953,0.069412,-0.118122
24,C,0.135157,0.030076,-0.054503,0.119467
24,D,-0.090223,-0.123577,-0.190041,0.153211
24,E,0.074544,0.129176,0.245553,0.095869
24,I,0.427202,0.240583,-0.02112,-0.216323


# 3. Aggregations

In [30]:
s.rolling(window=15, min_periods=5).aggregate(np.std)

2020-01-01 00:00:00         NaN
2020-01-01 00:00:01         NaN
2020-01-01 00:00:02         NaN
2020-01-01 00:00:03         NaN
2020-01-01 00:00:04    0.978224
                         ...   
2020-01-01 00:08:15    0.824425
2020-01-01 00:08:16    1.075773
2020-01-01 00:08:17    1.062857
2020-01-01 00:08:18    1.081028
2020-01-01 00:08:19    1.075770
Freq: S, Length: 500, dtype: float64

## 3.1 Applying Multiple Functions

Just pass a list of functions to aggregate with.

In [31]:
s.rolling(20).agg([max, min, np.mean, np.std]).dropna().head()

Unnamed: 0,max,min,mean,std
2020-01-01 00:00:19,1.045701,-2.216395,-0.239092,0.877622
2020-01-01 00:00:20,1.045701,-2.216395,-0.246477,0.885425
2020-01-01 00:00:21,1.045701,-2.216395,-0.139763,0.899956
2020-01-01 00:00:22,1.045701,-2.216395,-0.159944,0.878229
2020-01-01 00:00:23,1.045701,-2.216395,-0.210316,0.883316


In [32]:
df.rolling(20).agg([np.mean, np.std]).dropna().head()

Unnamed: 0_level_0,A,A,B,B,C,C,D,D
Unnamed: 0_level_1,mean,std,mean,std,mean,std,mean,std
19,-0.086699,1.105987,0.15483,0.897953,-0.378123,0.817277,-0.138586,1.149093
20,-0.2189,0.964423,0.163654,0.898648,-0.353397,0.821864,-0.097206,1.175845
21,-0.097373,1.030695,0.154619,0.898472,-0.455909,0.739727,-0.144674,1.163109
22,-0.172383,1.025177,0.146045,0.889797,-0.48887,0.690804,-0.200479,1.197468
23,-0.215198,1.021522,0.100831,0.905433,-0.406956,0.732196,-0.187642,1.197415


## 3.2 Applying Different Functions to DataFrame Columns

Just pass a `dict` to `agg`, mapping column names to aggregating functions.

In [33]:
df.rolling(20).agg({'A': [max, min], 'B': np.std, 'D': lambda x: np.quantile(x, 0.5)}
                  ).dropna().head()

Unnamed: 0_level_0,A,A,B,D
Unnamed: 0_level_1,max,min,std,<lambda>
19,2.222313,-2.181016,0.897953,-0.316002
20,1.111283,-2.181016,0.898648,-0.316002
21,1.573933,-2.181016,0.898472,-0.316002
22,1.573933,-2.181016,0.889797,-0.338491
23,1.573933,-2.181016,0.905433,-0.323687


> The aggregating functions can also be given as strings

In [34]:
df.rolling(20).agg({'A': 'sum', 'B': ['max', 'min'], 'D': 'std'}
                  ).dropna().head()

Unnamed: 0_level_0,A,B,B,D
Unnamed: 0_level_1,sum,max,min,std
19,-1.733971,1.783753,-1.530046,1.149093
20,-4.378009,1.783753,-1.530046,1.175845
21,-1.947456,1.783753,-1.530046,1.163109
22,-3.447661,1.783753,-1.530046,1.197468
23,-4.303962,1.783753,-1.530046,1.197415


# 4. Expanding Windows

A common alternative to rolling statistics is to use an expanding window, which yields the value of the statistic with all the data available up to that point in time.

## 4.1 Method Summary

In [35]:
pd.read_html('https://pandas.pydata.org/docs/user_guide/computation.html')[4]

Unnamed: 0,Function,Description
0,count(),Number of non-null observations
1,sum(),Sum of values
2,mean(),Mean of values
3,median(),Arithmetic median of values
4,min(),Minimum
5,max(),Maximum
6,std(),Sample standard deviation
7,var(),Sample variance
8,skew(),Sample skewness (3rd moment)
9,kurt(),Sample kurtosis (4th moment)


In [36]:
s.expanding(min_periods=2).sum()

2020-01-01 00:00:00          NaN
2020-01-01 00:00:01    -2.198507
2020-01-01 00:00:02    -1.237187
2020-01-01 00:00:03    -1.003094
2020-01-01 00:00:04    -0.332293
                         ...    
2020-01-01 00:08:15   -14.162348
2020-01-01 00:08:16   -11.598452
2020-01-01 00:08:17   -12.214303
2020-01-01 00:08:18   -13.222610
2020-01-01 00:08:19   -13.525680
Freq: S, Length: 500, dtype: float64

# 5. Exponentially Weighted Windows

A related set of functions are exponentially weighted versions of several of the above statistics. A similar interface to `.rolling` and `.expanding` is accessed through the `.ewm` method.

>One must specify precisely one of `span`, `center of mass`, `half-life` and `alpha` to the EW functions:

## 5.1 Method Summary

In [37]:
pd.read_html('https://pandas.pydata.org/docs/user_guide/computation.html')[5]

Unnamed: 0,Function,Description
0,mean(),EW moving average
1,var(),EW moving variance
2,std(),EW moving standard deviation
3,corr(),EW moving correlation
4,cov(),EW moving covariance


In [38]:
s.ewm(span=15).mean()

2020-01-01 00:00:00   -1.053725
2020-01-01 00:00:01   -1.102289
2020-01-01 00:00:02   -0.320804
2020-01-01 00:00:03   -0.153189
2020-01-01 00:00:04    0.058268
                         ...   
2020-01-01 00:08:15   -0.225316
2020-01-01 00:08:16    0.123336
2020-01-01 00:08:17    0.030937
2020-01-01 00:08:18   -0.098968
2020-01-01 00:08:19   -0.124481
Freq: S, Length: 500, dtype: float64

In [39]:
s.ewm(halflife=50).mean()

2020-01-01 00:00:00   -1.053725
2020-01-01 00:00:01   -1.099569
2020-01-01 00:00:02   -0.403061
2020-01-01 00:00:03   -0.240445
2020-01-01 00:00:04   -0.053108
                         ...   
2020-01-01 00:08:15    0.028846
2020-01-01 00:08:16    0.063782
2020-01-01 00:08:17    0.054416
2020-01-01 00:08:18    0.039771
2020-01-01 00:08:19    0.035046
Freq: S, Length: 500, dtype: float64

In [40]:
s.ewm(com=80).mean()

2020-01-01 00:00:00   -1.053725
2020-01-01 00:00:01   -1.099536
2020-01-01 00:00:02   -0.404033
2020-01-01 00:00:03   -0.241517
2020-01-01 00:00:04   -0.054492
                         ...   
2020-01-01 00:08:15    0.024456
2020-01-01 00:08:16    0.055872
2020-01-01 00:08:17    0.047562
2020-01-01 00:08:18    0.034500
2020-01-01 00:08:19    0.030324
Freq: S, Length: 500, dtype: float64