# [Go to "Computational Tools" in Pandas Docs](https://pandas.pydata.org/docs/user_guide/computation.html)

In [1]:
import pandas as pd
import numpy as np

# 1. Statistical Functions

## 1.1 Percentage Change

>Use the [pct_change][1] method.
>
>Parameters:
>- `periods` (int, default 1)
>- `fill_method` (str, default ‘pad’)
>- `limit` (int, default None)
>- `freq` (DateOffset, timedelta, or str, optional)

[1]: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pct_change.html#pandas-dataframe-pct-change

In [2]:
s = pd.Series([42, 57, 35, 101, 88])
spc = s.pct_change()

# by default, each value is compared to the previous one (periods=1)
# e.g. 0.357143 == 35.7143% increase: 57 from 42
pd.concat([s, spc], axis=1, keys=['s', '% change'])

Unnamed: 0,s,% change
0,42,
1,57,0.357143
2,35,-0.385965
3,101,1.885714
4,88,-0.128713


In [3]:
df = pd.DataFrame(np.random.randint(1, 10, (7, 3)))
dfpct = df.pct_change(periods=3)

pd.concat([df, dfpct], axis=1)

Unnamed: 0,0,1,2,0.1,1.1,2.1
0,6,9,9,,,
1,9,7,7,,,
2,6,9,9,,,
3,7,8,3,0.166667,-0.111111,-0.666667
4,3,2,8,-0.666667,-0.714286,0.142857
5,1,9,4,-0.833333,0.0,-0.555556
6,2,9,9,-0.714286,0.125,2.0


## 1.2 Covariance

>Use [Series.cov()][1] for covariance between series, and [DataFrame.cov()][2] for pairwise covariances among the series/columns in a dataframe. Missing values are excluded.

[1]: https://pandas.pydata.org/docs/reference/api/pandas.Series.cov.html
[2]: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.cov.html

In [4]:
s1 = pd.Series(np.random.randn(200), name='s1')
s2 = pd.Series(np.random.randn(200), name='s2')

s1.cov(s2), s2.cov(s1)

(0.1418521704781405, 0.1418521704781405)

In [5]:
df = pd.DataFrame(np.random.rand(50, 4))
df.cov()

Unnamed: 0,0,1,2,3
0,0.100312,-0.019691,-0.027244,-0.006723
1,-0.019691,0.077404,0.001938,-0.008211
2,-0.027244,0.001938,0.083108,7.3e-05
3,-0.006723,-0.008211,7.3e-05,0.071362


## 1.3 Correlation

>Use the [corr()][1] method. You can specify `method` as one of `pearson (default)`, `kendall` or `spearman`.

[1]: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html

In [6]:
s1.corr(s2, method='pearson') 

0.12996730235093143

In [7]:
s1.corr(s2, method='kendall')

0.1307537688442211

In [8]:
s1.corr(s2, method='spearman')

0.19012075301882547

In [9]:
# Pairwise correlation of DataFrame columns
df.corr()

Unnamed: 0,0,1,2,3
0,1.0,-0.223465,-0.298378,-0.079457
1,-0.223465,1.0,0.024165,-0.110477
2,-0.298378,0.024165,1.0,0.00095
3,-0.079457,-0.110477,0.00095,1.0


>[DataFrame.corrwith()][1] calculates the correlation between like-labeled `Series` in different `DataFrame`s

[1]: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corrwith.html

In [10]:
df1 = pd.DataFrame(np.random.randn(7, 4), columns=list('ABCD'))
df2 = pd.DataFrame(np.random.randn(7, 4), columns=list('ACDF'))
df1.corrwith(df2) 

A    0.433448
C   -0.188393
D    0.118545
B         NaN
F         NaN
dtype: float64

## 1.4 Rank

> Use the [rank()][1] method to numerically rank data (rank 1 through n) along an axis.

[1]: https://pandas.pydata.org/docs/reference/api/pandas.Series.rank.html

In [11]:
s = pd.Series([2.2, 4.5, 3.8, 5.9, 3.8])

# By default, the average is returned for records with equal value
# e.g 3.8 ties for 2nd and 3rd, and is thus ranked as mean([2, 3]) = 2.5
s.rank() # method=average

0    1.0
1    4.0
2    2.5
3    5.0
4    2.5
dtype: float64

In [12]:
# To return largest rank for equal records
s.rank(method='max')

0    1.0
1    4.0
2    3.0
3    5.0
4    3.0
dtype: float64

In [13]:
df = pd.DataFrame(np.random.randn(5,3))
print(df)

# ranking rows (vertically)
df.rank() # axis=0

          0         1         2
0 -1.647236  0.165342 -1.472941
1  0.611870  0.023304 -0.521819
2 -0.623801  0.160415  0.483638
3 -0.624204 -0.822573 -2.145984
4 -0.470108  0.374735  0.017061


Unnamed: 0,0,1,2
0,1.0,4.0,2.0
1,5.0,2.0,3.0
2,3.0,3.0,5.0
3,2.0,1.0,1.0
4,4.0,5.0,4.0


In [14]:
print(df)
# ranking columns (horizontally)
df.rank(axis=1)

          0         1         2
0 -1.647236  0.165342 -1.472941
1  0.611870  0.023304 -0.521819
2 -0.623801  0.160415  0.483638
3 -0.624204 -0.822573 -2.145984
4 -0.470108  0.374735  0.017061


Unnamed: 0,0,1,2
0,1.0,3.0,2.0
1,3.0,2.0,1.0
2,1.0,2.0,3.0
3,3.0,2.0,1.0
4,1.0,3.0,2.0


# 2. Window Functions

> `.rolling` for aggregations or to apply functions to "windows" of the data

>`.expanding` for aggregations or to apply functions to all the data available up to that point in time. 

>`.ewm` applies exponentially weighted statistical functions

## 2.1 Rolling Windows

### 2.1.1 Method Summary

In [15]:
pd.read_html('https://pandas.pydata.org/docs/user_guide/computation.html')[1]

Unnamed: 0,Method,Description
0,count(),Number of non-null observations
1,sum(),Sum of values
2,mean(),Mean of values
3,median(),Arithmetic median of values
4,min(),Minimum
5,max(),Maximum
6,std(),Sample standard deviation
7,var(),Sample variance
8,skew(),Sample skewness (3rd moment)
9,kurt(),Sample kurtosis (4th moment)


In [16]:
s = pd.Series(np.random.randn(500),
              index=pd.date_range('2020-01-01', periods=500, freq='s'))

s.rolling(window=50).sum()

2020-01-01 00:00:00         NaN
2020-01-01 00:00:01         NaN
2020-01-01 00:00:02         NaN
2020-01-01 00:00:03         NaN
2020-01-01 00:00:04         NaN
                         ...   
2020-01-01 00:08:15   -4.541972
2020-01-01 00:08:16   -5.936031
2020-01-01 00:08:17   -6.439987
2020-01-01 00:08:18   -6.357397
2020-01-01 00:08:19   -5.689216
Freq: S, Length: 500, dtype: float64

### 2.1.2 Rolling Apply

> The `apply()` method takes an extra `func` argument and performs generic rolling computations

In [17]:
def foo(x):
    return min(x) + max(x) 

s.rolling(25).apply(foo)

2020-01-01 00:00:00         NaN
2020-01-01 00:00:01         NaN
2020-01-01 00:00:02         NaN
2020-01-01 00:00:03         NaN
2020-01-01 00:00:04         NaN
                         ...   
2020-01-01 00:08:15   -0.222455
2020-01-01 00:08:16   -0.222455
2020-01-01 00:08:17   -0.222455
2020-01-01 00:08:18   -0.222455
2020-01-01 00:08:19    0.123818
Freq: S, Length: 500, dtype: float64

In [18]:
s.rolling(10).apply(np.ptp)

2020-01-01 00:00:00         NaN
2020-01-01 00:00:01         NaN
2020-01-01 00:00:02         NaN
2020-01-01 00:00:03         NaN
2020-01-01 00:00:04         NaN
                         ...   
2020-01-01 00:08:15    3.075773
2020-01-01 00:08:16    3.041160
2020-01-01 00:08:17    2.250781
2020-01-01 00:08:18    1.908662
2020-01-01 00:08:19    1.918418
Freq: S, Length: 500, dtype: float64

> [scipy.signal window functions][1] can be used. The weights used in the window are specified by the `win_type` keyword.

[1]: https://docs.scipy.org/doc/scipy/reference/signal.windows.html

### 2.1.3 Time-Aware Rolling

You can set `window` to a time offset. This can be particularly useful for an irregular time frequency index.

In [19]:
st = pd.Series([2.5, 4.8, 1.3, np.nan, 5.9],
                index=pd.Index([pd.Timestamp(2020, 1, 1, 1, 0, 1),
                                pd.Timestamp(2020, 1, 1, 1, 0, 3),
                                pd.Timestamp(2020, 1, 1, 1, 0, 4),
                                pd.Timestamp(2020, 1, 1, 1, 0, 7),
                                pd.Timestamp(2020, 1, 1, 1, 0, 11)]))
st 

2020-01-01 01:00:01    2.5
2020-01-01 01:00:03    4.8
2020-01-01 01:00:04    1.3
2020-01-01 01:00:07    NaN
2020-01-01 01:00:11    5.9
dtype: float64

In [20]:
st.rolling('5s').max()

2020-01-01 01:00:01    2.5
2020-01-01 01:00:03    4.8
2020-01-01 01:00:04    4.8
2020-01-01 01:00:07    4.8
2020-01-01 01:00:11    5.9
dtype: float64

### 2.1.4 Rolling Window Endpoints

The inclusion of the interval endpoints in rolling window calculations can be specified with the `closed` parameter:

In [21]:
pd.read_html('https://pandas.pydata.org/docs/user_guide/computation.html')[3].set_index('closed')

Unnamed: 0_level_0,Description,Default for
closed,Unnamed: 1_level_1,Unnamed: 2_level_1
right,close right endpoint,time-based windows
left,close left endpoint,
both,close both endpoints,fixed windows
neither,open endpoints,


In [22]:
st2 = pd.Series([1] * 5, index=st.index)
pd.DataFrame({'right': st2.rolling('5s', closed='right').sum(),
              'left': st2.rolling('5s', closed='left').sum(),
              'both': st2.rolling('5s', closed='both').sum(),
              'neither': st2.rolling('5s', closed='neither').sum()})

Unnamed: 0,right,left,both,neither
2020-01-01 01:00:01,1.0,,1.0,
2020-01-01 01:00:03,2.0,1.0,2.0,1.0
2020-01-01 01:00:04,3.0,2.0,3.0,2.0
2020-01-01 01:00:07,3.0,2.0,3.0,2.0
2020-01-01 01:00:11,2.0,1.0,2.0,1.0


### 2.1.5 Iteration Over Window

`Rolling` and `Expanding` objects accept iteration.

In [23]:
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})

for win in df.rolling(2):
    print(win)

   A  B
0  1  4
   A  B
0  1  4
1  2  5
   A  B
1  2  5
2  3  6


### 2.1.6 Centering Windows

Use the `center` keyword to set the labels at the center (the default is to set the labels to the right edge of the window).

In [24]:
s.rolling(5).mean()

2020-01-01 00:00:00         NaN
2020-01-01 00:00:01         NaN
2020-01-01 00:00:02         NaN
2020-01-01 00:00:03         NaN
2020-01-01 00:00:04    0.495057
                         ...   
2020-01-01 00:08:15   -0.479937
2020-01-01 00:08:16   -0.608534
2020-01-01 00:08:17   -0.481100
2020-01-01 00:08:18   -0.297336
2020-01-01 00:08:19   -0.175097
Freq: S, Length: 500, dtype: float64

In [25]:
s.rolling(5, center=True).mean()

2020-01-01 00:00:00         NaN
2020-01-01 00:00:01         NaN
2020-01-01 00:00:02    0.495057
2020-01-01 00:00:03    0.431148
2020-01-01 00:00:04    0.203207
                         ...   
2020-01-01 00:08:15   -0.481100
2020-01-01 00:08:16   -0.297336
2020-01-01 00:08:17   -0.175097
2020-01-01 00:08:18         NaN
2020-01-01 00:08:19         NaN
Freq: S, Length: 500, dtype: float64

### 2.1.7 Binary Window Functions

`cov()` and `corr()` can compute moving window statistics about two Series or any combination of `DataFrame/Series` or `DataFrame/DataFrame`.

In [26]:
s1 = pd.Series(np.random.randn(200))
s2 = pd.Series(np.linspace(-1, 1, 75))

s1.rolling(window=50).cov(s2).dropna().head()

49   -0.028233
50    0.009222
51    0.029689
52    0.030457
53    0.011550
dtype: float64

In [27]:
s1.rolling(window=50).corr(s2).dropna().head()

49   -0.073683
50    0.024443
51    0.077780
52    0.079713
53    0.029609
dtype: float64

In [28]:
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))

df.rolling(25).cov(s2).dropna().head()

Unnamed: 0,A,B,C,D
24,0.021572,0.05154,-0.050062,-0.045238
25,0.020245,0.025835,-0.013548,-0.02136
26,0.051145,-0.017756,-0.018925,-0.033465
27,0.053317,0.002606,-0.048665,-0.032443
28,0.042833,0.013238,-0.053503,-0.017919


In [29]:
df2 = pd.DataFrame(np.random.randn(70, 5), columns=list('ACDEI'))

# pairwise=True required for dataframe/dataframe
df.rolling(25).cov(df2, pairwise=True).dropna().head()

Unnamed: 0,Unnamed: 1,A,B,C,D
24,A,-0.039057,0.084499,0.314296,0.055429
24,C,-0.087545,-0.063866,-0.083898,-0.074239
24,D,0.123655,0.049059,-0.017986,0.11451
24,E,0.135151,0.051768,-0.014491,-0.123784
24,I,-0.291543,-0.014254,-0.153343,-0.294555


# 3. Aggregations

In [30]:
s.rolling(window=15, min_periods=5).aggregate(np.std)

2020-01-01 00:00:00         NaN
2020-01-01 00:00:01         NaN
2020-01-01 00:00:02         NaN
2020-01-01 00:00:03         NaN
2020-01-01 00:00:04    0.992858
                         ...   
2020-01-01 00:08:15    1.009871
2020-01-01 00:08:16    0.954966
2020-01-01 00:08:17    0.956861
2020-01-01 00:08:18    0.956035
2020-01-01 00:08:19    0.921477
Freq: S, Length: 500, dtype: float64

## 3.1 Applying Multiple Functions

Just pass a list of functions to aggregate with.

In [31]:
s.rolling(20).agg([max, min, np.mean, np.std]).dropna().head()

Unnamed: 0,max,min,mean,std
2020-01-01 00:00:19,1.871727,-0.699064,0.274015,0.675513
2020-01-01 00:00:20,1.871727,-0.699064,0.245949,0.69516
2020-01-01 00:00:21,1.837576,-0.699064,0.244242,0.690986
2020-01-01 00:00:22,1.837576,-0.938497,0.202688,0.736722
2020-01-01 00:00:23,1.837576,-0.938497,0.235604,0.716287


In [32]:
df.rolling(20).agg([np.mean, np.std]).dropna().head()

Unnamed: 0_level_0,A,A,B,B,C,C,D,D
Unnamed: 0_level_1,mean,std,mean,std,mean,std,mean,std
19,0.012553,0.744065,0.098507,0.994935,0.035644,1.23236,-0.026447,1.084731
20,-0.000832,0.767881,0.245266,1.022956,-0.124647,1.198281,-0.101494,1.163584
21,-0.030664,0.768417,0.301748,1.03746,-0.19073,1.231533,-0.101431,1.163479
22,0.007053,0.837586,0.279709,1.036418,-0.154726,1.187117,-0.113263,1.165808
23,0.052876,0.817725,0.203232,1.039959,-0.088517,1.210047,-0.159192,1.121992


## 3.2 Applying Different Functions to DataFrame Columns

Just pass a `dict` to `agg`, mapping column names to aggregating functions.

In [33]:
df.rolling(20).agg({'A': [max, min], 'B': np.std, 'D': lambda x: np.quantile(x, 0.5)}
                  ).dropna().head()

Unnamed: 0_level_0,A,A,B,D
Unnamed: 0_level_1,max,min,std,<lambda>
19,1.220806,-1.140664,0.994935,-0.031614
20,1.220806,-1.405811,1.022956,-0.031614
21,1.220806,-1.405811,1.03746,-0.031614
22,1.76435,-1.405811,1.036418,-0.031614
23,1.76435,-1.405811,1.039959,-0.031614


> The aggregating functions can also be given as strings

In [34]:
df.rolling(20).agg({'A': 'sum', 'B': ['max', 'min'], 'D': 'std'}
                  ).dropna().head()

Unnamed: 0_level_0,A,B,B,D
Unnamed: 0_level_1,sum,max,min,std
19,0.251067,1.910436,-1.631778,1.084731
20,-0.016643,1.910436,-1.631778,1.163584
21,-0.613275,1.910436,-1.631778,1.163479
22,0.141065,1.910436,-1.631778,1.165808
23,1.057514,1.910436,-1.631778,1.121992


# 4. Expanding Windows

A common alternative to rolling statistics is to use an expanding window, which yields the value of the statistic with all the data available up to that point in time.

## 4.1 Method Summary

In [35]:
pd.read_html('https://pandas.pydata.org/docs/user_guide/computation.html')[4]

Unnamed: 0,Function,Description
0,count(),Number of non-null observations
1,sum(),Sum of values
2,mean(),Mean of values
3,median(),Arithmetic median of values
4,min(),Minimum
5,max(),Maximum
6,std(),Sample standard deviation
7,var(),Sample variance
8,skew(),Sample skewness (3rd moment)
9,kurt(),Sample kurtosis (4th moment)


In [36]:
s.expanding(min_periods=2).sum()

2020-01-01 00:00:00          NaN
2020-01-01 00:00:01     1.956576
2020-01-01 00:00:02     1.849149
2020-01-01 00:00:03     1.310644
2020-01-01 00:00:04     2.475285
                         ...    
2020-01-01 00:08:15   -24.337817
2020-01-01 00:08:16   -24.853042
2020-01-01 00:08:17   -25.294729
2020-01-01 00:08:18   -24.842805
2020-01-01 00:08:19   -24.381126
Freq: S, Length: 500, dtype: float64

# 5. Exponentially Weighted Windows

A related set of functions are exponentially weighted versions of several of the above statistics. A similar interface to `.rolling` and `.expanding` is accessed through the `.ewm` method.

>One must specify precisely one of `span`, `center of mass`, `half-life` and `alpha` to the EW functions:

## 5.1 Method Summary

In [37]:
pd.read_html('https://pandas.pydata.org/docs/user_guide/computation.html')[5]

Unnamed: 0,Function,Description
0,mean(),EW moving average
1,var(),EW moving variance
2,std(),EW moving standard deviation
3,corr(),EW moving correlation
4,cov(),EW moving covariance


In [38]:
s.ewm(span=15).mean()

2020-01-01 00:00:00    0.084849
2020-01-01 00:00:01    1.037851
2020-01-01 00:00:02    0.604136
2020-01-01 00:00:03    0.258984
2020-01-01 00:00:04    0.491399
                         ...   
2020-01-01 00:08:15   -0.306430
2020-01-01 00:08:16   -0.332529
2020-01-01 00:08:17   -0.346174
2020-01-01 00:08:18   -0.246412
2020-01-01 00:08:19   -0.157900
Freq: S, Length: 500, dtype: float64

In [39]:
s.ewm(halflife=50).mean()

2020-01-01 00:00:00    0.084849
2020-01-01 00:00:01    0.984481
2020-01-01 00:00:02    0.615454
2020-01-01 00:00:03    0.320938
2020-01-01 00:00:04    0.494389
                         ...   
2020-01-01 00:08:15   -0.074853
2020-01-01 00:08:16   -0.080922
2020-01-01 00:08:17   -0.085894
2020-01-01 00:08:18   -0.078482
2020-01-01 00:08:19   -0.071039
Freq: S, Length: 500, dtype: float64

In [40]:
s.ewm(com=80).mean()

2020-01-01 00:00:00    0.084849
2020-01-01 00:00:01    0.983837
2020-01-01 00:00:02    0.615555
2020-01-01 00:00:03    0.321642
2020-01-01 00:00:04    0.494456
                         ...   
2020-01-01 00:08:15   -0.069464
2020-01-01 00:08:16   -0.074979
2020-01-01 00:08:17   -0.079515
2020-01-01 00:08:18   -0.072941
2020-01-01 00:08:19   -0.066328
Freq: S, Length: 500, dtype: float64