# Statistical tools

In [1]:
import addutils.toc ; addutils.toc.js(ipy_notebook=True)

With this tutorial we are going to see some of the statistical and computational tools offered by `pandas`.

In [2]:
import datetime
import scipy.io
import numpy as np
import pandas as pd
import bokeh.plotting as bk
from IPython.display import display, HTML
from addutils import css_notebook, side_by_side2
css_notebook()

## 1 Percent change

Given a `pandas.Series` the method `pct_change` returns a new `pandas.Series` object containing percent change over a given number of periods.

In [3]:
s1 = pd.Series(range(10, 18) + np.random.randn(8) / 10)

pct_ch_1d = s1.pct_change() * 100
pct_ch_3d = s1.pct_change(periods=3) * 100

HTML(side_by_side2(s1, pct_ch_1d, pct_ch_3d))

Unnamed: 0,0
0,9.996227
1,11.092212
2,11.961875
3,12.825606
4,13.957037
5,15.052242
6,15.986984
7,16.997841

Unnamed: 0,0
0,
1,10.963978
2,7.840303
3,7.220702
4,8.821654
5,7.846976
6,6.209985
7,6.323003

Unnamed: 0,0
0,
1,
2,
3,28.304463
4,25.827355
5,25.83514
6,24.648954
7,21.786893


## 2 Covariance

Given two `pandas.Series` the method `cov` computes covariance between them, excluding missing values.

In [4]:
s1 = pd.util.testing.makeTimeSeries(7)
s2 = s1 + np.random.randn(len(s1)) / 10
HTML(side_by_side2(s1, s2))

Unnamed: 0,0
2000-01-03,-0.900324
2000-01-04,-0.727965
2000-01-05,1.1249
2000-01-06,0.020534
2000-01-07,-0.889789
2000-01-10,0.373402
2000-01-11,-2.832169

Unnamed: 0,0
2000-01-03,-0.960679
2000-01-04,-0.653139
2000-01-05,1.161671
2000-01-06,0.058083
2000-01-07,-0.752655
2000-01-10,0.337995
2000-01-11,-2.723136


In [5]:
s1.cov(s2)

1.537248338742436

It is also possibile to compute pairwise covariance of a `pandas.DataFrame` columns using `pandas.DataFrame.cov` method. Here we use the module `pandas.util.testing` in order to generate random data easily:

In [6]:
d1 = pd.util.testing.makeTimeDataFrame()
print (d1.head())
print (d1.cov())

                   A         B         C         D
2000-01-03  0.227604  0.630300  1.294112 -0.036318
2000-01-04  1.180444 -0.936626 -1.712326  0.009644
2000-01-05  0.121701  1.138558  0.653508  0.369642
2000-01-06 -1.091047  1.811374  0.152829 -1.174066
2000-01-07 -0.428008 -1.217586  0.591549 -0.512264
          A         B         C         D
A  0.635379  0.055008 -0.021344  0.113902
B  0.055008  1.535345 -0.292853 -0.194581
C -0.021344 -0.292853  1.543322  0.144696
D  0.113902 -0.194581  0.144696  0.798089


## 3 Correlation

`pandas.Series.corr` allows to compute correlation between two `pandas.Series`. By the `method` paramether it's possible to choose between:

* Pearson
* Kendall
* Spearman

In [7]:
s1.corr(s2, method='pearson')

0.9986048219596884

Like we just seen for covariance, it is possibile to call `pandas.DataFrame.corr` to obtain pairwise correlation of columns over a `pandas.DataFrame`

In [8]:
d1.corr()

Unnamed: 0,A,B,C,D
A,1.0,0.055694,-0.021554,0.159953
B,0.055694,1.0,-0.190247,-0.175781
C,-0.021554,-0.190247,1.0,0.130377
D,0.159953,-0.175781,0.130377,1.0


## 4 Rolling moments and Binary rolling moments

`pandas` provides also a lot of methods for calculating rolling moments.

In [9]:
[n for n in dir(pd) if n.startswith('rolling')]

[]

Let's see some examples:

In [10]:
s3 = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
s3 = s3.cumsum()
s3_max = s3.rolling(60).max()
s3_mean = s3.rolling(60).mean()
s3_min = s3.rolling(60).min()
data = {'cumsum':s3, 'max':s3_max, 'mean':s3_mean, 'min':s3_min}
df = pd.DataFrame(data)
df.tail()

Unnamed: 0,cumsum,max,mean,min
2002-09-22,-21.382742,-17.256591,-21.552457,-26.390522
2002-09-23,-21.284376,-17.256591,-21.526603,-26.390522
2002-09-24,-21.225513,-17.256591,-21.493997,-26.390522
2002-09-25,-19.325814,-17.256591,-21.438869,-26.390522
2002-09-26,-18.891911,-17.256591,-21.371697,-26.390522


In [11]:
bk.output_notebook()

In [12]:
fig = bk.figure(x_axis_type = "datetime",
               tools="pan,box_zoom,reset", title = 'Rolling Moments',
               plot_width=750, plot_height=400)
fig.line(df.index, df['cumsum'], color='cadetblue', legend='Cumulative Sum')
fig.line(df.index, df['max'], color='mediumorchid', legend='Max')
fig.line(df.index, df['min'], color='mediumpurple', legend='Min')
fig.line(df.index, df['mean'], color='navy', legend='Mean')
bk.show(fig)

`pandas.Series.cumsum` returns a new `pandas.Series` containing the cumulative sum of the given values.

In [13]:
s4 = s3 + np.random.randn(len(s3))
rollc = s3.rolling(window=10).corr(s3)
data2 = {'cumsum':s3, 'similar':s4, 'rolling correlation':rollc}
df2 = pd.DataFrame(data2)

In [14]:
fig = bk.figure(x_axis_type = "datetime", title = 'Rolling Correlation',
       plot_width=750, plot_height=400)
fig.line(df2.index, df2['cumsum'], color='cadetblue', legend='Cumulative Sum')
fig.line(df2.index, df2['similar'], color='mediumorchid', legend='Similar')
fig.line(df2.index, df2['rolling correlation'], color='navy', legend='Rolling Corr.')
fig.legend.location = "bottom_right"
bk.show(fig)

## 5 A pratical example: Return indexes and cumulative returns

In [15]:
AAPL = pd.read_csv('example_data/p03_AAPL.txt', index_col='Date', parse_dates=True)
price = AAPL['Adj Close']
display(price.tail())

Date
2012-09-17    699.78
2012-09-18    701.91
2012-09-19    702.10
2012-09-20    698.70
2012-09-21    700.09
Name: Adj Close, dtype: float64

`pandas.Series.tail` returns the last n rows of a given `pandas.Series`.

In [16]:
price['2011-10-03'] / price['2011-3-01'] - 1
returns = price.pct_change()
ret_index = (1 + returns).cumprod()
ret_index[0] = 1
monthly_returns = ret_index.resample('BM').last().pct_change()

In [17]:
fig = bk.figure(x_axis_type = 'datetime', title = 'Monthly Returns',
                plot_width=750, plot_height=400)
fig.line(monthly_returns.index, monthly_returns)
bk.show(fig)

---

Visit [www.add-for.com](<http://www.add-for.com/>) for more tutorials and updates.

This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.