# scipy stats

This notebook focuses on the use of the scipy.stats module

It is built based on a learn-by-example approach So it only covers a little part of the module's functionalities but provides a practical application.

Some knowledge of `numpy` and `matplotlib` is needed to fully understand the content.

## Introduction

The scipy.stats module provides mainly:
* probability distributions: continuous, discrete and multivariate
* statistical functions such as statistics and tests

## Imports

In [None]:
%matplotlib inline
import numpy as np
import scipy as sp
from scipy import stats
import matplotlib.pyplot as plt
from datetime import datetime

## Load the data

In [None]:
stock_prices = np.loadtxt('../resources/stock.csv', skiprows=1, delimiter=',', usecols=(1, 2, 3))

In [None]:
plt.plot(stock_prices[:, 0], label='Apple')
plt.plot(stock_prices[:, 1], label='Microsoft')
plt.plot(stock_prices[:, 2], label='Intel')
_ = plt.legend()
_ = plt.title('2016 stock prices')

In [None]:
# Compute the daily increments
stock_incs = (stock_prices[1:, :] - stock_prices[:-1, :])/stock_prices[:-1, :]
n_incs, n_stocks = stock_incs.shape

In [None]:
plt.plot(stock_incs[:, 0], label='Apple')
plt.plot(stock_incs[:, 1], label='Microsoft')
plt.plot(stock_incs[:, 2], label='Intel')
_ = plt.legend()
_ = plt.title('2016 stock prices - daily relative increments')

In [None]:
# Compute some stats (not using scipy.stats)
m = np.mean(stock_incs, axis=0)
s = np.std(stock_incs, axis=0)
c = np.cov(stock_incs, rowvar=0)
sp = np.sqrt(s**2*n_incs/(n_incs - 1))
print('mean = {}'.format(m))
print('std = {}'.format(s))
print('pop std = {}'.format(sp))
print('cov = {}'.format(c))

Until here we haven't used `scipy.stats`

## Create a Normal distribution

Let's assume that the stock prices follow a Normal distribution

In [None]:
# Create estimated distributions based on the sample
app_dist = stats.norm(m[0], sp[0])
win_dist = stats.norm(m[1], sp[1])
intl_dist = stats.norm(m[2], sp[2])

In [None]:
x_range = np.arange(-0.05, +0.0501, 0.001)

# All the continous distributions have the pdf, cdf and rvs methods
# Probability Density Function
pdf = app_dist.pdf(x_range)
# Cumulative Distribution Function
cdf = app_dist.cdf(x_range)
# Random sample
rvs = app_dist.rvs(1000)

# Plot the data
fig = plt.figure(figsize=(15., 5.))

ax1 = fig.add_subplot(131)
ax1.plot(x_range, pdf)
_ = ax1.set_title('Probability Density Function')

ax2 = fig.add_subplot(132)
ax2.plot(x_range, cdf)
_ = ax2.set_title('Cumulative Distribution Function')

ax3 = fig.add_subplot(133)
ax3.hist(rvs, bins=100)
_ = ax3.set_title('Random Sample')


In [None]:
# We can test if this data fits a normal distribution (Kolmogorov-Smirnov test)
app_KS = stats.kstest(stock_incs[:, 0], 'norm', [m[0], sp[0]])
win_KS = stats.kstest(stock_incs[:, 1], 'norm', [m[1], sp[1]])
intl_KS = stats.kstest(stock_incs[:, 2], 'norm', [m[2], sp[2]])
print(app_KS, '\n', win_KS, '\n', intl_KS)

![Ummmmmm](../resources/homer-doh.jpg)

In [None]:
# Compare histogram with estimated distribution
x_range = np.arange(-0.05, +0.0501, 0.001)
x_axis = (x_range[1:] + x_range[:-1])/2.
y_app = (app_dist.cdf(x_range[1:]) - app_dist.cdf(x_range[:-1]))*n_incs
y_win = (win_dist.cdf(x_range[1:]) - win_dist.cdf(x_range[:-1]))*n_incs
y_intl = (intl_dist.cdf(x_range[1:]) - intl_dist.cdf(x_range[:-1]))*n_incs

fig = plt.figure(figsize=(16., 6.))
ax_app = fig.add_subplot(131)
_ = ax_app.hist(stock_incs[:, 0], bins=x_range, color='powderblue')
_ = ax_app.set_xlabel('Apple')
_ = ax_app.plot(x_axis, y_app, color='blue', linewidth=3)
ax_win = fig.add_subplot(132)
_ = ax_win.hist(stock_incs[:, 1], bins=x_range, color='navajowhite')
_ = ax_win.set_xlabel('Microsoft')
_ = ax_win.plot(x_axis, y_win, color='orange', linewidth=3)
ax_intl = fig.add_subplot(133)
_ = ax_intl.hist(stock_incs[:, 2], bins=x_range, color='lightgreen')
_ = ax_intl.set_xlabel('Intel')
_ = ax_intl.plot(x_axis, y_win, color='green', linewidth=3)

## Exercise:

Imagine you are a product designer in a finantial company. You want to create a new investment product to be "sold" to your clients based on the future stock prices of some IT companies. The profit the client gets from his investement is calculated like this:
* At  the time of the investment we check the initial stock prices
* 12 months later (let's say 240 work days), the client gets 100% of the investement back. Additionally if all stock prices are higher than the initial ones, the client earns half the lowest increment (in %). 

**What is the expected profit of this investment?**

**What is the 5% highest risk that the finantial company is assuming?**

First we will try to create a finantial product based on the stock prices of Apple and Microsoft

### Create a multinormal distribution

In [None]:
# Create a multivariate normal distribution object
m_norm = stats.multivariate_normal(m[:2], c[:2, :2])

In [None]:
# Show the contour plot of the pdf
x_range = np.arange(-0.05, +0.0501, 0.001)
x, y = np.meshgrid(x_range, x_range)

pos = np.dstack((x, y))
fig_m_norm = plt.figure(figsize=(6., 6.))
ax_m_norm = fig_m_norm.add_subplot(111)
ax_m_norm.contourf(x, y, m_norm.pdf(pos), 50)
_ = ax_m_norm.set_xlabel('Apple')
_ = ax_m_norm.set_ylabel('Windows')

### Compute the expected profit and top 5% risk

In [None]:
# Create N (e.g 1000) random simulations of the daily relative increments with 240 samples
N_SIMS = 1000
daily_incs = m_norm.rvs(size=[240, N_SIMS])

In [None]:
# Calculate yearly increments (from the composition of the daily increments)
year_incs = (daily_incs + 1.).prod(axis=0)

In [None]:
# calculate the amount payed for each simulation
def amount_to_pay(a):
    if np.all( a >= 1.):
        return (a.min() - 1)/2
    else:
        return 0.
earnings = np.apply_along_axis(amount_to_pay, 1, year_incs)

In [None]:
_ = plt.hist(earnings, bins=50)

In [None]:
print('Expected profit of the investment: {:.2%}'.format(earnings.mean()))

In [None]:
# To compute the 5% higher profit use the stats.scoreatpercentile function
print('%5 higher profit of the investment: {:.2%}'.format(stats.scoreatpercentile(earnings, 95)))
print('%1 higher profit of the investment: {:.2%}'.format(stats.scoreatpercentile(earnings, 99)))

Both the expected profit and the risk assessed are too high!!

**Try adding Intel to the product in order to lower them down**

In [None]:
# %load -r 2:10 solutions/07_02_scipy_stats.py