# Using Cholesky and Singular Value Decomposition to generated correlated random numbers
## The problem:
The ability to simulate correlated risk factors is key to many risk models. Historical Simulation achieves this implicitly, by using actual timeseries data for risk factors and applying changes for all risk factors for a given day, for a large number of days (250 or 500 typically). The empirically observed correlations, as well as the means and standard deviations, are implicitly embedded across the historical timeseries data sets. Historical simulation can be projected into the future (synthetic future scenarios) by scaling by square root of time (time into bthe future)

If we are doing *Monte Carlo* simulation however we need to do something different, since random drawings from a Normal(Gaussian)distribution will be uncorrelated - whereas real data will exhibit correlations. Therefore a technique must be developed to transform uncorrelated random variables to variables which exhibit the empirically observed correlations.

In this Jupyter notebook we explore some techniques for producing correlated random variables and variations on these techniques.
- Cholesky Factorisation : $LL^T=\Sigma$, using both covariance and correlation matrix variations to generate trials 
- Singular Value Decomposition : $UDV^T=\Sigma$ [TODO - help appreciated!]
## Theory - Cholesky Factorisation approach:
Consider a random vector, X, consisting of uncorrelated random variables with each random variable, $X_i$, having zero mean and unit variance 1 ($X\sim N(0,1)$). What we hant is some sort of technique for converting these standard normal variables to correlated variables which exhibit the observed empirical means and variances of theproblem we are modelling.


- Useful identities and results:
    - $\mathbb E[XX^T] = I$, where $X\sim N(0,1)$ Since $Var[XX^T]=\mathbb E [XX^T] + \mathbb E[X] \mathbb E[X^T]$
- To show that we can create new, correlated, random variables $Y$, where $Y=LX$ and
    - $L$ is the Cholesky factorisation matrix (see above "Cholesky"), 
    - X is a vector of independent uncorrelated variables from a Normal distribution with mean of zero and variance of one : $\boxed {X\sim N(0,1)}$
    - $Cov[Y,Y] = \mathbb E[YY^T]$


In [1]:
import pandas as pd

In [2]:
from IPython.display import display, Math, Latex, IFrame
import pandas as pd
#import pandas.io.data as pd_io
from pandas_datareader import data, wb
import numpy as np
import scipy as sci
G=pd.DataFrame(np.random.normal(size=(10000000,5)))
m=pd.DataFrame(np.matmul(G.transpose(), G))
display(Math(r'Demonstration~of~~ \mathbb E[XX^T] = I, ~~where~X\sim N(0,1)'))
print(m/10000000)

ValueError: Shape of passed values is (5, 5), indices imply (5, 10000000)

In [42]:
import pandas as pd
from pandas_datareader.iex.daily import IEXDailyReader
import pandas_datareader.data as web
import pandas as pd
import datetime as dt
AV_Token='D9TF0WKQJ9EMXK3N' # Alpha Vantage API token
IEX_token='sk_50446876d4364cc8b218b298207a8e8f'
symbols = ['GM', 'TSLA','AAPL', 'GOOG', 'MSFT','ZNGA', 'VIXY', 'SPY']
import numpy as np
import scipy as sci
stocks=['WDC', 'AAPL', 'IBM', 'MSFT', 'ORCL']
# p=data.DataReader(stocks,data_source='google')#[['Adj Close']] GOOGLE permanently broken
# p=IEXDailyReader(symbols=stocks, api_key=IEX_token).read()
p = web.DataReader('AAPL', "av-daily", api_key=AV_Token)
q = web.DataReader('WDC',  "av-daily", api_key=AV_Token)


In [56]:
d=pd.DataFrame()
for s in stocks:
    new = web.DataReader(s, "av-daily", api_key=AV_Token)
    new.columns = pd.MultiIndex.from_product([[s],new.columns ])
    d=pd.concat([d,new],axis=1)

In [59]:
cols=d.columns.to_list()
cols

[('WDC', 'open'),
 ('WDC', 'high'),
 ('WDC', 'low'),
 ('WDC', 'close'),
 ('WDC', 'volume'),
 ('AAPL', 'open'),
 ('AAPL', 'high'),
 ('AAPL', 'low'),
 ('AAPL', 'close'),
 ('AAPL', 'volume'),
 ('IBM', 'open'),
 ('IBM', 'high'),
 ('IBM', 'low'),
 ('IBM', 'close'),
 ('IBM', 'volume'),
 ('MSFT', 'open'),
 ('MSFT', 'high'),
 ('MSFT', 'low'),
 ('MSFT', 'close'),
 ('MSFT', 'volume'),
 ('ORCL', 'open'),
 ('ORCL', 'high'),
 ('ORCL', 'low'),
 ('ORCL', 'close'),
 ('ORCL', 'volume')]

In [44]:
p.columns = pd.MultiIndex.from_product([['AAPL'],p.columns ])
q.columns = pd.MultiIndex.from_product([['WDC'],q.columns ])


In [47]:
d=pd.concat([p,q],axis=1)

In [91]:
# Clever pandas iloc to select all rows;'close' columns only; start at 4th and bump by 5s, and get all rows
cps=d.iloc[:,3::5]


Unnamed: 0_level_0,WDC,AAPL,IBM,MSFT,ORCL
Unnamed: 0_level_1,close,close,close,close,close
1999-11-01,3.125,77.625,96.750,92.375,12.797
1999-11-02,3.125,80.250,94.813,92.563,13.250
1999-11-03,2.938,81.500,94.375,92.000,14.328
1999-11-04,2.813,83.625,91.563,91.750,14.547
1999-11-05,2.813,88.313,90.250,91.563,14.672
...,...,...,...,...,...
2019-10-21,58.600,240.510,132.580,138.430,55.130
2019-10-22,58.860,239.960,133.960,136.370,54.110
2019-10-23,57.650,243.180,134.380,137.240,54.130
2019-10-24,59.520,243.580,134.070,139.940,54.260


In [62]:
q=web.DataReader('AAPL', "av-daily", api_key=AV_Token)

In [None]:
from pivottablejs import pivot_ui
pivot_ui(m)

In [66]:
df=p.ix[0]
#df.pop('ATML') get rid of duff entry with NaNs!! - handy as you can just remove (and optionally save) a chunk!!
# Create 1-day log returns for data frame content
df=np.log(df/df.shift(1) )
df=df.dropna()
print("Days:{}".format(len(df)))
corr=df.corr()
print(corr)
chol=np.linalg.cholesky(corr)
#chol=sci.linalg.cholesky(corr, lower=True)
print(chol)
sigma=df.std()
mu=df.mean()
print("sigma=\n{}\n mu=\n{}".format(sigma,mu))
#No generate random normal samples with observed means ("mu"s) and st_devs ("sigma"s)
#G_rands=np.random.normal(loc=mu,scale=sigma,size=(1000,len(sigma)))
G_rands=pd.DataFrame(np.random.normal(size=(1000000,len(sigma))))
#G_Corr_rand=G_rands.dot(chol)
G_Corr_rand=(chol.dot(G_rands.transpose())).transpose()
# Now apply the std dev and mean by multiplation and addition, respectively - return as pandas df
G_=pd.DataFrame(G_Corr_rand * np.broadcast_to(sigma,(1000000,len(sigma))) + np.broadcast_to(mu,(1000000,len(mu))))
print(G_.head())
print(corr)
print(G_.corr())
df.describe().T

Days:24


.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


TypeError: corr() missing 1 required positional argument: 'other'

In [None]:
import pandas as pd
from pandas_datareader import data, wb
import numpy as np
import scipy as sci

stocks=['WDC', 'AAPL', 'IBM', 'MSFT', 'ORCL']
p=data.DataReader(stocks,data_source='yahoo')[['Adj Close']]
df=p.ix[0] #convert pandas "panel" to pandas "data frame"

df=df.dropna()

cov=df.cov()
chol=np.linalg.cholesky(cov) # default is left/lower; use chol=sci.linalg.cholesky(cov, lower=False) otherwise
print ('Cholesky L=\n{}, \nL^T=\n{},\nLL^T=\n{}'.format(chol, chol.transpose(), chol.dot(chol.T)))

G_rands=pd.DataFrame(np.random.normal(size=(1000000,len(sigma))))
G_=pd.DataFrame((chol.dot(G_rands.transpose())).transpose())

print(G_.head())
print(cov)
print(G_.cov())

In [None]:
#Check for tiny size - LL^T should be equal to cov, so diff should be negligible
chol.dot(chol.T) - cov

In [None]:
print (chol.dot(chol.T) - cov).max()