# Some Matrix calculus
### Differentiating a linear form

### Differentiating a quadratic form


# (Ordinary) Linear Regression
### Univariate regression
Consider an independent variable $x$ and dependent variable $y$. A **univariate regression** model assumes every observation obeys the following equation:
\begin{equation*}
    y=\beta_1 x+ \beta_0 + \epsilon
\end{equation*}
where $\beta_1$ and $\beta_0$ are linear regression coefficients and $\epsilon$ is the residual term. 

To solve for the best-fit coefficients, we consider minimizing the **sum of squared errors**.
\begin{equation*}
SSE = \sum_{i=1}^n(y_i-(\beta_0+\beta_1 x_i))^2
\end{equation*}
where each $(x_i,y_i)$ is an observation.

By differentiation of the $SSE$ with respect to $\beta_0$ and $\beta_1$, we get the following closed form solutions to the best estimators for the coefficients given the data:
\begin{equation}
\beta_1 = \frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{\sum(x_i-\bar{x})^2}
\beta_0 = \bar{y}-\beta_1\bar{x}
\end{equation}

If the residual/error terms were assumed to follow a normal distribution, then the same formula above can also be derived via a maximum likelihood interpretation.

The **coefficient of determination** measures how much of the variance in $y$ is explained by $x$ based on the linear model. In univariate regression, this is the same as the square of the sample correlation between $x$ and $y$.

### Multivariate regression (Matrix form)
#### Derivation
**Multivariate regression** is the extension to univariate regression, where we now consider $n$ independent variables and $m$ dependent variables. This is given by the model below:
\begin{equation*}
y=x^T\beta + \epsilon
\end{equation*}
where $y=(y_1,..,y_m)^T\in\mathbb{R}^m$ and $x=(1,x_1,..,x_n)^T\in\mathbb{R}^{n+1}$ are random vectors, $\beta=(\beta_0,\beta_1,...\beta_1)\in\mathbb{R}^{n+1}$ is the vector of regression coefficients and $\epsilon\in\mathbb{R}^m$ is the vector of residuals.

Given $p$ data points, we can derive the best-fit solution for the regression coefficients from the data using the sum of squared residuals and differentiation just as in the univariate case:
\begin{array}{rl}
Y &= \displaystyle X\beta + \epsilon ; Y\in\mathbb{R}^{p\times n}, X\in\mathbb{R}^{p\times n+1} \\\\
\implies SSE(\beta) &= \displaystyle\epsilon^T\epsilon \\\\
&= (Y - X\beta)^T(Y - X\beta) \\\\
\hat{\beta} &=\displaystyle\argmin_\beta SSE(\beta) \text{ (Best estimator)} \\\\
\displaystyle\nabla_\beta SSE=\frac{\partial SSE}{\partial\beta} &=\displaystyle -2X^T(Y-X\beta) \text{ (Matrix differentation)} \\\\
\implies \hat{\beta} &= \displaystyle(X^TX)^{-1}X^TY \text{ (Assuming invertibility of $X^TX$)}
\end{array}

Sidenote: There exists an iterative algorithm called **least mean squares** which incrementally updates coefficients as compared to redoing matrix computations, making for faster and less expensive computations.

#### Basic case: $Y\in\mathbb{R},X\in\mathbb{R}^2$ (Good to memorize by heart)
Suppose that we have two independent variables for one dependent with all the bias/intercept terms zeroed/assumed to be zeroed. Then we can represent our data as matrices $X\in\mathbb{R}^{p\times 2}$ and $Y\in\mathbb{R}^{p\times 1}$.
\begin{array}{rl}
X^TX &= \begin{bmatrix}a & b\\ c & d \end{bmatrix}, X^Ty = \begin{bmatrix}e \\ f \end{bmatrix}\\\\
\implies (X^TX)^{-1} &= \displaystyle \frac{1}{ad-bc}\begin{bmatrix}d & -b\\ -c & a\end{bmatrix} \\\\
\implies \hat{\beta} &= \displaystyle \frac{1}{ad-bc}\begin{bmatrix}d & -b\\ -c & a\end{bmatrix}\begin{bmatrix}e \\ f \end{bmatrix}
\end{array}

### Non exhaustive considerations for OLS:
#### Coefficients as random variables based on data
Expected value
Variance

#### Linearity assumption

#### Homo/Heteroskedasticity, Exogeneity, Orthogonality and other assumptions about errors/residuals

#### Issues of singular matrices/correlation/collinearity/multicollinearity

#### $R^2$ and Spurious regression

#### Frisch–Waugh–Lovell theorem

#### Some special cases (easily derivable but good to know for intuition):
Regression coefficients of Y against X vs X against Y

Perfectly correlated columns

Orthogonal regressors

Degrees of freedom and sample points

Adding a shift

### Regularization
#### Definition
Any model $f$ will have a variance, how much its predictions fluctuate, and bias, how deviated its the predictions are from their true values. The **bias-variance** tradeoff is a phenomenon where attempting to decrease bias leads to an increase in variance and vice versa. The two extremes of this tradeoff are underfitting (high bias low variance) and overfitting (low bias high variance).

**Regularization** refers to a list of techniques to make optimization problems more robust to overfitting, i.e. they introduce bias and reduce variance. These techniques broadly categorize into
- **explicit** techniques where penalty terms or constraints are added to the loss/objective functions when computing parameters;
- and **implicit** techniques where the algorithm for optimization does the regularization, e.g. early stopping and dropout in training neural nets.

#### L2 penalty (Ridge)
In **ridge regression**, the optimization objective for the regression coefficients includes the **L2 penalty** which is the $L2$-norm of the coefficients scaled by a parameter $lambda$. (When $\lambda=0$ we get back the regular OLS formula.) 
\begin{equation*}
\hat{\beta} = \argmin_{\beta} SSE + \lambda \beta^T\beta = \argmin_{\beta} J(\beta)
\end{equation*}
where $J(\cdot)$ denotes the objective function.

Applying matrix differentiation, we can get a closed form formula for the coefficients:
\begin{array}{rl}
\nabla_\beta J(\beta) &= -2X^T(Y-X\beta) + 2\lambda\beta \\\\
\implies \hat{\beta} &= (X^TX+\lambda I)^{-1}X^TY 
\end{array}
where $I$ denotes the $n$ by $n$ identity matrix

The purpose of the L2 penalty is first and foremost to keep coefficient values small to lower variance and reduce overfitting. Besides this, it can also make matrix inversion more numerically stable to calculate which is important when dealing with high dimensions and collinearity.

Its primary drawback is that it is poor with feature selection since it wouldn't zero out irrelevant predictors as much as the L1 penalty does.

#### L1 penalty (Lasso)
In **Lasso** (least absolute shrinkage and selection operator) regression, we have as a penalty term the $L1$-norm of the coefficients scaled by a parameter $\lambda$. 
\begin{equation*}
\hat{\beta} = \argmin_{\beta} SSE + \lambda ||\beta||_1
\end{equation*}

Since the L1 constraint involves absolute values, it is more likely to shrink the coefficients to 0, (i.e. leading to feature selection).

In the univariate case, the regression coefficients are solvable by considering a case-by-case partial differentiation with respect to $\beta$ (this is known as the subgradient). This will lead to the soft-thresholding operator, a closed form solution to the 1D problem:
\begin{equation*}
\hat{\beta} = sign(z)*max(|z|-\gamma,0)
\end{equation*}
where $z = \frac{X_j^T(y-X_{-j}\beta_{-j})}{X_j^TX_j}$ and $\gamma = \frac{\lambda}{X^T_jX_j}$.

Since the L1 norm is non differentiable, the regression coefficients for multivariate regression are solved via optimization algorithms. One of the most popular algorithms is called **coordinate descent** which breaks down into the following steps:
- Pick a coefficient $\beta_j$ to update in the vector
- Fix all other coefficients
- Solve the 1D Lasso problem on just $\beta_j$ via soft-thresholding
- Repeat with other coefficients till convergence

Like the L2 penalty, the L1 penalty also aims to reduce variance at the cost of bias. However, it also helps to perform feature selection by encouraging interpretable models through sparsity (which is when coefficients go to 0). 

Its primary drawback is that it requires iterative algorithms which leads to a slower calculation. It is also poor at dealing with collinearity.

Sidenote: You can add both L1 and L2 penalties into the objective function. This is called **elastic net** and its regression coefficients can also be solved via coordinate descent.

#### Hyperparameter tuning: determining penalty parameters
The penalty parameter $\lambda$ is known as a hyperparameter since it is not calculated during the regression on the data itself. One way to optimize $\lambda$ from a set of candidate values is through **model evaluation** techniques which determine general model performance. Below are three such techniques that help quantify **generalization error**:
- Cross validation: This is the process of splitting the data into multiple "folds", then fitting and testing on consecutive folds to compute the model's error. 
- AIC and BIC criterion: AIC and BIC stand for Akaike and Bayesian information criterion respectively. These both quantify how well a model fits and its complexity, whereby the lower the value the better.
\begin{array}{rl}
AIC = 2k-2ln(L(\hat{\theta}|x))
BIC = k\ln{n}-2ln(L(\hat{\theta}|x))
\end{array}
where $k$ is the number of model parameters, $n$ is the number of observations, and $L$ is the likelihood given the fitted parameters. In both cases, they favour simpler models over complex ones, with BIC penalizing complexity more.
- Bootstrapping: This is the process of fitting models on bootstrap samples and then testing their accuracy on points outside the samples, repeating to get an average error.

#### Importance of data standardization
**Standardization** is the idea of scaling and shifting all the data to be uniform in scale and variance. Some of the most common approaches to this are:
- **min-max standardization** where you scale every column/variable between 0 and 1 based on minimum and maximums;
- **normalization** where you shift the data by the mean and divide by the standard deviation.

When introducing L1 and L2 penalties, standardization is important since penalization should operate uniformly across all coefficients. In this case, the penalties disproportionately affect larger features since they both target shrinkage. We also need to make sure the intercept isn't penalized since it does not contribute to model variance.

Proper scaling also allows better convergence for optimization, and better interpretability of coefficients as measures of relative feature importance.

# Time series [[src](https://www.statslab.cam.ac.uk/~rrw1/timeseries/t.pdf)]

### Definitions
A **time series** is a set of statistics/data ${X_t}_{t\geq 0}$ collected at regular intervals. Time series analysis is the process of summarizing time series data, making forecasts, and fitting models.

### Classical decomposition
A simple decomposition of a time series is into 4 components: trend (long term drift), seasonality (calendar based fluctuations), cycles (non calendar fluctuations), and residuals (random fluctuations). These can be interpreted as four separate processes which form the sum or product to the process in question.

### Stationarity
The **stationarity** of a time series refers to how time invariant its properties are. A time series can either be weakly or strongly stationary:
- A **strongly stationary** process ${X_t}_{t\geq 0}$ satisfies the property that the joint distribution is invariant to timeshifts, i.e. all moments (variance, skew, kurtosis, distribution shape, etc.) always stay the same. Formally
\begin{equation*}
(X_{t_1},...,X_{t_k})\stackrel{\mathcal{D}}{=}(X_{t_1+h},...,X_{t_k+h})
\end{equation*}
for all integers $k$, shifts $h$ and indices $t_i$.
- A **weakly/second order stationary** process ${X_t}_{t\geq 0}$ satisfies the property that only the mean and autocovariance are invariant under time shifts. Formally:
\begin{array}{rl}
\mathbb{E}[X_t]&=\mu\\
Var(X_t)&=\gamma(0)=\sigma^2 < \infty\\
Cov(X_t,X_{t+h}) &= \gamma(h)
\end{array}
where $\gamma(h)$ is the autocovariance function depending on lag $h$. 

Just as **autocovariance** describes covariance between different lagged values in a time series,
the **autocorrelation** describes the correlation between these same values. Given an autocovariance function $\gamma(h)$, the autocorrelation $\rho_k$ is given by:
\begin{equation*}
\rho_k=\gamma(k)/\gamma(0)=Corr(X_t,X_{t+k})
\end{equation*}
This takes advantage of the assumption that the standard deviation of both $X_t$ and $X_{t+k}$ is $\sqrt{\gamma(0)}$.

### Some basic time series models
#### Autoregressive, Moving average and white noise processes
An **autoregressive** process of order $p$, $AR(p)$, is one where the current value in the time series depends linearly on prior values up to $p$ steps ago.
\begin{equation*}
X_t = \sum_{r=1}^p\phi_rX_{t-r}+\epsilon_t
\end{equation*}
where $\phi_r$ are all constants and $\epsilon_t$ is noise with mean $0$ and variance $\sigma^2$.

A **moving average** process of order $q$, $MA(q)$, is a time series whose current value is a linear combination of white noise/error terms:
\begin{equation*}
X_t = \sum_{s=0}^q\theta_s\epsilon_{t-s}
\end{equation*}
where $\theta_i$ are all fixed constants, and $\epsilon_j$ are "innovations" with mean $0$ and variance $\sigma^2$.
Note that all moving average processes are weakly stationary, with strong stationarity occuring if $\epsilon$ are also iid.

**Invertibility** is an important property for MA processes: an $MA$ process is invertible if you can express $\epsilon_t$ as an $AR(\infty)$ process. From invertibility, you can then derive the residuals process given data which would otherwise be difficult even with estimates of $theta_s$.

**White noise** corresponds to the residuals or error terms $\epsilon_t$ in the previous equations. These are uncorrelated random variables (0 autocovariance and autocorrelation) with mean 0 and variance $\sigma^2$. This make white noise weakly stationary. If the terms were all independent/iid, then the process is also strongly stationary.
#### ARMA, ARIMA 
**ARMA(p,q)** or autoregressive moving average processes are a combination of both $MA(q)$ and AR$(p)$ processes.
\begin{equation*}
X_t =\sum_{r=1}^p\phi_rX_{t-r} + \sum_{s=0}^q\theta_s\epsilon_{t-s}
\end{equation*}

**ARIMA(p,d,q)** or autoregressive integrate moving average processes are processes where there exists an integer $d$, known as the order of integration, such that the $d^{th}$ order difference $\nabla^dY_t$ of $X_t$ gives an $ARMA(p,q)$ process.

### Cointegrated processes [[src](https://www.uh.edu/~bsorense/coint.pdf)]
Two processes $X_t$ and $Y_t$ are **cointegrated** if they are both have an order of integration 1 and there exists a parameter $\alpha$ such that $Y_t-\alpha X_t$ is a stationary process. This idea extends to multiple processes $X_t^i$ where they are cointegrated if a linear combination of these processes results in a stationary one.

The further generalization of cointegration is **multicointegration** which looks at the cointegration between processes of different orders of integration.

### Some statistical tests
#### Stationarity
To test for non stationarity, the **Augmented Dickey-Fuller** or ADF test can be used. It assumes (takes as null hypothesis) that the process has a unit root, i.e. 1 is a valid solution to the characteristic polynomial of the model. If the p-value from the ADF test is rejectable, then the data suggests that the process is stationary.

The **Engle-Granger** test looks at whether two series are cointegrated by performing OLS on the data followed by ADF on the residuals. 

The extension of this is the Johansen test which can test multiple time series for cointegration.

#### Differentiating models
The **turning point test** is a test that checks for white noise. It looks at the ordering of any 3 successive values in the series $X_t$. For large enough sample points $n$ the distribution of this value should follow $N(2n/3,8n/45)$, and thus we can reject the null hypothesis of the process following white noise based on where it lies in the distribution.

When plotting the autocorrelations, if there is a cuttoff at lag $q$, then it suggests that the time series fits an $MA(q)$ model. 

When plotting the **partial autocorrelation**, defined as the autocorrelation with the effects of intermediate lags from 1 to $k-1$ removed, a cutoff at lag $p$ suggests that the time series fits an $AR(p)$ model.

In combination, these can help to identify whether a model also fits an $ARMA(p,q)$ model.

To test for the invertibility of an $MA(q)$ process, we need to consider its **characteristic polynomial**. Given $X_t = \sum_{s=0}^q\theta_s\epsilon_{t-s}$ we have the polynomial:
\begin{equation*}
f(L) = 1 + \sum_{i=1}^q\theta_iL^i
\end{equation*}
where $L\epsilon_t = \epsilon_{t-1}$ is known as the lag operator. If the all roots of the polynomial are outside the unit circle of the complex plane, then the process is invertible.

# Statistical Arbitrage via pairs trading [[src](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=141615)]


# Statistical Arbitrage via ETFs


# Statistical Arbitrage using factors: PCA [[src](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1153505)]

### Assumed returns model


### PCA
**PCA** stands for **principal component analysis** and is an unsupervised way of reducing the dimensionality of data. At its core, it aims to find the **principal components**, a list of orthogonal vectors which represent coefficients for the linear combination of some $n$ input random variables, that best describe the variance of the data. The heuristic underlying this is that these principal components are most the important since they describe the majority of variance.

Consider an $m\times n$ matrix $X$ of data representing $m$ sample points for $n$ variables. Assuming everything has been mean-centered (i.e. each column has had the mean $\mu_i$ subtracted to center values around 0), we can derive the covariance matrix of the $n$ variables:
\begin{equation}
\Sigma = \frac{1}{m}X^TX
\end{equation}

For any unit vector $u$ (i.e. $||u|| = 1$), we have that the variance of $u^Tx$, the linear combination of each of the $n$ random variables, is given by
\begin{equation}
Var(u^Tx) = u^T\Sigma u
\end{equation}

Each of the $i^{th}$ principal components are ordered in terms of how much variance they contribute from highest to lowest. So the problem for the first principal component $u_1$ is given by:
\begin{equation}
u_1 = \argmax_{u\in\mathbb{R}}\frac{u^T\Sigma u}{u^Tu}
\end{equation}

The second, third and general $i^{th}$ components are then described by the following:
\begin{equation}
u_i = \argmax_{u\in\mathbb{R}^n;||u||=1, u \perp u_j \forall 1\leq j < i}\frac{u^T\Sigma u}{u^Tu}
\end{equation}

The quotient in the maximization problem is known as **Rayleigh's quotient** and its maximum corresponds to the eigenvector of $\Sigma$ with the largest eigenvalue by the **Courant Fischer Theorem**. Subsequently, the $i^{th}$ principal component is then the eigenvector corresponding to the $i^{th}$ largest eigenvalue.

Two of the most common ways of solving for the principal components, therefore, is to either directly perform 
- **eigen value decomposition**: $\Sigma = U\Lambda U^T$ where $\Lambda$ is diagonal and $U$ invertible
- or **SVD (singular value decomposition)**: $X=U\Sigma_sV^T$ where $U$ and $V$ are orthogonal matrices and $\Sigma_s$ is diagonal. Under SVD, the covariance matrix is $\Sigma = \frac{1}{m}V\Sigma_s^T\Sigma_s V^T$.

Note that **SVD** is generally used for its stability.

### PCA on returns data


### Statistical arbitrage of principal components



# Extensions:
### Latency arbitrage
### Sniping arbitrage
### Stat arb using ICA
Like PCA, ICA (independent component analysis) is also a method of dimensionality reduction. 
### Risk control in stat arb
### Multiasset stat arb framework
### Stat arb with Deep ML
### The factor zoo