# Regression

## Setup

Let $i$ be an index representing a unique observational setup. A unique value of $i$ could be a unique value of time, or represent a unique bundle of variables that are set when we run the experiment, or anything else that uniquely encodes our observational setup when measuring our set of variables. 

Let $i\in\mathbb{I}$ where $\mathbb{I}$ is the set of all possible observational setup indices.

Let: $x_{i1}, x_{i2}, ..., x_{in}, y_{i}$, be the set of variables we measure with observational set up $i$ where variables $x_{i1}, x_{i2}, ..., x_{in}$ are the feature/regressor/predictor variables, and $y_{i}$ is the target/response variable for the given observational setup $i$. Let $j$ be the feature index such that $x_{ij}$ is a specific feature/regressor/predictor variable for a given observational setup $i$, and feature index $j$.

Define: $\{X_{ij}|i\in\mathbb{I}\}$ and $\{Y_{i}|i\in\mathbb{I}\}$ to be the stochastic processes that generate values for feature variables $x_{ij}$ and target variables $y_{i}$

Let: $\mathbb{X}_{i} =  X_{i1} \times X_{i2} \times ...\times X_{in}$ and $\vec{X_{i}} \in \mathbb{X}_{i}$

Goal: model the target/response variables using the feature/regressor/predictor variables.

## Assumptions

1. Let $\vec{X_{i}}, Y_{i}$ be modelable by random variables, where each $X_{ij}$ and $Y_{i}$ be different random variables for different observational setups (values of $i$) and feature indices (values of $j$)

2. There is reason to believe that a correlation exists between the random variables in the feature vectors $\vec{X_{i}}$ and the target variables $Y_{i}$. This means $\vec{X_{i}}$ and $Y_{i}$ are statistically dependent, knowing information about the distribution of one implies some knowledge about the distribution of the other. This correlation can be modeled by a function between the random variables. $$Y_{i} = f_{i}(\vec{X_{i}}, \epsilon_{i})$$ <br> where $\epsilon_{i}$ is an unmeasureable random variable that is used to model the randomness of variable $Y_{i}$ that is not due to the randomness of the feature variables in feature vector $\vec{X_{i}}$ for the given observational setup $i$. The random variable $\epsilon_{i}$ is necessary because for a given observational setup $i$, if the feature vector $\vec{X_{i}}$ is already known, then random variable $Y_{i}$ is no longer random if $Y_{i}$ depended only on the feature variables and not $\epsilon_{i}$.

3. The function is independent of the observational setup.

Once the form of this function is learned, it is the same for every observational setup (symmetric across all observational setups) WORD BETTER THIS SENTENCE

Usually in regression we are interested in predicting values for $Y_{i}$ from already know values of the $X_{i1}, X_{i2}, ..., X_{in}$. Since $Y_{i}$ is still random even when values for $X_{i1}, X_{i2}, ..., X_{in}$ are known (this is thanks to $\epsilon_{i}$) We suffice with predicting the expected value for $Y_{i}$ when $X_{i1}, X_{i2}, ..., X_{in}$ are measured and known.

Thus the goal of regression is usually to learn an expression for: $E[Y_{i}|\vec{X_{i}}=\vec{x_{i}}] = E[f(\vec{X_{i}}, \epsilon_{i})|\vec{X_{i}}=\vec{x_{i}}] = E[f(\vec{x_{i}}, \epsilon_{i})|\vec{X_{i}}=\vec{x_{i}}] $

## Ordinary Least Squares (OLS) Regression

Along with the above assumptions we include.

### Assumptions

1. The randomness in $Y_{i}$ that comes from the randomness of the feature vector $\vec{X_{i}}$ can be additively seperated from the randomness in $Y_{i}$ that comes from the unmeasurable random variable $\epsilon_{i}$:

$$Y_{i} = f(\vec{X_{i}}) + \epsilon_{i}$$


2. Strict Exogeneity, the expected value of $\epsilon_{i}$ (the randomness in $Y_{i}$ **not** due to the feature variables in feature vector $\vec{X_{j}}$ at all observational setups), disappears when the feature variables of feature vector $\vec{X_{j}}$ are known:

$$E[\epsilon_{i}|\vec{X_{j}}=\vec{x_{j}}] = 0$$

3. $f(\vec{X_{i}})$ is linear in the components of $\vec{X_{i}}$:

$$f(\vec{X_{i}}) = \beta_{j}X_{ij} + \beta_{0}$$

4. The components of $\vec{X_{i}}$ are linearly independent, which means any component of $\vec{X_{i}}$ such as $X_{ij}$, cannnot be linearly built from the other components of $\vec{X_{i}}$:

$$a_{j}X_{ij} = 0 \implies a_{j} = 0$$

5. Spherical errors, this assumption implies that there exists no linear correlation between the error varaibles $\epsilon_{i}$ for different observational setups, that is dependent on the feature variables $\vec{x_{i}}$ This assumption on its own, does not always imply there is no linear correlation between error variables of different observational setups in general, it only implies that the linear correlation between error variables of different observational setups conditioned on the feature variables doesn't exist, or in other words there exists no linear function between $\epsilon_{i}$s of different observational setups that also is a function of our feature variables $\vec{X_{i}}$, but there might a exist a linear function between the $\epsilon_{i}$s of different obesrevational setups that isn't a function of our feature variables $\vec{X_{i}}$ this might be some feature variable that we forgot to include in our model or it could just be a property of the error distribution. (see law of total covariance). Spherical errors also implies that the variance of the error terms for a given observational setup does not depend on our feature variables $\vec{X_{i}}$. <br> $$COV[\epsilon_{i}, \epsilon_{j}|\vec{X_{k}}=\vec{x_{k}}, \vec{X_{l}}=\vec{x_{l}}] = \sigma^{2}\delta_{ij}$$ $$COV[\epsilon_{i}, \epsilon_{j}|\vec{X_{k}}=\vec{x_{k}}, \vec{X_{l}}=\vec{x_{l}}]=E[(\epsilon_{i} - E[\epsilon_{i}|\vec{X_{k}}=\vec{x_{k}},  \vec{X_{l}}=\vec{x_{l}}])(\epsilon_{j} - E[\epsilon_{j}|\vec{X_{k}}=\vec{x_{k}}, \vec{X_{l}}=\vec{x_{l}}])|\vec{X_{k}}=\vec{x_{k}}, \vec{X_{l}}=\vec{x_{l}}]=\sigma^{2}\delta_{ij}$$ $$COV[\epsilon_{i}, \epsilon_{j}|\vec{X_{k}}=\vec{x_{k}}, \vec{X_{l}}=\vec{x_{l}}] = E[\epsilon_{i}\epsilon_{j}|\vec{X_{k}}=\vec{x_{k}}, \vec{X_{l}}=\vec{x_{l}}] - E[\epsilon_{i}|\vec{X_{k}}=\vec{x_{k}}, \vec{X_{l}}=\vec{x_{l}}]E[\epsilon_{j}|\vec{X_{k}}=\vec{x_{k}}, \vec{X_{l}}=\vec{x_{l}}] = \sigma^{2}\delta_{ij}$$ <br>
where $\delta_{ij} = 1$ for $i=j$ and $\delta_{ij}=0$ for $i \neq j$ 

### Important Implications and Notes

$1. \land 2. \implies E[Y_{i}|\vec{X_{i}}=\vec{x_{i}}] = f(\vec{x_{i}})$

$E[Y_{i}|\vec{X_{i}}=\vec{x_{i}}] = E[f(\vec{X_{i}}) + \epsilon_{i}|\vec{X_{i}}=\vec{x_{i}}] = E[f(\vec{X_{i}})|\vec{X_{i}}=\vec{x_{i}}] + E[\epsilon_{i}|\vec{X_{i}}=\vec{x_{i}}] = E[f(\vec{X_{i}})|\vec{X_{i}}=\vec{x_{i}}] = f(\vec{x_{i}})$

This is important because it means that on average, the value of $Y_{i}$ is equal to a deterministic function of our features, which means that as long as we are only concerned with the average values of $Y_{i}$ for each observational setup, we have boiled our regression problem down to that of function fitting.

$2. \implies E[\epsilon_{i}] = 0$

$E[\epsilon_{i}] = E[E[\epsilon_{i}|\vec{X_{i}}=\vec{x_{i}}]] = E[0] = 0$

This is important because it means that the randomness of our target variable that is not due to our features, averages out to zero, because it is noise.

$2. \implies E[\vec{X_{i}}\epsilon_{i}] = 0$

$E[\vec{X_{i}}\epsilon_{i}] = E[E[\vec{X_{i}}\epsilon_{i}|\vec{X_{i}}=\vec{x_{i}}]] = E[\vec{x_{i}}E[\epsilon_{i}|\vec{X_{i}}=\vec{x_{i}}]] = E[\vec{x_{i}}0] = 0$

This is important because it shows that there should be no correlation between the features, and the $\epsilon$ error terms.Essentially the random errors should not be a function of our features.

$2. \land 5. \implies (Homoscedasticity: E[\epsilon_{i}^{2}|\vec{X_{i}}=\vec{x_{i}}] = \sigma^{2}) \land (No \space Autocorrelation: E[\epsilon_{i}\epsilon_{j}|\vec{X_{i}}=\vec{x_{i}}] = 0 \space for \space i \neq j)$

$COV[\epsilon_{i}, \epsilon_{j}|\vec{X_{i}}=\vec{x_{i}}, \vec{X_{j}}=\vec{x_{j}}]=E[(\epsilon_{i} - E[\epsilon_{i}|\vec{X_{i}}=\vec{x_{i}},  \vec{X_{j}}=\vec{x_{j}}])(\epsilon_{j} - E[\epsilon_{j}|\vec{X_{i}}=\vec{x_{i}}, \vec{X_{j}}=\vec{x_{j}}])|\vec{X_{i}}=\vec{x_{i}}, \vec{X_{j}}=\vec{x_{j}}]=\sigma^{2}\delta_{ij}$ 
$COV[\epsilon_{i}, \epsilon_{j}|\vec{X_{i}}=\vec{x_{i}}, \vec{X_{j}}=\vec{x_{j}}] = E[\epsilon_{i}\epsilon_{j}|\vec{X_{i}}=\vec{x_{i}}, \vec{X_{j}}=\vec{x_{j}}] - E[\epsilon_{i}|\vec{X_{i}}=\vec{x_{i}}, \vec{X_{j}}=\vec{x_{j}}]E[\epsilon_{j}|\vec{X_{i}}=\vec{x_{i}}, \vec{X_{j}}=\vec{x_{j}}] = \sigma^{2}\delta_{ij}$
$COV[\epsilon_{i}, \epsilon_{j}|\vec{X_{i}}=\vec{x_{i}}, \vec{X_{j}}=\vec{x_{j}}] = E[\epsilon_{i}\epsilon_{j}|\vec{X_{i}}=\vec{x_{i}}, \vec{X_{j}}=\vec{x_{j}}] = \sigma^{2}\delta_{ij}$

$\implies (Homoscedasticity: E[\epsilon_{i}^{2}|\vec{X_{i}}=\vec{x_{i}}] = \sigma^{2})$ and $(No \space Autocorrelation: E[\epsilon_{i}\epsilon_{j}|\vec{X_{i}}=\vec{x_{i}}] = 0 \space for \space i \neq j)$

This is important it means that 

$3. \land 4. \implies$ the features are statistically independent