# Models

## Linear Regression

For two variables $X$ and $Y$, each with $n$ values:

$\Large\sigma_{XY} = \frac{\Sigma^n_{i = 1}(x_i - \mu_x)(y_i - \mu_y)}{n}$ <br/>

`np.cov(X, Y, ddof=0)`

Pearson Correlation:<br/>$\Large r_P = \frac{\Sigma^n_{i = 1}(x_i - \mu_x)(y_i - \mu_y)}{\sqrt{\Sigma^n_{i = 1}(x_i - \mu_x)^2\Sigma^n_{i = 1}(y_i -\mu_y)^2}}$

Note that we are simply standardizing the covariance by the standard deviations of X and Y (the $n$'s cancel!).

`np.corrcoef(X, Y)`

**Where X and Y are lists of X and Y values.**

Similarly, you can use SciPy:<br>
`stats.pearsonr(X, Y)`

### Regression equations

The solution for a simple regression best-fit line is as follows:

- slope: <br/>$\Large m = r_P\frac{\sigma_y}{\sigma_x} = \frac{cov(X, Y)}{var(X)}$

- y-intercept:<br/> $\Large b = \mu_y - m\mu_x$

### Using Stats Model

`sm.formula.ols(formula="y ~ x", data=test_df).fit().summary()` <br>
Where your x:y data is in a dataframe.

### $R^2$

$R^2$, the *coefficient of determination*, is a measure of how well the model fits the data.

The actual calculation of $R^2$ is: <br/> $\Large R^2\equiv 1-\frac{\Sigma_i(y_i - \hat{y}_i)^2}{\Sigma_i(y_i - \bar{y})^2}$.

Adjusted $R^2$ adds a penalty for the complexity of the model. <br>
Adding more predictors increases the penalty dependent on the significance of the added predictors.

### Assumptions

#### Linearity

**The relationship between the target and predictor is linear.** Check this by drawing a scatter plot of your predictor and your target, and see if there is evidence that the relationship might not follow a straight line.

#### Independence

**The errors are independent**. In other words: Knowing the error for one point doesn't tell you anything about the error for another.

#### Normality

**The errors are normally distributed.** That is, smaller errors are more probable than larger errors, according to the familiar bell curve.

#### Homoskedasticity

**The errors are homoskedastic.** That is, the errors have the same variance. 