# Linear Regression

Regression Model -> values drifting/regressing towards to average/mean

Linear Regression:
> A linear approximation of the causal relationship between two or more variables.

**OLS** = **O**rdinary **L**east **S**quares

**MAE** = **M**inimum **A**bsolute **E**rror

**SST** = **S**um of **S**quares **T**otal

**SSR** = **S**um of **S**quares **R**egression

**SSE** = **S**um of **S**quares of **E**rrors

---
Pearsons Coefficient is regularly used.
   - Values between -1 and 1
   
https://en.wikipedia.org/wiki/Pearson_correlation_coefficient

---
## Process

1. Samples (aquiring)
2. Create Model for samples
3. Make predictions for the entire population

$$y=\beta_0 + \beta_1 x_1 + \epsilon $$

Where:
- $y$ = Dependant variable, Predicted Variable
- $x_1$ = Independant variable, Predictor variable, Regressors

---
**Example**

$y$ = Income

$x$ = Education

$\beta_1 = 50$

$\beta_1 \rightarrow$ quantifies the effect of education on income.

unit increase in education $\rightarrow$ 50 times income increase

$y=\beta_0 + \epsilon$

$\beta_0$ is the minimum wage a person can earn

$\epsilon \rightarrow$ error of estimation

The Error of Estimation is the different between the actual values from the sample data and the predicted values from the model.

---
## Sample Data

Sample data should be split into training and testing data.
   - This is typically split 75% training, 25% testing

---
Anovo framework - Analysis of Variance

1. SST/TSS
    - Square of the deifference between observed dependant variable and it's mean.
    - Sometimes referred to as the: Total variablity of the dataset
2. SSR (aka ESS)
    - Square of the difference between the predictied variable and the mean of the dependant variable.
    - Describes how well the line fits the data
    - Sometimes referred to as: ESS - Explained Sum of Scores
    - SSR=SST - Regression model captures all the variability, and is perfect.
3. SSE (aka RSS)
    - Some of Square Errors
    - AKA: RSS - Residual Sum of Squares
    - Measures the unexplained variability of the regression
    - The difference between the observed value and the predicted value

$$
\begin{align}
\text{Total Variability} &= \text{Explained variability} + \text{Unexplained Variability}\\
\text{SST} &= \text{SSR} + \text{SSE}\\
\text{Actual Value} &= \text{Predicted Value} + \text{Error}
\end{align}
$$

The lower the error, the better the model. Such that a higher SSR implies a lower SSE.

---
OLS = Ordinary Least Square

Method: Least Squares (SSE):
   - Aims at finding a line which minimizes the SSE

$$R^2 = \frac{\text{SSR (Variability explained by this Model)}}{\text{SST (Total Variablity of the Dataset)}}$$

$R^2$ can be between 0 and 1.

- For scientific experiments an $R^2$ value of between 0.7 and 0.9 would be good.
- For Social sciences it might be more like 0.2 to 0.4.

---
## F-statistic (F-Test)

$
H_0 \text{:} \beta_1 = \beta_2 = \beta_3 \dots \beta_n = 0\\
H_1 \text{: at least one} B_i \ne 0\\
\text{If all }\beta\text{'s are }0\text{, then none the }x\text{'s matter.}
$

The F-statistic measures the overall signifigance of the model.

---
## Assumptions:

### 1. Lineraity 
   - $y=\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n + \epsilon$
   - To resolve: If the two variables are not linear, perform a LOG or EXP transformation.

### 2. No Endogenevity 
   - (Covariance of error term = 0)
   - Difference between the observed valdue and the predicted value (error) is correlated with the independant variables, known as: Omitted Variable Bias.
    - Omitted Variable Bias is introduced in the model when we forget to include a relevant variable.
    - e.g.
      
      $y \leftarrow x$ (included variable)
    
      $y \leftarrow x\text{*}$ (omitted relevant variable)
      
      identified by a relationship between $x$ and $\epsilon$ (error) 
   - To resolve: Check to ensure that all relevant variables have been included. 
      
      
### 3. Normaility & Homoscedasticity
   - Symptom: Error term is normally distributed, constant variance
   - The variance of the different error terms obtained must be equal
   - To resolve: perform a transformation, check for Omitted Variable Bias

### 4. No Auto-correlation
   - Symptom: co-variance of 2 error terms = 0
   - Often encountered with time series analysis
   - See: "Day of the Week Effect"

### 5. No Multicollinearity 
   - (2 or more variables have a high correlation with each other)
   - If theres a perfect multicollineararity between two variables, then it is not nessecary to include both variables in the model.