# The 5 Regression Assumptions 
---

##### 1. Linearity 
- linear regression is the simplest one and assumes linearity. Each independent variable is multiplied by a coefficient and summed up to predict the value.

##### 2. Endogeneity of Regressors
- Mathematically, this is expressed as the covariance of the error, and the Xs is 0 for any error or X

##### 3. Normality and Homoscedasticity of the error term 
- Normality means the error term is normally distributed. The expected value of the error is zero as we expect to have no errors on average. Homoscedasticity in plain English means constant variance.

##### 4. No Autocorrelation 
- Mathematically, the covariance of any two error terms is zero. That's the assumption that would usually stop you from using a linear regression in your analysis

##### 5. No Multicollinearity 
- Multicollinearity is observed when two or more variables have a high correlation between each other.

## 1. Linearity - use a scatter plot (only for Linear Regression)
---

A linear regression is the simplest non-trivial relationship. It is called linear because the equation is linear. Each independent variable is multiplied by a coefficient, and summed up to predict the value of the dependent variable.

- The easiest way is to choose an independent variable x1, and plot it against the dependent y on a scatter plot. If the data points form a pattern that looks like a straight line, then a linear regression model is suitable.

- If the relationship is non-linear, you should not use the data before transforming it appropriately eather a Log or Exponential transformation

## 2. No Endogeneity - where p value < 0.05
---

Refers to the prohibition of a link between the independent variables and the errors.

- Omitted variable bias is introduced to the model when you forget to include a relevant variable. As each independent variable explains y, they move together and are somewhat correlated. Similarly, y is also explained by the omitted variable, so they are also correlated. Chances are, the omitted variable is also correlated with at least one independent x. However, you forgot to include it as a regressor. Everything that you don't explain with your model goes into the error. So actually, the error becomes correlated with everything else

- An incorrect exclusion of a variable, like in this case, leads to biased and counterintuitive estimates that are toxic to your regression analysis.

- An incorrect inclusion of a variable, as we saw in our "Adjusted R-squared" lecture, leads to inefficient estimates, which don't bias the regression, and you can immediately drop them.

## 3. Normality and Homoscedasticity
---

Normality - assume the error term is normally distributed

Zero Mean - if the mean is not expected to be zero, then the line is not the best fitting one. Having an intercept solves that problem

Homoscedasticity - to have equal variance. The error term should have equal variance, one with the other. If there is heteroscedasticity (huge variance) the you can calculate the naturl log

- Semi-log model -  as X increases by one unit, Y changes, by b1 percent.

- Log-log model - for each percentage point change in x, Y changes by b1 percentage points. Graph shrinks in height and width

## 4. No Autocorrelation
---

Errors are assumed to be uncorrelated

To check for this: plot all the residuals on a graph and look for patterns. If you can't find any, you're safe.

Another way is the Durbin Watson test - 0 - 4. 2 indicates no autocorrelation while values below 1 and above 3 cause for alarm

- Do not use the linear regression model when error terms are autocorrelated (like time series data).

- The Durbin-Watson statistic ranges from 0 to 4:

    - Results 2: Indicates no autocorrelation.
    - Results 0 to <2: Indicates positive autocorrelation.
    - Results > 2 to 4: Indicates negative autocorrelation.

## 5. No Multicollinearity
---

We observe multicollinearity when two or more variables have a high correlation.

- If there is multicollinearity then this imposes a big problem to our regression model as the coefficients will be wrongly estimated. The reasoning is that, if A can be represented using B, there is no point using both

- VIF = 1: no multicollinearity

- 1 < VIF < 5: perfectly okay

- 5 to 10 < VIF: unacceptable

Fix: 1. Drop one of the two variables. 2. Transform them into one variable.

# Overfitting & Underfitting - problem encompassing predictive analytics
---

##### Overfitting: 
 - Means the regression has focused on the particular dataset so much it has "missed the point." When a model captures noise in the data and is too complex. It will perform exceptionally well on training data but poorly on unseen data.
    - Line follows the data points too close
    - Misses the point
    - High train accuracy

##### Solution
- Split the initial dataset into training and test data 90/10 or 80/20 
- Create the regression on the training data - then test the model on the test data by creating a confusion matrix and assessing accuracy

##### Underfitting: 
- Means the model has not captured the underlying logic of the data. It doesn't know what to do and therefore provides an answer that is far from correct. When a model is too simple to capture the underlying trends in the data, resulting in poor performance on both the training and testing sets. Poor predictive power & low accuracy
    - Line does not follow the data points
    - Doesnt capture any logic
    - Low train accuracy

##### A good model:
- Captures the underlying logic of the dataset
- High train accuracy