# The 5 Assumptions of Linear Regression

These are the conditions that must be met for the results of a regression analysis to be valid and reliable. 

---

##### 1. Linearity 
- The relationship between the independent and dependent variables is linear.

##### 2. No Multicollinearity 
- Multicollinearity is observed when two or more independent variables have a high correlation between each other.

##### 3. No Endogeneity of Regressors
- Endogeneity refers to situations in which an independent variable in a linear regression model is correlated to the error term.

##### 4. Normality and Homoscedasticity of the error term 
- Normality means the error term is normally distributed. The expected value of the error is zero as we expect to have no errors on average. 
- Homoscedasticity in plain English means constant variance.

##### 5. No Autocorrelation 
- Mathematically, the covariance of any two error terms is zero. That's the assumption that would usually stop you from using a linear regression in your analysis

## How to test
---

### 1. Linearity - Use a scatter plot 

Plot each independent variable against the dependent variable. A clear linear relationship will appear as a straight line or a tight cluster of points around a straight line.


- If the data points form a pattern that looks like a straight line, then a linear regression model is suitable.
- If the relationship is non-linear, you should not use the data before transforming it appropriately eather a Log or Exponential transformation

### 2. No Multicollinearity - Use Variance Inflation Factor (VIF):

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor 
vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

- VIF = 1: no multicollinearity
- 1 < VIF < 5: perfectly okay
- 5 to 10 < VIF: unacceptable

- If there is multicollinearity then this imposes a big problem to the regression model as the coefficients will be wrongly estimated. The reasoning is that, if A can be represented using B, there is no point using both 

Fix: 1. Drop one of the two variables. 2. Transform them into one variable.

### 3. No Endogeneity - Durbin-Wu-Hausman (DWH) Test

A variable is endogenous if it’s linked with the information that is not included or can’t be measured in the model.

Cause & Example
1. Omitted Variable	- You forgot to include “weather front strength,” which affects both humidity and temperature
2. Reverse Causality - Higher temperature itself changes humidity
3. Measurement Error - The humidity sensor is noisy, so the “error” spills into 

- If any of these happen → endogeneity problem → OLS is biased & need to run
- If not → your model is fine with OLS and can use the independent variables where p value < 0.05

The Hausman test compares:
- OLS (ordinary least squares, consistent but potentially biased)
- IV/2SLS (instrumental variables, consistent if instruments are valid)

If there’s a significant difference, OLS is likely biased → endogeneity is present.

1. Perform OLS and 2SLS regressions:
- Run the original regression using Ordinary Least Squares (OLS) to get the OLS coefficients.
- Perform a Two-Stage Least Squares (2SLS) regression using the same model but including the instrumental variable(s).
2. Compare the coefficients: Compare the coefficients for the suspected endogenous variable from both the OLS and 2SLS regressions.
T3. est for significance: If the coefficients differ significantly, it indicates endogeneity. The DWH test formally compares these coefficients. A statistically significant difference suggests the OLS results are biased and the 2SLS results are more appropriate. 

### 4. Normality and Homoscedasticity - Test after predictions with a Histogram and Scatter plot

Normality - Histogram: Create a histogram of the residuals (calculated as the difference between the actual value and the predicted value for each data point (y-^y)). A normal distribution should resemble a bell curve, if the histogram is heavily skewed or has multiple peaks, it may violate the assumption of normality.

Homoscedasticity - Scatter plot: Residuals should roughly be centered around 0 with no clear pattern — this indicates the model errors are random, a good sign for linear regression assumptions.
- If points are randomly scattered → good, homoscedastic.
- If residuals fan out or form a pattern → heteroscedasticity.

If there is heteroscedasticity (huge variance) calculate the naturl log
- Semi-log model -  as X increases by one unit, Y changes, by b1 percent.
- Log-log model - for each percentage point change in x, Y changes by b1 percentage points. Graph shrinks in height and width

### 5. No Autocorrelation

To check for this: plot all the residuals on a graph and look for patterns. If you can't find any, you're safe.

Another way is the Durbin Watson test - 0 - 4. 2 indicates no autocorrelation while values below 1 and above 3 cause for alarm

- The Durbin-Watson statistic ranges from 0 to 4:

    - Results 2: Indicates no autocorrelation.
    - Results 0 to <2: Indicates positive autocorrelation.
    - Results > 2 to 4: Indicates negative autocorrelation.

- Do not use the linear regression model when error terms are autocorrelated (like time series data).

# Overfitting & Underfitting - problem encompassing predictive analytics
---

##### Overfitting: 
 - Means the regression has focused on the particular dataset so much it has "missed the point." When a model captures noise in the data and is too complex. It will perform exceptionally well on training data but poorly on unseen data.
    - Line follows the data points too close
    - Misses the point
    - High train accuracy

##### Solution
- Split the initial dataset into training and test data 90/10 or 80/20 
- Create the regression on the training data - then test the model on the test data by creating a confusion matrix and assessing accuracy

##### Underfitting: 
- Means the model has not captured the underlying logic of the data. It doesn't know what to do and therefore provides an answer that is far from correct. When a model is too simple to capture the underlying trends in the data, resulting in poor performance on both the training and testing sets. Poor predictive power & low accuracy
    - Line does not follow the data points
    - Doesnt capture any logic
    - Low train accuracy

##### A good model:
- Captures the underlying logic of the dataset
- High train accuracy