# Linear Regression

## Summary


- **Regression**: a method for modeling the relationship between one dependent and one or more independent variables

Linear regression describes the linear relationship between the independent variables and the dependent one. Linear regression solves for the "line of best fit".

**Keywords**:
- supervised learning
- regression


### Assumptions

- **Linearity of residuals**: There needs to be a linear relationship between the dependent variable and independent variables
- **Independence of residuals**: Error should not be correlated with the dependent variable.

![](../img/independence-residuals.jpg)

- **Normal distribution of residuals**: Error should be normally distributed, with a mean near 0.
- **Equal variance of residuals**: Error must have constance variance, or homoscedasticity.

![](../img/homoscedasticity.jpg)

- **No perfect or very high multicollinearity**

### Pros

- E

### Cons

- As feature size increases, the model becomes more prone to overfitting

### Common Use Cases

- C


## How It Works

### Linear Regression

To calculate the best-fit line, linear regression uses a traditional slose-intercept form: $y_i = \beta_0 + \beta_1x_1 + ... + \beta_nx_n$

Mathematically, the best fit line is obtained by minimizing the **Residual Sum of Squares (RSS)**.

While others exist, generally linear regression uses **Mean Squared Error (MSE)** as its **cost / loss function**: $MSE=\frac{1}{N}\sum{(y_i - (\beta_0+\beta_1x_1+ ... + \beta_nx_n))^2}$.

Steps for **gradient descent**, which aims to optimize the cost function by minimizing its error through an iterative process:

1. Set coefficients to zero and user defines the **learning rate**. The learning rate defines how large each step will be. The larger it is the faster it converges, but it could end up being so large you overshoot or fail to converge at a minima. The smaller it is, the longer it converges but more accurate its results will be.
2. Calculate the partial derivative of MSE, one time for each coefficient with respect to that coefficient. ( $D_{\beta0}, D_{\beta1}, \text{etc}$ ). Here, we are calculating the steepness of the slope (value of the slope tangent to the curve). These are called the **gradients**.
3. Update the coefficients: $\beta_0 = \beta_0 - L \times D_{\beta0}$, $\beta_1 = \beta_1 - L \times D_{\beta1}$, etc
    1. Where L is the learning rate.
4. Repeat steps 2 and 3 until the cost function is a very small value, or ideally 0 (0 error means 100% prediction accuracy).

### Logistic Regression

For **logistic regression**, which has a binary dependent variable, we use the equation $logit(p_i) = ln(\frac{p_i}{1-p_i}) = \beta_0 + \beta_1x_{1,i} + ... + \beta_nx_{n,i}$.



## Bias Variance Trade-Off

- **Bias** measures how accurate the model is likely to be on future data.
- Generally, linear algorithms have a high bias. They are likely to perform better on new data, but they are also simple and less flexible.
- **Variance** measures the sensitivity of the model towards the training data, or how much the model reacts when input data is changed.
- Ideally, variance is low and the model doesn't change much because the algorithm is well-suited to the underlying patterns in the training dataset. Another way to say this, is the model is more generalizable.
- We want both low bias and low variance. However, the **bias variance trade-off** has an inverse relationship. An increase in one often leads to a decrease in the other.

![](../img/bias-variance.jpg)

## Overfitting and Underfitting

- **Overfitting** occurs when a model learns every pattern and noice in the data to such an extent that it affects the performance of the model on unseen data. It interprets noice as patterns.
    - This happens when a model has low bias and higher variance.
    - To prevent overfitting, try:
        - Cross-validation
        - If the training data is too small, add more examples
        - If the training data is too large, feature selection
        - Regularization
- **Underfitting** occurs when the model fails to learn from the training dataset and is also not able to generalize the test dataset. This usually leads to low training and test accuracy.
    - This happens when a model has high bias and low variance.
    - To prevent underfitting, try:
        - Increasing the model complexity
        - Increasing the number of features in the training data
        - Removing noise from the data

## Paradigms

- **Statisticians paradigm** has the goal of understanding underlying causal relationships (inferential)
    - General question: "What is the causal effect of changes in an independent variable on changes in a dependent variable?"
    - Consistency and bias are important
- **Predictive paradigm** has the goal of prediction and pattern recognition
    - General question: "How accurately can we predict a dependent variable based on independent variables?"
    - Goodness-of-fit takes precedence over bias and efficiency

## Hypothesis Testing

We are hypothesis testing, for each coefficient, if it explains the variance in the dependent variable.

$H_0: B_1=0$

$H_A: B_1\ne0$

To test, we use a **t-test**, a test statistic for each coefficient: $t=\frac{m-\mu}{s/\sqrt{n}}$

## Cautions

- Regression finds correlations in data, it does not necessarily imply causation.


## Evaluating the Model

- R-Squared (R2), also known as the coefficient of determination, represents the amount of variation explained / captured by the model
    - Ranges between 0-1
    - The higher it is, the better the fit
    - $R^2=1-\frac{RSS}{TSS}$
        - Where RSS is the Residual Sum of Squares: $RSS=\sum(y-\beta_0-\beta_1x_1-...-\beta_nx_n)^2$
        - Where TSS is the Total Sum of Squares: $TSS=\sum(y-\bar{y})^2$
            - Where $\bar{y}$ is the mean of the data points

## Improving the Model

- Look for **multicollinearity**, or relationships between independent variables. Remove redundancy when you identify it because it can make it difficult to determine which variable is contibuting towards the prediction of the dependent variable. Find multicollinearity by:
    - Correlation
    - **Variance Inflation Factor** (VIF) explains the relationship between two independent variables: $VIF=\frac{1}{1-R^2}$
        - If VIF > 10, then the value is high and it should be dropped. 5 < VIF < 10, then inspect, and VIF < 5 is a good value. 
- Spend time on feature selection, especially when there are many features.
