# Regression

## Key Take Aways
- Regression is a Supervised Learning with multiple features and a target
    - If target is numerical → Regression Analysis
    - If target is categorical → Classification

- r² and RMSE is used as measure of goodness of the fit. But may not be a measure of goodness of the solution to a _Business Problem_


---------------------------------------

h08c - Advanced Linear Regression (Notebook)

l08 - Lecture
        - What is Regression Analysis?
        - Regression Models
        - Simple Linear Regression Model
        - Coefficient of determination (r2) and RMSE – Goodness of the fit
        - Multiple Linear Regression

l08b - has something to be interpreted.


--------------------------------------


# Simple Linear Regression

Simple Linear regression is: $y = \beta_0 + \beta_1x$

What does each term represent?
- $y$ is the response
- $x$ is the feature
- $\beta_0$ is the intercept
- $\beta_1$ is the coefficient for x

The _Learning_ is the part of minimizing the sum of squared residuals.

![Slope](img/slope_intercept.png)

## How well does the model fit the data?
The most common way to evaluate the overall fit of a linear model is by the $r^2$. $r^2$ is the proportion of variance explained, meaning the proportion of variance in the observed data that is explained by the model, or the reduction in error over the null model. (The null model just predicts the mean of the observed response, and thus it has an intercept and no slope.)

$r^2$ is between 0 and 1, and higher is better because it means that more variance is explained by the model.

![r_squared](img/r_squared.png)

## Model Evaluation Metrics for Regression
To measure *quality of a model* there are 3 types of evaluators. Mean Absolute Error (MAE), Mean Squared Error (MSE) and Root Mean Squared Error (RMSE). MSE is more popular than MAE because MSE "punishes" larger errors. But, RMSE is even more popular than MSE because RMSE is interpretable in the "y" units. Hint: **Smaller is better**, since all these metrics measure the _Error_.

---

# Advanced Linear Regression
Simple LInear regression can easily be extended to include multiple features. This is then called _multiple linear regression_.

$y = \beta_0 + \beta_1x_1 + ... + \beta_nx_n$

Each $x$ represents a different feature, and each feature has its own coefficient. In this case:

$y = \beta_0 + \beta_1 \times TV + \beta_2 \times Radio + \beta_3 \times Newspaper$

As an example let's say we calculated the following coefficients:  
$\beta_1 = 0.046$  
$\beta_2 = 0.188$  
$\beta_3 = -0.001$  

How do we interpret these coefficients? For a given amount of Radio and Newspaper ad spending, an **increase of $1000 in TV ad spending** is associated with an **increase in Sales of 46 widgets**.

## Feature Selection

Which Features to include? One Idea is → try different models and chec wheather the $r^2$ values goes up when new features are added. One drawback is that $r^2$ will always increase more features are added to the model!

A better approach is _train/test split_ or _cross-validation_. Importantly, cross-validation can be applied to any model, whereas the methods described above only apply to linear models. 

An example would be, using a train/test split on the previous example. Building a Linear Regression Model with all features and one without _Newspapers_. Let's say the outcome is somewhat like

**Include Newspaper**: $RMSE = 1.4047; r^2 = 0.896$  
**Excluded Newspaper**: $RMSE = 1.3879; r^2 = 0.897$

We can see that it makes sense to _not_ include Newspaper - the RMSE goes up and $r^2$ goes down if it is included, so the model without Newspaper is better.

### Handling categorical Features with two categories
If the categorical feature has only two possibilities like "large/small" the data can be represented by a dummy variable _size\_large_ coded as 0/1.

**Interpretation:**  
Let's say the following coefficients are the outcome:  
$TV = 0.046$  
$Radio = 0.188$  
$Newspaper = -0.001$  
$Size\_large = 0.057$  

The Interpretation for _Size\_large_ goes as followed: For a given amount of TV/Radio/Newspaper ad spending, being a large market is associated with an average increase in Sales of 57 widgets (as compared to a small market, which is called the baseline level). Reverse Encoding of small/large would simply lead to a negative coefficient, meaning _"In small markets the sales of widgets would go down by 57 widgets, compared to the baseline (which is now the large market)"_


### Handling categorical Features with more than two categories
In case of multiple categorical features, which are not ordered, we'll need to create $k-1$ extra dummy variables. Example, Area has three categories: rural, suburban and urban. The needed dummy variables would for example be, `area_rural` and `area_subruban`. `area_urban` is not needed since that going to be our baseline. Both dummy variables are binary coded, and if both are zero, the baseline holds. 

**Interpretation:**  
$TV = 0.046$  
$Radio = 0.188$  
$Newspaper = -0.001$  
$Size\_large = 0.077$  
$Area\_suburban = -0.107$  
$Area\_urban = 0.268$  

- Holding all other variables fixed, being a **suburban** area is associated with an average **decrease**, in Sales of 107 (as compared to the baseline level, which is rural)
- Being an **urban** area is associated with an average **increase** in Sales of 268 widgets (compared to rural)
