# COMP-2704: Supervised Machine Learning
### <span style="color:blue"> Week 6 </span>

## <span style="color:blue"> Chapter 4 continued</span>
### Another alternative to avoiding overfitting: Regularization

* Regularization is a method for lowering the amount of overfitting.
* One can tune the hyperparameters of a model to overfit slightly, then use regularization to improve the model.
* This will tend to increase the training error, but lower the validation error.
* Remember, one should not find the testing error until after all of the hyperparameters, including those for regularization, have been set.

### The leaky roof example

<img src='leaky_house.jpg' width='300'/>

*Generated by AI*

There are three roofers to consider. Here is how well they perform:
* Roofer 1: uses cardboard and duct tape; reduces leakager to 1,000 ml of water per day.
* Roofer 2: uses shingles and nails; reduces leakage to 1 ml of water per day.
* Roofer 3: uses marble slabs and reinforcing beams; reduces leakage to 0 ml of water per day. 

A repair by roofer 3 will work the best. But how expensive is each fix?
* Roofer 1: \$1
* Roofer 2: \$100
* Roofer 3: \$100,000

Roofer 1 is the cheapest. However, roofer 2 seems like the best overall option.

**<span style="color:green">Q: What is a metric that would measure roofer 2 to be the best?</span>**

Well, we could add the *performance* and the *price*, then choose the roofer with the lowest total value:
* Roofer 1: $1000 + 1 = 1001$
* Roofer 2: $100 + 1 = 101$
* Roofer 3: $100,000 + 1 = 100,001$

We see that roofer 2 has the lowest value of this metric.

* In machine learning, *performance* is analogous to *error* and *price* is analogous to a *regularization* term.
* Next we discuss how to create a regularization term and how it is combined with error to create a new error function.

### Another example of overfitting: Movie recommendations

Consider the following example:
* We have 10 movies, M1, M2, ... , M10, each is a feature.
* There are 100 users, each is a sample of data.
* The data set has 100 rows; each column has the time $x_i$ (in seconds) that the user watched movie $i$.
* We want a model that predicts the time a user will spend watching a new movie, M11.
* We develop the following two models:
    * Model 1: $\hat{y} = 2x_3 + 1.4x_7 – 0.5x_7 + 8$
    * Model 2: $\hat{y} = 22x_1 – 103x_2 – 14x_3 + 109x_4 – 93x_5 + 203x_6 + 87x_7 – 55x_8 + 378x_9 – 25x_{10}+8$
    
**<span style="color:green">Q: Which model do you think is overfitting?</span>**

* Notice that many coefficients in model 1 are 0.
* There are more coefficients in model 2, and they have larger absolute values.
* Using only regression error (such as RMSE) will lead to a model like model 2 during training. We need a new error function that leads to something more like model 1.

To create a new error function, consider the following two polynomial 'norms':
* **L1** norm: sum the absolue values of all coefficients (but not the bias):
    * Model 1 $\Rightarrow |2| + |1.4| + |-0.5| = 3.9$
    * Model 2 $\Rightarrow |22| + |–103| + |–14| + |109| + |–93| + |203| + |87| + |–55| + |378| + |–25| = 1,089$
* **L2** norm: sum the squared values of all coefficients (but not the bias):
    * Model 1 $\Rightarrow 2^2 + 1.4^2 + (–0.5)^2 = 6.21$
    * Model 2 $\Rightarrow 22^2 + (–103)^2 + (–14)^2 + 109^2 + (-93)^2 + 203^2 + 87^2 + (–55)^2 + 378^2 + (–25)^2 = 227,131$
    
So we see that these norms are larger for complex models that tend to overfit.

### Modifying the error function to solve our problem: Lasso regression and ridge regression

We now have regression error and regularization term (L1 or L2), and we would like to optimize both:
* Keeping the regression error low will lead to better predictions on the training set.
* Keeping the norm low will lead prevent the model from overfitting so it makes good predictions on new data.

The way to do this is to create a new error function from the sum of these two terms:
$$ \text{error} = \text{regression error} + \lambda \left(\text{regularization term}\right)$$
* $\lambda$ is a hyperparameter that controls the relative strength of each term.
    * making $\lambda > 1$ makes the regularization term dominate over the error, and training will focus more on preventing overfitting.
    * making $\lambda < 1$ makes the regression error dominant, and training will focus more on making better predictions.
    
There are three ways to add regularization:
* **Lasso** regularization uses the L1 norm
    * Coefficients tend to shrink to zero: $2 → 1.99 → 1.98 → … → 0.02 → 0.01 → 0$
    * Leaves a model with fewer coefficients.
* **Ridge** regularization uses the L2 norm
    * Coefficients tend to become smaller, but not zero: $2 → 1.98 → 1.9602 → … → 0.2734 → 0.2707 → 0.2680$
    * Leaves a model with more coefficients than Lasso regularization, but they are small.
* **Elastic net** regularization uses both the L1 and L2 norms
    * First uses ridge regularization, then lasso.
    * The $\lambda_1$ and $\lambda_2$ coefficients for each can be set separately.
    
<img src='Fig4.7.png' width='600'/>

<span style="color:red">*Let us now review the textbook code Polynomial_regression_regularization.ipynb.*</span>

**<span style="color:green">Q: Notice that we are using a linear regression model to do polynomial regression. Can you explain that?</span>**