# Regularization

### Overfitting and Occam's Razor

In the last subject, we discussed model complexity and the ability to generalize from data. We saw two cases.

- Underfitting - the model is too simple and fails to fit the data/signal
- Overfitting - the model is too complex and fits the noise in addition to the signal

In this subject, we will see how to control **overfitting using regularization**. But first, **let's talk about Occam's razor** which is the basic idea behind it, but also an interesting principle in general.

### Occam's razor
Occam's razor is a principle which states that if multiple solutions are available, the simplest one is better than the others. **The idea is that it's easy to build overly complicated solutions with ad-hoc rules that don't generalize well.**

In the context of machine learning, the principle says that we should prefer simpler models unless we are sure that the complex ones are necessary.

We often say that generalization is the central goal of machine learning. Occam's razor is one of the important principles to achieve this. You can take a look at section 3 and 4 of the paper "A few useful things to know about machine learning" by Pedro Domingos to learn more about the intuition behind generalization. Here is the link to the google scholar page. https://scholar.google.ch/scholar?cluster=4404716649035182981&hl=en&as_sdt=0,5

### Increasing the amount of data
The amount of data also plays a role in the under-/overfitting balance. Let's do a quick experiment. In this image, we show two polynomial regressions of degree 9 fitted to 10 and 80 data points from the same source of data.

https://d7whxh71cqykp.cloudfront.net/uploads/image/data/3734/data-size.svg

In the first case, the model is strongly overfitting. In fact, the polynomial passes through each data point. The problem is lessened in the second case.

### Summary
In this unit, we learned about Occam's razor which is an important principle in machine learning. In the next unit, we will learn about regularization which is an efficient way to reduce overfitting.


# Regularziation

In practice, we use regularization to fight overfitting. In this unit, we will see the basic idea behind it. We will then implement regularization with Scikit-learn in the next unit.

### L2 Regularization

hen fitting a model, we are searching for a set of optimal parameters 
⃗
w
 that minimize the loss function 
L
(
⃗
w
)
.

As you can see, there are **no constraints on the parameters**
⃗
w
. 
In particular, **they can get very large as long as they minimize the loss function.** However, **large coefficients is one of the symptoms of overfitting.** **With large coefficients, a small variation in the input data has a big effect on the predictions.**

The idea behind **regularization is to add a constraint on the value of the parameters**. In practice, we include a penalization term in the cost function that measures how large the parameters are. For instance, 
L
2
 regularization measures the squares of the parameters 
w
i
.

minL(⃗w)+α * ∑w2i

When 
α
 tends toward zero, the constraint on the parameters vanishes, and the problem is the same as before. When 
α
 tends toward infinity, the 
L
(
⃗
w
)
 term becomes irrelevant compared to the 
L
2
 one. In this case, all parameters are zero except the intercept term 
w
0
.

### Geometrical interpretation

*Note: it's not necessary to understand the mathematics behind this - here, we introduce this geometrical interpretation because it can help us visualize the effect of regularization. However, if you are curious about the maths behind this interpretation and want to learn more about Lagrange multipliers, you can take a look a this excellent tutorial from khanacademy.org.*

In practice, we never use this formulation to fit our models. However, it's useful because it provides a nice geometrical interpretation - For a model with two parameters 
w
1
 and 
w
2
, we are searching for a point inside a circle of radius 
c
. In three dimensions, we are searching for a set of values inside a sphere of radius 
c
. Here is an illustration of the two-dimensional case.

https://d7whxh71cqykp.cloudfront.net/uploads/image/data/2938/regularization-geometrical-interpretation.svg

In this image, the blue point outside the circle represents the parameters 
w
1
 and 
w
2
 that minimize the unconstrained loss value 
L
(
⃗
w
)
. This minimal value is not inside the circle of radius 
c
. Hence it's not a valid solution according to the constraint. The solution that minimizes the cost function inside the gray circle is denoted 
w
∗
 (
w
 star).
 
 ### Other regularizers
 
 We will use 
L
2
 regularization in this course, but there are other regularizers. For instance, the **lasso regularization 
L
1
 is a variant of 
L
2**
 which penalizes the absolute value of the coefficients instead of their squares.
 
 https://d7whxh71cqykp.cloudfront.net/uploads/image/data/2937/l1-regularization.svg
 
 In other words, with 
L
1
 regularization, the optimal solution only has a few non-zero parameters, and we say that the solution is sparse which is a desired property in some cases. You can take a look at this thread if you want to learn more about this topic.
 https://www.quora.com/What-is-the-difference-between-L1-and-L2-regularization-How-does-it-solve-the-problem-of-overfitting-Which-regularizer-to-use-and-when/answer/Xavier-Amatriain

### Summary
Let's summarize what we've learned in this unit. Here are a few takeaways.

- The idea behind regularization is to add a **constraint** on the amplitude of the coefficients.
- This constraint corresponds to an additional term in the cost function called the **penalization term.**
- We use an alpha 
α
 parameter to control the **regularization strength.**

In the next unit, we will implement 
L
2
 regularization for linear regressions.

### Ridge Regression

 the last unit, we learned about 
L
2
 regularization and saw that it adds a constraint on the length of the vector of parameters 
⃗
w
. So far, **we didn't specify any particular model or cost function**, but if we use **multi-linear regressions and minimize the squares of the residuals**, we obtain the ridge regression model.