# **Lecture 4: Foundations of Supervised Learning**
## Applied Machine Learning

## **Why does Supervised Learning work?**
Previously, we learned about supervised learning, derived our first algorithm, and used it to predict diabetes risk.\

In this lecture, we are going to dive deeper into why really supervised learning works.

## **Part 1: Data Distribution**
First, let's look at the data, and define where it comes from.\
Later, this will be useful to precisely define when supervised learning guaranteed to work.



### **Review: Components of A Supervised Machine Learning Problem**
At a high level, a supervised machine learning problem has the following structure:
$$\underbrace{\text{Training  Dataset}}_\text{Features + Attributes} + \underbrace{\text{Learning Algorithm}}_\text{Model class + Objective + Optimizer} \to Predictive ~ Model$$

Where does the data come from?

### **Data Distribution**
We will assumed that the dataset is sampled from a probability distribution $\mathbb{P}$, which we will call the *data distribution*. We will denote this as
$$x, y \sim \mathbb{P}$$
The training set $\mathcal{D} = \{(x^{(i}),y^{(i)} \mid i = 1,2,\dots, n \}$ consists of independent and identically distributed (IID) samples from $\mathbb{P}$.

### **Data Distribution: IID Sampling**
The key assumption in that the training examples are independent and identically distributed (IID).

*   Each training example is from the same distribution.
*   This distribution does not depend on previous training example

**Example:** Flipping a coin. Each flip has same probability of heads & tails and does not depend on the previous flip.

**Counter-Example:** Yearly census data. The population on each year will be close to that of the previous year

### **Data Distribution: Example**

Let's implement an example of data distribution in numpy

In [None]:
import numpy as np
np.random.seed(0)

def true_fn(X):
  return np.cos(1.5*np.pi*X)


Let's visualize it

In [None]:
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [12,4]

X_test = np.linspace(0, 1, 100)
plt.plot(X_test, true_fn(X_test), label="True function")
plt.legend()

Let's now draw the sample from the distribution. We will generate random variable $x$, and then generate random $y$ using
$$y = f(x) + \epsilon$$ 
for a random noise $\epsilon$

In [None]:
n_samples = 30

X = np.sort(np.random.rand(n_samples))
y = true_fn(X) + np.random.rand(n_samples)*0.1

We can visualize the samples

In [None]:
plt.plot(X_test, true_fn(X_test), label='True function')
plt.scatter(X, y, edgecolor='b', s=20, label='Samples' )
plt.legend()

### **Data Distribution: Motivation**
Why assumed that the dataset is sampled from the same distribution?


*   There is inherent uncertainty from the data. The data may consist of noisy measurements (readings from an imperfect thermometer),
*   There is uncertainty in the process we model, if $y$ is a stock price, there is randomness in the market that cannot be modeled,
*   We can use probability and statistics to analyze supervised machine learning algorithm and prove that they work. 



cornell_tech2.svg

## **Part 2: Why Does Supervised Learning Work?**

We made an assumption that the training dataset is sampled from a data distribution.

Let's now use it to gain intuition about why supervised learning works

### **Review: Data distribution**

We will assumed that the dataset is sampled from a probability distribution $\mathbb{P}$, which we will call the *data distribution*. We will denote this as
$$x, y \sim \mathbb{P}$$
The training set $\mathcal{D} = \{(x^{(i}),y^{(i)}) \mid i = 1,2,\dots, n \}$ consists of independent and identically distributed (IID) samples from $\mathbb{P}$.

### **Review: Supervised Learning Model**
We'll say that a model is a function
$$f: \mathcal{X} \to \mathcal{Y}$$

that maps inputs $x \in \mathcal{X}$ to targets $y \in \mathcal{Y}$

### **What Makes A Good Model?**
There are several things we may want out of a good model.

1.   Interpretable features that explain how $x$ affects 
$y$,
2.   Confidence intervals around y (we will see later how to obtain these),
3. Accurate predictions of the targets $y$ from inputs $x$.

In this lecture, we will focus on the latter.




### **What Makes A Good Model?**
A good predictive model is one that makes **accurate prediction** on **new data** that it has not seen at training time.

### **Hold-out Dataset: Definition**
A hold-out dataset:
$$\dot{\mathcal{D}} = \{(\dot{x}^{(i)},\dot{y}^{(i)})\mid i = 1,2,\dots,m\}$$
is another dataset that is sampled IID from the same distribution $\mathbb{P}$ as the training dataset $\mathcal{D}$ and the two datasets are disjoint.

Let's generate a hold-out dataset for the example we saw earlier

In [None]:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)

def true_fn(X):
    return np.cos(1.5 * np.pi * X)

X_test = np.linspace(0, 1, 100)
plt.plot(X_test, true_fn(X_test), label="True function")
plt.legend()

In [None]:
n_samples, n_holdout_samples = 30, 30

X = np.sort(np.random.rand(n_samples))
y = true_fn(X) + np.random.rand(n_samples)*0.1
X_holdout = np.sort(np.random.rand(n_holdout_samples))
y_holdout = true_fn(X_holdout) + np.random.rand(n_holdout_samples)*0.1

plt.plot(X_test, true_fn(X_test), label='True Function')
plt.scatter(X,y, edgecolor='b', s=20, label='Samples')
plt.scatter(X_holdout, y_holdout, edgecolor='r', s=20, label='Holdout Samples')
plt.legend()

### **Defining What is an Accurate Prediction**
Suppose that we have a function $isaccurate(y,y')$ that determines if $y$ is an accurate estimate of $y'$, e.g.:


*   Is the target variable is close enough to the true target
$$isaccurate(y,y') = true ~ if (|y - y'| ~ is ~ small) ~ else ~ false$$
*   Did we predict the right class?
$$isaccurate(y,y')=true ~ if (y=y')~ else ~false$$

This defines accuracy on tha data point. We say a supervised learning model is accurate if it correctly predicts the target on new (held-out) data. 

### **Defining What is an Accurate Model**
We can say that a predictive model $f$ is accurate if it has probability of making error on a random hold-out sample is small
$$1 - \mathbb{P}[isaccurate(\dot{y}, f(\dot{x}))] \leq \epsilon$$
for $\dot{x}, \dot{y} \sim \mathbb{P}$, for some small $\epsilon > 0$ and some definition of accuracy.

We can also say that a predictive model $f$ is inaccurate if it has probability of making error on a random holdout sample is large
$$1 - \mathbb{P}[isaccurate(\dot{y}, f(\dot{x}))] \geq \epsilon$$
Equivalently,
$$\mathbb{P}[isaccurate(\dot{y}, f(\dot{x}))] \leq 1 - \epsilon$$

### **Generalization**
In machine learning, **generalization** is the property of the predictive model to achieve good performance on new, heldout data that distinct from the training set.

Will supervised learning return a model that generalizes?

### **Recall: Supervised Learning**
Recall out ituitive definition of supervised learning


1.   First, we collect a dataset of labeled training examples.
2.   We train a model to output accurate predictions on this dataset.
3. When the model see new, similar data, it will also be accurate.


### **Recall: Supervised Learning**
Recall that supervised learning at high level performs the following procedure:



1.   Collect the training dataset $\mathcal{D}$  of labeled examples,
2.   Output a model that is accurate on $\mathcal{D}$

I claim that the output model is also guaranteed to generalize if $\mathcal{D}$ is large enough.





### **Applying Supervised Learning**
In oder to prove that supervised learning works, we will make two simplifying assumptions:

1.   We define a model class $\mathcal{M}$ containing $H$ different models.
2.   One of these models fits the training dataset perfectly (is accurate on every datapoint) and we choose that model.

(Both of these assumptions are relaxed)



### **Why Supervised Learning Works**
**Claim:** The probability that supervised learning will return an inaccurate model decrease exponentially with training set size $n$

1.   A model $f$ is inaccurate if $\mathbb{P}[isinaccurate(\dot{y}, f(\dot{x})] \leq 1 - \epsilon$. The probability that an inaccurate model $f$ fits perfectly the training set is at most $\prod_{i=1}^n\mathbb{P}[isinaccurate(y^{(i)}, f(x^{(i)})] \leq (1 - \epsilon)^n$
2.   We have $H$ models in  $\mathcal{M}$, and any of them could be inaccurate. The probability that at least one the at most $H$ inaccurate models will fit the training set perfectly is $\leq H(1 - \epsilon)^n$.

Therefore, the claim holds.



## **Part 3: Underfitting and Overfitting**
Let's now dive deeper into the concept of generalizaation and two possible failure modes of supervised learning: overfitting and underfitting 

### **Review: Generalization**
We will assume that the dataset is governed by a probability distribution $\mathbb{P]$, which we will call the *data distribution*. We will denote this as
$$x, y \sim \mathbb{P}$$

A hold-out set $\dot{\mathcal{D}} = \{(\dot{x}^{(i)}, \dot{y}^{(i)}) \mid i= 1,2, \dots, m\}$ consists of independent and identically distributed (IID) samples from $\mathbb{P}$ and is distinct from the training set.

The model that **generalizes** is accurate on a hold-out dataset

### **Review: Polynomial Regression**
In 1D polynomial regression, we fit a model
$$f_{\theta}(x) := \theta^\top \phi(x)$$
that is linear in $\theta$ but non-linear in $x$ because the features $\phi(x) : \mathbb{R} \to \mathbb{R}^p$ is non-linear.

By using polynomial features such as $\phi(x) = [1\; x\; x^2\; \dots\; x^p]$, we can fit any polynomial of degree $p$

### **Polynomial Better Fit the Data**
When we switch from linear models to polynomials, we can better fit the data and increae the accuracy of our models.

Consider the synthetic dataset we have seen earlier

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

np.random.seed(0)
n_samples = 30 
X = np.sort(np.random.rand(n_samples))
y = true_fn(X) + np.random.rand(n_samples)*0.1

X_test = np.linspace(0, 1, 100)
plt.plot(X_test, true_fn(X_test), label='True Functions')
plt.scatter(X, y, edgecolor='b', s=20, label='Samples')
plt.legend()

Although fitting a linear model does not work well, quadratic or cubic polynomials improve the fit.

In [None]:
degrees = [1, 2, 3]
plt.figure(figsize = (14,5))
for i in range(len(degrees)):
  ax = plt.subplot(1, len(degrees), i+1)

  polynomial_features = PolynomialFeatures(degree=degrees[i], include_bias = False)
  linear_regression = LinearRegression()
  pipeline = Pipeline([("pf", polynomial_features), ('lr', linear_regression)])
  pipeline.fit(X[:, np.newaxis], y )

  ax.plot(X_test, true_fn(X_test), label='True Functions')
  ax.plot(X_test, pipeline.predict(X_test[:,np.newaxis]), label='Model')
  ax.scatter(X, y, edgecolor='b', s=20, label='Samples')
  ax.set_xlim((0,1))
  ax.set_ylim((-2,2))
  ax.legend(loc='best')
  ax.set_title("Polynomial of Degree {}".format(degrees[i]))


### **Towards Higher-Degree Polynomial Features?**
As we increase the complexity of our model class $\mathcal{M}$ to even higher degree polynomials, we are able to fit the data increasingly even better.

What happen if we increse the degree of the polynomial?

In [None]:
degrees = [30]
plt.figure(figsize = (14,5))
for i in range(len(degrees)):
  ax = plt.subplot(1, len(degrees), i+1)

  polynomial_features = PolynomialFeatures(degree=degrees[i], include_bias = False)
  linear_regression = LinearRegression()
  pipeline = Pipeline([("pf", polynomial_features), ('lr', linear_regression)])
  pipeline.fit(X[:, np.newaxis], y )

  ax.plot(X_test, true_fn(X_test), label='True Functions')
  ax.plot(X_test, pipeline.predict(X_test[:,np.newaxis]), label='Model')
  ax.scatter(X, y, edgecolor='b', s=20, label='Samples')
  ax.set_xlim((0,1))
  ax.set_ylim((-2,2))
  ax.legend(loc='best')
  ax.set_title("Polynomial of Degree {}".format(degrees[i]))

### **The Problem With Increasing Model Capaity**

As the degree of polynomial increases to the size of the dataset, we are increasingly able to fit every point in the dataset.\

However, this results in highly an irregular curve: its behaviour outside the training set is wildly inaccurate.


### **Overfitting**
Overfitting is one of the most common failure modes of machine learning:

*   A very expressive model (a high degree polynomial) fits the training dataset perfectly.
*   The model also makes incorrect predictions outside the training set, and does not generalize.




### **Underfitting**
A related failure mode is underfitting:


*   A small model (e.g., a straight line) will not fit the training data well
*   Held-out data is similar to training data, so it will not accurate either.

Finding the trade off between  overfitting and underfitting is one of the main challenges in applying machine learning.



### **Overfitting vs. Underfitting: Evaluation**
We can measure the overfitting and underfitting by estimating accuracy on held-out dataset and comparing it to the training dataset

*   If training performance is high but held-out performance is low, we are overfitting
*   if the performance in both training and held-out dataset are low, we are underfitting



In [None]:
degrees = [1,40,5]
title = ['Underfitting', 'Overfitting', 'Appropriate Capacity']
plt.figure(figsize=(14,5))
for i in range(len(degrees)):
  ax = plt.subplot(1, len(degrees), i+1)

  polynomial_features = PolynomialFeatures(degree=degrees[i], include_bias = False)
  linear_regression = LinearRegression()
  pipeline = Pipeline([("pf", polynomial_features), ('lr', linear_regression)])
  pipeline.fit(X[:, np.newaxis], y )

  ax.plot(X_test, true_fn(X_test), label='True Functions')
  ax.plot(X_test, pipeline.predict(X_test[:,np.newaxis]), label='Model')
  ax.scatter(X, y, edgecolor='b', s=20, label='Samples', alpha=0.2)
  ax.scatter(X_holdout[::3], y_holdout[::3], edgecolor= 'r', s=20, label='Samples')
  ax.set_xlim((0,1))
  ax.set_ylim((-2,2))
  ax.legend(loc='best')
  ax.set_title("{} (Degrees {})".format(title[i], degrees[i]))
  ax.text(0.05, -1.7, 'Hold-out MSE: %.4f' %((y_holdout - pipeline.predict(X_holdout[:,np.newaxis]))**2).mean())

### **Dealing with Underfitting**
Balancing ovefitting vs. underfitting is a major challenges in applying machine learning. Briefly, here is some approaches:
*   To fight underfitting, we may increase our model class to encompass more expressive model.
*   We may also create more richer features for the data that will make the dataset easier to fit.



### **Dealing with Overfitting**
We will see many ways to deal with overfitting. but these are some ideas:

*   We may reduce the complexity of our model by reducing the size of $\mathcal{M}$.
*   We may also modify our objective to penalize complex models that may overfit our data



## **Part 4: Regularization**
We will now see a very important way to reduce overfitting - regularization. We will see several important new algorithms.

### **Review: Generalization**
We will assume that the dataset is governed by a probability distribution $\mathbb{P}$, which we will call the *data distribution*. We will denote this as
$$x, y \sim \mathbb{P}.$$

A hold-out set $\dot{\mathcal{D}} = \{(\dot{x^{(i)}}, \dot{y^{(i)}}) \mid i = 1,2,...,n\}$ consists of *independent and identically distributed* (IID) samples from $\mathbb{P}$ and is distinct from the training set.

### **Review: Overfitting**
Overfitting is one of the most common failure modes of machine learning.
* A very expressive model (a high degree polynomial) fits the training dataset perfectly.
* The model also makes wildly incorrect prediction outside this dataset, and doesn't generalize.

We will visualize overfitting by trying to fit small dataset with high degree polynomial.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

degrees = [30]
plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
    ax = plt.subplot(1, len(degrees), i + 1)

    polynomial_features = PolynomialFeatures(degree=degrees[i], include_bias=False)
    linear_regression = LinearRegression()
    pipeline = Pipeline([("pf", polynomial_features), ("lr", linear_regression)])
    pipeline.fit(X[:, np.newaxis], y)

    X_test = np.linspace(0, 1, 100)
    ax.plot(X_test, true_fn(X_test), label="True function")    
    ax.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
    ax.scatter(X, y, edgecolor='b', s=20, label="Samples")
    ax.set_xlim((0, 1))
    ax.set_ylim((-2, 2))
    ax.legend(loc="best")
    ax.set_title("Polynomial of Degree {}".format(degrees[i]))

### **Regularization: Intuition**
The idea of regularization is to penalty complex models that may overfit the data.\
In the previous example, a less complex would reply less on the polynomials terms of high degree.

### **Regularization: Definition**
The idea of regularization is to train models with an augmented objective $J : \mathcal{M} \to \mathbb{R}$ defined over a training dataset $\mathcal{D}$ of size n as
$$J(f) = \frac{1}{n}\sum_{i=1}^nL(y^{(i)}, f(x^{(i)})) + λ\cdot R(f)$$
Let's dissect components of this objective:


*   A loss function $L(y, f(x))$ such as a mean squared error.

*   A regularizer $R: \mathcal{M} \to \mathbb{R}$ that penalizes models that overly complex.
*   A regularization parameter $\lambda> 0$, which controls the strength of the regularizer

When the model $f_{\theta}$ is parametrized by $\theta$, we can also use the following notation:
$$J(\theta) = \frac{1}{n}\sum_{i=1}^nL(y^{(i)}, f_{\theta}(x^{(i)})) + λ\cdot R(\theta)$$


### **L2 Regularization: Definition**
How can we define a regularizer $R: \mathcal{M} \to \mathbb{R}$ to control the complexity of a model $f \in \mathcal{M}$?

In the context of linear models $f(x) = \theta^\top x$, a widely used approache is L2 regularization, which defines following objectives:
$$J(\theta) = \frac{1}{n}\sum_{i=1}^nL(y^{(i)}, \theta^\top x^{(i)}) + \frac{λ}{2}\cdot||\theta||_2^2. $$

Let's dissect the components of this objective:



*   The regularizer $R:\mathcal{M} \to \mathbb{R}$ is the function $R(\theta) = ||\theta||_2^2 = \sum_{i=1}^d\theta_i^2$. This is also known as the L2 norm of $\theta$.

*   The regularizer penalizes large parameters. This prevents us from over-relying on any single feature and penalizes wildly irregular solutions.
*   L2 regularization can be used with most models (linear, neural, etc.)





### **L2 Regularization for Polynomial Regression**
Let's consider a applocation to the polynomial model we have seen so far. Given the polynomial feature $\phi(x)$, we optimize the following objective:
$$J(\theta) = \sum_{j=1}^d(y^{(i)} - \theta^\top \phi(x^{(i)})) + \frac{\lambda}{2}||\theta||^2_2.$$

We are going to implement regularized and standard polynomial regression on three random training sets sampled from the same distribution.

In [None]:
from sklearn.linear_model import Ridge

degrees = [15,15,15]
plt.figure(figsize=(14,5))
for idx,i in enumerate(range(len(degrees))):
  # sample a dataset
  np.random.seed(idx)
  n_samples = 30
  X = np.sort(np.random.rand(n_samples))
  y = true_fn(X) + np.random.rand(n_samples)*0.1

  # Fit a least square model
  polynomial_features = PolynomialFeatures(degree=degrees[i], include_bias=False)
  linear_regression = LinearRegression()
  pipeline = Pipeline([("pf", polynomial_features), ("lr", linear_regression)])
  pipeline.fit(X[:,np.newaxis], y)

  # Fit a Ridge model
  polynomial_features = PolynomialFeatures(degree=degrees[i], include_bias=False)
  linear_regression = Ridge(alpha=0.1) #in sklear we use alpha insted of lambda
  pipeline2 = Pipeline([("pf", polynomial_features), ("lr", linear_regression)])
  pipeline2.fit(X[:,np.newaxis], y)

  # Visualize the results
  ax = plt.subplot(1, len(degrees), i+1)
  #ax.plot(X_test, true_fn(X_test), label='True Function')
  ax.plot(X_test, pipeline.predict(X_test[:,np.newaxis]), label='No Regularization')
  ax.plot(X_test, pipeline2.predict(X_test[:,np.newaxis]), label = "L2 Regularization")
  ax.scatter(X,y, edgecolor='b', s=20, label='Samples')
  ax.set_xlim([0,1])
  ax.set_ylim({-2,2})
  ax.legend(loc='best')
  ax.set_title("Dataset sample #{}".format(idx))




We can show that by using small weights, we prevent the model from learning irregular functions.

In [None]:
print('Non-regularized weigths of the polynomial models need to be large to fit every point:')
print(pipeline.named_steps['lr'].coef_[:4])
print()

print('By regularizing the weight to be small, we force the curve to be more regular')
print(pipeline2.named_steps['lr'].coef_[:4])

### **How to choose $\lambda$?**
In brief, the most common approach is to choose the value of $\lambda$ that results in the best performance on a held-out validation set.

We will see this strategies and several other in more detail.

### **Normal Equations for Regularized Models**
How do we fit regularized models? In the linear case, we can do this easily by using generalized normal equations.\
Let $L(\theta) = \frac{1}{2}(X\theta - y)^\top(X\theta - y)$ be our Least Square objective. We can write Ridge objective as:
$$J(\theta) = \frac{1}{2}(X\theta - y)^\top(X\theta - y) + \frac{1}{2}\lambda||\theta||^2_2$$

This allow us to derive the gradient as follow:

$$\begin{align*}
\nabla_{\theta}J(\theta) &= \nabla_{\theta}\left(\frac{1}{2}(X\theta - y)^\top(X\theta - y) + \frac{1}{2}\lambda||\theta||^2_2 \right) \\
&= \nabla_{\theta}(L(\theta) + \frac{1}{2}\lambda||\theta||^2_2)\\
&= (X^\top X)\theta - X^\top y + \lambda\theta\\
&= (X^\top X + \lambda I)\theta - X^\top y
\end{align*}$$

We used the derivation of normal equations for least squares to obtain $\nabla_{\theta}L(\theta)$ as well as the fact that $\nabla_xx^\top x = 2x$

We can set the gradient to zero to obtain normal equation for Ridge Models:
$$(X^\top X + \lambda I)\theta = X^\top y.$$

Hence, the value $\theta^*$ that minimizes the objective is given by:
$$\theta^* = (X^\top X + \lambda I)^{-1}X^\top y.$$

Note that the matrix $(X^\top X + \lambda I)$ is always invertible, which able to addresses problems with least squares we saw earlier.

Algorithm: Ridge Model

*   **Type**: Supervised Learning (Regression)
*   **Model Family**: Linear models
*   **Objective function**: L2-regularized mean square error
*   **Optimizer**: Normal Equations

## **Part 5: Regularization and Sparsity**
We will now look another form of regularization, which has an important new property call **sparsity**

### **Regularization: Definition**
The idea of regularization is to train models with an augmented objective $J : \mathcal{M} \to \mathbb{R}$ defined over a training set $\mathcal{D}$ of size n as:
$$J(f) = \frac{1}{n}\sum_{i=1}^nL(y^{(i)}, f(x^{(i)})) + \lambda R(f)$$

Let's dissect the components of this objective:


*   A loss function $L(y^{(i)}, f(x^{(i)}))$ such as mean squared error
*   A regularizer $R: \mathcal{M} \to \mathbb{R}$ that penalizes models that are overly complex.



### **L1 Regularization: Definition**
Another closely related approach to regularization is to penalize the size of weights using the L1 norm.\
In the context of linear models $f(x) = \theta^\top x$, L1 norm yields the following form of objective:
$$J(\theta) = \frac{1}{n}\sum_{i=1}^nL(y^{(i)}, \theta^\top x^{(i)}) + \lambda ||\theta||_1.$$

Let's dissect the components of this objective:


*   The regularizer $R:\mathcal{M} \to \mathbb{R}$ is the function $R(\theta) = ||\theta||_1 = \sum_{j=1}^d\|theta_j|$. This is also known as the L1 norm of $\theta$,
*   The regularizer also penalizes large weights, it also force more weights to decay to zero, as opposed to just being small.


### **Algorithm: Lasso**
L1-regularized linear regression is also known as Lasso (least absolute shrinkage and selection operator).


*   **Type**: Supervised learning (Regression)
*   **Modol family**: Linear models
*   **Objective function**: L1-regularized mean squared error
*   **Optimizer**: gradient descent, coordinate descent, least angle regression (LARs) and others.



### **Regularizing via. Constraints**
Consider regularized problem with a penalty term:
$$min_{\theta \in \Theta}L(\theta) + \lambda R(\theta).$$
We may also enforce an explicit constraint on the complexity of the model:
$$\begin{align*}
min_{\theta \in \Theta} \; &L(\theta) \\
\text{such that}, \; &R(\theta) \leq \lambda'
\end{align*}$$
We will not prove this, but solving this problem is equivalent to solving the penalized problem for some $\lambda > 0$ that's different form $\lambda'$

In other words,


*   We can regularize by explicitly enforcing $R(\theta)$ to be less than a value instead of penalizing it.
*   For each value of $\lambda$, we are inplicitly setting a constraint of $R(\theta)$.



### **Example:**
This is what look like for a linear model:
$$\begin{align*}
min_{\theta \in \Theta} \; &\frac{1}{2n}\sum_{i=1}^n(y^{(i)} - \theta^\top x^{(i)})^2 \\
\text{such that} \; &||\theta|| \leq \lambda'
\end{align*}$$

where $||\cdot||$ can either be the L1 or L2 norm.

### **L1 vs. L2 Regularization**
The folloing image by [David Kapil]() and Hastie et al, explains the difference between the two norms

<left><img width=75% src="/content/l1-vs-l2-annotated.png"></left>


### **Sparsity: Definition**
A vctor is said to be sparse if a large fraction of its entire is zero.\
L1-regularized linear regression produces *sparse weights.*

*   This makes the model more interpretable.
*   It also make it computationally more tractable in very large dimensions.


### **Sparsity: Ridge Model**
To better understand sparsity, we will fit L2-regularized linear models to the UCI Diabetes Dataset and observe the magnitude of each weight (colored lines) as a function of the regularization parameters.

In [None]:
from sklearn.datasets import load_diabetes
from sklearn.linear_model import Ridge
from matplotlib import pyplot as plt

X,y = load_diabetes(return_X_y=True)

# Create Ridge coefficient
alphas = np.logspace(-5,2,200)
ridge_coefs = []
for a in alphas:
  ridge = Ridge(alpha = a, fit_intercept=False)
  ridge.fit(X,y)
  ridge_coefs.append(ridge.coef_)

# Plot Ridge Coefficients
plt.figure(figsize=(14,5))
plt.plot(alphas, ridge_coefs)
plt.xscale('log')
plt.xlabel('Regularization parameter (lambda)')
plt.ylabel('Magnitude of model parameters')
plt.title('Ridge coefficients as a function of the regularization')
plt.axis('tight')

### **Sparsity: Lasso**
The above Ridge model dose not produce sparse weights. Let's compare it to the Lasso model. 

In [None]:
import warnings
warnings.filterwarnings('ignore')
from sklearn.datasets import load_diabetes
from sklearn.linear_model import lars_path

# Create Lasso coefficients
_, _, lasso_coefs = lars_path(X,y,method='lasso')
xx = np.sum(np.abs(lasso_coefs.T), axis=1)

# Plot the Ridge coeffiicients
plt.figure(figsize=(14, 5))
plt.subplot(121)    
plt.plot(alphas, ridge_coefs)
plt.xscale('log')
plt.xlabel('Regularization parameter (alpha)')
plt.ylabel('Magnitude of model parameters')
plt.title('Ridge coefficients as a function of the regularization')
plt.axis('tight')

# Plot the Lasso coefficients
plt.subplot(122)
plt.plot(3500 - xx, lasso_coefs.T)
ymin, ymax = plt.ylim()
plt.xlim(ax.get_xlim()[::-1]) # reverse axis
plt.xlabel('Regularization parameter (lambda)')
plt.ylabel('Regularization Strength')
plt.title('Lasso Path')
plt.axis('tight')