### Introduction  

Now that you've seen some basic linear regression models it's time to discuss further how to better tune these models. As you saw, we usually begin with an error or loss function for which we'll apply an optimization algorithm such as gradient descent. We then apply this optimization algorithm to the error function we're trying to minimize and voila, we have an optimized solution! Unfortunately, things aren't quite that simple. 

### Overfitting and Underfitting
Most importantly is the issue of generalization.
This is often examined by discussing underfitting and overfitting.
![](./images/overfit_underfit.png)

Recall our main goal when performing regression: we're attempting to find relationships that can generalize to new cases. Generally, the more data that we have the better off we'll be as we can observe more patterns and relationships within that data. However, some of these patterns and relationships may not generalize well to other cases. 


### Systems of Equations
If you recall from your earlier life as a algebra student:

$2x +10 = 18$ has a unique solution; one variable, one equation, one solution

Similarly, two variables with two equations has one solution*   
$x+y=4$
$2x+2y=10$

However, if we allow 2 variables with only 1 equation, we can have infinite solutions.
$x+y=4$

*(An inconsistent system will have no solution and a system where the second equation is a multiple of the first will have infinite solutions)

### Fundamental Theorem of Algebra
http://mathworld.wolfram.com/FundamentalTheoremofAlgebra.html

This is setting us for a wonderful restatement of the fundamental theorem of algebra!
If we allow ourselves more variables then we have equations (more variables then observances), then we can have infinite solutions.

If I have 30 observed data points, I can make a model that will perfectly go through all 30 points using a degree 30 polynomial. But such a complex model may not generalize well.


### Normalization

So with all of this, what do we do? What we've seen is that an error function alone may not prove sufficient. Yes, we want to have good performance, but we also want our model to hold up for future cases that may not be within the data we have available. A common method in normalizing models therefore is to penalize the number of terms allowed within the model. This also captures the intuition behind Occam's razor, that we should choose a simpler explanation when presented with two options that produce comparable results.


## $L^p$ norm of x
In order to help account for underfitting and overfitting, we often use what are called $L^p$ norms.   
The **$L^p$ norm of x** is defined as:  

### $||x||_p  =  \big(\sum_{i} x_i^p\big)^\frac{1}{p}$

In [None]:
# Calculate the L1 and L2 Norms

## Ridge (L2)
One common normalization is called Ridge Regression and uses the $l_2$ norm (also known as the Euclidean norm) as defined above.   
The ridge coefficients minimize a penalized residual sum of squares:    
    $ \sum(\hat{y}-y)^2 + \lambda\bullet w^2$

Write this loss function for performing ridge regression.

In [None]:
#Your function goes here

## Lasso (L1)
Another common normalization is called Lasso Regression and uses the $l_1$ norm.   
The ridge coefficients minimize a penalized residual sum of squares:    
    $ \sum(\hat{y}-y)^2 + \lambda\bullet |w|$

Write this loss function for performing ridge regression.

In [None]:
Elastic Neat - Mean of L1 + L2 Norms

### Bias Variance Tradeoffs

### Choosing $\lambda$