# Regression

The goal of this chapter is to take a tour of some regression algorithms, build up an understanding of how they differ from one another and when to use them, and show how their parameters work.

We will journey through:
* linear regression (with gradient descent) 
* polynomial (with bayesian)
* regression trees
* isotonic detour (broken stick) 
* logit and softmax

## Overview

When we look at our data, our minds start to seek out patterns. Even before we apply all our statistics knowledge, our minds look for shapes and repeating figures. Augmenting our senses with mathematical analysis and inference tools such that we can observe even more patterns in the world around, is one of the many joys of data science.

## Linear regression

Each regression has a function that defines how it calculates the line of best fit, as well as a function defining the cost function it uses. Let's get comfortable with these two concepts so we can watch them change with each regression. 

There are two ways we can write a regression prediction function:

$\hat{y} = \theta_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} ...+ \theta_{n}x_{n}$

$\hat{y} = \theta^T\cdot(\mathbf{x})$

The first shows us that there is an initial bias term $\theta_{0}$, which serves as the intercept of our function, and then a series of weighted paramters multipled by each of our input features.  

The second equation shows the same thing, only the row of parameters has been flipped on its side and transposed into a column of values, the matrix $\theta^T$. The same thing has been done to all the x's, which have been bundled up into their own matrix $\mathbf{x}$. The little dot between them is a dot product, which, when performed on these matrices, yields what we see in the first equation. For a fabulous review of linear algegra notation, check out: https://github.com/ageron/handson-ml/blob/master/math_linear_algebra.ipynb

The goal is to find the correct weights for each feature that will yield the most accurate target output. To find these ideal parameter weights, we must use some kind of cost function, and the convention is to use Mean Square Error (MSE):

$MSE(\theta) = \frac{1}{m}\sum_{i=1}^m(\theta^T\cdot\mathbf{x}^i - y^i)^2$

The part in the parenthes after the $\sum$ should look familiar. It is the difference between our predicted $\hat{y}$, value and the actual $y$ value for the $i$ instance. Square these differences, sum them across all $m$ instances, and divide by the number $m$, and you've got your MSE. 

### Gradient Descent

Are there programatic ways to optimize our parameters?

### Regularized linear regression (ridge, lasso, elastic net)

There are three regularized versions of linear regression that I want to mention in passing, as they will come in handy when trying to reduce overfitting of your data. Basically, if you normalize the data in some way before optimizing your cost function, you can increase your liklihood of building a model that is more useful in novel situations. There are three tools often used to bound the weights so they can only get so large or small. 

Ridge regression adds a regulization term

Lasso regression adds a different regularization term

Elastic Net applies whatever is the min of Ridge or Lasso

### Example

Ada Lovelace is having a rough day. It's the middle of the summer, and even though the air hot and humid, she's hacking away at a mathematical proof fully decked out in the attire of her times which requires her to wear layer upon layer of impractical clothes. There's a fly buzzing around the _______

## Polynomial regression

Most of the world's phenomena can not be explained by straight lines, and that's where polynomial regression comes in. 

## Bayesian polynomial regression

I find polynomial regressions to be useful, and I like them even better when they take on a Bayesian flavor.