<h1 style="font-size:42px; text-align:center; margin-bottom:30px;"><span style="color:SteelBlue">Module 4:</span> Regression Algorithm</h1>
<hr>

Welcome to the workbook for <span style="color:royalblue">Module 4: Regression Algorithms</span>!

We'll introduce two powerful mechanisms in modern algorithms: regularization and ensembles. As you'll see, these mechanisms "fix" some fatal flaws in older methods, which has lead to their popularity.

Let's get started!


y = mx + c

y = dependent variable
x = dependent variable
m, c = constants/co-efficients

y = m1x1 + m2x2 + m3x3 + m4x4 + m5x5.... + c (For Multiple Linear Regression)


<br><hr id="toc">

### How to Pick ML Algorithms


In this lesson, we'll introduce 5 very effective machine learning algorithms for regression tasks. They each have classification counterparts as well.

And yes, just 5 for now. Instead of giving you a long list of algorithms, our goal is to explain a few essential concepts (e.g. regularization, ensembling, automatic feature selection) that will teach you why some algorithms tend to perform better than others.

In applied machine learning, individual algorithms should be swapped in and out depending on which performs best for the problem and the dataset. Therefore, we will focus on intuition and practical benefits over math and theory.

## Why Linear Regression is Flawed

To introduce the reasoning for some of the advanced algorithms, let's start by discussing basic linear regression. Linear regression models are very common, yet deeply flawed.

Linear Regression
Simple linear regression models fit a "straight line" (technically a hyperplane depending on the number of features, but it's the same idea). In practice, they rarely perform well. We actually recommend skipping them for most machine learning problems.

Their main advantage is that they are easy to interpret and understand. However, our goal is not to study the data and write a research report. Our goal is to build a model that can make accurate predictions.

In this regard, simple linear regression suffers from two major flaws:

1. It's prone to overfit with many input features.
2. It cannot easily express non-linear relationships.


![220px-Linear_regression.svg.png](attachment:220px-Linear_regression.svg.png)

![noisy-sine-linear-regression.png](attachment:noisy-sine-linear-regression.png)

## Let's take a look at how we can address the first flaw.

# Regularization in Machine Learning


This is the first "advanced" tactic for improving model performance. It’s considered pretty "advanced" in many ML courses, but it’s really pretty easy to understand and implement.

The first flaw of linear models is that they are prone to be overfit with many input features.

The number of features is too damn high!
Let's take an extreme example to illustrate why this happens:

Let's say you have 100 observations in your training dataset.
Let's say you also have 100 features.
If you fit a linear regression model with all of those 100 features, you can perfectly "memorize" the training set.
Each coefficient would simply memorize one observation. This model would have perfect accuracy on the training data, but perform poorly on unseen data.
It hasn’t learned the true underlying patterns; it has only memorized the noise in the training data.
Regularization is a technique used to prevent overfitting by artificially penalizing model coefficients.

1. It can discourage large coefficients (by dampening them).
2. It can also remove features entirely (by setting their coefficients to 0).

The "strength" of the penalty is tunable. 

### Regularized Regression Algos

# Lasso Regression
Lasso, or LASSO, stands for Least Absolute Shrinkage and Selection Operator.

Lasso regression penalizes the absolute size of coefficients.

Practically, this leads to coefficients that can be exactly 0.

Thus, Lasso offers automatic feature selection because it can completely remove some features.
Remember, the "strength" of the penalty should be tuned.
A stronger penalty leads to more coefficients pushed to zero.

# Ridge Regression

Ridge stands Really Intense Dangerous Grapefruit Eating (just kidding... it's just ridge).

Ridge regression penalizes the squared size of coefficients.

Practically, this leads to smaller coefficients, but it doesn't force them to 0.

In other words, Ridge offers feature shrinkage.

Again, the "strength" of the penalty should be tuned.
A stronger penalty leads to coefficients pushed closer to zero.

# Elastic Net Regression

Elastic-Net is a compromise between Lasso and Ridge.

Elastic-Net penalizes a mix of both absolute and squared size.

The ratio of the two penalty types should be tuned.

The overall strength should also be tuned.

Oh and in case you’re wondering, there’s no "best" type of penalty. It really depends on the dataset and the problem. We recommend trying different algorithms that use a range of penalty strengths as part of the tuning process, which we'll cover in detail tomorrow.

# Awesome, we’ve just seen 3 algorithms that can protect linear regression from overfitting. 

But if you remember, linear regression suffers from two main flaws:

1. It's prone to overfit with many input features.
2. It cannot easily express non-linear relationships.

How can we address the second flaw?

### Decision Tree Algos

Well, we need to move away from linear models to do so.... we need to bring in a new category of algorithms.

Decision trees model data as a "tree" of hierarchical branches. They make branches until they reach "leaves" that represent predictions.

Decision Tree Example
Due to their branching structure, decision trees can easily model nonlinear relationships.

For example, let's say for Single Family homes, larger lots command higher prices.
However, let's say for Apartments, smaller lots command higher prices (i.e. it's a proxy for urban / rural).
This reversal of correlation is difficult for linear models to capture unless you explicitly add an interaction term (i.e. you can anticipate it ahead of time).
On the other hand, decision trees can capture this relationship naturally.
Unfortunately, decision trees suffer from a major flaw as well. If you allow them to grow limitlessly, they can completely "memorize" the training data, just from creating more and more and more branches.

As a result, individual unconstrained decision trees are very prone to being overfit.​

So, how can we take advantage of the flexibility of decision trees while preventing them from overfitting the training data?

![Decision-Tree-Example.jpg](attachment:Decision-Tree-Example.jpg)

Well, we need to move away from linear models to do so.... we need to bring in a new category of algorithms.

Decision trees model data as a "tree" of hierarchical branches. They make branches until they reach "leaves" that represent predictions.

Decision Tree Example
Due to their branching structure, decision trees can easily model nonlinear relationships.

For example, let's say for Single Family homes, larger lots command higher prices.
However, let's say for Apartments, smaller lots command higher prices (i.e. it's a proxy for urban / rural).
This reversal of correlation is difficult for linear models to capture unless you explicitly add an interaction term (i.e. you can anticipate it ahead of time).
On the other hand, decision trees can capture this relationship naturally.
Unfortunately, decision trees suffer from a major flaw as well. If you allow them to grow limitlessly, they can completely "memorize" the training data, just from creating more and more and more branches.

As a result, individual unconstrained decision trees are very prone to being overfit.​

So, how can we take advantage of the flexibility of decision trees while preventing them from overfitting the training data?

![Decision-Tree-Example.jpg](attachment:Decision-Tree-Example.jpg)

# Tree Ensembles
Ensembles are machine learning methods for combining predictions from multiple separate models. There are a few different methods for ensembling, but the two most common are:

### Bagging
Bagging attempts to reduce the chance overfitting complex models.

It trains a large number of "strong" learners in parallel.
A strong learner is a model that's relatively unconstrained.
Bagging then combines all the strong learners together in order to "smooth out" their predictions.

### Boosting
Boosting attempts to improve the predictive flexibility of simple models.

It trains a large number of "weak" learners in sequence.
A weak learner is a constrained model (i.e. you could limit the max depth of each decision tree).
Each one in the sequence focuses on learning from the mistakes of the one before it.
Boosting then combines all the weak learners into a single strong learner.


While bagging and boosting are both ensemble methods, they approach the problem from opposite directions. Bagging uses complex base models and tries to "smooth out" their predictions, while boosting uses simple base models and tries to "boost" their aggregate complexity.

Ensembling is a general term, but when the base models are decision trees, they have special names: random forests and boosted trees!

### Random forests
Random forests train a large number of "strong" decision trees and combine their predictions through bagging.

In addition, there are two sources of "randomness" for random forests:

Each tree is only allowed to choose from a random subset of features to split on (leading to feature selection).
Each tree is only trained on a random subset of observations (a process called resampling).

### Boosted trees
Boosted trees train a sequence of "weak", constrained decision trees and combine their predictions through boosting.

Each tree is allowed a maximum depth, which should be tuned.
Each tree in the sequence tries to correct the prediction errors of the one before it.
