<a href="https://colab.research.google.com/github/HaidyGiratallah/Intro_to_Machine_Learning_2019/blob/master/Intro_to_ML_linear_regression_code_along.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Linear Regression

In this notebook we'll explore how you can perform linear regression and the principles behind it. 
We'll first begin with a simple example of how one can perform linear regression through gradient descent then show how you can do an equivalent method via sklearn. Finally we'll work through cross-validation through both holdout and k-folds then explore the effect of regularization on the models we create.

***

First we'll import some Python modules that will prove useful for visualization and performing gradient descent

### Creating your dataset

In [0]:
#Make synthetic dataset for regression


The first thing that one might do when working with a new dataset is to plot the data. With high dimensional datasets this becomes rather difficult to do, so alternative methods of exploration are needed but we're keeping it simple here. 

Now that we've visualized our features (which in effect, has shown us that *we can predict y from X*), we can fit a linear regression model:

With sklearn model fitting is incredibly easy to do. All you need to do is to specify a model, then fit the model to the data that you have.

Once a model is fit we can query some properties from it such as the coefficients of the linear model and the intercept.

You can interpret the linear model as such:

$$\hat{y} = w^Tx + \beta$$

Where:

- $w$ = <code>lm.coef_</code>
- $\beta$ = <code>lm.intercept_</code>

In order to use the model in order to predict new values, we can feed it a set of inputs to predict on using <code>lm.predict</code>:

To visualize the model which is sometimes useful to do (especially in the low-dimensional case), we can generate predictions for a set of input X values and overlay the model over our raw data:

Now that we've visualized the model we can also calculate residuals which is a way of assessing how well our model did. To do this, all we need to do is to generate predictions for each point in our raw data and then compare it to the actual datapoint of our raw data:

Of course the value of the MSE that we generate isn't really a useful indicator all by itself. It only really makes sense when we compare it to an alternative model. Let's suppose that we have an alternative model of the following form which we've just eye-balled:

$$f(x) = 25X + 25$$

### Exercise: 
Visualize the two linear regression models. Which one do you think is better? Calculate the MSE on the alternative model. Which model has the lower MSE? 

### Solution:

Our eye-balled linear model did surprisingly well but does not perform as well as Sklearn's optimized linear model. 

***

But wait! We just fit some ML models on our entire dataset then calculated the MSE using the data that we fit the model to! **This is wrong!!!!**. With the goal of statistical modelling being generalizability of your models, we need to re-do this, and do it properly... let's perform cross-validation to get an idea of how well our model will perform. We'll perform two types of cross-validation:

1. Holdout Cross Validation
2. K-Fold Cross validation

### Holdout Cross Validation

Let's use Pandas to simplify working with the data and keep tabs on which subset of the data is meant for training or testing:

Let's perform 60/40 holdout split. That means we'll have to select 60% of our data to convert into Training and 40% of our data to convert into Testing. We'll do this by using pandas <code>DataFrame.sample</code> which will shuffle our raw data then we'll convert 60% of the data into train and 40% of the data into test:

In [0]:
#First get indices for each of the training data


In [0]:
df.head()

### Exercise

Train the model on the training dataset, then compute the MSE on both the training and testing dataset

### Solution

With **holdout** in particular, there's an easier way to set up your data using sklearn which can produce train and test data for you automatically:

We can proceed as usual with the same level of caution for dealing with data shapes

### Exercise:

Perform holdout cross-validation on data generated by <code>train_test_split</code>

As expected a model optimized on the training data need not perform well against testing data indicating to us our expected generalization performance is not as good as we'd expect when we trained the model on the full dataset. This is actually rather typical. However as it is a **holdout** validation score and our dataset is relatively small it's subject to error. Instead let's use a better estimate **k-fold** cross validation which is slightly more complicated to set up.

***

### K-Fold Cross Validation

Recall that **K-Fold** cross validation is done through splitting up the data into $K$ different subsets of the data. We select each subset of the data and use it as a test set, the rest of the $K-1$ folds become our training data to build a model on:

Easy steps to building a **K-Fold** cross-validation dataset:
1. Shuffle the data order to ensure randomization (in case your data is ordered for whatever reason)
2. Split the data into $K$ equal segments
3. For each segment assign it a fold number equal to which segment it is

In [0]:
#Pick number of folds


In [0]:
#Shuffle the data


In [0]:
#Get total number of samples:


In [0]:
#Loop through K times and assign each subset it's fold


Now that we've assigned each subset of the data a fold number the next step is to loop through each fold, train the model on the rest of the data, and test it on the selected fold:

K-Folds gives a much more stable estimate of what our MSE would look like on unseen data. In fact the larger K is, the more stable your estimate is (under some conditions), with a trade-off of compute time. 

***

Another easy way to calculate MSE on your data is to use <code>sklearn.metrics</code> which provides a function to easily calculate mean squared error. It looks as follows:

***

## Expanding to multiple dimensions and regularization of your linear model


Now that we've familiarized ourselves with running a linear model in 2-dimensional space it's time to really take advantage of linear regression's ability to deal with high-dimensional space!

We'll go over the following concepts:

1. Training a high-dimensional linear regression model
2. What overfitting looks like
3. Ridge penalization
4. LASSO regression as a linear feature selector

### Make dataset

With this dataset, we have $50$ features to work with, and $2000$ samples! This is effectively not visualizable and therefore will require us work with our intuition with how data behaves in high-dimensional space.

First we'll start with the most naive approach. Let's start with a simple regression model, explore where it goes wrong with K-fold cross-validation, then try other approaches that can help us get better generalization out of our data:

I've created some "helper functions" to get you set up quickly. If you'd like to know how things work feel free to copy and paste the function code and examine how each step functions to assign K folds

Looks like our model performs a lot worse on the test set compared to our training set... The error is sometimes 3-4 times larger than what we see in the training set. This is a classic result of over-fitting your data! 

The probable reason for why this is the case is that the number of features that we have is just **way too high**. We have a tiny sample relative to the number of features that we need to deal with. This scenario is all too common in big-data derived from health-care (and usually even worse!). 

If you recall from the ML theory lecture, there are a few ways to combat this:

1. ***Stop being greedy* and pick out your important features**. However in this case we have no idea what our data actually means, it's just a bunch of $x_i$ meaningless features we have no *a priori* knowledge about. So this isn't a feasibe option...
2. **Dimensionality Reduction**. This is a great way to reduce the number of features in your dataset while maintaining as much variance as possible. We'll get to this in a later component of this workshop
3. **Regularization**. We could try using **regularization** to deal with the problem of having too many features.
***

Recall that **regularization** works by modifying the cost function such that it is penalized by the total weight of the features. There are two main kinds of regularization that are typically used (although an infinite amount of regularization methods exist!). 

1. **Ridge** - $L_2$ penalization using the form $\text{Cost} + \lambda \sum_{i=0}^{K}w_i^2$
2. **LASSO** - $L_1$ penalization using the form $\text{Cost} + \lambda \sum_{i=0}^{K}|w_i|$

We'll explore both these techniques and how they influence our ability to generalize to unseen data. 

***
Remember that $\lambda$ is a weight factor that determines just *how strongly we should penalize high feature weights*. This is called a **hyperparameter** of the model and is typically also optimized. However we won't explore this topic as that is pretty advanced. Although if you're interested look up:

1. Hyperparameter optimization
2. sklearn's GridSearchCV, RandomSearchCV
3. Bayesian Optimization of Hyperparameters
4. Nested K-fold cross validation
***

Luckily, using both **Ridge** and **LASSO** penalization for our models is as easy as importing it from <code>sklearn</code>! We train it the exact same way we do it for the standard <code>LinearRegression</code> model

### Ridge-penalized Regression

Let's first train our Ridge-penalized linear model using the same K-fold cross-validation

Notice what happened here! With Ridge penalization what we did was perform a trade-off. Intuitively what's happening is that Ridge penalization reducing feature weights (linear model coefficients) across all our features which is effectively reducing the complexity of our model. Thus in total our model is being shifted from being too complex (over-fitting, too much variance) to a simpler model (slightly under-fitting possibly, less variance)!

Since our end goal really is to maximize the generalizability of our model, we've gained a net positive by penalizing our model for selecting weights that are too high!

### LASSO Regression

**LASSO** also performs a somewhat similar task to **Ridge**. If you recall from the lecture, the main difference between LASSO and Ridge is that while Ridge reduces only the overall mean weight across all features, LASSO has a tendency to push some features to 0 (features it deems "unimportant" by way of collinearity). 

The game-changing thing about **LASSO** is that it gives us a *subset of features that stand-out as being useful in predicting the outcome $y$*. This is incredibly useful for helping us narrow down on useful features when we're flooded with tons of data. In a scientific setting **LASSO** is great for data-driven hypothesis generation. 

LASSO performs almost as well as Ridge on the training dataset, but more importantly it also selects a subset of useful features. Let's take a look at the feature weights determined by each of the models that we've trained. For the final model remember that we train the model on the full dataset so let's do that:

With each model trained let's pull out the coefficient each model spits out. We'll include the intercepts as well:

Now with every coefficient pulled out, let's plot them all: