# Linear Regression Bakeoff

![Paul Hollywood gif](https://media.giphy.com/media/OjrcZp4fXMHBryKoXZ/giphy.gif)

### Inferential vs. Predictive
You should think of this primarily as a project in **inferential** statistics. That means:
- focusing on trying to satisfy the assumptions of linear regression;
- using all your records to build models;
- aiming for understanding how features influence sales prices.

But we also invite you to a level-up: a friendly competition among the teams. And here the goal is **predictive**. That means:
- maximizing $R^2$;
- utilizing train-test splits;
- utilizing validation sets (or cross-validation).
We’ll have SOME UNLABELED TEST DATA FOR YOU TO PLUG INTO YOUR MODELS.


# Training Data

Like a Kaggle competition, you are provided with the following training data representing 3/4 of the data set.  
It is split into **predictive features** (X_train) and **target variable** (y_train)

In [1]:
import pandas as pd
import numpy as np

X_train = pd.read_csv('bakeoff_data/Xtrain.csv')
y_train = pd.read_csv('bakeoff_data/ytrain.csv')

In [2]:
print(X_train.shape)

(16197, 19)


In [3]:
print(y_train.shape)

(16197, 1)


In [4]:
X_train.head()

Unnamed: 0,date,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,3/4/2015,3,2.5,1880,4499,2.0,0.0,0.0,3,8,1880,0.0,1993,0.0,98029,47.5664,-121.999,2130,5114
1,10/7/2014,3,2.5,2020,6564,1.0,0.0,0.0,3,7,1310,710.0,1994,0.0,98042,47.3545,-122.158,1710,5151
2,1/16/2015,5,4.0,4720,493534,2.0,0.0,0.0,5,9,3960,760.0,1975,0.0,98027,47.4536,-122.009,2160,219542
3,3/30/2015,2,2.0,1430,3880,1.0,0.0,0.0,4,7,1430,0.0,1949,0.0,98117,47.6844,-122.392,1430,3880
4,10/14/2014,3,2.25,2270,32112,1.0,0.0,0.0,4,8,1740,530.0,1980,0.0,98042,47.3451,-122.094,2310,41606


As you can see, you have been provided with 19 independent features.  You may use as many of them as you like in your model.  The goal is to get the highest R^2 on the test data.

# Test Data

But how will you know that your model resulted in a high R^2 in the test data? You won't! At least, you won't know until the submission window has closed.  

You will notice that while you have a file named `Xtest.csv`, you do not have a file named `ytest.csv`. Your instructor has that in their posession, and will keep it secret from the bakeoff contestants.  

Once you have decided on your best model, you will then make predictions.  These predictions will be compared to the labels held in the hidden `ytest.csv`, resulting in a final R^2 score. In order for your submission to be valid, you have to have a prediction for every row of `Xtest.csv`.

Below, the `Xtest.csv` has been imported into this notebook for you.

In [8]:
X_test = pd.read_csv('bakeoff_data/Xtest.csv')
X_test.head()


Unnamed: 0,date,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,2/20/2015,3,0.75,850,8573,1.0,0.0,0.0,3,6,600,250.0,1945,0.0,98146,47.503,-122.356,850,8382
1,10/8/2014,3,1.0,1510,6083,1.0,0.0,0.0,4,6,860,650.0,1940,0.0,98115,47.6966,-122.324,1510,5712
2,3/25/2015,4,2.25,1790,42000,1.0,0.0,0.0,3,7,1170,620.0,1983,0.0,98045,47.4819,-121.744,2060,50094
3,2/17/2015,2,1.5,1140,2500,1.0,0.0,1.0,3,7,630,510.0,1988,,98106,47.5707,-122.359,1500,5000
4,5/23/2014,3,1.0,1500,3920,1.0,0.0,0.0,3,7,1000,500.0,1947,0.0,98107,47.6718,-122.359,1640,4017


In [9]:
print(X_test.shape)

(5400, 19)


Notice how the cell above indicates that there are **5400** records in `X_test`.  You should therefore submit 5400 predicted saleprices.  

# Building Your Best Model

So how does one build a model that one has confidence will perform well on the test data? You could just fit the model on the training data, and consider the R^2.  But remember, no matter what, your training R^2  will always go up when you add more features. With that in mind, you could just implement a 6th degree polynomial transformation, and your training R^2 will be very high.  What will that mean in terms of the bias-variance tradeoff?  Your model will be highly complex and surely overfit. Therefore, you would expect it to perform poorly on the test set.

To get an idea of how your model will perform on unseen data, you will have to choose some method of creating a validation set within your training data.  

There are several ways to do that, and you will have to pick the method that you are most comfortable with.  

The simplest way would be to simply perform another train-test-split on your training data, fit your model on the larger part of that secondary split, and then score your model on the smaller validation set. 

The more comprehensive way would be to use the Sklearn cross-validation class or Kfolds.  If you specify 5 folds, then you train your model on 5 different sets of training data and 5 different sets of validation data.  You would then look at the mean R^2 of the 5 validation sets.

Your task will be to try out different hypotheses iteratively, and select the combination of predictors that explains the most variance.

# Generating Predictions

After you have selected your best combination of features, your work is not quite done. You have to use your trained model to make predictions.  In doing so, you have to watch out for a few stumbling points.

## 1: Retrain Your Model on the Entire Training Set

When you are iteratively building your model with cross-validation, you are required to leave out some data (the validation data) in the training process.  You always want to train your model on as much data as possible. The validation process tells you which features to use in your final model, but you need to then retrain your model on the entire training data using those features.  You could not perform this step, but your model will perform worse. 



## 2: Prepare your X_test Exactly as You Prepared your X_train

When selecting the best features for your model, you will most certainly alter your X_train data frame.  For example, maybe you did not include the `date` feature. After fitting your final model to a version of X_train without date, you then try to make a prediction on X_test.  Sklearn will complain that the dimensions of X_test do not match the demensions required on the fit model.  So, before making your predictions, you will have to drop the `date` column from X_test.  Any transformation you do to X_train will have to be performed on X_test. 

You will also have to deal with the missing values in the X_test.  There are 3 columns which include NA's.  You will not be able to drop rows containing missing values, since doing so will result in diminishing the number of predictions in your final set. If those columns are important to your model, you will have to fill the NA's in the test set just as you did in your training set. Of course, you could opt to not include those columns in your final model.





# Checking your Prediction Shape

You have selected the features for your best model, and trained your model on the entire data set.  You have transformed the X_test in the same way that you transformed your X_train.  You have made a set of predictions. 

In the cell below, you will find a fake y_test; it has been filled with zeros.

In [10]:
import numpy as np
y_test_fake = np.full((5400,1), 0)

In order to test that your predictions are of the correct shape, feed your 5400 predicted values into the cell below.

In [11]:
from sklearn.metrics import r2_score

# fake predictions using the mean of y_train.
your_y_hat_predictions = np.full((5400,1), np.mean(y_train))

r2_score(your_y_hat_predictions, y_test_fake)

0.0

Only pay attention to errors thrown by the cell above, not the R^2.   If the cell does not throw any errors, your predictions are ready for submission.

Convert the array of predictions into a `csv` by filling in the placeholder filepath and variable name with the appropriate values.

In [95]:
np.savetxt('your_team_member_names.csv', your_y_hat_predictions, delimiter=',')

There will be a Slack channel designated for submitting your final predictions `csv`. 

Only predictions received before 5 pm PST will be considered valid.  

The team with the highest R^2 will be deemed the Linear Regression Bakeoff winner.



![on you marks, get set, bake](https://media.giphy.com/media/l3vRhl6k5tb3oPGLK/giphy.gif)