# Linear Regression Bakeoff

![Paul Hollywood gif](https://media.giphy.com/media/OjrcZp4fXMHBryKoXZ/giphy.gif)

### Inferential vs. Predictive
You should think of this primarily as a project in **inferential** statistics. That means:
- focusing on trying to satisfy the assumptions of linear regression;
- using all your records to build models;
- aiming for understanding how features influence sales prices.

But we also invite you to a level-up: a friendly competition among the teams. And here the goal is **predictive**. That means:
- maximizing $R^2$;
- utilizing train-test splits;
- utilizing validation sets (or cross-validation).
We’ll have SOME UNLABELED TEST DATA FOR YOU TO PLUG INTO YOUR MODELS.


# Training Data

Like a Kaggle competition, you are provided with the following training data representing 3/4 of the data set.  
It is split into **predictive features** (X_train) and **target variable** (y_train)

In [1]:
import pandas as pd
import numpy as np

X_train = pd.read_csv('bakeoff_data/Xtrain.csv')
y_train = pd.read_csv('bakeoff_data/ytrain.csv')

In [2]:
print(X_train.shape)

(16197, 19)


In [3]:
print(y_train.shape)

(16197, 1)


In [4]:
X_train.head()

Unnamed: 0,date,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,3/4/2015,3,2.5,1880,4499,2.0,0.0,0.0,3,8,1880,0.0,1993,0.0,98029,47.5664,-121.999,2130,5114
1,10/7/2014,3,2.5,2020,6564,1.0,0.0,0.0,3,7,1310,710.0,1994,0.0,98042,47.3545,-122.158,1710,5151
2,1/16/2015,5,4.0,4720,493534,2.0,0.0,0.0,5,9,3960,760.0,1975,0.0,98027,47.4536,-122.009,2160,219542
3,3/30/2015,2,2.0,1430,3880,1.0,0.0,0.0,4,7,1430,0.0,1949,0.0,98117,47.6844,-122.392,1430,3880
4,10/14/2014,3,2.25,2270,32112,1.0,0.0,0.0,4,8,1740,530.0,1980,0.0,98042,47.3451,-122.094,2310,41606


As you can see, you have been provided with 19 independent features.  You may use as many of them as you like in your model.  The goal is to get the highest R^2 on the test data.

# Test Data

But how will you know that your model resulted in a high R^2 in the test data? You won't! At least, you won't know until the submission window has closed.  

You will notice that while you have a file named `Xtest.csv`, you do not have a file named `ytest.csv`. Your instructor has that in their posession, and will keep it secret from the bakeoff contestants.  

Once you have decided on your best model, you will then make predictions.  These predictions will be compared to the labels held in the hidden `ytest.csv`, resulting in a final R^2 score. In order for your submission to be valid, you have to have a prediction for every row of `Xtest.csv`.

Below, the `Xtest.csv` has been imported into this notebook for you.

In [5]:
X_test = pd.read_csv('bakeoff_data/Xtest.csv')
X_test.head()


Unnamed: 0,date,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,2/20/2015,3,0.75,850,8573,1.0,0.0,0.0,3,6,600,250.0,1945,0.0,98146,47.503,-122.356,850,8382
1,10/8/2014,3,1.0,1510,6083,1.0,0.0,0.0,4,6,860,650.0,1940,0.0,98115,47.6966,-122.324,1510,5712
2,3/25/2015,4,2.25,1790,42000,1.0,0.0,0.0,3,7,1170,620.0,1983,0.0,98045,47.4819,-121.744,2060,50094
3,2/17/2015,2,1.5,1140,2500,1.0,0.0,1.0,3,7,630,510.0,1988,,98106,47.5707,-122.359,1500,5000
4,5/23/2014,3,1.0,1500,3920,1.0,0.0,0.0,3,7,1000,500.0,1947,0.0,98107,47.6718,-122.359,1640,4017


In [6]:
print(X_test.shape)

(5400, 19)


Notice how the cell above indicates that there are **5400** records in `X_test`.  You should therefore submit 5400 predicted saleprices.  

# Building Your Best Model

So how does one build a model that one has confidence will perform well on the test data? You could just fit the model on the training data, and consider the R^2.  But remember, no matter what, your training R^2  will always go up when you add more features. With that in mind, you could just implement a 6th degree polynomial transformation, and your training R^2 will be very high.  What will that mean in terms of the bias-variance tradeoff?  Your model will be highly complex and surely overfit. Therefore, you would expect it to perform poorly on the test set.

To get an idea of how your model will perform on unseen data, you will have to choose some method of creating a validation set within your training data.  

There are several ways to do that, and you will have to pick the method that you are most comfortable with.  

The simplest way would be to simply perform another train-test-split on your training data, fit your model on the larger part of that secondary split, and then score your model on the smaller validation set. 

The more comprehensive way would be to use the Sklearn cross-validation class or Kfolds.  If you specify 5 folds, then you train your model on 5 different sets of training data and 5 different sets of validation data.  You would then look at the mean R^2 of the 5 validation sets.

Your task will be to try out different hypotheses iteratively, and select the combination of predictors that explains the most variance.

In [7]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16197 entries, 0 to 16196
Data columns (total 19 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   date           16197 non-null  object 
 1   bedrooms       16197 non-null  int64  
 2   bathrooms      16197 non-null  float64
 3   sqft_living    16197 non-null  int64  
 4   sqft_lot       16197 non-null  int64  
 5   floors         16197 non-null  float64
 6   waterfront     14441 non-null  float64
 7   view           16148 non-null  float64
 8   condition      16197 non-null  int64  
 9   grade          16197 non-null  int64  
 10  sqft_above     16197 non-null  int64  
 11  sqft_basement  16197 non-null  object 
 12  yr_built       16197 non-null  int64  
 13  yr_renovated   13318 non-null  float64
 14  zipcode        16197 non-null  int64  
 15  lat            16197 non-null  float64
 16  long           16197 non-null  float64
 17  sqft_living15  16197 non-null  int64  
 18  sqft_l

In [8]:
X_train.describe()

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
count,16197.0,16197.0,16197.0,16197.0,16197.0,14441.0,16148.0,16197.0,16197.0,16197.0,16197.0,13318.0,16197.0,16197.0,16197.0,16197.0,16197.0
mean,3.372229,2.116426,2083.69303,15071.89,1.494752,0.007686,0.232165,3.410385,7.658702,1790.467926,1971.019942,81.993843,98078.10008,47.560975,-122.21372,1987.809286,12784.065074
std,0.905951,0.768049,918.209756,40775.85,0.540474,0.087338,0.766092,0.650777,1.169277,827.5986,29.325399,396.213694,53.486457,0.138273,0.141639,685.189105,26833.379871
min,1.0,0.5,370.0,520.0,1.0,0.0,0.0,1.0,3.0,370.0,1900.0,0.0,98001.0,47.1559,-122.519,399.0,651.0
25%,3.0,1.75,1430.0,5058.0,1.0,0.0,0.0,3.0,7.0,1200.0,1952.0,0.0,98033.0,47.4725,-122.329,1490.0,5100.0
50%,3.0,2.25,1912.0,7620.0,1.5,0.0,0.0,3.0,7.0,1560.0,1975.0,0.0,98065.0,47.5733,-122.231,1840.0,7620.0
75%,4.0,2.5,2560.0,10720.0,2.0,0.0,0.0,4.0,8.0,2220.0,1997.0,0.0,98117.0,47.6783,-122.124,2360.0,10086.0
max,11.0,8.0,13540.0,1651359.0,3.5,1.0,4.0,5.0,13.0,9410.0,2015.0,2015.0,98199.0,47.7776,-121.315,6210.0,871200.0


In [9]:
y_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16197 entries, 0 to 16196
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   price   16197 non-null  float64
dtypes: float64(1)
memory usage: 126.7 KB


In [10]:
y_train.describe()

Unnamed: 0,price
count,16197.0
mean,541284.5
std,366344.7
min,78000.0
25%,323500.0
50%,450000.0
75%,645000.0
max,7700000.0


In [11]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5400 entries, 0 to 5399
Data columns (total 19 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   date           5400 non-null   object 
 1   bedrooms       5400 non-null   int64  
 2   bathrooms      5400 non-null   float64
 3   sqft_living    5400 non-null   int64  
 4   sqft_lot       5400 non-null   int64  
 5   floors         5400 non-null   float64
 6   waterfront     4780 non-null   float64
 7   view           5386 non-null   float64
 8   condition      5400 non-null   int64  
 9   grade          5400 non-null   int64  
 10  sqft_above     5400 non-null   int64  
 11  sqft_basement  5400 non-null   object 
 12  yr_built       5400 non-null   int64  
 13  yr_renovated   4437 non-null   float64
 14  zipcode        5400 non-null   int64  
 15  lat            5400 non-null   float64
 16  long           5400 non-null   float64
 17  sqft_living15  5400 non-null   int64  
 18  sqft_lot

In [12]:
X_test.describe()

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
count,5400.0,5400.0,5400.0,5400.0,5400.0,4780.0,5386.0,5400.0,5400.0,5400.0,5400.0,4437.0,5400.0,5400.0,5400.0,5400.0,5400.0
mean,3.376111,2.114028,2070.210185,15181.94,1.49213,0.007322,0.238953,3.408148,7.655556,1782.98463,1970.938889,88.568177,98077.507222,47.557447,-122.21477,1983.054074,12680.953148
std,0.984894,0.771851,917.805949,43270.26,0.537347,0.085265,0.764518,0.649909,1.184994,828.294279,29.526848,410.952944,53.595322,0.139365,0.137951,685.405621,28558.979278
min,1.0,0.75,410.0,609.0,1.0,0.0,0.0,1.0,4.0,410.0,1900.0,0.0,98001.0,47.1622,-122.515,670.0,659.0
25%,3.0,1.75,1420.0,5001.0,1.0,0.0,0.0,3.0,7.0,1190.0,1951.0,0.0,98032.0,47.465725,-122.327,1480.0,5100.0
50%,3.0,2.25,1910.0,7616.5,1.5,0.0,0.0,3.0,7.0,1550.0,1975.0,0.0,98065.0,47.5689,-122.228,1830.0,7619.5
75%,4.0,2.5,2520.0,10588.0,2.0,0.0,0.0,4.0,8.0,2200.0,1997.0,0.0,98118.0,47.6775,-122.127,2370.0,10080.0
max,33.0,7.75,10040.0,1164794.0,3.0,1.0,4.0,5.0,13.0,8860.0,2015.0,2015.0,98199.0,47.7775,-121.315,5790.0,858132.0


In [13]:
X_test[X_test['bedrooms'] > 10]

Unnamed: 0,date,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
2667,6/25/2014,33,1.75,1620,6000,1.0,0.0,0.0,5,7,1040,580.0,1947,0.0,98103,47.6878,-122.331,1330,4700


# Generating Predictions

After you have selected your best combination of features, your work is not quite done. You have to use your trained model to make predictions.  In doing so, you have to watch out for a few stumbling points.

## 1: Retrain Your Model on the Entire Training Set

When you are iteratively building your model with cross-validation, you are required to leave out some data (the validation data) in the training process.  You always want to train your model on as much data as possible. The validation process tells you which features to use in your final model, but you need to then retrain your model on the entire training data using those features.  You could not perform this step, but your model will perform worse. 



## 2: Prepare your X_test Exactly as You Prepared your X_train

When selecting the best features for your model, you will most certainly alter your X_train data frame.  For example, maybe you did not include the `date` feature. After fitting your final model to a version of X_train without date, you then try to make a prediction on X_test.  Sklearn will complain that the dimensions of X_test do not match the demensions required on the fit model.  So, before making your predictions, you will have to drop the `date` column from X_test.  Any transformation you do to X_train will have to be performed on X_test. 

You will also have to deal with the missing values in the X_test.  There are 3 columns which include NA's.  You will not be able to drop rows containing missing values, since doing so will result in diminishing the number of predictions in your final set. If those columns are important to your model, you will have to fill the NA's in the test set just as you did in your training set. Of course, you could opt to not include those columns in your final model.





# Checking your Prediction Shape

You have selected the features for your best model, and trained your model on the entire data set.  You have transformed the X_test in the same way that you transformed your X_train.  You have made a set of predictions. 

In the cell below, you will find a fake y_test; it has been filled with zeros.

In [14]:
import numpy as np
y_test_fake = np.full((5400,1), 0)

In order to test that your predictions are of the correct shape, feed your 5400 predicted values into the cell below.

In [15]:
from sklearn.metrics import r2_score

# fake predictions using the mean of y_train.
your_y_hat_predictions = np.full((5400,1), np.mean(y_train))

r2_score(your_y_hat_predictions, y_test_fake)

0.0

Only pay attention to errors thrown by the cell above, not the R^2.   If the cell does not throw any errors, your predictions are ready for submission.

Convert the array of predictions into a `csv` by filling in the placeholder filepath and variable name with the appropriate values.

In [16]:
np.savetxt('your_team_member_names.csv', your_y_hat_predictions, delimiter=',')

There will be a Slack channel designated for submitting your final predictions `csv`. 

Only predictions received before 5 pm PST will be considered valid.  

The team with the highest R^2 will be deemed the Linear Regression Bakeoff winner.



![on you marks, get set, bake](https://media.giphy.com/media/l3vRhl6k5tb3oPGLK/giphy.gif)

## Simp[lest Way

In [17]:
from sklearn.linear_model import LinearRegression

In [18]:
lr = LinearRegression()

X_train_num = X_train.select_dtypes(exclude='object')
X_train.isna().sum()

date                0
bedrooms            0
bathrooms           0
sqft_living         0
sqft_lot            0
floors              0
waterfront       1756
view               49
condition           0
grade               0
sqft_above          0
sqft_basement       0
yr_built            0
yr_renovated     2879
zipcode             0
lat                 0
long                0
sqft_living15       0
sqft_lot15          0
dtype: int64

In [19]:
X_train_no_na = X_train_num.fillna({'waterfront': 0, 'view': 0, 'yr_renovated': 0})

In [20]:
X_train_no_na.isna().sum()

bedrooms         0
bathrooms        0
sqft_living      0
sqft_lot         0
floors           0
waterfront       0
view             0
condition        0
grade            0
sqft_above       0
yr_built         0
yr_renovated     0
zipcode          0
lat              0
long             0
sqft_living15    0
sqft_lot15       0
dtype: int64

In [21]:
lr.fit(X_train_no_na, y_train)

LinearRegression()

In [22]:
lr.score(X_train_no_na,y_train)

0.70339000928039

In [23]:
X_test_num = X_test.select_dtypes(exclude ='object')

In [24]:
X_test_num.isna().sum()

bedrooms           0
bathrooms          0
sqft_living        0
sqft_lot           0
floors             0
waterfront       620
view              14
condition          0
grade              0
sqft_above         0
yr_built           0
yr_renovated     963
zipcode            0
lat                0
long               0
sqft_living15      0
sqft_lot15         0
dtype: int64

In [25]:
X_test_no_na = X_test_num.fillna({'waterfront': 0, 'view': 0, 'yr_renovated': 0})

In [26]:
X_train_no_na.isna().sum()

bedrooms         0
bathrooms        0
sqft_living      0
sqft_lot         0
floors           0
waterfront       0
view             0
condition        0
grade            0
sqft_above       0
yr_built         0
yr_renovated     0
zipcode          0
lat              0
long             0
sqft_living15    0
sqft_lot15       0
dtype: int64

In [27]:
y_hat = lr.predict(X_test_no_na)
y_hat.shape

(5400, 1)

## Secondary Train-Test-Split:

In [28]:
from sklearn.model_selection import train_test_split

X_t, X_val, y_t, y_val = train_test_split(X_train, y_train, random_state=42, test_size=.2)

In [29]:
X_t_num = X_t.select_dtypes(exclude='object')

In [30]:
X_t_no_na = X_t_num.fillna({'waterfront': 0, 'view': 0, 'yr_renovated': 0})

In [31]:
X_t_no_na.isna().sum()

bedrooms         0
bathrooms        0
sqft_living      0
sqft_lot         0
floors           0
waterfront       0
view             0
condition        0
grade            0
sqft_above       0
yr_built         0
yr_renovated     0
zipcode          0
lat              0
long             0
sqft_living15    0
sqft_lot15       0
dtype: int64

In [32]:
lr.fit(X_t_no_na,y_t)

LinearRegression()

In [33]:
lr.score(X_t_no_na,y_t)

0.7105556206432422

In [34]:
X_val_num = X_val.select_dtypes(exclude ='object')
X_val_no_na = X_val_num.fillna({'waterfront': 0, 'view': 0, 'yr_renovated': 0})

In [35]:
lr.score(X_val_no_na, y_val)

0.6658166143285698

## Log Transform

In [36]:
lr_log = LinearRegression()

In [37]:
y_t_log = np.log(y_t)
y_val_log = np.log(y_val)

In [38]:
lr_log.fit(X_t_no_na, y_t_log)

LinearRegression()

In [39]:
lr_log.score(X_t_no_na,y_t_log)

0.7774196654712606

In [40]:
lr_log.score(X_val_no_na, y_val_log)

0.7504632344551945

In [41]:
lr_final = LinearRegression()
y_train_log = np.log(y_train)

lr_final.fit(X_train_no_na, y_train_log)
lr_final.score(X_train_no_na, y_train_log)


0.7724254311572506

In [44]:
y_hat_log = lr_final.predict(X_test_no_na)


array([[12.31579799],
       [12.85915736],
       [12.62860151],
       ...,
       [13.9940947 ],
       [14.0911344 ],
       [12.39352374]])

In [46]:
y_hat = np.e ** y_hat_log

(5400, 1)

In [None]:
np.savetxt('example.csv', y_hat, delimeter=',')