# Houses Kaggle Competition (revisited with Deep Learning 🔥) 

[<img src='https://github.com/lewagon/data-images/blob/master/ML/kaggle-batch-challenge.png?raw=true' width=600>](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data)

⚙️ Let's re-use our previous **pipeline** built in the module **`05-07-Ensemble-Methods`** and try to improve our final predictions with a Neural Network!

## (0) Libraries and imports

In [1]:
%load_ext autoreload
%autoreload 2

# DATA MANIPULATION
import pandas as pd
import numpy as np

# DATA VISUALISATION
import matplotlib.pyplot as plt
import seaborn as sns

# VIEWING OPTIONS IN THE NOTEBOOK
from sklearn import set_config; set_config(display='diagram')

## (1) 🚀 Getting Started

### (1.1) Load the datasets

💾 Let's load our **training dataset**

In [48]:
data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/houses_train_raw.csv")
X = data.drop(columns='SalePrice')
y = data['SalePrice']

In [3]:
X.shape, y.shape

((1460, 80), (1460,))

💾 Let's also load the **test set**

❗️ Remember ❗️ You have access to `X_test` but only Kaggle has `y_test`

In [4]:
X_test = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/houses_test_raw.csv")

In [5]:
X_test.shape

(1459, 80)

### (1.2) Train/Val Split

❓ **Holdout** ❓ 

As you are not allowed to use the test set (and you don't have access to *y_test* anyway), split your dataset into a training set and a validation set.

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, train_size=0.7)

### (1.3) Import the preprocessor

🎁 You will find in `utils/preprocessor.py` the **`data-preprocessing pipeline`** that was built in our previous iteration.

❓ Run the cell below, and make sure you understand what the pipeline does. Look at the code in `preprocessor.py` ❓

In [7]:
from utils.preprocessor import create_preproc

preproc = create_preproc(X_train)
preproc

❓ **Scaling your numerical features and encoding the categorical features** ❓

Apply these transformations to _both_ your training set and your validation set.

In [8]:
X_train_treated = preproc.fit_transform(X_train, y_train)
X_val_treated = preproc.transform(X_val)

## (2) 🔮 Your predictions in Tensorflow/Keras

🚀 This is your first **regression** task with Keras! 

💡 Here a few tips to get started:
- Kaggle's [rule](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview/evaluation) requires to minimize **`rmsle`** (Root Mean Square Log Error). 
    - As you can see, we can specify `msle` directly as a loss-function with Tensorflow.Keras!
    - Just remember to take the square-root of your loss results to read your rmsle metric.
    
    
😃 The best boosted-tree ***rmsle*** score to beat is around ***0.13***

---

<img src="https://i.pinimg.com/564x/4c/fe/ef/4cfeef34af09973211f584e8307b433c.jpg" alt="`Impossible mission" style="height: 300px; width:500px;"/>

---


❓ **Your mission, should you choose to accept it:** ❓
- 💪 Beat the best boosted-tree 💪 

    - Your responsibilities are:
        - to build the ***best neural network architecture*** possible,
        - and to control the epoch number to ***avoid overfitting***.

### (2.1) Predicting the houses' prices using a Neural Network

❓ **Preliminary Question: Initializing a Neural Network** ❓

Create a function `initialize_model` which initializes a Dense Neural network:
- You are responsible for designing the architecture (number of layers, number of neurons)
- The function should also compile the model with the following parameters:
    - ***optimizer = "adam"***
    - ***loss = "msle"*** (_Optimizing directly for the Squared Log Error!_)
        

In [9]:
X_train_treated.shape

(1021, 159)

In [10]:
X_val_treated.shape

(439, 159)

In [11]:
from tensorflow.keras import models
from tensorflow.keras import layers

In [13]:
#model = initialize_model()
#model.fit(X_train_treated, y_train, batch_size=12, epochs = 200,
#         validation_data = (X_val_treated, y_val))

❓ **Questions/Guidance** ❓

1. Initialize a Neural Network
2. Train it
3. Evaluate its performance
4. Is the model overfitting the dataset? 

🎁 We coded a `plot_history` function that you can use to detect overfitting

In [14]:
def plot_history(history):
    plt.plot(np.sqrt(history.history['loss']))
    plt.plot(np.sqrt(history.history['val_loss']))
    plt.title('Model Loss')
    plt.ylabel('RMSLE')
    plt.xlabel('Epoch')
    plt.legend(['Train', 'Val'], loc='best')
    plt.show()

### (2.2) Challenging yourself

❓ **Questions to challenge yourself:** ❓
- Are you satisfied with your score ?
- Before publishing it, ask yourself whether you could really trust it or not ?
- Have you cross-validated your neural network ? 
    - Feel free to cross-validate it manually with a *for loop* in Python to make sure that your results are robust against the randomness of a _train-val split_ before before submitting to Kaggle

In [32]:
X_treated = pd.DataFrame(preproc.fit_transform(X, y))
X_treated

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,152,153,154,155,156,157,158,159,160,161
0,0.119780,0.413559,0.375,0.125089,0.333333,0.0,0.064212,0.000000,0.000000,0.666667,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
1,0.212942,0.000000,0.375,0.173281,0.000000,0.5,0.121575,0.000000,0.333333,0.666667,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
2,0.134465,0.419370,0.375,0.086109,0.333333,0.0,0.185788,0.000000,0.333333,0.666667,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
3,0.143873,0.366102,0.375,0.038271,0.333333,0.0,0.231164,0.492754,0.333333,0.333333,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
4,0.186095,0.509927,0.500,0.116052,0.333333,0.0,0.209760,0.000000,0.333333,0.666667,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,0.142038,0.336077,0.375,0.000000,0.000000,0.0,0.407962,0.000000,0.333333,0.666667,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
1456,0.399036,0.000000,0.375,0.139972,0.333333,0.0,0.252140,0.000000,0.666667,0.666667,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
1457,0.195961,0.557869,0.500,0.048724,0.000000,0.0,0.375428,0.000000,0.666667,0.666667,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
1458,0.170721,0.000000,0.250,0.008682,0.333333,0.0,0.000000,0.202899,0.000000,0.333333,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0


In [49]:
X_preproc = preproc.fit_transform(X, y)

In [None]:
X_treated.shape, y.shape

In [56]:
sorted(y.unique())

[34900,
 35311,
 37900,
 39300,
 40000,
 52000,
 52500,
 55000,
 55993,
 58500,
 60000,
 61000,
 62383,
 64500,
 66500,
 67000,
 68400,
 68500,
 72500,
 73000,
 75000,
 75500,
 76000,
 76500,
 78000,
 79000,
 79500,
 79900,
 80000,
 80500,
 81000,
 82000,
 82500,
 83000,
 83500,
 84000,
 84500,
 84900,
 85000,
 85400,
 85500,
 86000,
 87000,
 87500,
 88000,
 89000,
 89471,
 89500,
 90000,
 90350,
 91000,
 91300,
 91500,
 92000,
 92900,
 93000,
 93500,
 94000,
 94500,
 94750,
 95000,
 96500,
 97000,
 97500,
 98000,
 98300,
 98600,
 99500,
 99900,
 100000,
 101000,
 101800,
 102000,
 102776,
 103000,
 103200,
 103600,
 104000,
 104900,
 105000,
 105500,
 105900,
 106000,
 106250,
 106500,
 107000,
 107400,
 107500,
 107900,
 108000,
 108480,
 108500,
 108959,
 109000,
 109008,
 109500,
 109900,
 110000,
 110500,
 111000,
 111250,
 112000,
 112500,
 113000,
 114500,
 114504,
 115000,
 116000,
 116050,
 116500,
 116900,
 117000,
 117500,
 118000,
 118400,
 118500,
 118858,
 118964,
 119000

In [130]:
def initialize_model(x):
    model = models.Sequential()
    model.add(layers.Dense(125, activation='relu', input_dim=x.shape[1]))
    model.add(layers.Dense(80, activation='relu'))
    model.add(layers.Dense(40, activation='relu'))
    model.add(layers.Dense(25, activation='relu'))
    model.add(layers.Dense(1, activation = 'linear'))
    
    model.compile(loss='msle', optimizer = "adam", metrics = 'msle', )
    
    return model

In [142]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_log_error
from sklearn.model_selection import KFold

kfold = KFold(n_splits=5, shuffle=True)

scores = []

for train, val in kfold.split(X_treated, y):
    model = initialize_model(X_treated)
    model.fit(X_treated.loc[train],y.loc[train], batch_size=16, epochs = 280, verbose=0)
    print(y.loc[train].unique().mean())
    scores.append(model.evaluate(X_treated.loc[val],y.loc[val]))

scores

199207.98981324278
197764.03321678322
198352.6529209622
200528.9913644214
199876.05114638447


[[0.023445632308721542, 0.023445632308721542],
 [0.027645908296108246, 0.027645908296108246],
 [0.03743932023644447, 0.03743932023644447],
 [0.0150569723919034, 0.0150569723919034],
 [0.02236449345946312, 0.02236449345946312]]

In [143]:
np.mean([k[1] for k in scores])

0.025190465338528156

### (2.3) (Bonus) Using all your CPU cores to run Neural Networks

🔥 **BONUS** 🔥 **Multiprocessing  computing using [dask](https://docs.dask.org/en/latest/delayed.html)** and **all your CPU cores**:

_(to mimic SkLearn's `n_jobs=-1`)_

In [27]:
!pip install --quiet dask

In [38]:
def evaluate_model(X, y, train_index, val_index):
    
    X_train, X_val = X[train_index], X[val_index]
    y_train, y_val = y[train_index], y[val_index]
    
    model = initialize_model(X_train)
    
    history = model.fit(X_train, 
                        y_train,
                        validation_data = (X_val, y_val), 
                        epochs = 500,
                        batch_size = 16,
                        verbose=0)
    return pd.DataFrame({
                'rmsle_final_epoch': [model.evaluate(X_val, y_val)**0.5],
                'rmsle_min': [min(history.history['val_loss'])**0.5]
                        })

In [50]:
from sklearn.model_selection import KFold
from dask import delayed

cv = 5
kf = KFold(n_splits = cv, shuffle = True)
f = delayed(evaluate_model)

results = delayed([f(X_preproc, y, train_index, val_index) for (train_index, val_index) in kf.split(X_preproc)
                  ]).compute(
                      scheduler='processes', num_workers=8)

pd.concat(results, axis=0).reset_index(drop=True)

2022-02-14 16:03:11.476728: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-02-14 16:03:11.546393: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)




TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'float'

Traceback
---------
  File "/Users/humbert/.pyenv/versions/3.8.12/envs/lewagon/lib/python3.8/site-packages/dask/local.py", line 219, in execute_task
    result = _execute_task(task, data)
  File "/Users/humbert/.pyenv/versions/3.8.12/envs/lewagon/lib/python3.8/site-packages/dask/core.py", line 119, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/var/folders/yf/53d25rm10gq_l7wdtlp_205w0000gn/T/ipykernel_48005/574681042.py", line 15, in evaluate_model


## (3) 🏅FINAL SUBMISSION

🦄 Predict the ***prices of the houses in your test set*** and submit your results to Kaggle! 



In [61]:
X_test = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/houses_test_raw.csv")

In [101]:
X_preproc, y

(array([[0.11977972, 0.41355932, 0.        , ..., 1.        , 1.        ,
         0.        ],
        [0.21294172, 0.        , 0.        , ..., 1.        , 1.        ,
         0.        ],
        [0.13446535, 0.41937046, 0.        , ..., 1.        , 1.        ,
         0.        ],
        ...,
        [0.19596145, 0.55786925, 0.        , ..., 1.        , 1.        ,
         0.        ],
        [0.17072051, 0.        , 0.        , ..., 1.        , 1.        ,
         0.        ],
        [0.21156494, 0.        , 0.        , ..., 1.        , 1.        ,
         0.        ]]),
 0       208500
 1       181500
 2       223500
 3       140000
 4       250000
          ...  
 1455    175000
 1456    210000
 1457    266500
 1458    142125
 1459    147500
 Name: SalePrice, Length: 1460, dtype: int64)

In [145]:
model = initialize_model(X_preproc)
model.fit(X_preproc,y, batch_size=16, epochs = 280, verbose=0)
#model.fit(X_preproc, y, batch_size=16, epochs = 200, verbose=0)

<tensorflow.python.keras.callbacks.History at 0x15ab7c550>

In [109]:
model.count_params()

21870

In [146]:
y_pred = model.predict(preproc.transform(X_test))
y_pred

array([[127331.63],
       [162913.5 ],
       [178692.36],
       ...,
       [165877.6 ],
       [101068.05],
       [212998.38]], dtype=float32)

In [79]:
preproc.transform(X_test)

array([[0.12895824, 0.        , 0.        , ..., 1.        , 1.        ,
        0.        ],
       [0.22831574, 0.        , 0.        , ..., 1.        , 1.        ,
        0.        ],
       [0.13630106, 0.33946731, 0.        , ..., 1.        , 1.        ,
        0.        ],
       ...,
       [0.20422212, 0.        , 0.        , ..., 1.        , 1.        ,
        0.        ],
       [0.1459385 , 0.        , 0.        , ..., 1.        , 1.        ,
        0.        ],
       [0.15190454, 0.48619855, 0.        , ..., 1.        , 1.        ,
        0.        ]])

💾 Save your predictions in a Dataframe called `results` with the format required by Kaggle so that when you will export it to a `.csv`, Kaggle can read it.

In [116]:
X_test

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,6,2010,WD,Normal
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,144,0,,,,0,1,2010,WD,Normal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1454,2915,160,RM,21.0,1936,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,6,2006,WD,Normal
1455,2916,160,RM,21.0,1894,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,4,2006,WD,Abnorml
1456,2917,20,RL,160.0,20000,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,9,2006,WD,Abnorml
1457,2918,85,RL,62.0,10441,Pave,,Reg,Lvl,AllPub,...,0,0,,MnPrv,Shed,700,7,2006,WD,Normal


In [147]:
results = pd.DataFrame(y_pred, columns=['SalePrice'])
results = pd.concat([X_test[['Id']],results], axis = 1)
results

Unnamed: 0,Id,SalePrice
0,1461,127331.632812
1,1462,162913.500000
2,1463,178692.359375
3,1464,189830.218750
4,1465,183711.250000
...,...,...
1454,2915,98789.226562
1455,2916,74510.460938
1456,2917,165877.593750
1457,2918,101068.046875


📤  Export your results using the Kaggle's submission format and submit it online!

_(Uncomment the last cell of this notebook)_

In [148]:
results.to_csv("submission_final.csv", header = True, index = False)

---

🏁 Congratulations !

💾 Don't forget to `git add/commit/push` your notebook...

🚀 ... it's time for the Recap !