# Validation and cross-validation

In this exercise you will implement a validation pipeline.

At the end of the MSLE exercise you tested your model against the training and test datasets. As you should observe, there's a gap between the results. By validating your model, not only should you be able to anticipate the test time performance, but also have a method to compare different models.

Implement the basic validation method, i.e. a random split. Test it with your model from Exercise MSLE.

In [124]:
%matplotlib inline

#!wget -O mieszkania.csv https://www.dropbox.com/s/zey0gx91pna8irj/mieszkania.csv?dl=1
#!wget -O mieszkania_test.csv https://www.dropbox.com/s/dbrj6sbxb4ayqjz/mieszkania_test.csv?dl=1

In [125]:
from typing import Tuple

import numpy as np
import pandas as pd
from sklearn import preprocessing

np.random.seed(357)

In [126]:
def load(name: str) -> Tuple[np.ndarray, np.array]:
    data = pd.read_csv(name)
    x = data.loc[:, data.columns != 'cena'].to_numpy()
    y = data['cena'].to_numpy()

    return x, y

In [127]:
x_train, y_train = load('mieszkania.csv')
x_test, y_test = load('mieszkania_test.csv')

x_test, y_test

(array([[71, 'wolowo', 2, 2, 1912, 1],
        [45, 'mokotowo', 1, 1, 1938, 0],
        [38, 'mokotowo', 1, 1, 1999, 1],
        ...,
        [89, 'wolowo', 2, 2, 1922, 1],
        [40, 'wolowo', 1, 1, 1959, 0],
        [68, 'grodziskowo', 2, 1, 1927, 0]], dtype=object),
 array([ 322227,  295878,  306530,  553641,  985348,  695726,   99751,
         891261,  536499,  527093,  861472,  701472,  429776,  547725,
         669560,  318362, 1140170,  341242,  113580,  456093,  470730,
         421012,  617318,  796117,  138901,  857820,  939450,  398165,
         944399, 1025413,  522440,  344346,  145702,  246712,  574154,
         807608,  568048,  412494,  588840,  766040,  979540, 1044803,
         742235,  758936,  388672,  178238,  530053, 1150687,  587013,
         269316,  270969, 1008103,  299708,  393925,  511106,  947932,
         127717,  752428, 1185932,  330988,  330699,  403778,  584561,
         795392,  602356,  680512,  202121,  888872,  456054,  227841,
         343730,  

In [175]:
labelencoder = preprocessing.LabelEncoder()
labelencoder.fit(x_train[:, 1])
x_train[:, 1] = labelencoder.transform(x_train[:, 1])
x_test[:, 1] = labelencoder.transform(x_test[:, 1])

x_train = x_train.astype(np.float64)
x_test = x_test.astype(np.float64)

y_train, y_test = y_train.reshape(-1, 1), y_test.reshape(-1, 1)
x_train

array([[1.040e+02, 1.000e+00, 2.000e+00, 2.000e+00, 1.940e+03, 1.000e+00],
       [4.300e+01, 2.000e+00, 1.000e+00, 1.000e+00, 1.970e+03, 1.000e+00],
       [1.280e+02, 0.000e+00, 3.000e+00, 2.000e+00, 1.916e+03, 1.000e+00],
       ...,
       [1.070e+02, 0.000e+00, 2.000e+00, 2.000e+00, 1.935e+03, 0.000e+00],
       [1.170e+02, 0.000e+00, 3.000e+00, 2.000e+00, 1.978e+03, 1.000e+00],
       [5.600e+01, 3.000e+00, 2.000e+00, 1.000e+00, 1.923e+03, 0.000e+00]])

In [176]:
import plotly.express as px
def plot_loss(l):
    fig = px.line(y=l, labels={'y':'loss'})
    fig.show()

def mse(xs, ps):
    assert len(xs) == len(ps)

    n = len(xs)
    sum = np.sum((xs - ps) ** 2)

    return (1/n) * sum

def msle(xs, ps):
    assert len(xs) == len(ps)

    n = len(xs)
    sum = np.sum((np.log(1 + xs) - np.log(1 + ps)) ** 2)

    return (1/n) * sum

In [177]:
n, features = x_train.shape
w = np.zeros(features).reshape(-1, 1)
lr = 1e-3 # step size
t = 0.99
n_epochs = 60


In [178]:

#print(y_train.shape)

def predict(w, xs):
    return xs @ w

def evaluate(w, xs, ys):
    return msle(ys, predict(w, xs))

def linear_regression(n_epochs, lr, w, x, y):
    losses = [evaluate(w, x, y)]

    for i in range(n_epochs):
        y_hat = predict(w, x)
        dJdwi = np.sum((y_hat - y) * x, axis = 0)

        w = w - (2/n) * (lr * dJdwi).reshape(-1,1)

        loss = evaluate(w, x, y)
        losses.append(loss)

        if i == 0 or (i+1) % 20 == 0:
            print(f'Iter: {i:>3} Loss: {loss:8.8f} w: {[f"{x[0]:.2f}" for x in w]}')

    return (w, losses)

(w_msle, losses) = linear_regression(n_epochs, lr, w, np.log(1 + x_train), np.log(1 + y_train))

plot_loss(losses)

Iter:   0 Loss: 2.32887578 w: ['0.11', '0.02', '0.03', '0.02', '0.20', '0.01']
Iter:  19 Loss: 0.00116427 w: ['0.68', '0.13', '0.17', '0.15', '1.21', '0.05']
Iter:  39 Loss: 0.00023183 w: ['0.71', '0.14', '0.18', '0.16', '1.25', '0.06']
Iter:  59 Loss: 0.00023095 w: ['0.71', '0.14', '0.18', '0.16', '1.25', '0.06']


In [132]:
#######################################################
# TODO: Implement the basic validation method,        #
# compare MSLE on training, validation, and test sets #
#######################################################


To make the random split validation reliable, a huge chunk of training data may be needed. To get over this problem, one may apply cross-validaiton.

![alt-text](https://chrisjmccormick.files.wordpress.com/2013/07/10_fold_cv.png)

Let's now implement the method. Make sure that:
* number of partitions is a parameter,
* the method is not limited to `mieszkania.csv`,
* the method is not limited to one specific model.

In [133]:
####################################
# TODO: Implement cross-validation #
####################################

Recall that sometimes validation may be tricky, e.g. significant class imbalance, having a small number of subjects, geographically clustered instances...

What could in theory go wrong here with random, unstratified partitions? Think about potential solutions and investigate the data in order to check whether these problems arise here.

In [134]:
##############################
# TODO: Investigate the data #
##############################