# Linear regression

In this exercise you will use linear regression to predict flat prices. Training will be handled via gradient descent and we will:
* have multiple features (i.e. variables used to make the prediction),
* employ some basic feature engineering,
* work with a non-standard loss function.

Let's start with getting the data.

In [39]:
%matplotlib inline

#!wget -O mieszkania.csv https://www.dropbox.com/s/zey0gx91pna8irj/mieszkania.csv?dl=1
#!wget -O mieszkania_test.csv https://www.dropbox.com/s/dbrj6sbxb4ayqjz/mieszkania_test.csv?dl=1

In [40]:
!head mieszkania.csv mieszkania_test.csv

==> mieszkania.csv <==
m2,dzielnica,ilość_sypialni,ilość_łazienek,rok_budowy,parking_podziemny,cena
104,mokotowo,2,2,1940,1,780094
43,ochotowo,1,1,1970,1,346912
128,grodziskowo,3,2,1916,1,523466
112,mokotowo,3,2,1920,1,830965
149,mokotowo,3,3,1977,0,1090479
80,ochotowo,2,2,1937,0,599060
58,ochotowo,2,1,1922,0,463639
23,ochotowo,1,1,1929,0,166785
40,mokotowo,1,1,1973,0,318849

==> mieszkania_test.csv <==
m2,dzielnica,ilość_sypialni,ilość_łazienek,rok_budowy,parking_podziemny,cena
71,wolowo,2,2,1912,1,322227
45,mokotowo,1,1,1938,0,295878
38,mokotowo,1,1,1999,1,306530
70,ochotowo,2,2,1980,1,553641
136,mokotowo,3,2,1939,1,985348
128,wolowo,3,2,1983,1,695726
23,grodziskowo,1,1,1975,0,99751
117,mokotowo,3,2,1942,0,891261
65,ochotowo,2,1,2002,1,536499


Each row in the data represents a separate property. Our goal is to use the data from `mieszkania.csv` to create a model that can predict a property's price (i.e. `cena`) given its features (i.e. `m2,dzielnica,ilosc_sypialni,ilosc_lazienek,rok_budowy,parking_podziemny`).

From now on, we should interfere only with `mieszkania.csv` (dubbed the training dataset) to make our decisions and create the model. The (only) purpose of `mieszkania_test.csv` is to test our model on **unseen** data.

Our predictions should minimize the so-called mean squared logarithmic error:
$$
MSLE = \frac{1}{n} \sum_{i=1}^n (\log(1+y_i) - \log(1+p_i))^2,
$$
where $y_i$ is the ground truth, and $p_i$ is our prediction.

Let's start with implementing the loss function.

In [41]:
from typing import Tuple

import numpy as np
import pandas as pd
from sklearn import preprocessing

def msle(ys, ps):
    assert len(ys) == len(ps)

    n = len(ys)
    result = (1/n) * np.sum((np.log(1 + ys) - np.log(1 + ps))**2)

    return result

The simplest model is predicting the same constant for each instance. Test your implementation of msle against outputing the mean price.

In [42]:
# load and prepare the data

def load(name):
    data = pd.read_csv(name)
    xs = data[data.columns[:-1]].to_numpy()
    ys = data[data.columns[-1]].to_numpy()
    return (xs, ys)

x_train, y_train = load('mieszkania.csv')
x_test, y_test = load('mieszkania_test.csv')
districts = len(np.unique(x_train.T[1]))

def one_hot(arr):
    columns = arr.shape[1]

    for i in range(columns):
        if type(arr[0][i]) != int :

            column = arr[:, i]
            rest = arr[:, [j != i for j in range(columns)]]

            unique = np.unique(column).reshape(1,-1)
            one_hot = np.vstack([column]*len(unique))
            one_hot = (one_hot.T == unique).astype(int)

            return np.hstack([rest, one_hot])

def add_column(value, arr):
    ones_column = np.ones(x_train.shape[0]).reshape(-1,1)
    return np.hstack([ones_column * value, arr])


In [43]:

x_train, y_train

x_train = one_hot(x_train)

x_train = add_column(1, x_train).astype(np.float64)

#normalize
x_norm = []
y_norm = np.max(y_train)

for i in range(x_train.shape[1]):
    max = np.max(x_train.T[i])
    x_norm.append(max)
    x_train.T[i] /= max

y_train = (y_train / np.max(y_train)).reshape(-1,1)

In [44]:
###################################################
# TODO: Compute msle for outputing the mean price #
###################################################

def mse(xs, ps):
    assert len(xs) == len(ps)

    n = len(xs)
    sum = np.sum((xs - ps) ** 2)

    return (1/n) * sum

def msle(xs, ps):
    assert len(xs) == len(ps)

    n = len(xs)
    sum = np.sum((np.log(1 + xs) - np.log(1 + ps)) ** 2)

    return (1/n) * sum

mean_price = np.mean(y_train)

msle(y_train, np.ones(x_train.shape[0]) * mean_price)

np.float64(5.468949827661844)

Recall that outputing the mean minimzes $MSE$. However, we're now dealing with $MSLE$.

Think of a constant that should result in the lowest $MSLE$.

In [45]:
#############################################
# TODO: Find this constant and compute msle #
#############################################

# (log(1 + p) - log(1 + x)) ** 2

msle_mean = (np.e ** ((1 / len(y_train)) * np.sum(np.log(1 + y_train)))) - 1

msle(y_train, np.ones(x_train.shape[0]) * msle_mean)

np.float64(5.431039782425594)

Now, let's implement a standard linear regression model.

In [46]:
n, features = x_train.shape
w = np.zeros(features).reshape(-1, 1)
lr = 0.2 # step size
t = 0.99
n_epochs = 250


In [51]:

#print(y_train.shape)

def predict(w, xs):
    return xs @ w

def evaluate(w, xs, ys):
    return mse(ys, predict(w, xs))

def evaluate_msle(w, xs, ys):
    return msle(ys, predict(w, xs))

def linear_regression(n_epochs, lr, w, x, y, eval_fun = evaluate):
    losses = [eval_fun(w, x, y)]

    for i in range(n_epochs):
        y_hat = predict(w, x)
        dJdwi = np.sum((y_hat - y) * x, axis = 0)

        w = w - (2/n) * (lr * dJdwi).reshape(-1,1)

        loss = eval_fun(w, x, y)
        losses.append(loss)

        if i == 0 or (i+1) % 200 == 0:
            print(f'Iter: {i:>3} Loss: {loss:8.8f} w: {[f"{x[0]:.2f}" for x in w]}')

    return (w, losses)

(w_mse, losses1) = linear_regression(n_epochs, lr, w, x_train, y_train, evaluate_msle)

""
import plotly.express as px
def plot_loss(l):
    fig = px.line(y=l, labels={'y':'loss'})
    fig.show()

plot_loss(losses1)

Iter:   0 Loss: 0.03616379 w: ['0.18', '0.12', '0.14', '0.12', '0.18', '0.09', '0.03', '0.06', '0.06', '0.04']
Iter: 199 Loss: 0.00208007 w: ['-0.03', '0.46', '0.26', '0.18', '-0.03', '0.02', '-0.13', '0.08', '0.08', '-0.07']


Note that the loss function that the algorithms optimizes (i.e $MSE$) differs from $MSLE$. We've already seen that this may result in a suboptimal solution.

How can you change the setting so that we optimze $MSLE$ instead?

Hint:
<sub><sup><sub><sup><sub><sup>
Be lazy. We don't want to change the algorithm.
</sup></sub></sup></sub></sup></sub>

In [52]:
#############################################
# TODO: Optimize msle and compare the error #
#############################################

(w_msle, l2) = linear_regression(n_epochs, lr, w, np.log(x_train + 1), np.log(y_train + 1))

plot_loss(l2)

Iter:   0 Loss: 0.02699879 w: ['0.10', '0.07', '0.08', '0.07', '0.10', '0.05', '0.02', '0.03', '0.03', '0.02']
Iter: 199 Loss: 0.00154459 w: ['-0.04', '0.43', '0.29', '0.22', '-0.04', '0.01', '-0.13', '0.07', '0.07', '-0.06']


Without any feature engineering our model approximates the price as a linear combination of original features:
$$
\text{price} \approx w_1 \cdot \text{area} + w_2 \cdot \text{district} + \dots.
$$
Let's now introduce some interactions between the variables. For instance, let's consider a following formula:
$$
\text{price} \approx w_1 \cdot \text{area} \cdot \text{avg. price in the district per sq. meter} + w_2 \cdot \dots + \dots.
$$
Here, we model the price with far greater granularity, and we may expect to see more acurate results.

Add some feature engineering to your model. Be sure to play with the data and not with the algorithm's code.

Think how to make sure that your model is capable of capturing the $w_1 \cdot \text{area} \cdot \text{avg. price...}$ part, without actually computing the averages.

Hint:
<sub><sup><sub><sup><sub><sup>
Is having a binary encoding for each district and multiplying it by area enough?
</sup></sub></sup></sub></sup></sub>

Hint 2:
<sub><sup><sub><sup><sub><sup>
Why not multiply everything together? I.e. (A,B,C) -> (AB,AC,BC).
</sup></sub></sup></sub></sup></sub>

In [53]:
###############################################
# TODO: Implement the feature engineering part #
###############################################

def add_avg_price(x_train, districts = districts):
    districts_cols = x_train.T[-districts:]
    avg_price = x_train.T[1] * districts_cols
    ex_train = np.vstack([x_train.T, avg_price]).T #appended new features

    return ex_train

ex_train = add_avg_price(x_train)

#print(ex_train.T[9:])
efeatures = ex_train.shape[1]
ew = np.zeros(efeatures).reshape(-1, 1)

(ew_mse, _) = linear_regression(n_epochs, lr, ew, ex_train, y_train, eval_fun=evaluate_msle)
print('optimizing mse^\n')
(ew_msle, _) = linear_regression(n_epochs, lr, ew, np.log(1 + ex_train), np.log(1 + y_train))
print('optimizing msle^')


Iter:   0 Loss: 0.03794292 w: ['0.18', '0.12', '0.14', '0.12', '0.18', '0.09', '0.03', '0.06', '0.06', '0.04', '0.02', '0.04', '0.04', '0.03']
Iter: 199 Loss: 0.00067180 w: ['-0.02', '0.40', '0.23', '0.15', '-0.02', '0.01', '-0.03', '0.02', '-0.00', '-0.01', '-0.10', '0.22', '0.28', '-0.00']
optimizing mse^

Iter:   0 Loss: 0.02467929 w: ['0.10', '0.07', '0.08', '0.07', '0.10', '0.05', '0.02', '0.03', '0.03', '0.02', '0.01', '0.02', '0.02', '0.02']
Iter: 199 Loss: 0.00079184 w: ['-0.03', '0.38', '0.26', '0.19', '-0.03', '0.01', '-0.06', '0.04', '0.02', '-0.02', '-0.03', '0.16', '0.20', '0.04']
optimizing msle^


In [54]:
### Check models on test data

x_norm, y_norm

def prepare_x(x, features):
    x = one_hot(x)
    x = add_column(1, x)

    for i in range(x.shape[1]):
        max = x_norm[i]
        x.T[i] /= max
    
    if features == x.shape[1]:
        return x
    else:
        return add_avg_price(x)

def check_loss(w, x_ = x_test, y_ = y_test):
    features = w.shape[0]
    x_ = prepare_x(x_, features).astype(np.float64)  
    y_= (y_/ y_norm).reshape(-1,1)

    print(f"MSE: {mse(predict(w, x_), y_)} \nMSLE: {msle(predict(w, x_), y_)}\n")

print("Test accuracies:")
check_loss(w_mse)
print("omse^     omsle\\/")
check_loss(w_msle)
check_loss(ew_mse)
print("eomse^    ewomsle\\/")
check_loss(ew_msle)

Test accuracies:
MSE: 0.003997138382053667 
MSLE: 0.0018071128203093123

omse^     omsle\/
MSE: 0.004420362293649104 
MSLE: 0.0020094707581270743

MSE: 0.0012601271497606206 
MSLE: 0.0006040250205272509

eomse^    ewomsle\/
MSE: 0.002353637445160785 
MSLE: 0.0010727266737188587

