## Exercise 2

Goal of this exercise is to predict the price of the houses in Boston. Load the dataset from the text file boston.txt1 using the function np.genfromtxt. Use the first 100 rows for testing, the next 50 rows for validation, i.e., for tuning hyperparameters, and the rest of the dataset for fitting your linear model. You do not have to introduce any additional features. For each dataset report the test mean squared error.


In [1]:
import numpy as np
import pandas as pd

In [2]:
data = np.genfromtxt('boston.txt', skip_header = 22)

In [3]:
data_set = pd.DataFrame(data, columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV'])

Dropping the "B" feature, because I don't want my model to be racist. 

In [4]:
data_set = data_set.drop('B', axis = 1)

In [5]:
data_set

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.0900,1.0,296.0,15.3,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0.0,0.573,6.593,69.1,2.4786,1.0,273.0,21.0,9.67,22.4
502,0.04527,0.0,11.93,0.0,0.573,6.120,76.7,2.2875,1.0,273.0,21.0,9.08,20.6
503,0.06076,0.0,11.93,0.0,0.573,6.976,91.0,2.1675,1.0,273.0,21.0,5.64,23.9
504,0.10959,0.0,11.93,0.0,0.573,6.794,89.3,2.3889,1.0,273.0,21.0,6.48,22.0


### Data split - features and targets

In [6]:
X_train = data_set.iloc[150:, :-1]  # features
y_train = data_set.iloc[150:, -1]   # last column - target median value of owner-occupied homes in $1000's

X_test = data_set.iloc[:100, :-1]
y_test = data_set.iloc[:100, -1]

X_val = data_set.iloc[100:150, :-1]
y_val = data_set.iloc[100:150, -1]

### Linear regression using the MLE estimator

MLE estimator formula:   
$$ w_{MLE} = (X^TX)^{-1}X^Ty $$

In [7]:
X_train = X_train.to_numpy()
y_train = y_train.to_numpy()

w_mle = np.dot(np.dot(np.linalg.inv(np.dot(X_train.T, X_train)), X_train.T), y_train)

Parameters of linear regression, calculated using MLE estimator formula:

In [8]:
w_mle

array([-1.10360975e-01,  6.84258958e-02, -7.31769025e-02,  3.00158574e+00,
       -6.71850899e+00,  5.87379340e+00,  2.07389281e-02, -1.27870759e+00,
        8.07953547e-03, -4.59799140e-03,  1.08706374e-01, -5.53690108e-01])

In [9]:
def predict(weights, features):
    
    predictions = []
    
    for row in range(features.shape[0]):
        predictions.append(np.dot(weights, features.iloc[row, :]))
        
    return predictions


test_pred_mle = predict(w_mle, X_test)

#### Evaluation

In [10]:
mse_mle = np.square(np.subtract(y_test, test_pred_mle)).mean()

In [11]:
mse_mle

11.959718840650325

### Linear regression using the ridge regression estimator

MLE estimator formula:   
$$ w_{MLE} = ( \lambda I + X^TX)^{-1}X^Ty $$

In [12]:
lam = 0.6

w_ridge = np.dot(np.dot(np.linalg.inv(np.dot(lam, np.identity(12)) + np.dot(X_train.T, X_train)), X_train.T), y_train)

In [13]:
test_pred_ridge = predict(w_ridge, X_test)

#### Evaluation

In [14]:
mse_ridge = np.square(np.subtract(y_test, test_pred_ridge)).mean()

In [15]:
mse_ridge

11.796587217512931