# Homework 5

This homework asks you to perform various experiments with ensemble methods.

The dataset is the same real estate dataset we previously used from:

https://www.kaggle.com/datasets/mirbektoktogaraev/madrid-real-estate-market

You will write code and discussion into code and text cells in this notebook.

If a code block starts with TODO:, this means that you need to write something there.

There are also markdown blocks with questions. Write the answers to these questions in the specified locations.

Some code had been written for you to guide the project. Don't change the already written code.

## Grading
The points add up to 10. Extensive partial credit will be offered. Thus, make sure that you are at least attempting all problems.

Make sure to comment your code, such that the grader can understand what different components are doing or attempting to do.

In [1]:
import numpy as np
import pandas as pd
import sklearn.linear_model
import sklearn.metrics
import sklearn.ensemble


# A. Setup.

In this project we are going to work in a multi-variable setting.

This time, there are 7 explanatory variables: ``sq_mt_built``, ``n_rooms``, ``n_bathrooms``, ``is_renewal_needed``, ``is_new_development`` and ``has_fitted_wardrobes``.

We will first create the training and test data while doing some minimal data cleaning.

In [2]:
df = pd.read_csv("houses_Madrid.csv")
#print(f"The columns of the database {df.columns}")

xfields = ["sq_mt_built", "n_rooms", "n_bathrooms", "has_individual_heating", \
           "is_renewal_needed", "is_new_development", "has_fitted_wardrobes"]
yfield = ["buy_price"]
# print (xfields + yfield)
dfsel = df[xfields + yfield]
dfselnona = dfsel.dropna()
df_shuffled = dfselnona.sample(frac=1) # shuffle the rows
x = df_shuffled[xfields].to_numpy(dtype=np.float64)
y = df_shuffled[yfield].to_numpy(dtype=np.float64)
print(x.shape)
training_data_x = x[:8000]
training_data_y = y[:8000]
test_data_x = x[8000:]
test_data_y = y[8000:]
print(f"Training data is composed of {len(training_data_x)} samples.")
print(f"Test data is composed of {len(test_data_x)} samples.")
# print(test_data_x[45])

(9764, 7)
Training data is composed of 8000 samples.
Test data is composed of 1764 samples.


# B. Creating a linear regression multi-variable baseline.

In this section we make a linear regression predictor for the multi-variable case. We also check the performance of the resulting regressor, and print the error.

This part is had been done for you, such that the work does not depend on you importing parts from the previous projects.

You will need to adapt this for the other models.

In [3]:
# training the linear regressor
regressor = sklearn.linear_model.LinearRegression()
regressor.fit(training_data_x, training_data_y)
# We will create the predictions yhat for every x from the training data. We will do this one at a time. This is not an efficient way to do it, but it allows you to write and debug functions that return a scalar number
yhats = []
for x in test_data_x:
    yhat = regressor.predict([x])
    yhats.append(yhat[0])

# Now, print some examples of the quality of the classifier
examples = [45, 67, 170, 189, 207]
for i in examples:
    x = test_data_x[i]
    y = test_data_y[i]
    yhat = regressor.predict([x])[0][0]
    print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but our system predicted {yhat:.2f}")

# Now calculate the root mean square error on the resulting arrays
error = sklearn.metrics.mean_squared_error(yhats, test_data_y, squared=False)
print(f"The mean square error of the linear regression is {error:.2f} euro")

House 45 with 239.0 sqmt was sold for [600000.] euros, but our system predicted 1041267.85
House 67 with 345.0 sqmt was sold for [1375000.] euros, but our system predicted 1643583.67
House 170 with 118.0 sqmt was sold for [480000.] euros, but our system predicted 553617.87
House 189 with 670.0 sqmt was sold for [2800000.] euros, but our system predicted 2888539.26
House 207 with 216.0 sqmt was sold for [1180000.] euros, but our system predicted 983130.87
The mean square error of the linear regression is 351143.43 euro


# P1: Random Forest using sklearn (5 points)

Use the RandomForestRegressor function from sklearn to predict the prices of the house. Print the resulting error and samples, similar to the way in Section B.

Experiment with the settings of the hyperparameters: n_estimators (try at least values 10, 25, 100, 200) and max_depth (try at least values 1, 2, 4, 8, 16 and None).

Retain the hyperparameter value that gives you the best result.



In [18]:
# TODO implement here
from sklearn.ensemble import RandomForestRegressor
regressor = sklearn.ensemble.RandomForestRegressor(n_estimators=25, max_depth=8)
regressor.fit(training_data_x, training_data_y)

yhats = []
for x in test_data_x:
    yhat = regressor.predict([x])
    yhats.append(yhat[0])

# Now, print some examples of the quality of the classifier
examples = [45, 67, 170, 189, 207]
for i in examples:
    x = test_data_x[i]
    y = test_data_y[i]
    #yhat = regressor.predict([x])[0][0]
    yhat = regressor.predict([x])[0]
    print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but our system predicted {yhat:.2f}")

# Now calculate the root mean square error on the resulting arrays
error = sklearn.metrics.mean_squared_error(yhats, test_data_y, squared=False)
print(f"The mean square error of the linear regression is {error:.2f} euro")

  regressor.fit(training_data_x, training_data_y)


House 45 with 239.0 sqmt was sold for [600000.] euros, but our system predicted 1326430.61
House 67 with 345.0 sqmt was sold for [1375000.] euros, but our system predicted 1851927.77
House 170 with 118.0 sqmt was sold for [480000.] euros, but our system predicted 488443.39
House 189 with 670.0 sqmt was sold for [2800000.] euros, but our system predicted 2360319.16
House 207 with 216.0 sqmt was sold for [1180000.] euros, but our system predicted 897915.97
The mean square error of the linear regression is 335896.89 euro


# Questions:
* Q: Do you find that Random Forest performs better than the previous approaches you implemented? Discuss.
* A: Random Forest does perform better than the previous approaches that I have implemented. It does not run as fast as some other approaches but random forest is still quick in its runtime.
* Q: Explain the impact of the number of estimators and max tree depth hyperparameters on the accuracy. Which hyperparameter setting gives you the best value? Is this the same as the default settings in sklearn?
* A: The best settings that I got were n_estimator=25 and max_depth = 8. This is different compared to the default settings of 100 and None respectively.
* Q: Explain the impact of the hyperparameters on the training time.
* A: Increasing the n_estimators definitely had an impact on the training time.


# P2: AdaBoost using sklearn (5 points)

Use the AdaBoost function from sklearn to predict the prices of the house. Print the resulting error and samples, similar to the way in Section B.

Experiment with the settings of the hyperparameters: loss (try "linear", "square" and "exponential) and learning_rate (try at least values 0.2, 0.5, 1 and 2)

In [37]:
# TODO implement here
from sklearn.ensemble import AdaBoostRegressor
regressor = sklearn.ensemble.AdaBoostRegressor(loss='linear', learning_rate=2)
regressor.fit(training_data_x, training_data_y)

yhats = []
for x in test_data_x:
    yhat = regressor.predict([x])
    yhats.append(yhat[0])

# Now, print some examples of the quality of the classifier
examples = [45, 67, 170, 189, 207]
for i in examples:
    x = test_data_x[i]
    y = test_data_y[i]
    #yhat = regressor.predict([x])[0][0]
    yhat = regressor.predict([x])[0]
    print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but our system predicted {yhat:.2f}")

# Now calculate the root mean square error on the resulting arrays
error = sklearn.metrics.mean_squared_error(yhats, test_data_y, squared=False)
print(f"The mean square error of the linear regression is {error:.2f} euro")


  y = column_or_1d(y, warn=True)


House 45 with 239.0 sqmt was sold for [600000.] euros, but our system predicted 1442244.12
House 67 with 345.0 sqmt was sold for [1375000.] euros, but our system predicted 2514598.16
House 170 with 118.0 sqmt was sold for [480000.] euros, but our system predicted 432331.74
House 189 with 670.0 sqmt was sold for [2800000.] euros, but our system predicted 2798357.72
House 207 with 216.0 sqmt was sold for [1180000.] euros, but our system predicted 981241.83
The mean square error of the linear regression is 454263.31 euro


# Questions:
* Q: Do you find that Adaboost performs better than the previous approaches you implemented? Discuss.
* A: I found that Adaboost performs worse compared to Random Forest as they have comparable runtimes but Adaboost seems to have a worse MSE.
* Q: Explain the impact of the loss and the learning_rate hyperparameters on the accuracy. Which hyperparameter setting gives you the best value? Is this the same as the default settings in sklearn?
* A: The best parameter settings I got were 'linear' and 2 for loss and learning rate respectively. These are close to default values with the only difference being the default learning rate is 1.
* Q: Explain the impact of the hyperparameters on the training time.
* A: Changing the loss parameter from linear to either square or exponential increased the training time.