# Homework 5

This homework asks you to perform various experiments with ensemble methods.

The dataset is the same real estate dataset we previously used from:

https://www.kaggle.com/datasets/mirbektoktogaraev/madrid-real-estate-market

You will write code and discussion into code and text cells in this notebook.

If a code block starts with TODO:, this means that you need to write something there.

There are also markdown blocks with questions. Write the answers to these questions in the specified locations.

Some code had been written for you to guide the project. Don't change the already written code.

## Grading
The points add up to 10. Extensive partial credit will be offered. Thus, make sure that you are at least attempting all problems.

Make sure to comment your code, such that the grader can understand what different components are doing or attempting to do.

In [1]:
import numpy as np
import pandas as pd
import sklearn.linear_model
import sklearn.metrics
import sklearn.ensemble


# A. Setup.

In this project we are going to work in a multi-variable setting.

This time, there are 7 explanatory variables: ``sq_mt_built``, ``n_rooms``, ``n_bathrooms``, ``is_renewal_needed``, ``is_new_development`` and ``has_fitted_wardrobes``.

We will first create the training and test data while doing some minimal data cleaning.

In [5]:
df = pd.read_csv("houses_Madrid.csv")
#print(f"The columns of the database {df.columns}")

xfields = ["sq_mt_built", "n_rooms", "n_bathrooms", "has_individual_heating", \
           "is_renewal_needed", "is_new_development", "has_fitted_wardrobes"]
yfield = ["buy_price"]
# print (xfields + yfield)
dfsel = df[xfields + yfield]
dfselnona = dfsel.dropna()
df_shuffled = dfselnona.sample(frac=1) # shuffle the rows
x = df_shuffled[xfields].to_numpy(dtype=np.float64)
y = df_shuffled[yfield].to_numpy(dtype=np.float64)
print(x.shape)
training_data_x = x[:8000]
training_data_y = y[:8000]
test_data_x = x[8000:]
test_data_y = y[8000:]
print(f"Training data is composed of {len(training_data_x)} samples.")
print(f"Test data is composed of {len(test_data_x)} samples.")
# print(test_data_x[45])

(9764, 7)
Training data is composed of 8000 samples.
Test data is composed of 1764 samples.


# B. Creating a linear regression multi-variable baseline.

In this section we make a linear regression predictor for the multi-variable case. We also check the performance of the resulting regressor, and print the error.

This part is had been done for you, such that the work does not depend on you importing parts from the previous projects.

You will need to adapt this for the other models.

In [6]:
# training the linear regressor
regressor = sklearn.linear_model.LinearRegression()
regressor.fit(training_data_x, training_data_y)
# We will create the predictions yhat for every x from the training data. We will do this one at a time. This is not an efficient way to do it, but it allows you to write and debug functions that return a scalar number
yhats = []
for x in test_data_x:
    yhat = regressor.predict([x])
    yhats.append(yhat[0])

# Now, print some examples of the quality of the classifier
examples = [45, 67, 170, 189, 207]
for i in examples:
    x = test_data_x[i]
    y = test_data_y[i]
    yhat = regressor.predict([x])[0][0]
    print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but our system predicted {yhat:.2f}")

# Now calculate the root mean square error on the resulting arrays
error = sklearn.metrics.mean_squared_error(yhats, test_data_y, squared=False)
print(f"The mean square error of the linear regression is {error:.2f} euro")

House 45 with 58.0 sqmt was sold for [340000.] euros, but our system predicted 237474.00
House 67 with 50.0 sqmt was sold for [295000.] euros, but our system predicted 205471.15
House 170 with 162.0 sqmt was sold for [570000.] euros, but our system predicted 638787.34
House 189 with 77.0 sqmt was sold for [390000.] euros, but our system predicted 366002.59
House 207 with 80.0 sqmt was sold for [290000.] euros, but our system predicted 258236.29
The mean square error of the linear regression is 380831.49 euro


# P1: Random Forest using sklearn (5 points)

Use the RandomForestRegressor function from sklearn to predict the prices of the house. Print the resulting error and samples, similar to the way in Section B.

Experiment with the settings of the hyperparameters: n_estimators (try at least values 10, 25, 100, 200) and max_depth (try at least values 1, 2, 4, 8, 16 and None).

Retain the hyperparameter value that gives you the best result.



In [52]:
# TODO implement here
# Define hyperparameter values to experiment with
n_estimators_values = [10, 25, 100, 200]
max_depth_values = [1, 2, 4, 8, 16, None]

best_n_estimators = None
best_max_depth = None
best_error = float('inf')

training_data_y = np.ravel(training_data_y)
# Loop through hyperparameter values
for n_estimator in n_estimators_values:
    for max_dep in max_depth_values:
        # Create and train the RandomForestRegressor
        model = sklearn.ensemble.RandomForestRegressor(n_estimators=n_estimator, max_depth=max_dep)
        model.fit(training_data_x, training_data_y)

        # Make predictions on the test set
        y_preds= model.predict(test_data_x)

        # Now, print some examples of the quality of the classifier
        examples = [45, 67, 170, 189, 207]
        for i in examples:
          x = test_data_x[i]
          y = test_data_y[i]
          y_pred = model.predict([x])[0]
          print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but our system predicted {y_pred:.2f}")
        # Now calculate the root mean square error on the resulting arrays
        error = sklearn.metrics.mean_squared_error(y_preds, test_data_y, squared=False)
        print(f"The mean square error of the random forest regression is {error:.2f} euro")

        # Update best hyperparameters if the current model performs better
        if error < best_error:
            best_error = error
            best_n_estimators = n_estimator
            best_max_depth = max_dep

# Print the best hyperparameter values
print(f"\nBest hyperparameters: n_estimators={best_n_estimators}, max_depth={best_max_depth}, Best MSE={best_error}")

House 45 with 58.0 sqmt was sold for [340000.] euros, but our system predicted 433963.63
House 67 with 50.0 sqmt was sold for [295000.] euros, but our system predicted 433963.63
House 170 with 162.0 sqmt was sold for [570000.] euros, but our system predicted 433963.63
House 189 with 77.0 sqmt was sold for [390000.] euros, but our system predicted 433963.63
House 207 with 80.0 sqmt was sold for [290000.] euros, but our system predicted 433963.63
The mean square error of the random forest regression is 483348.31 euro
House 45 with 58.0 sqmt was sold for [340000.] euros, but our system predicted 318394.98
House 67 with 50.0 sqmt was sold for [295000.] euros, but our system predicted 318394.98
House 170 with 162.0 sqmt was sold for [570000.] euros, but our system predicted 758232.50
House 189 with 77.0 sqmt was sold for [390000.] euros, but our system predicted 318394.98
House 207 with 80.0 sqmt was sold for [290000.] euros, but our system predicted 318394.98
The mean square error of the r

# Questions:
* Q: Do you find that Random Forest performs better than the previous approaches you implemented? Discuss.
* A: Yes, comaring to the best MSE of implementation of both RandomForestRegressor and KNeighborsRegressor, the random forest tends to permform about 10000 lower than the k-neighbor do.
* Q: Explain the impact of the number of estimators and max tree depth hyperparameters on the accuracy. Which hyperparameter setting gives you the best value? Is this the same as the default settings in sklearn?
* A: After running the model many times, the best accuracy always sit at max tree depth of 8 and number of estimator 25, 100, or 200. It is not the default settings since even the same parameters would produce different output every time. It depends on the random parts of the algorithm
* Q: Explain the impact of the hyperparameters on the training time.
* A: The more or the larger hyperparameters we have, the longer training time it takes.


# P2: AdaBoost using sklearn (5 points)

Use the AdaBoost function from sklearn to predict the prices of the house. Print the resulting error and samples, similar to the way in Section B.

Experiment with the settings of the hyperparameters: loss (try "linear", "square" and "exponential) and learning_rate (try at least values 0.2, 0.5, 1 and 2)

In [51]:
# TODO implement here
loss_values = ["linear", "square", "exponential"]
learning_rate_values = [0.2, 0.5, 1, 2]

best_loss = None
best_learning_rate = None
best_error = float('inf')

# Loop through hyperparameter values' combinations
for cur_loss in loss_values:
    for cur_learning_rate in learning_rate_values:
        model = sklearn.ensemble.AdaBoostRegressor(loss=cur_loss, learning_rate=cur_learning_rate, random_state=42)
        model.fit(training_data_x, training_data_y)

        y_hats = []
        for x in test_data_x:
            yhat = model.predict([x])
            y_hats.append(yhat[0])

        for i in examples:
          x = test_data_x[i]
          y = test_data_y[i]
          y_hat = model.predict([x])[0]
          print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but our system predicted {y_hat:.2f}")
        error = sklearn.metrics.mean_squared_error(y_hats, test_data_y, squared=False)
        print(f"The mean square error of the AdaBoost regression is {error:.2f} euro")

        # Update best hyperparameters
        if error < best_error:
            best_error = error
            best_loss = cur_loss
            best_learning_rate = cur_learning_rate

print(f"\nBest hyperparameters: loss={best_loss}, learning_rate={best_learning_rate}, Best MSE={best_error}")


House 45 with 58.0 sqmt was sold for [340000.] euros, but our system predicted 313594.84
House 67 with 50.0 sqmt was sold for [295000.] euros, but our system predicted 313594.84
House 170 with 162.0 sqmt was sold for [570000.] euros, but our system predicted 756494.61
House 189 with 77.0 sqmt was sold for [390000.] euros, but our system predicted 313594.84
House 207 with 80.0 sqmt was sold for [290000.] euros, but our system predicted 313594.84
The mean square error of the AdaBoost regression is 398806.26 euro
House 45 with 58.0 sqmt was sold for [340000.] euros, but our system predicted 328190.86
House 67 with 50.0 sqmt was sold for [295000.] euros, but our system predicted 328190.86
House 170 with 162.0 sqmt was sold for [570000.] euros, but our system predicted 756494.61
House 189 with 77.0 sqmt was sold for [390000.] euros, but our system predicted 328190.86
House 207 with 80.0 sqmt was sold for [290000.] euros, but our system predicted 328190.86
The mean square error of the AdaBoo

# Questions:
* Q: Do you find that Adaboost performs better than the previous approaches you implemented? Discuss.
* A: No, the best mse is about 20000 higher than the previous one.
* Q: Explain the impact of the loss and the learning_rate hyperparameters on the accuracy. Which hyperparameter setting gives you the best value? Is this the same as the default settings in sklearn?
* A: The best hyperparameter is loss of exponential and learning rate of 0.2, which is same everytime I run it. So it should be the default settiings in sklearn.
* Q: Explain the impact of the hyperparameters on the training time.
* A: It took about the same time for hyperparameters on the training.