# Homework 5

This homework asks you to perform various experiments with ensemble methods. 

The dataset is the same real estate dataset we previously used from:

https://www.kaggle.com/datasets/mirbektoktogaraev/madrid-real-estate-market

You will write code and discussion into code and text cells in this notebook. 

If a code block starts with TODO:, this means that you need to write something there. 

There are also markdown blocks with questions. Write the answers to these questions in the specified locations.

Some code had been written for you to guide the project. Don't change the already written code.

## Grading
The points add up to 10. Extensive partial credit will be offered. Thus, make sure that you are at least attempting all problems. 

Make sure to comment your code, such that the grader can understand what different components are doing or attempting to do. 

In [None]:
import numpy as np
import pandas as pd
import sklearn.linear_model
import sklearn.metrics
import sklearn.ensemble


# A. Setup. 

In this project we are going to work in a multi-variable setting. 

This time, there are 7 explanatory variables: ``sq_mt_built``, ``n_rooms``, ``n_bathrooms``, ``is_renewal_needed``, ``is_new_development`` and ``has_fitted_wardrobes``. 

We will first create the training and test data while doing some minimal data cleaning.

In [None]:
np.random.seed(1)
df = pd.read_csv("houses_Madrid.csv")
#print(f"The columns of the database {df.columns}")

xfields = ["sq_mt_built", "n_rooms", "n_bathrooms", "has_individual_heating", \
           "is_renewal_needed", "is_new_development", "has_fitted_wardrobes"]
yfield = ["buy_price"]
# print (xfields + yfield)
dfsel = df[xfields + yfield]
dfselnona = dfsel.dropna()
df_shuffled = dfselnona.sample(frac=1) # shuffle the rows
x = df_shuffled[xfields].to_numpy(dtype=np.float64)
y = df_shuffled[yfield].to_numpy(dtype=np.float64)
print(x.shape)
training_data_x = x[:8000]
training_data_y = y[:8000]
test_data_x = x[8000:]
test_data_y = y[8000:]
print(f"Training data is composed of {len(training_data_x)} samples.")
print(f"Test data is composed of {len(test_data_x)} samples.")
# print(test_data_x[45])

# B. Creating a linear regression multi-variable baseline. 

In this section we make a linear regression predictor for the multi-variable case. We also check the performance of the resulting regressor, and print the error. 

This part is had been done for you, such that the work does not depend on you importing parts from the previous projects. 

You will need to adapt this for the other models. 

In [None]:
np.random.seed(1)
# training the linear regressor
regressor = sklearn.linear_model.LinearRegression()
regressor.fit(training_data_x, training_data_y)
# We will create the predictions yhat for every x from the training data. We will do this one at a time. This is not an efficient way to do it, but it allows you to write and debug functions that return a scalar number
yhats = []
for x in test_data_x:
    yhat = regressor.predict([x])
    yhats.append(yhat[0])

# Now, print some examples of the quality of the classifier
examples = [45, 67, 170, 189, 207]
for i in examples:
    x = test_data_x[i]
    y = test_data_y[i]
    yhat = regressor.predict([x])[0][0]
    print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but our system predicted {yhat:.2f}")

# Now calculate the root mean square error on the resulting arrays
error = sklearn.metrics.mean_squared_error(yhats, test_data_y, squared=False)
print(f"The mean square error of the linear regression is {error:.2f} euro")

# P1: Random Forest using sklearn (5 points)

Use the RandomForestRegressor function from sklearn to predict the prices of the house. Print the resulting error and samples, similar to the way in Section B. 

Experiment with the settings of the hyperparameters: n_estimators (try at least values 10, 25, 100, 200) and max_depth (try at least values 1, 2, 4, 8, 16 and None).

Retain the hyperparameter value that gives you the best result. 



In [None]:
# TODO implement here
np.random.seed(1)
from sklearn.ensemble import RandomForestClassifier


# Default Random Forest:
rf = RandomForestClassifier().fit(training_data_x,training_data_y.ravel())

# Create yhat one at a time from test x
yhats = []
for x in test_data_x:
    yhat = rf.predict([x])
    yhats.append(yhat[0])

# Examples showing the efficacy of this model
examples = [45, 67, 170, 189, 207]
for i in examples:
    x = test_data_x[i]
    y = test_data_y[i]
    yhat = rf.predict([x])[0]
    print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but Default Random Forest predicted {yhat:.2f}")

# Mean Squared Error of Random Forest:
error = sklearn.metrics.mean_squared_error(yhats, test_data_y, squared=False)
print(f"The mean square error of Default Random Forest is {error:.2f} euro")

In [None]:
# n_estimators = 10:
np.random.seed(1)

rf = RandomForestClassifier(n_estimators=10).fit(training_data_x,training_data_y.ravel())

# Create yhat one at a time from test x
yhats = []
for x in test_data_x:
    yhat = rf.predict([x])
    yhats.append(yhat[0])

# Examples showing the efficacy of this model
examples = [45, 67, 170, 189, 207]
for i in examples:
    x = test_data_x[i]
    y = test_data_y[i]
    yhat = rf.predict([x])[0]
    print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but Random Forest with 10 estimators predicted {yhat:.2f}")

# Mean Squared Error of Random Forest:
error = sklearn.metrics.mean_squared_error(yhats, test_data_y, squared=False)
print(f"The mean square error of Random Forest with 10 estimators is {error:.2f} euro")

In [None]:
# n_estimators = 25:
np.random.seed(1)

rf = RandomForestClassifier(n_estimators=25).fit(training_data_x,training_data_y.ravel())

# Create yhat one at a time from test x
yhats = []
for x in test_data_x:
    yhat = rf.predict([x])
    yhats.append(yhat[0])

# Examples showing the efficacy of this model
examples = [45, 67, 170, 189, 207]
for i in examples:
    x = test_data_x[i]
    y = test_data_y[i]
    yhat = rf.predict([x])[0]
    print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but Random Forest with 25 estimators predicted {yhat:.2f}")

# Mean Squared Error of Random Forest:
error = sklearn.metrics.mean_squared_error(yhats, test_data_y, squared=False)
print(f"The mean square error of Random Forest with 25 estimators is {error:.2f} euro")

In [None]:
# n_estimators = 100:
np.random.seed(1)

rf = RandomForestClassifier(n_estimators=100).fit(training_data_x,training_data_y.ravel())

# Create yhat one at a time from test x
yhats = []
for x in test_data_x:
    yhat = rf.predict([x])
    yhats.append(yhat[0])

# Examples showing the efficacy of this model
examples = [45, 67, 170, 189, 207]
for i in examples:
    x = test_data_x[i]
    y = test_data_y[i]
    yhat = rf.predict([x])[0]
    print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but Random Forest with 100 estimators (Default) predicted {yhat:.2f}")

# Mean Squared Error of Random Forest:
error = sklearn.metrics.mean_squared_error(yhats, test_data_y, squared=False)
print(f"The mean square error of Random Forest with 100 estimators (Default) is {error:.2f} euro")

In [None]:
# n_estimators = 200:
np.random.seed(1)

rf = RandomForestClassifier(n_estimators=200).fit(training_data_x,training_data_y.ravel())

# Create yhat one at a time from test x
yhats = []
for x in test_data_x:
    yhat = rf.predict([x])
    yhats.append(yhat[0])

# Examples showing the efficacy of this model
examples = [45, 67, 170, 189, 207]
for i in examples:
    x = test_data_x[i]
    y = test_data_y[i]
    yhat = rf.predict([x])[0]
    print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but Random Forest with 200 estimators predicted {yhat:.2f}")

# Mean Squared Error of Random Forest:
error = sklearn.metrics.mean_squared_error(yhats, test_data_y, squared=False)
print(f"The mean square error of Random Forest with 200 estimators is {error:.2f} euro")

In [None]:
# max_depth = 1:
np.random.seed(1)

rf = RandomForestClassifier(max_depth=1).fit(training_data_x,training_data_y.ravel())

# Create yhat one at a time from test x
yhats = []
for x in test_data_x:
    yhat = rf.predict([x])
    yhats.append(yhat[0])

# Examples showing the efficacy of this model
examples = [45, 67, 170, 189, 207]
for i in examples:
    x = test_data_x[i]
    y = test_data_y[i]
    yhat = rf.predict([x])[0]
    print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but Random Forest with Max Depth 1 predicted {yhat:.2f}")

# Mean Squared Error of Random Forest:
error = sklearn.metrics.mean_squared_error(yhats, test_data_y, squared=False)
print(f"The mean square error of Random Forest with Max Depth 1 is {error:.2f} euro")

In [None]:
# max_depth = 2:
np.random.seed(1)

rf = RandomForestClassifier(max_depth=2).fit(training_data_x,training_data_y.ravel())

# Create yhat one at a time from test x
yhats = []
for x in test_data_x:
    yhat = rf.predict([x])
    yhats.append(yhat[0])

# Examples showing the efficacy of this model
examples = [45, 67, 170, 189, 207]
for i in examples:
    x = test_data_x[i]
    y = test_data_y[i]
    yhat = rf.predict([x])[0]
    print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but Random Forest with Max Depth 2 predicted {yhat:.2f}")

# Mean Squared Error of Random Forest:
error = sklearn.metrics.mean_squared_error(yhats, test_data_y, squared=False)
print(f"The mean square error of Random Forest with Max Depth 2 is {error:.2f} euro")

In [None]:
# max_depth = 4:
np.random.seed(1)

rf = RandomForestClassifier(max_depth=4).fit(training_data_x,training_data_y.ravel())

# Create yhat one at a time from test x
yhats = []
for x in test_data_x:
    yhat = rf.predict([x])
    yhats.append(yhat[0])

# Examples showing the efficacy of this model
examples = [45, 67, 170, 189, 207]
for i in examples:
    x = test_data_x[i]
    y = test_data_y[i]
    yhat = rf.predict([x])[0]
    print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but Random Forest with Max Depth 4 predicted {yhat:.2f}")

# Mean Squared Error of Random Forest:
error = sklearn.metrics.mean_squared_error(yhats, test_data_y, squared=False)
print(f"The mean square error of Random Forest with Max Depth 4 is {error:.2f} euro")

In [None]:
# max_depth = 8:
np.random.seed(1)

rf = RandomForestClassifier(max_depth=8).fit(training_data_x,training_data_y.ravel())

# Create yhat one at a time from test x
yhats = []
for x in test_data_x:
    yhat = rf.predict([x])
    yhats.append(yhat[0])

# Examples showing the efficacy of this model
examples = [45, 67, 170, 189, 207]
for i in examples:
    x = test_data_x[i]
    y = test_data_y[i]
    yhat = rf.predict([x])[0]
    print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but Random Forest with Max Depth 8 predicted {yhat:.2f}")

# Mean Squared Error of Random Forest:
error = sklearn.metrics.mean_squared_error(yhats, test_data_y, squared=False)
print(f"The mean square error of Random Forest with Max Depth 8 is {error:.2f} euro")

In [None]:
# max_depth = 16:
np.random.seed(1)

rf = RandomForestClassifier(max_depth=16).fit(training_data_x,training_data_y.ravel())

# Create yhat one at a time from test x
yhats = []
for x in test_data_x:
    yhat = rf.predict([x])
    yhats.append(yhat[0])

# Examples showing the efficacy of this model
examples = [45, 67, 170, 189, 207]
for i in examples:
    x = test_data_x[i]
    y = test_data_y[i]
    yhat = rf.predict([x])[0]
    print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but Random Forest with Max Depth 16 predicted {yhat:.2f}")

# Mean Squared Error of Random Forest:
error = sklearn.metrics.mean_squared_error(yhats, test_data_y, squared=False)
print(f"The mean square error of Random Forest with Max Depth 16 is {error:.2f} euro")

In [None]:
# max_depth = None:
np.random.seed(1)

rf = RandomForestClassifier(max_depth=None).fit(training_data_x,training_data_y.ravel())

# Create yhat one at a time from test x
yhats = []
for x in test_data_x:
    yhat = rf.predict([x])
    yhats.append(yhat[0])

# Examples showing the efficacy of this model
examples = [45, 67, 170, 189, 207]
for i in examples:
    x = test_data_x[i]
    y = test_data_y[i]
    yhat = rf.predict([x])[0]
    print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but Random Forest with Max Depth None (Default) predicted {yhat:.2f}")

# Mean Squared Error of Random Forest:
error = sklearn.metrics.mean_squared_error(yhats, test_data_y, squared=False)
print(f"The mean square error of Random Forest with Max Depth None (Default) is {error:.2f} euro")

# Questions: 
* Q: Do you find that Random Forest performs better than the previous approaches you implemented? Discuss. 
* A: Random Forest MSEs:
    * Default Random Forest: 464957.87
    * 10 Estimators: 446286.40
    * 25 Estimators: 457670.26
    * 100 Estimators (Default): 442164.61
    * 200 Estimators: 441326.10
    * Max Depth 1: 690792.12
    * Max Depth 2: 531027.36
    * Max Depth 4: 457221.31
    * Max Depth 8: 432126.40
    * Max Depth 16: 445192.75
    * Max Depth None (Default): 451945.06


* Q: Explain the impact of the number of estimators and max tree depth hyperparameters on the accuracy. Which hyperparameter setting gives you the best value? Is this the same as the default settings in sklearn?
* A: It seems like the more estimators there were, the lower the MSE. Similarly, the deeper the trees got, the better it performed, until a certain point when MSE started rising again.

* The best value came from n_estimators = 100, max_depth = 8. This is different from the default setting, which is n_estimators = 100 and max_depth = None.


* Q: Explain the impact of the hyperparameters on the training time. 
* A: << Fill in your answer here >>


# P2: AdaBoost using sklearn (5 points)

Use the AdaBoost function from sklearn to predict the prices of the house. Print the resulting error and samples, similar to the way in Section B. 

Experiment with the settings of the hyperparameters: loss (try "linear", "square" and "exponential) and learning_rate (try at least values 0.2, 0.5, 1 and 2)

In [None]:
# TODO implement here
from sklearn.ensemble import AdaBoostRegressor
# max_depth = None:
np.random.seed(1)
ada = AdaBoostRegressor().fit(training_data_x,training_data_y)

# Create yhat one at a time from test x
yhats = []
for x in test_data_x:
    yhat = ada.predict([x])
    yhats.append(yhat[0])

# Examples showing the efficacy of this model
examples = [45, 67, 170, 189, 207]
for i in examples:
    x = test_data_x[i]
    y = test_data_y[i]
    yhat = ada.predict([x])[0]
    print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but AdaBoost predicted {yhat:.2f}")

# Mean Squared Error of Random Forest:
error = sklearn.metrics.mean_squared_error(yhats, test_data_y, squared=False)
print(f"The mean square error of AdaBoost is {error:.2f} euro")

# Questions: 
* Q: Do you find that Adaboost performs better than the previous approaches you implemented? Discuss. 
* A: << Fill in your answer here >>
* Q: Explain the impact of the loss and the learning_rate hyperparameters on the accuracy. Which hyperparameter setting gives you the best value? Is this the same as the default settings in sklearn?
* A: << Fill in your answer here >>
* Q: Explain the impact of the hyperparameters on the training time. 
* A: << Fill in your answer here >>