# Homework 5

This homework asks you to perform various experiments with ensemble methods. 

The dataset is the same real estate dataset we previously used from:

https://www.kaggle.com/datasets/mirbektoktogaraev/madrid-real-estate-market

You will write code and discussion into code and text cells in this notebook. 

If a code block starts with TODO:, this means that you need to write something there. 

There are also markdown blocks with questions. Write the answers to these questions in the specified locations.

Some code had been written for you to guide the project. Don't change the already written code.

## Grading
The points add up to 10. Extensive partial credit will be offered. Thus, make sure that you are at least attempting all problems. 

Make sure to comment your code, such that the grader can understand what different components are doing or attempting to do. 

In [1]:
import numpy as np
import pandas as pd
import sklearn.linear_model
import sklearn.metrics
import sklearn.ensemble


# A. Setup. 

In this project we are going to work in a multi-variable setting. 

This time, there are 7 explanatory variables: ``sq_mt_built``, ``n_rooms``, ``n_bathrooms``, ``is_renewal_needed``, ``is_new_development`` and ``has_fitted_wardrobes``. 

We will first create the training and test data while doing some minimal data cleaning.

In [2]:
np.random.seed(1)
df = pd.read_csv("houses_Madrid.csv")
#print(f"The columns of the database {df.columns}")

xfields = ["sq_mt_built", "n_rooms", "n_bathrooms", "has_individual_heating", \
           "is_renewal_needed", "is_new_development", "has_fitted_wardrobes"]
yfield = ["buy_price"]
# print (xfields + yfield)
dfsel = df[xfields + yfield]
dfselnona = dfsel.dropna()
df_shuffled = dfselnona.sample(frac=1) # shuffle the rows
x = df_shuffled[xfields].to_numpy(dtype=np.float64)
y = df_shuffled[yfield].to_numpy(dtype=np.float64)
print(x.shape)
training_data_x = x[:8000]
training_data_y = y[:8000]
test_data_x = x[8000:]
test_data_y = y[8000:]
print(f"Training data is composed of {len(training_data_x)} samples.")
print(f"Test data is composed of {len(test_data_x)} samples.")
# print(test_data_x[45])

(9764, 7)
Training data is composed of 8000 samples.
Test data is composed of 1764 samples.


# B. Creating a linear regression multi-variable baseline. 

In this section we make a linear regression predictor for the multi-variable case. We also check the performance of the resulting regressor, and print the error. 

This part is had been done for you, such that the work does not depend on you importing parts from the previous projects. 

You will need to adapt this for the other models. 

In [3]:
np.random.seed(1)
# training the linear regressor
regressor = sklearn.linear_model.LinearRegression()
regressor.fit(training_data_x, training_data_y)
# We will create the predictions yhat for every x from the training data. We will do this one at a time. This is not an efficient way to do it, but it allows you to write and debug functions that return a scalar number
yhats = []
for x in test_data_x:
    yhat = regressor.predict([x])
    yhats.append(yhat[0])

# Now, print some examples of the quality of the classifier
examples = [45, 67, 170, 189, 207]
for i in examples:
    x = test_data_x[i]
    y = test_data_y[i]
    yhat = regressor.predict([x])[0][0]
    print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but our system predicted {yhat:.2f}")

# Now calculate the root mean square error on the resulting arrays
error = sklearn.metrics.mean_squared_error(yhats, test_data_y, squared=False)
print(f"The mean square error of the linear regression is {error:.2f} euro")

House 45 with 70.0 sqmt was sold for [218000.] euros, but our system predicted 223709.52
House 67 with 460.0 sqmt was sold for [950000.] euros, but our system predicted 1969126.92
House 170 with 360.0 sqmt was sold for [2150000.] euros, but our system predicted 1735252.34
House 189 with 240.0 sqmt was sold for [1600000.] euros, but our system predicted 924119.15
House 207 with 86.0 sqmt was sold for [129000.] euros, but our system predicted 219496.45
The mean square error of the linear regression is 388066.28 euro


# P1: Random Forest using sklearn (5 points)

Use the RandomForestRegressor function from sklearn to predict the prices of the house. Print the resulting error and samples, similar to the way in Section B. 

Experiment with the settings of the hyperparameters: n_estimators (try at least values 10, 25, 100, 200) and max_depth (try at least values 1, 2, 4, 8, 16 and None).

Retain the hyperparameter value that gives you the best result. 



In [4]:
# TODO implement here
np.random.seed(1)
from sklearn.ensemble import RandomForestClassifier


# Default Random Forest:
rf = RandomForestClassifier().fit(training_data_x,training_data_y.ravel())

# Create yhat one at a time from test x
yhats = []
for x in test_data_x:
    yhat = rf.predict([x])
    yhats.append(yhat[0])

# Examples showing the efficacy of this model
examples = [45, 67, 170, 189, 207]
for i in examples:
    x = test_data_x[i]
    y = test_data_y[i]
    yhat = rf.predict([x])[0]
    print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but Default Random Forest predicted {yhat:.2f}")

# Mean Squared Error of Random Forest:
error = sklearn.metrics.mean_squared_error(yhats, test_data_y, squared=False)
print(f"The mean square error of Default Random Forest is {error:.2f} euro")

House 45 with 70.0 sqmt was sold for [218000.] euros, but Default Random Forest predicted 186000.00
House 67 with 460.0 sqmt was sold for [950000.] euros, but Default Random Forest predicted 1300000.00
House 170 with 360.0 sqmt was sold for [2150000.] euros, but Default Random Forest predicted 870000.00
House 189 with 240.0 sqmt was sold for [1600000.] euros, but Default Random Forest predicted 1575000.00
House 207 with 86.0 sqmt was sold for [129000.] euros, but Default Random Forest predicted 129000.00
The mean square error of Default Random Forest is 464957.87 euro


In [5]:
# n_estimators = 10:
np.random.seed(1)

rf = RandomForestClassifier(n_estimators=10).fit(training_data_x,training_data_y.ravel())

# Create yhat one at a time from test x
yhats = []
for x in test_data_x:
    yhat = rf.predict([x])
    yhats.append(yhat[0])

# Examples showing the efficacy of this model
examples = [45, 67, 170, 189, 207]
for i in examples:
    x = test_data_x[i]
    y = test_data_y[i]
    yhat = rf.predict([x])[0]
    print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but Random Forest with 10 estimators predicted {yhat:.2f}")

# Mean Squared Error of Random Forest:
error = sklearn.metrics.mean_squared_error(yhats, test_data_y, squared=False)
print(f"The mean square error of Random Forest with 10 estimators is {error:.2f} euro")

House 45 with 70.0 sqmt was sold for [218000.] euros, but Random Forest with 10 estimators predicted 235000.00
House 67 with 460.0 sqmt was sold for [950000.] euros, but Random Forest with 10 estimators predicted 1300000.00
House 170 with 360.0 sqmt was sold for [2150000.] euros, but Random Forest with 10 estimators predicted 2150000.00
House 189 with 240.0 sqmt was sold for [1600000.] euros, but Random Forest with 10 estimators predicted 1575000.00
House 207 with 86.0 sqmt was sold for [129000.] euros, but Random Forest with 10 estimators predicted 127000.00
The mean square error of Random Forest with 10 estimators is 472473.67 euro


In [6]:
# n_estimators = 25:
np.random.seed(1)

rf = RandomForestClassifier(n_estimators=25).fit(training_data_x,training_data_y.ravel())

# Create yhat one at a time from test x
yhats = []
for x in test_data_x:
    yhat = rf.predict([x])
    yhats.append(yhat[0])

# Examples showing the efficacy of this model
examples = [45, 67, 170, 189, 207]
for i in examples:
    x = test_data_x[i]
    y = test_data_y[i]
    yhat = rf.predict([x])[0]
    print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but Random Forest with 25 estimators predicted {yhat:.2f}")

# Mean Squared Error of Random Forest:
error = sklearn.metrics.mean_squared_error(yhats, test_data_y, squared=False)
print(f"The mean square error of Random Forest with 25 estimators is {error:.2f} euro")

House 45 with 70.0 sqmt was sold for [218000.] euros, but Random Forest with 25 estimators predicted 145000.00
House 67 with 460.0 sqmt was sold for [950000.] euros, but Random Forest with 25 estimators predicted 1300000.00
House 170 with 360.0 sqmt was sold for [2150000.] euros, but Random Forest with 25 estimators predicted 2150000.00
House 189 with 240.0 sqmt was sold for [1600000.] euros, but Random Forest with 25 estimators predicted 1575000.00
House 207 with 86.0 sqmt was sold for [129000.] euros, but Random Forest with 25 estimators predicted 156000.00
The mean square error of Random Forest with 25 estimators is 462026.46 euro


In [7]:
# n_estimators = 100:
np.random.seed(1)

rf = RandomForestClassifier(n_estimators=100).fit(training_data_x,training_data_y.ravel())

# Create yhat one at a time from test x
yhats = []
for x in test_data_x:
    yhat = rf.predict([x])
    yhats.append(yhat[0])

# Examples showing the efficacy of this model
examples = [45, 67, 170, 189, 207]
for i in examples:
    x = test_data_x[i]
    y = test_data_y[i]
    yhat = rf.predict([x])[0]
    print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but Random Forest with 100 estimators (Default) predicted {yhat:.2f}")

# Mean Squared Error of Random Forest:
error = sklearn.metrics.mean_squared_error(yhats, test_data_y, squared=False)
print(f"The mean square error of Random Forest with 100 estimators (Default) is {error:.2f} euro")

House 45 with 70.0 sqmt was sold for [218000.] euros, but Random Forest with 100 estimators (Default) predicted 186000.00
House 67 with 460.0 sqmt was sold for [950000.] euros, but Random Forest with 100 estimators (Default) predicted 1300000.00
House 170 with 360.0 sqmt was sold for [2150000.] euros, but Random Forest with 100 estimators (Default) predicted 870000.00
House 189 with 240.0 sqmt was sold for [1600000.] euros, but Random Forest with 100 estimators (Default) predicted 1575000.00
House 207 with 86.0 sqmt was sold for [129000.] euros, but Random Forest with 100 estimators (Default) predicted 129000.00
The mean square error of Random Forest with 100 estimators (Default) is 464957.87 euro


In [8]:
# n_estimators = 200:
np.random.seed(1)

rf = RandomForestClassifier(n_estimators=200).fit(training_data_x,training_data_y.ravel())

# Create yhat one at a time from test x
yhats = []
for x in test_data_x:
    yhat = rf.predict([x])
    yhats.append(yhat[0])

# Examples showing the efficacy of this model
examples = [45, 67, 170, 189, 207]
for i in examples:
    x = test_data_x[i]
    y = test_data_y[i]
    yhat = rf.predict([x])[0]
    print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but Random Forest with 200 estimators predicted {yhat:.2f}")

# Mean Squared Error of Random Forest:
error = sklearn.metrics.mean_squared_error(yhats, test_data_y, squared=False)
print(f"The mean square error of Random Forest with 200 estimators is {error:.2f} euro")

House 45 with 70.0 sqmt was sold for [218000.] euros, but Random Forest with 200 estimators predicted 186000.00
House 67 with 460.0 sqmt was sold for [950000.] euros, but Random Forest with 200 estimators predicted 1300000.00
House 170 with 360.0 sqmt was sold for [2150000.] euros, but Random Forest with 200 estimators predicted 870000.00
House 189 with 240.0 sqmt was sold for [1600000.] euros, but Random Forest with 200 estimators predicted 1575000.00
House 207 with 86.0 sqmt was sold for [129000.] euros, but Random Forest with 200 estimators predicted 188000.00
The mean square error of Random Forest with 200 estimators is 441489.29 euro


In [9]:
# max_depth = 1:
np.random.seed(1)

rf = RandomForestClassifier(max_depth=1).fit(training_data_x,training_data_y.ravel())

# Create yhat one at a time from test x
yhats = []
for x in test_data_x:
    yhat = rf.predict([x])
    yhats.append(yhat[0])

# Examples showing the efficacy of this model
examples = [45, 67, 170, 189, 207]
for i in examples:
    x = test_data_x[i]
    y = test_data_y[i]
    yhat = rf.predict([x])[0]
    print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but Random Forest with Max Depth 1 predicted {yhat:.2f}")

# Mean Squared Error of Random Forest:
error = sklearn.metrics.mean_squared_error(yhats, test_data_y, squared=False)
print(f"The mean square error of Random Forest with Max Depth 1 is {error:.2f} euro")

House 45 with 70.0 sqmt was sold for [218000.] euros, but Random Forest with Max Depth 1 predicted 550000.00
House 67 with 460.0 sqmt was sold for [950000.] euros, but Random Forest with Max Depth 1 predicted 650000.00
House 170 with 360.0 sqmt was sold for [2150000.] euros, but Random Forest with Max Depth 1 predicted 650000.00
House 189 with 240.0 sqmt was sold for [1600000.] euros, but Random Forest with Max Depth 1 predicted 650000.00
House 207 with 86.0 sqmt was sold for [129000.] euros, but Random Forest with Max Depth 1 predicted 550000.00
The mean square error of Random Forest with Max Depth 1 is 715704.78 euro


In [10]:
# max_depth = 2:
np.random.seed(1)

rf = RandomForestClassifier(max_depth=2).fit(training_data_x,training_data_y.ravel())

# Create yhat one at a time from test x
yhats = []
for x in test_data_x:
    yhat = rf.predict([x])
    yhats.append(yhat[0])

# Examples showing the efficacy of this model
examples = [45, 67, 170, 189, 207]
for i in examples:
    x = test_data_x[i]
    y = test_data_y[i]
    yhat = rf.predict([x])[0]
    print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but Random Forest with Max Depth 2 predicted {yhat:.2f}")

# Mean Squared Error of Random Forest:
error = sklearn.metrics.mean_squared_error(yhats, test_data_y, squared=False)
print(f"The mean square error of Random Forest with Max Depth 2 is {error:.2f} euro")

House 45 with 70.0 sqmt was sold for [218000.] euros, but Random Forest with Max Depth 2 predicted 210000.00
House 67 with 460.0 sqmt was sold for [950000.] euros, but Random Forest with Max Depth 2 predicted 1800000.00
House 170 with 360.0 sqmt was sold for [2150000.] euros, but Random Forest with Max Depth 2 predicted 1250000.00
House 189 with 240.0 sqmt was sold for [1600000.] euros, but Random Forest with Max Depth 2 predicted 950000.00
House 207 with 86.0 sqmt was sold for [129000.] euros, but Random Forest with Max Depth 2 predicted 210000.00
The mean square error of Random Forest with Max Depth 2 is 544482.11 euro


In [11]:
# max_depth = 4:
np.random.seed(1)

rf = RandomForestClassifier(max_depth=4).fit(training_data_x,training_data_y.ravel())

# Create yhat one at a time from test x
yhats = []
for x in test_data_x:
    yhat = rf.predict([x])
    yhats.append(yhat[0])

# Examples showing the efficacy of this model
examples = [45, 67, 170, 189, 207]
for i in examples:
    x = test_data_x[i]
    y = test_data_y[i]
    yhat = rf.predict([x])[0]
    print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but Random Forest with Max Depth 4 predicted {yhat:.2f}")

# Mean Squared Error of Random Forest:
error = sklearn.metrics.mean_squared_error(yhats, test_data_y, squared=False)
print(f"The mean square error of Random Forest with Max Depth 4 is {error:.2f} euro")

House 45 with 70.0 sqmt was sold for [218000.] euros, but Random Forest with Max Depth 4 predicted 155000.00
House 67 with 460.0 sqmt was sold for [950000.] euros, but Random Forest with Max Depth 4 predicted 1850000.00
House 170 with 360.0 sqmt was sold for [2150000.] euros, but Random Forest with Max Depth 4 predicted 2100000.00
House 189 with 240.0 sqmt was sold for [1600000.] euros, but Random Forest with Max Depth 4 predicted 950000.00
House 207 with 86.0 sqmt was sold for [129000.] euros, but Random Forest with Max Depth 4 predicted 155000.00
The mean square error of Random Forest with Max Depth 4 is 456202.14 euro


In [12]:
# max_depth = 8:
np.random.seed(1)

rf = RandomForestClassifier(max_depth=8).fit(training_data_x,training_data_y.ravel())

# Create yhat one at a time from test x
yhats = []
for x in test_data_x:
    yhat = rf.predict([x])
    yhats.append(yhat[0])

# Examples showing the efficacy of this model
examples = [45, 67, 170, 189, 207]
for i in examples:
    x = test_data_x[i]
    y = test_data_y[i]
    yhat = rf.predict([x])[0]
    print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but Random Forest with Max Depth 8 predicted {yhat:.2f}")

# Mean Squared Error of Random Forest:
error = sklearn.metrics.mean_squared_error(yhats, test_data_y, squared=False)
print(f"The mean square error of Random Forest with Max Depth 8 is {error:.2f} euro")

House 45 with 70.0 sqmt was sold for [218000.] euros, but Random Forest with Max Depth 8 predicted 210000.00
House 67 with 460.0 sqmt was sold for [950000.] euros, but Random Forest with Max Depth 8 predicted 3200000.00
House 170 with 360.0 sqmt was sold for [2150000.] euros, but Random Forest with Max Depth 8 predicted 1150000.00
House 189 with 240.0 sqmt was sold for [1600000.] euros, but Random Forest with Max Depth 8 predicted 895000.00
House 207 with 86.0 sqmt was sold for [129000.] euros, but Random Forest with Max Depth 8 predicted 170000.00
The mean square error of Random Forest with Max Depth 8 is 429131.01 euro


In [13]:
# max_depth = 16:
np.random.seed(1)

rf = RandomForestClassifier(max_depth=16).fit(training_data_x,training_data_y.ravel())

# Create yhat one at a time from test x
yhats = []
for x in test_data_x:
    yhat = rf.predict([x])
    yhats.append(yhat[0])

# Examples showing the efficacy of this model
examples = [45, 67, 170, 189, 207]
for i in examples:
    x = test_data_x[i]
    y = test_data_y[i]
    yhat = rf.predict([x])[0]
    print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but Random Forest with Max Depth 16 predicted {yhat:.2f}")

# Mean Squared Error of Random Forest:
error = sklearn.metrics.mean_squared_error(yhats, test_data_y, squared=False)
print(f"The mean square error of Random Forest with Max Depth 16 is {error:.2f} euro")

House 45 with 70.0 sqmt was sold for [218000.] euros, but Random Forest with Max Depth 16 predicted 165000.00
House 67 with 460.0 sqmt was sold for [950000.] euros, but Random Forest with Max Depth 16 predicted 1300000.00
House 170 with 360.0 sqmt was sold for [2150000.] euros, but Random Forest with Max Depth 16 predicted 870000.00
House 189 with 240.0 sqmt was sold for [1600000.] euros, but Random Forest with Max Depth 16 predicted 1575000.00
House 207 with 86.0 sqmt was sold for [129000.] euros, but Random Forest with Max Depth 16 predicted 210000.00
The mean square error of Random Forest with Max Depth 16 is 461845.72 euro


In [14]:
# max_depth = None:
np.random.seed(1)

rf = RandomForestClassifier(max_depth=None).fit(training_data_x,training_data_y.ravel())

# Create yhat one at a time from test x
yhats = []
for x in test_data_x:
    yhat = rf.predict([x])
    yhats.append(yhat[0])

# Examples showing the efficacy of this model
examples = [45, 67, 170, 189, 207]
for i in examples:
    x = test_data_x[i]
    y = test_data_y[i]
    yhat = rf.predict([x])[0]
    print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but Random Forest with Max Depth None (Default) predicted {yhat:.2f}")

# Mean Squared Error of Random Forest:
error = sklearn.metrics.mean_squared_error(yhats, test_data_y, squared=False)
print(f"The mean square error of Random Forest with Max Depth None (Default) is {error:.2f} euro")

House 45 with 70.0 sqmt was sold for [218000.] euros, but Random Forest with Max Depth None (Default) predicted 186000.00
House 67 with 460.0 sqmt was sold for [950000.] euros, but Random Forest with Max Depth None (Default) predicted 1300000.00
House 170 with 360.0 sqmt was sold for [2150000.] euros, but Random Forest with Max Depth None (Default) predicted 870000.00
House 189 with 240.0 sqmt was sold for [1600000.] euros, but Random Forest with Max Depth None (Default) predicted 1575000.00
House 207 with 86.0 sqmt was sold for [129000.] euros, but Random Forest with Max Depth None (Default) predicted 129000.00
The mean square error of Random Forest with Max Depth None (Default) is 464957.87 euro


# Questions: 
* Q: Do you find that Random Forest performs better than the previous approaches you implemented? Discuss. 
* A: As you can see below, the MSE value for Random Forest is highly dependent on its parameters, such as number of estimators and maximum depth. Thus, Random Forest can perform better or worse than other models depending on its parameters. However, in general it appears that linear regression does a better job than random forest, because its MSE is in the 388000s whereas Random Forest's MSEs are all at least in the 430000s. By this reasoning, we can also see that Random Forest appears to perform worse than grid search, random search, and ridge as well. Compared to kNNs, Random Forest can perform either better or worse, depending on RF's parameters and kNN's parameters. Overall, however, it appears that kNN performs better with the right parameter adjustments. Thus, it seems like Random Forest is not the best model for this dataset, as it generally has higher MSE values than other techniques.
    * Random Forest MSEs: (These are subject to change due to the random nature of Random Forests)
        * Default Random Forest: 464957.87
        * 10 Estimators: 446286.40
        * 25 Estimators: 457670.26
        * 100 Estimators (Default): 442164.61
        * 200 Estimators: 441326.10
        * Max Depth 1: 690792.12
        * Max Depth 2: 531027.36
        * Max Depth 4: 457221.31
        * Max Depth 8: 432126.40
        * Max Depth 16: 445192.75
        * Max Depth None (Default): 451945.06
    * Linear Regression MSE: 388066.28
    * Grid search: 401108.81
    * Random search: 395892.43
    * Ridge: 394881.47
    * Multivariate kNN MSEs:
        * kNN = 1: 479167.53
        * kNN = 3: 403527.37
        * kNN = 15: 393917.63
        * kNN = 20: 393440.89
        * The following are all k = 20:
            * Uniform weight parameter (All points weighted equally): 393440.89
            * Distance weight parameter (Closer points weighted more): 377271.89
            * Euclidean distance parameter with Uniform weight: 393440.89 (Notice that this is the same as uniform weight, because euclidean is the default metric)
            * Manhattan distance parameter with Uniform weight: 390028.71
            * Euclidean with Distance weight: 377271.89 (Notice that this is the same as distance weight, because euclidean is the default metric)
            * Manhattan with Distance weight: 374194.76
            * Chebyshev distance parameter with Uniform weight: 396331.16
            * Chebyshev distance parameter with Distance weight: 380412.40





* Q: Explain the impact of the number of estimators and max tree depth hyperparameters on the accuracy. Which hyperparameter setting gives you the best value? Is this the same as the default settings in sklearn?
* A: It seems like the more estimators there were, the lower the MSE. Similarly, the deeper the trees got, the better it performed, until a certain point when MSE started rising slightly again. The best value came from n_estimators = 100, max_depth = 8. The max_depth is different from the default setting, which is n_estimators = 100 and max_depth = None.


* Q: Explain the impact of the hyperparameters on the training time. 
* A: According to the observed training times below, it appears that training time increases as the number of estimators increases. In fact, it seems like the time increase and increase in estimators are roughly proportional, at least on my computer. 25 estimators x 4 = 100 estimators. 4 seconds x 4 = 16 seconds which is close to 15 seconds.  100 estimators x 2 = 200 estimators. 15 seconds x 2 = 30 seconds, which is close to 32 seconds. Training time also tends to increase as maximum depth increases. 
    * 10 estimators: 2s
    * 25 estimators: 4s
    * 100 estimators: 15s
    * 200 estimators: 32s
    * max depth 1: 8s
    * 2: 7.5s
    * 4: 7.8s
    * 8: 8.5s
    * 16: 13s
    * None: 16s



# P2: AdaBoost using sklearn (5 points)

Use the AdaBoost function from sklearn to predict the prices of the house. Print the resulting error and samples, similar to the way in Section B. 

Experiment with the settings of the hyperparameters: loss (try "linear", "square" and "exponential) and learning_rate (try at least values 0.2, 0.5, 1 and 2)

In [15]:
# TODO implement here
# linear loss is default, learning rate is 1 by default
from sklearn.ensemble import AdaBoostRegressor
np.random.seed(1)
ada = AdaBoostRegressor().fit(training_data_x,training_data_y.ravel())

# Create yhat one at a time from test x
yhats = []
for x in test_data_x:
    yhat = ada.predict([x])
    yhats.append(yhat[0])

# Examples showing the efficacy of this model
examples = [45, 67, 170, 189, 207]
for i in examples:
    x = test_data_x[i]
    y = test_data_y[i]
    yhat = ada.predict([x])[0]
    print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but AdaBoost (Linear Loss, Learning Rate 1) predicted {yhat:.2f}")

# Mean Squared Error of Random Forest:
error = sklearn.metrics.mean_squared_error(yhats, test_data_y, squared=False)
print(f"The mean square error of AdaBoost (Linear Loss, Learning Rate 1) is {error:.2f} euro")

House 45 with 70.0 sqmt was sold for [218000.] euros, but AdaBoost (Linear Loss, Learning Rate 1) predicted 319010.72
House 67 with 460.0 sqmt was sold for [950000.] euros, but AdaBoost (Linear Loss, Learning Rate 1) predicted 2021583.00
House 170 with 360.0 sqmt was sold for [2150000.] euros, but AdaBoost (Linear Loss, Learning Rate 1) predicted 1917245.73
House 189 with 240.0 sqmt was sold for [1600000.] euros, but AdaBoost (Linear Loss, Learning Rate 1) predicted 1311297.85
House 207 with 86.0 sqmt was sold for [129000.] euros, but AdaBoost (Linear Loss, Learning Rate 1) predicted 319010.72
The mean square error of AdaBoost (Linear Loss, Learning Rate 1) is 422594.14 euro


In [16]:
#square loss
np.random.seed(1)
ada = AdaBoostRegressor(loss='square').fit(training_data_x,training_data_y.ravel())

# Create yhat one at a time from test x
yhats = []
for x in test_data_x:
    yhat = ada.predict([x])
    yhats.append(yhat[0])

# Examples showing the efficacy of this model
examples = [45, 67, 170, 189, 207]
for i in examples:
    x = test_data_x[i]
    y = test_data_y[i]
    yhat = ada.predict([x])[0]
    print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but AdaBoost (Square Loss, Learning Rate 1) predicted {yhat:.2f}")

# Mean Squared Error of Random Forest:
error = sklearn.metrics.mean_squared_error(yhats, test_data_y, squared=False)
print(f"The mean square error of AdaBoost (Square Loss, Learning Rate 1) is {error:.2f} euro")

House 45 with 70.0 sqmt was sold for [218000.] euros, but AdaBoost (Square Loss, Learning Rate 1) predicted 463107.03
House 67 with 460.0 sqmt was sold for [950000.] euros, but AdaBoost (Square Loss, Learning Rate 1) predicted 2373567.67
House 170 with 360.0 sqmt was sold for [2150000.] euros, but AdaBoost (Square Loss, Learning Rate 1) predicted 2170600.55
House 189 with 240.0 sqmt was sold for [1600000.] euros, but AdaBoost (Square Loss, Learning Rate 1) predicted 1280926.33
House 207 with 86.0 sqmt was sold for [129000.] euros, but AdaBoost (Square Loss, Learning Rate 1) predicted 463107.03
The mean square error of AdaBoost (Square Loss, Learning Rate 1) is 454124.59 euro


In [17]:
#exponential loss
np.random.seed(1)
ada = AdaBoostRegressor(loss='exponential').fit(training_data_x,training_data_y.ravel())

# Create yhat one at a time from test x
yhats = []
for x in test_data_x:
    yhat = ada.predict([x])
    yhats.append(yhat[0])

# Examples showing the efficacy of this model
examples = [45, 67, 170, 189, 207]
for i in examples:
    x = test_data_x[i]
    y = test_data_y[i]
    yhat = ada.predict([x])[0]
    print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but AdaBoost (Exponential Loss, Learning Rate 1) predicted {yhat:.2f}")

# Mean Squared Error of Random Forest:
error = sklearn.metrics.mean_squared_error(yhats, test_data_y, squared=False)
print(f"The mean square error of AdaBoost (Exponential Loss, Learning Rate 1) is {error:.2f} euro")

House 45 with 70.0 sqmt was sold for [218000.] euros, but AdaBoost (Exponential Loss, Learning Rate 1) predicted 578176.93
House 67 with 460.0 sqmt was sold for [950000.] euros, but AdaBoost (Exponential Loss, Learning Rate 1) predicted 2669300.27
House 170 with 360.0 sqmt was sold for [2150000.] euros, but AdaBoost (Exponential Loss, Learning Rate 1) predicted 2421636.31
House 189 with 240.0 sqmt was sold for [1600000.] euros, but AdaBoost (Exponential Loss, Learning Rate 1) predicted 1461619.21
House 207 with 86.0 sqmt was sold for [129000.] euros, but AdaBoost (Exponential Loss, Learning Rate 1) predicted 578176.93
The mean square error of AdaBoost (Exponential Loss, Learning Rate 1) is 552471.26 euro


In [21]:
# Learning Rate 0.2

# linear loss
from sklearn.ensemble import AdaBoostRegressor
np.random.seed(1)
ada = AdaBoostRegressor(learning_rate=0.2).fit(training_data_x,training_data_y.ravel())

# Create yhat one at a time from test x
yhats = []
for x in test_data_x:
    yhat = ada.predict([x])
    yhats.append(yhat[0])

# Examples showing the efficacy of this model
examples = [45, 67, 170, 189, 207]
for i in examples:
    x = test_data_x[i]
    y = test_data_y[i]
    yhat = ada.predict([x])[0]
    print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but AdaBoost (Linear Loss, Learning Rate 0.2) predicted {yhat:.2f}")

# Mean Squared Error of Random Forest:
error = sklearn.metrics.mean_squared_error(yhats, test_data_y, squared=False)
print(f"The mean square error of AdaBoost (Linear Loss, Learning Rate 0.2) is {error:.2f} euro")

House 45 with 70.0 sqmt was sold for [218000.] euros, but AdaBoost (Linear Loss, Learning Rate 0.2) predicted 318968.48
House 67 with 460.0 sqmt was sold for [950000.] euros, but AdaBoost (Linear Loss, Learning Rate 0.2) predicted 2384084.56
House 170 with 360.0 sqmt was sold for [2150000.] euros, but AdaBoost (Linear Loss, Learning Rate 0.2) predicted 1893228.97
House 189 with 240.0 sqmt was sold for [1600000.] euros, but AdaBoost (Linear Loss, Learning Rate 0.2) predicted 1365918.02
House 207 with 86.0 sqmt was sold for [129000.] euros, but AdaBoost (Linear Loss, Learning Rate 0.2) predicted 318968.48
The mean square error of AdaBoost (Linear Loss, Learning Rate 0.2) is 429843.59 euro


In [22]:
#square loss
np.random.seed(1)
ada = AdaBoostRegressor(loss='square',learning_rate=0.2).fit(training_data_x,training_data_y.ravel())

# Create yhat one at a time from test x
yhats = []
for x in test_data_x:
    yhat = ada.predict([x])
    yhats.append(yhat[0])

# Examples showing the efficacy of this model
examples = [45, 67, 170, 189, 207]
for i in examples:
    x = test_data_x[i]
    y = test_data_y[i]
    yhat = ada.predict([x])[0]
    print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but AdaBoost (Square Loss, Learning Rate 0.2) predicted {yhat:.2f}")

# Mean Squared Error of Random Forest:
error = sklearn.metrics.mean_squared_error(yhats, test_data_y, squared=False)
print(f"The mean square error of AdaBoost (Square Loss, Learning Rate 0.2) is {error:.2f} euro")

House 45 with 70.0 sqmt was sold for [218000.] euros, but AdaBoost (Square Loss, Learning Rate 0.2) predicted 307040.20
House 67 with 460.0 sqmt was sold for [950000.] euros, but AdaBoost (Square Loss, Learning Rate 0.2) predicted 2287317.55
House 170 with 360.0 sqmt was sold for [2150000.] euros, but AdaBoost (Square Loss, Learning Rate 0.2) predicted 1835721.49
House 189 with 240.0 sqmt was sold for [1600000.] euros, but AdaBoost (Square Loss, Learning Rate 0.2) predicted 1259458.04
House 207 with 86.0 sqmt was sold for [129000.] euros, but AdaBoost (Square Loss, Learning Rate 0.2) predicted 307040.20
The mean square error of AdaBoost (Square Loss, Learning Rate 0.2) is 441601.39 euro


In [23]:
#exponential loss
np.random.seed(1)
ada = AdaBoostRegressor(loss='exponential',learning_rate=0.2).fit(training_data_x,training_data_y.ravel())

# Create yhat one at a time from test x
yhats = []
for x in test_data_x:
    yhat = ada.predict([x])
    yhats.append(yhat[0])

# Examples showing the efficacy of this model
examples = [45, 67, 170, 189, 207]
for i in examples:
    x = test_data_x[i]
    y = test_data_y[i]
    yhat = ada.predict([x])[0]
    print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but AdaBoost (Exponential Loss, Learning Rate 0.2) predicted {yhat:.2f}")

# Mean Squared Error of Random Forest:
error = sklearn.metrics.mean_squared_error(yhats, test_data_y, squared=False)
print(f"The mean square error of AdaBoost (Exponential Loss, Learning Rate 0.2) is {error:.2f} euro")

House 45 with 70.0 sqmt was sold for [218000.] euros, but AdaBoost (Exponential Loss, Learning Rate 0.2) predicted 321307.56
House 67 with 460.0 sqmt was sold for [950000.] euros, but AdaBoost (Exponential Loss, Learning Rate 0.2) predicted 2298739.03
House 170 with 360.0 sqmt was sold for [2150000.] euros, but AdaBoost (Exponential Loss, Learning Rate 0.2) predicted 1928194.23
House 189 with 240.0 sqmt was sold for [1600000.] euros, but AdaBoost (Exponential Loss, Learning Rate 0.2) predicted 1376737.64
House 207 with 86.0 sqmt was sold for [129000.] euros, but AdaBoost (Exponential Loss, Learning Rate 0.2) predicted 321307.56
The mean square error of AdaBoost (Exponential Loss, Learning Rate 0.2) is 425379.25 euro


In [24]:
# Learning Rate 0.5

# linear loss
from sklearn.ensemble import AdaBoostRegressor
np.random.seed(1)
ada = AdaBoostRegressor(learning_rate=0.5).fit(training_data_x,training_data_y.ravel())

# Create yhat one at a time from test x
yhats = []
for x in test_data_x:
    yhat = ada.predict([x])
    yhats.append(yhat[0])

# Examples showing the efficacy of this model
examples = [45, 67, 170, 189, 207]
for i in examples:
    x = test_data_x[i]
    y = test_data_y[i]
    yhat = ada.predict([x])[0]
    print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but AdaBoost (Linear Loss, Learning Rate 0.5) predicted {yhat:.2f}")

# Mean Squared Error of Random Forest:
error = sklearn.metrics.mean_squared_error(yhats, test_data_y, squared=False)
print(f"The mean square error of AdaBoost (Linear Loss, Learning Rate 0.5) is {error:.2f} euro")

House 45 with 70.0 sqmt was sold for [218000.] euros, but AdaBoost (Linear Loss, Learning Rate 0.5) predicted 320346.67
House 67 with 460.0 sqmt was sold for [950000.] euros, but AdaBoost (Linear Loss, Learning Rate 0.5) predicted 2169413.60
House 170 with 360.0 sqmt was sold for [2150000.] euros, but AdaBoost (Linear Loss, Learning Rate 0.5) predicted 1897647.98
House 189 with 240.0 sqmt was sold for [1600000.] euros, but AdaBoost (Linear Loss, Learning Rate 0.5) predicted 1316129.41
House 207 with 86.0 sqmt was sold for [129000.] euros, but AdaBoost (Linear Loss, Learning Rate 0.5) predicted 320346.67
The mean square error of AdaBoost (Linear Loss, Learning Rate 0.5) is 428720.01 euro


In [25]:
#square loss
np.random.seed(1)
ada = AdaBoostRegressor(loss='square',learning_rate=0.5).fit(training_data_x,training_data_y.ravel())

# Create yhat one at a time from test x
yhats = []
for x in test_data_x:
    yhat = ada.predict([x])
    yhats.append(yhat[0])

# Examples showing the efficacy of this model
examples = [45, 67, 170, 189, 207]
for i in examples:
    x = test_data_x[i]
    y = test_data_y[i]
    yhat = ada.predict([x])[0]
    print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but AdaBoost (Square Loss, Learning Rate 0.5) predicted {yhat:.2f}")

# Mean Squared Error of Random Forest:
error = sklearn.metrics.mean_squared_error(yhats, test_data_y, squared=False)
print(f"The mean square error of AdaBoost (Square Loss, Learning Rate 0.5) is {error:.2f} euro")

House 45 with 70.0 sqmt was sold for [218000.] euros, but AdaBoost (Square Loss, Learning Rate 0.5) predicted 327474.21
House 67 with 460.0 sqmt was sold for [950000.] euros, but AdaBoost (Square Loss, Learning Rate 0.5) predicted 2640456.23
House 170 with 360.0 sqmt was sold for [2150000.] euros, but AdaBoost (Square Loss, Learning Rate 0.5) predicted 2135307.44
House 189 with 240.0 sqmt was sold for [1600000.] euros, but AdaBoost (Square Loss, Learning Rate 0.5) predicted 1264757.76
House 207 with 86.0 sqmt was sold for [129000.] euros, but AdaBoost (Square Loss, Learning Rate 0.5) predicted 327474.21
The mean square error of AdaBoost (Square Loss, Learning Rate 0.5) is 459890.88 euro


In [26]:
#exponential loss
np.random.seed(1)
ada = AdaBoostRegressor(loss='exponential',learning_rate=0.5).fit(training_data_x,training_data_y.ravel())

# Create yhat one at a time from test x
yhats = []
for x in test_data_x:
    yhat = ada.predict([x])
    yhats.append(yhat[0])

# Examples showing the efficacy of this model
examples = [45, 67, 170, 189, 207]
for i in examples:
    x = test_data_x[i]
    y = test_data_y[i]
    yhat = ada.predict([x])[0]
    print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but AdaBoost (Exponential Loss, Learning Rate 0.5) predicted {yhat:.2f}")

# Mean Squared Error of Random Forest:
error = sklearn.metrics.mean_squared_error(yhats, test_data_y, squared=False)
print(f"The mean square error of AdaBoost (Exponential Loss, Learning Rate 0.5) is {error:.2f} euro")

House 45 with 70.0 sqmt was sold for [218000.] euros, but AdaBoost (Exponential Loss, Learning Rate 0.5) predicted 391314.24
House 67 with 460.0 sqmt was sold for [950000.] euros, but AdaBoost (Exponential Loss, Learning Rate 0.5) predicted 2524546.24
House 170 with 360.0 sqmt was sold for [2150000.] euros, but AdaBoost (Exponential Loss, Learning Rate 0.5) predicted 2015182.61
House 189 with 240.0 sqmt was sold for [1600000.] euros, but AdaBoost (Exponential Loss, Learning Rate 0.5) predicted 1352784.08
House 207 with 86.0 sqmt was sold for [129000.] euros, but AdaBoost (Exponential Loss, Learning Rate 0.5) predicted 391314.24
The mean square error of AdaBoost (Exponential Loss, Learning Rate 0.5) is 450257.76 euro


In [27]:
# Learning Rate 2

# linear loss
from sklearn.ensemble import AdaBoostRegressor
np.random.seed(1)
ada = AdaBoostRegressor(learning_rate=2).fit(training_data_x,training_data_y.ravel())

# Create yhat one at a time from test x
yhats = []
for x in test_data_x:
    yhat = ada.predict([x])
    yhats.append(yhat[0])

# Examples showing the efficacy of this model
examples = [45, 67, 170, 189, 207]
for i in examples:
    x = test_data_x[i]
    y = test_data_y[i]
    yhat = ada.predict([x])[0]
    print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but AdaBoost (Linear Loss, Learning Rate 2) predicted {yhat:.2f}")

# Mean Squared Error of Random Forest:
error = sklearn.metrics.mean_squared_error(yhats, test_data_y, squared=False)
print(f"The mean square error of AdaBoost (Linear Loss, Learning Rate 2) is {error:.2f} euro")

House 45 with 70.0 sqmt was sold for [218000.] euros, but AdaBoost (Linear Loss, Learning Rate 2) predicted 410523.36
House 67 with 460.0 sqmt was sold for [950000.] euros, but AdaBoost (Linear Loss, Learning Rate 2) predicted 2267798.35
House 170 with 360.0 sqmt was sold for [2150000.] euros, but AdaBoost (Linear Loss, Learning Rate 2) predicted 2267798.35
House 189 with 240.0 sqmt was sold for [1600000.] euros, but AdaBoost (Linear Loss, Learning Rate 2) predicted 1492648.43
House 207 with 86.0 sqmt was sold for [129000.] euros, but AdaBoost (Linear Loss, Learning Rate 2) predicted 410523.36
The mean square error of AdaBoost (Linear Loss, Learning Rate 2) is 473509.03 euro


In [28]:
#square loss
np.random.seed(1)
ada = AdaBoostRegressor(loss='square',learning_rate=2).fit(training_data_x,training_data_y.ravel())

# Create yhat one at a time from test x
yhats = []
for x in test_data_x:
    yhat = ada.predict([x])
    yhats.append(yhat[0])

# Examples showing the efficacy of this model
examples = [45, 67, 170, 189, 207]
for i in examples:
    x = test_data_x[i]
    y = test_data_y[i]
    yhat = ada.predict([x])[0]
    print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but AdaBoost (Square Loss, Learning Rate 2) predicted {yhat:.2f}")

# Mean Squared Error of Random Forest:
error = sklearn.metrics.mean_squared_error(yhats, test_data_y, squared=False)
print(f"The mean square error of AdaBoost (Square Loss, Learning Rate 2) is {error:.2f} euro")

House 45 with 70.0 sqmt was sold for [218000.] euros, but AdaBoost (Square Loss, Learning Rate 2) predicted 950000.00
House 67 with 460.0 sqmt was sold for [950000.] euros, but AdaBoost (Square Loss, Learning Rate 2) predicted 1845695.55
House 170 with 360.0 sqmt was sold for [2150000.] euros, but AdaBoost (Square Loss, Learning Rate 2) predicted 1748233.48
House 189 with 240.0 sqmt was sold for [1600000.] euros, but AdaBoost (Square Loss, Learning Rate 2) predicted 1070000.00
House 207 with 86.0 sqmt was sold for [129000.] euros, but AdaBoost (Square Loss, Learning Rate 2) predicted 950000.00
The mean square error of AdaBoost (Square Loss, Learning Rate 2) is 667926.89 euro


In [29]:
#exponential loss
np.random.seed(1)
ada = AdaBoostRegressor(loss='exponential',learning_rate=2).fit(training_data_x,training_data_y.ravel())

# Create yhat one at a time from test x
yhats = []
for x in test_data_x:
    yhat = ada.predict([x])
    yhats.append(yhat[0])

# Examples showing the efficacy of this model
examples = [45, 67, 170, 189, 207]
for i in examples:
    x = test_data_x[i]
    y = test_data_y[i]
    yhat = ada.predict([x])[0]
    print(f"House {i} with {x[0]} sqmt was sold for {y} euros, but AdaBoost (Exponential Loss, Learning Rate 2) predicted {yhat:.2f}")

# Mean Squared Error of Random Forest:
error = sklearn.metrics.mean_squared_error(yhats, test_data_y, squared=False)
print(f"The mean square error of AdaBoost (Exponential Loss, Learning Rate 2) is {error:.2f} euro")

House 45 with 70.0 sqmt was sold for [218000.] euros, but AdaBoost (Exponential Loss, Learning Rate 2) predicted 789090.59
House 67 with 460.0 sqmt was sold for [950000.] euros, but AdaBoost (Exponential Loss, Learning Rate 2) predicted 2461907.68
House 170 with 360.0 sqmt was sold for [2150000.] euros, but AdaBoost (Exponential Loss, Learning Rate 2) predicted 2361805.87
House 189 with 240.0 sqmt was sold for [1600000.] euros, but AdaBoost (Exponential Loss, Learning Rate 2) predicted 1481088.19
House 207 with 86.0 sqmt was sold for [129000.] euros, but AdaBoost (Exponential Loss, Learning Rate 2) predicted 789090.59
The mean square error of AdaBoost (Exponential Loss, Learning Rate 2) is 672398.53 euro


# Questions: 
* Q: Do you find that Adaboost performs better than the previous approaches you implemented? Discuss. 
* A: Depending on the parameters, it appears that Adaboost can actually perform quite badly. When we used a square and exponential loss with a learning rate of 2, we ended up with MSEs in the 600000s. Thankfully, this is not always the case. Adaboost tends to have an MSE in the 400000s, with its minimum being 422594.14. This means that Adaboost does not perform as well as linear regression, grid search, random search, or ridge techniques. Adaboost at its best does surpass Random Forest's best from our previous tests above, though. Depending on the parameters, Adaboost can perform better than kNN, but kNN's best performance surpasses Adaboost. In summary, it appears that Adaboost is not the best model for this data, although it may be able to do better than Random Forest. A different model, like kNN, would perform better, based on the mean square error values.
    * Adaboost MSEs:
        * AdaBoost (Linear Loss, Learning Rate 1) is 422594.14
        * AdaBoost (Square Loss, Learning Rate 1) is 454124.59
        * AdaBoost (Exponential Loss, Learning Rate 1) is 552471.26
        * AdaBoost (Linear Loss, Learning Rate 0.2) is 429843.59
        * AdaBoost (Square Loss, Learning Rate 0.2) is 441601.39
        * AdaBoost (Exponential Loss, Learning Rate 0.2) is 425379.25
        * AdaBoost (Linear Loss, Learning Rate 0.5) is 428720.01
        * AdaBoost (Square Loss, Learning Rate 0.5) is 459890.88
        * AdaBoost (Exponential Loss, Learning Rate 0.5) is 450257.76
        * AdaBoost (Linear Loss, Learning Rate 2) is 473509.03
        * AdaBoost (Square Loss, Learning Rate 2) is 667926.89
        * AdaBoost (Exponential Loss, Learning Rate 2) is 672398.53
* Q: Explain the impact of the loss and the learning_rate hyperparameters on the accuracy. Which hyperparameter setting gives you the best value? Is this the same as the default settings in sklearn?
* A: 
    * When it comes to the impact of loss on accuracy, it is difficult to give a clear cut answer about which is better than others. As you can see above, for different learning rates, the same loss may either improve or worsen the accuracy. It does appear, however, that linear losses tend to be more accurate than square and exponential losses. Square and exponential losses appear to perform similarly when it comes to accuracy (seen by calculating the average MSE of Adaboost with square losses/exponential losses), but by this analysis, square loss seems to contribute slightly more towards accuracy. Overall, linear loss seems to have the best effect on accuracy. 
    * I notice that higher learning rates seem to lead to more variance in MSE across different losses. This makes sense, since learning rates affect how much different "submodels" contribute to the overall boosted model. In addition, on average, as learning rate increases, so does the MSE. So it appears that a low learning rate like 0.2 would lead to more accuracy.
    * The best hyperparameter setting was Linear Loss, Learning Rate 1 with MSE 422594.14. This is not surprising given my analysis above preferring linear losses, but the learning rate of 1 is surprising, since it seems like lower learning rates do better on average. Interestingly, this is the same as the default settings in sklearn for AdaBoostRegressor.
* Q: Explain the impact of the hyperparameters on the training time. 
* A: Learning rate does not seem to impact the training time much. There is no obvious correlation between learning rate and training time. However, it does appear that linear loss leads to a shorter training time, square loss leads to a longer training time, and exponential loss leads to the longest training time.