# Foundations of Data Mining: Resit Assignment Part 3¶


Please complete all assignments in this notebook. You should submit this notebook, as well as a PDF version (See File > Download as).

In [1]:
%matplotlib inline
from preamble import *
plt.rcParams['savefig.dpi'] = 100 # This controls the size of your figures
# Comment out and restart notebook if you only want the last output of each cell.
InteractiveShell.ast_node_interactivity = "all"

## 1. Preprocessing vs Tuning (6 points (0.5+0.5+1+1+1+2))

Consider the [Body fat dataset](http://www.openml.org/d/560). It contains a number of body measurements and the goal is to accurately predict bodyfat using only a few measurements. 

- Analyze which are the most relevant features. Use a model-based feature selection technique based on GradientBoosting and plot the feature importance of all features. You can use the plotting code below to create the plot.
- Split the data in a training and test set (the default split is fine), and apply the RandomForest-based technique to select and report the most relevant features. 
    - You can use the `SelectFromModel` method with the `median` threshold.
    - Avoid data leaks: don't select features based on the test data.
- Evaluate at least 4 regression algorithms, using default parameter settings, on the selected features and train-test split. Compare at least Decision Trees, Random Forests, Gaussian Processes, and SVMs (you can try several kernels). Report and interpret the R^2 scores.
    - You can use the scikit-learn implementation of Gaussian processes (GPY is also allowed)
- Explore the effect of the number of features by making the feature selection more strict (remove more features) The goal is to predict bodyfat using as few features as possible. What is the effect of using fewer features for each of your methods?
    - For correct results, use a pipeline that includes at least a feature selection, normalization, and classifier step. Add more components if you think that they may help, but also discuss why and whether they helped.
- Go back to your first selection of features (using the `median` threshold). Explore which models are overfitting by plotting the predictions. Do this for the most relevant feature (on the X-axis), and plot it against the target (Bodyfat, on the Y-axis). You can use the plotting code below, and adapt it to look at different parts of the X-axis. Interpret the plot: how well does each model perform? Does it match your earlier R^2 scores? Which algorithms are over/underfitting and may require more tuning?
    - Don't include algorithms that received negative R^2 scores.
    - Keep in mind that you are seeing only 1 feature while the model was trained on several more.
- Optimize the main hyperparameters of the SVM and Gaussian Process algorithm. For SVM, you can stick to the RBF kernel. For Gaussian Processes, two kernels are given below. Don't forget to tune your regularization hyperparameters (C and alpha, both on a log scale, use at least 10 values). Use a grid search and plot the results in a heatmap (one for SVM, one for GPs). Interpret the results. Which values work best? Can you get much better results?

In [2]:
# Get the data 
fat_data = oml.datasets.get_dataset(560)
X, y, attribute_names = fat_data.get_data(
    target=fat_data.default_target_attribute, 
    return_attribute_names=True)

In [3]:
# Feature importance plot. Input the training data, the attribute names (already given in the code above), 
# and the trained model
def plot_feature_importances(data,attribute_names,model):
    n_features = data.shape[1]
    plt.barh(range(n_features), model.feature_importances_, align='center')
    plt.yticks(np.arange(n_features), attribute_names)
    plt.xlabel("Feature importance")
    plt.ylabel("Feature")
    plt.ylim(-1, n_features)

In [4]:
# Shows the predictions of multiple models.
# Property_index is the index of the most relevant feature (0 by default if feature selection was applied correctly)
def plot_fat_predictions(X_train, X_test, y_train, y_test, models, property_index=0):
    
    X_all = np.concatenate((X_train, X_test),axis=0)
    
    plt.figure(figsize=(8,6))
    plt.rcParams['lines.linewidth'] = 1

    prop = X_all[:,property_index]
    prop_train = X_train[:,property_index]
    prop_test = X_test[:,property_index]

    sort_all = prop.argsort()
    sort_train = prop_train.argsort()
    sort_test = prop_test.argsort()

    plt.scatter(prop_train[sort_train], y_train[sort_train], label="Training data")
    plt.scatter(prop_test[sort_test], y_test[sort_test], label="Test data")
    for name, model in models.items():
        predictions = model.predict(X_all)
        plt.plot(prop[sort_all], predictions[sort_all], label=name)
    plt.xlabel("Property")
    plt.ylabel("Bodyfat");
    plt.legend()
    #plt.xlim(1.04,1.06); # Zoom in on parts of the plot

In [5]:
# These are two simple kernels for the Gaussian Process. 
# Feel free to adjust the parameters (e.g. the gamma parameter of the RBF kernel)
from sklearn.gaussian_process.kernels import ConstantKernel, RBF

ker_rbf = RBF(0.1, length_scale_bounds="fixed")
ker_rbf2 = ConstantKernel(1.0, constant_value_bounds="fixed") * RBF(0.1, length_scale_bounds="fixed")
kernel_list = [ker_rbf, ker_rbf2]

## 2. Cross-validation, Out-of-bag error, Bias-Variance (4 points (1+1+2))
RandomForests offer us an interesting, and cheaper, way to evaluate models instead of cross-validation: the out-of-bag error.
It is often used to decide when to stop adding more trees to a forest.
We will explore this on the [Wall Robot Navigation dataset](http://www.openml.org/d/1497), which contains 
about 5500 readings of an ultrasound sensor array mounted on a robot, and your task is to train a model
to predict how the robot should move next.

* First, train a RandomForest classifier on this dataset with an increasing number of trees (on a log scale to max. 512). Plot the Out-Of-Bag error against the number of trees. Which model would you build to control the robot? It should be as cheap as possible, and make predictions fast.
    - The Out-Of-Bag error was discussed in the lecture on trees and ensembles. Example code is also added below.
* Construct the same plot, but now use 10-fold Cross-validation and error rate instead of the OOB error. Compare the two. What can you deduce from this?
* Compute the bias and variance for increasing numbers of trees. Does the bias and variance increase/decrease for the ensemble? Do these results somehow match your earlier observations? Explain in detail.

Hint: Error rate = 1 - accuracy. It is not a standard scoring metric for ```cross_val_score```, so you'll need to let it compute the accuracy values, and then compute the mean error rate yourself.  
Hint: We discussed bias-variance decomposition in class, and we have provided code to compute it earlier.

In [None]:
robot = oml.datasets.get_dataset(1497) # Download Ionosphere data
X, y = robot.get_data(target=eeg.default_target_attribute);

# Out of bag errors can be retrieved from the RandomForest classifier as follows. 
# You'll need to loop over the number of trees yourself.
# http://scikit-learn.org/stable/auto_examples/ensemble/plot_ensemble_oob.html
from sklearn import ensemble
clf = ensemble.RandomForestClassifier()
clf.fit(X, y)
(1 - clf.oob_score)