# Regression with Gaussian Processes

------------------------------------------------------
*Machine Learning, Master in Big Data Analytics, 2023-2024*

*Pablo M. Olmos olmos@tsc.uc3m.es, Emilio Parrado Hernandez, eparrado@ing.uc3m.es*

------------------------------------------------------

The aim of this homework is to solve a real data problem using Gaussian Processes.

The problem is the prediction of both the heating load (HL) and cooling load (CL) of residential buildings. We consider eight input variables for each building: relative compactness, surface area, wall area, roof area, overall height, orientation, glazing area, glazing area distribution.

In this [paper](https://www.sciencedirect.com/science/article/pii/S037877881200151X) you can find a detailed description of the problem and a solution based on linear regression [(iteratively reweighted least squares (IRLS) algorithm)](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=10&ved=2ahUKEwjZuoLY2OjgAhUs3uAKHUZ7BVcQFjAJegQIAhAC&url=https%3A%2F%2Fpdfs.semanticscholar.org%2F9b92%2F18e7233f4d0b491e1582c893c9a099470a73.pdf&usg=AOvVaw3YDwqZh1xyF626VqfnCM2k) and random forests. Using GPs, our goal is not only estimate accurately both HL and CL, but also get a measure of uncertainty in our predictions.

The data set can be downloaded from the [UCI repository](https://archive.ics.uci.edu/ml/datasets/Energy+efficiency#).

## 1. Loading and preparing the data

* Download the dataset
* Divide at random the dataset into train (80%) and test (20%) datasets 

In [12]:
#Your code here
import pandas as pd
from sklearn.model_selection import train_test_split

# Read the Excel file into a pandas DataFrame
data = pd.read_excel('ENB2012_data.xlsx')

# Define the new column names
new_column_names = {
    'X1': 'Relative Compactness',
    'X2': 'Surface Area',
    'X3': 'Wall Area',
    'X4': 'Roof Area',
    'X5': 'Overall Height',
    'X6': 'Orientation',
    'X7': 'Glazing Area',
    'X8': 'Glazing Area Distribution',
    'Y1': 'Heating Load',
    'Y2': 'Cooling Load'
}

# Rename the columns
data = data.rename(columns=new_column_names)
data

Unnamed: 0,Relative Compactness,Surface Area,Wall Area,Roof Area,Overall Height,Orientation,Glazing Area,Glazing Area Distribution,Heating Load,Cooling Load
60,0.82,612.5,318.5,147.00,7.0,2,0.10,1,23.53,27.31
618,0.64,784.0,343.0,220.50,3.5,4,0.40,2,18.90,22.09
346,0.86,588.0,294.0,147.00,7.0,4,0.25,2,29.27,29.90
294,0.90,563.5,318.5,122.50,7.0,4,0.25,1,32.84,32.71
231,0.66,759.5,318.5,220.50,3.5,5,0.10,4,11.43,14.83
...,...,...,...,...,...,...,...,...,...,...
71,0.76,661.5,416.5,122.50,7.0,5,0.10,1,32.21,33.67
106,0.86,588.0,294.0,147.00,7.0,4,0.10,2,26.33,27.36
270,0.71,710.5,269.5,220.50,3.5,4,0.10,5,10.67,14.26
435,0.98,514.5,294.0,110.25,7.0,5,0.25,4,28.62,30.12


## 2. Baseline results: Random Forests (10%)

Train a Random Forests selecting the number of trees in the forest and the maximum number of leaves using cross validation. 

**Print the scores in the test set for each target. These will be our baseline results.** 


In [16]:
import pandas as pd
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

# Separate features and target variables
X = data.drop(columns=['Heating Load', 'Cooling Load'])
y_heating = data['Heating Load']
y_cooling = data['Cooling Load']

# Split the data into train and test sets
X_train, X_test, y_train_heating, y_test_heating = train_test_split(X, y_heating, test_size=0.2, random_state=42)
X_train, X_test, y_train_cooling, y_test_cooling = train_test_split(X, y_cooling, test_size=0.2, random_state=42)

# Define preprocessing steps for numerical features
numeric_features = X.columns
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features)
    ])

# Define the Random Forest Regressor pipeline
rf_pipeline_heating = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor())
])

rf_pipeline_cooling = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor())
])

# Define hyperparameters to tune
param_grid = {
    'regressor__n_estimators': [50, 100, 150],
    'regressor__max_leaf_nodes': [None, 10, 20, 30]
}

# Perform cross-validation to find the best hyperparameters for heating load
grid_search_heating = RandomizedSearchCV(rf_pipeline_heating, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search_heating.fit(X_train, y_train_heating)

# Perform cross-validation to find the best hyperparameters for cooling load
grid_search_cooling = RandomizedSearchCV(rf_pipeline_cooling, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search_cooling.fit(X_train, y_train_cooling)

# Get the best estimators
best_rf_heating = grid_search_heating.best_estimator_
best_rf_cooling = grid_search_cooling.best_estimator_

# Fit the models on the training data
best_rf_heating.fit(X_train, y_train_heating)
best_rf_cooling.fit(X_train, y_train_cooling)

# Predict on the test set
y_pred_heating = best_rf_heating.predict(X_test)
y_pred_cooling = best_rf_cooling.predict(X_test)

# Calculate RMSE for each target variable
rmse_heating = mean_squared_error(y_test_heating, y_pred_heating, squared=False)
rmse_cooling = mean_squared_error(y_test_cooling, y_pred_cooling, squared=False)

print("RMSE for Heating Load:", rmse_heating)
print("RMSE for Cooling Load:", rmse_cooling)


RMSE for Heating Load: 0.49969530508308696
RMSE for Cooling Load: 1.7547635597380087


Use attribute `feature_importances_` to discuss which are, for RF, the most informative features to predict each target.

In [17]:
# Get feature importances for heating load prediction
feature_importance_heating = best_rf_heating.named_steps['regressor'].feature_importances_

# Get feature importances for cooling load prediction
feature_importance_cooling = best_rf_cooling.named_steps['regressor'].feature_importances_

# Combine feature importances with feature names
feature_importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance for Heating Load': feature_importance_heating,
    'Importance for Cooling Load': feature_importance_cooling
})

# Sort features by importance for each target
feature_importance_df = feature_importance_df.sort_values(by='Importance for Heating Load', ascending=False)
print("Feature importance for Heating Load:")
print(feature_importance_df)

feature_importance_df = feature_importance_df.sort_values(by='Importance for Cooling Load', ascending=False)
print("\nFeature importance for Cooling Load:")
print(feature_importance_df)


Feature importance for Heating Load:
                     Feature  Importance for Heating Load  \
0       Relative Compactness                     0.375779   
1               Surface Area                     0.196229   
4             Overall Height                     0.173437   
3                  Roof Area                     0.128402   
6               Glazing Area                     0.080092   
2                  Wall Area                     0.034538   
7  Glazing Area Distribution                     0.010775   
5                Orientation                     0.000748   

   Importance for Cooling Load  
0                     0.464365  
1                     0.078125  
4                     0.280514  
3                     0.060578  
6                     0.048570  
2                     0.039623  
7                     0.016380  
5                     0.011844  

Feature importance for Cooling Load:
                     Feature  Importance for Heating Load  \
0       Relative 

## 3. First result Gaussian Process (10%)

You will train two independent GPs, one to estimate HL and one to estimate CL. Each of the two GPs will be endowed with a composite kernel function $\kappa_T = \kappa_c \cdot \kappa_r + \kappa_w$, where:
- $\kappa_c$ is a constant kernel
- $\kappa_r$ is an RBF kernel
- $\kappa_w$ is a White Noise kernel

Evaluate those GPs with the corresponding test sets.

**How do these results compare with those obtained using Random Forests?**

**Discuss the contribution of the White Noise kernel and of the constant kernel to the final results**.




In [4]:
#Your code here

## 4. Gaussian Process with Features selected by Random Forest (10%)

Train now two independent GPs, using the same composite kernel as above, but now with those features indicated by RF as most relevant.

**Is there any significant improvement over RF or GP with all the features?**


In [5]:
#Your code here

## 5. Gaussian Processes with ARD kernels (30%)

Now use and ARD RBF kernel in the composite, that means enable a different lenghtscale for each feature. 

**Discuss the impact of the ARD kernel in the results of the GPs fit for each target.**

**Print the parameters of the kernels after the GPs are fit. In particular discuss how the lengthscale achieved per each feature can be used as a proxy to estimate the relevance of each feature. Are these relevances aligned with those found by Random Forests?**

**Discuss the impact of scaling or not the data in the conclusions extracted from the ARD results.** 

In [6]:
#Your code here

## 6. Predictive distributions (15%)

The predictive distribution can be employed to asses the confidence on the predictions made by the GPs. For the GPs fit in parts 3 (composite kernel, single RBF lengthscale for all features) and 5 (ARD kernel) compute the predictions for the test data including mean and standard deviation.

**For each GP produce an scatter plot of the absolute error of the predictions (true target minus mean of the predictive distributions) vs. the standard deviation of the corresponding predictive distribution. Is the standard deviation of the predictive distribution informative about the confidence in the predictions?**

In [7]:
#Your code here

## 7. Advanced kernels (25%)

Finally we will try to improve the results by using a more complicated kernel, which combine various covariance functions. Try to evaluate at least 10 different composite kernels by combining different instances of the basic kernels presented in class. Consider doing this programatically.

**Discuss about how the diversity of the results in each target as you vary the composite kernels configuration.**

In [8]:
#Your code here