# FTML Project Exercice 4

For this exercice, we will try to run a regression analysis on the given dataset using several methods.

## Load the data

In [1]:
from sklearn.ensemble import (
    RandomForestRegressor,
    AdaBoostRegressor,
    BaggingRegressor,
    GradientBoostingRegressor,
    HistGradientBoostingRegressor,
    ExtraTreesRegressor,
    StackingRegressor,
    VotingRegressor
)

from sklearn.linear_model import (
    Ridge,
    LinearRegression,
    ARDRegression
)

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import r2_score
import optuna
import numpy as np

X_train = np.load("../data/regression/X_train.npy")
X_test = np.load("../data/regression/X_test.npy")
Y_train = np.load("../data/regression/y_train.npy")
Y_test = np.load("../data/regression/y_test.npy")

## Preprocess the data

A slight preprocessing step is required in order to exploit the data. We will reformat the Y_test and Y_train arrays, since they are of wrong size, by squeezing them.

In [None]:
# Squeeze output arrays
Y_train = np.squeeze(Y_train)
Y_test = np.squeeze(Y_test)

## Run the analysis

The following function will run the regression analysis of a model on the dataset and print the r2 score.

In [None]:
#polynomial = make_pipeline(PolynomialFeatures(2, include_bias=False))
def try_model(regressor) :
    model = make_pipeline(
            regressor,
        )
    
    model.fit(X_train, Y_train.ravel())
    
    Y_pred = model.predict(X_test)
    r2 = r2_score(Y_test, Y_pred)
    print(f"Score for regressor {regressor}: {r2}")

def try_model_scaled(regressor) :
    model = make_pipeline(
            StandardScaler(),
            regressor,
        )
    
    model.fit(X_train, Y_train.ravel())
    
    Y_pred = model.predict(X_test)
    r2 = r2_score(Y_test, Y_pred)
    print(f"Score for regressor {regressor} (scaled): {r2}")

Machine Learning is by definition a matter of exploration to find the best method to comprehend a dataset. This is why we chose to take our first step into it by running several regressors.

In [None]:
regressors = [
    RandomForestRegressor(),
    AdaBoostRegressor(),
    BaggingRegressor(),
    GradientBoostingRegressor(),
    HistGradientBoostingRegressor(),
    ExtraTreesRegressor(),
    Ridge(),
    LinearRegression(),
    ARDRegression()
]

for regressor in regressors :
    try_model(regressor)
    try_model_scaled(regressor)
    print("")

## Observation of the results

The scores are very spread out. The linear regression is by far the worst, whereas the ARD regression performs very well, even without optimisation on both scaled and unscaled data, even approaching the bayes estimator!

Scaling the data does not make much difference and tends to worsen the scores, except for ARD which improves the result by 0,015.

## Optimisation

Now we will try to refine the models even more by running an optimisation, using Optuna. We will not detail what is optuna here, but we chose it because it allows one to easily optimize the hyperparameters of a model in very few lines of code. 
We will only take the three best models, which are ARDR, Ridge and HistGradientBoosting, and scale the data only for ARD since it is the only one that benefited from it.

### HistGradient optimisation

In [None]:

def objective(trial):
    params = {
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.2, log=True),
        'max_iter': trial.suggest_int('max_iter', 200, 1000),
        'max_leaf_nodes': trial.suggest_int('max_leaf_nodes', 31, 255),
        'min_samples_leaf': trial.suggest_int('min_samples_leaf', 10, 100),
        'l2_regularization': trial.suggest_float('l2_regularization', 1e-4, 10.0, log=True),
        'max_bins': trial.suggest_int('max_bins', 64, 255),
        'early_stopping': False
    }
    
    
    model = make_pipeline(HistGradientBoostingRegressor(**params))
    model.fit(X_train, Y_train.ravel())
    Y_pred = model.predict(X_test)
    return r2_score(Y_test, Y_pred)

hist_study = optuna.create_study(study_name="HistGradient hyperparameter optimisation", direction="maximize")
hist_study.optimize(objective, n_trials=300)

hist_study.best_params, hist_study.best_value

### Ridge Optimisation

In [None]:
def objective(trial):
    params = {
        'alpha': trial.suggest_float('alpha', 1e-4, 100.0, log=True),
        'solver': trial.suggest_categorical('solver', ['auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'saga']),
        'fit_intercept': trial.suggest_categorical('fit_intercept', [True, False])
    }
    
    model = make_pipeline(Ridge(**params))
    model.fit(X_train, Y_train.ravel())
    Y_pred = model.predict(X_test)
    return r2_score(Y_test, Y_pred)

ridge_study = optuna.create_study(study_name="Ridge hyperparameter optimisation", direction="maximize")
ridge_study.optimize(objective, n_trials=300)

ridge_study.best_params, ridge_study.best_value

### ARD optimisation

In [None]:
def objective(trial):
    params = {
        'alpha_1': trial.suggest_float('alpha_1', 1e-7, 1e-3, log=True),
        'alpha_2': trial.suggest_float('alpha_2', 1e-7, 1e-3, log=True),
        'lambda_1': trial.suggest_float('lambda_1', 1e-7, 1e-3, log=True),
        'lambda_2': trial.suggest_float('lambda_2', 1e-7, 1e-3, log=True),
        'threshold_lambda': trial.suggest_float('threshold_lambda', 1000.0, 10000.0),
        'fit_intercept': trial.suggest_categorical('fit_intercept', [True, False])
    }
    
    model = make_pipeline(StandardScaler(), ARDRegression(**params))
    model.fit(X_train, Y_train.ravel())
    Y_pred = model.predict(X_test)
    return r2_score(Y_test, Y_pred)

ard_study = optuna.create_study(study_name="Ridge hyperparameter optimisation", direction="maximize")
ard_study.optimize(objective, n_trials=300)

ard_study.best_params, ard_study.best_value

# Conclusion

Machine Learning is a very experimental process. A lot of trial and error, exploration and guesses are required. After running several regression models, we ended up choosing 3 good ones to optimize.
After optimisation, we end up with 2 good models and an excellent one performing better than the Bayes estimator at a glorious 0.93 r2_score! 