# Exercise 4 regression on a given dataset

Perform a regression on the dataset stored in FTML/Project/data/regression/.
You are free to choose the regression methods, but you must compare at least two
methods. You can do more than 2 but this is not mandatory for this exercise. Discuss the choice of the optimization procedures, solvers, hyperparameters, crossvalidation, etc. The Bayes estimator for this dataset and the squared loss reaches
a R2 score of approximately 0.92, for at least 1 of the 2 estimators (1 estimator is
enough).

Your objective is be to obtain a R2 score superior than 0.88 on the test set, that
must not be used during training. Remember that training is the complete model optimisation procedure, including model selection and hyperparameters testing, not
only when you call a .fit() method ! This is the topic that we discussed during the
practical sessions on train / validation / test and cross-validation. However, since
you have the test set, all you can do is "pretend" not to use it during training, since
you can always compute the score test several times without putting it in your solution.

We firstly define the librairies we will use, load the data, print some information on it and finally define our target accuracy:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline

import warnings
warnings.filterwarnings('ignore')

plt.style.use('default')
sns.set_palette("husl")

# Load the data
X_train = np.load("../data/regression/X_train.npy")
y_train = np.load("../data/regression/y_train.npy")
X_test = np.load("../data/regression/X_test.npy")
y_test = np.load("../data/regression/y_test.npy")

print(f"dataset shape: X_train {X_train.shape}, y_train {y_train.shape}")
print(f"test set shape: X_test {X_test.shape}, y_test {y_test.shape}")

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

TARGET_R2 = 0.88 # as written in the subject

dataset shape: X_train (200, 200), y_train (200, 1)
test set shape: X_test (200, 200), y_test (200, 1)


## Hyperparameter Tuning and Model Selection

Now we'll tune the hyperparameters for our regression models and find the best performing one. We're comparing five different approaches to see which works best for our dataset.


The Linear Regression serves as our baseline because it has no hyperparameters to tune. For Ridge Regression, we test different regularization strengths (alpha values) from 0.001 to 100 to find the right balance between fitting the data and preventing overfitting. Lasso Regression gets similar treatment with alpha values, but we focus on a smaller range since Lasso tends to be more aggressive with feature selection. Elastic Net combines both Ridge and Lasso penalties, so we tune both the alpha parameter and the l1_ratio that controls the mix between the two regularization types.

Random Forest is our tree-based model that doesn't need feature scaling. We tune the number of trees (n_estimators), how deep each tree can grow (max_depth), and the minimum samples needed to split a node (min_samples_split).

We use 5-fold cross-validation with GridSearchCV to find the best hyperparameters for each model. This means we split our training data into 5 parts, train on 4 parts, and validate on the remaining part, repeating this process 5 times.

After we found the best hyperparameters, we'll evaluate all models on our test set to see which one performs best in practice and whether we can achieve our target R² score of 0.88

In [25]:
print("\nhyperparameter tuning and choosing the best model...")

# we define the models and their hyperparameter grids
models = {
    'Linear Regression': {
        'model': Pipeline([
            ('scaler', StandardScaler()),
            ('regressor', LinearRegression())
        ]),
        'params': {}  # No hyperparameters to tune
    },
    
    'Ridge Regression': {
        'model': Pipeline([
            ('scaler', StandardScaler()),
            ('regressor', Ridge())
        ]),
        'params': {
            'regressor__alpha': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
        }
    },
    
    'Lasso Regression': {
        'model': Pipeline([
            ('scaler', StandardScaler()),
            ('regressor', Lasso(max_iter=2000))
        ]),
        'params': {
            'regressor__alpha': [0.001, 0.01, 0.02, 0.05, 0.1, 0.5, 1.0]
        }
    },
    
    'Elastic Net': {
        'model': Pipeline([
            ('scaler', StandardScaler()),
            ('regressor', ElasticNet(max_iter=2000))
        ]),
        'params': {
            'regressor__alpha': [0.001, 0.01, 0.1, 1.0],
            'regressor__l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9]
        }
    },
    
    'Random Forest': {
        'model': RandomForestRegressor(random_state=42, n_jobs=-1),
        'params': {
            'n_estimators': [50, 100, 200],
            'max_depth': [5, 10, 15, None],
            'min_samples_split': [2, 5, 10]
        }
    }
}

# we will store the best models and their cross-validation results
best_models = {}
cv_results = {}

for name, config in models.items():
    print(f"\nhyper tuning {name}...")
    
    if config['params']:  # if there are hyperparameters to tune
        grid_search = GridSearchCV(
            config['model'], 
            config['params'], 
            cv=5, 
            scoring='r2', 
            n_jobs=-1,
            verbose=0
        )
        grid_search.fit(X_train, y_train)
        
        best_models[name] = grid_search.best_estimator_
        cv_score = grid_search.best_score_
        best_params = grid_search.best_params_
        
        print(f"best cross validation R² score: {cv_score:.4f}")
        print(f"best params: {best_params}")
        
    else:  # no hyperparameters to tune, just cross-validate
        cv_scores = cross_val_score(config['model'], X_train, y_train, cv=5, scoring='r2')
        cv_score = cv_scores.mean()
        
        config['model'].fit(X_train, y_train)
        best_models[name] = config['model']
        
        print(f"cross validation R² score: {cv_score:.4f} ± {cv_scores.std():.4f}")
    
    cv_results[name] = cv_score



hyperparameter tuning and choosing the best model...

hyper tuning Linear Regression...
cross validation R² score: 0.4479 ± 0.1677

hyper tuning Ridge Regression...
best cross validation R² score: 0.5702
best params: {'regressor__alpha': 10.0}

hyper tuning Lasso Regression...
best cross validation R² score: 0.9250
best params: {'regressor__alpha': 0.02}

hyper tuning Elastic Net...
best cross validation R² score: 0.9130
best params: {'regressor__alpha': 0.01, 'regressor__l1_ratio': 0.9}

hyper tuning Random Forest...
best cross validation R² score: 0.2083
best params: {'max_depth': 15, 'min_samples_split': 5, 'n_estimators': 200}


## Final Evaluation on Test Set

Now comes the moment of truth - testing our tuned models on the test set that we've kept completely separate from the training process. This is where we'll see how well our models actually perform on unseen data and whether we can achieve our target R² score of 0.88.

One of the most important aspects of this evaluation is comparing the cross-validation scores we got during training with the actual test set performance. If there's a big difference between these two, it might indicate that our model is overfitting to the training data. Ideally, we want to see similar performance on both, which would suggest our model generalizes well.

We'll identify the best performing model based on the R² score and see if any of our models successfully meet the target threshold. This comparison will also help us understand which approach works best for this particular dataset and whether our hyperparameter tuning was effective.

In [26]:
# =============================================================================
# FINAL EVALUATION ON TEST SET
# =============================================================================

print("\nfinal Evaluation on the test set")

test_results = {}
predictions = {}

for name, model in best_models.items():
    # predict on test set
    y_pred = model.predict(X_test)
    predictions[name] = y_pred
    
    # calculate metrics
    r2 = r2_score(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    
    test_results[name] = {
        'R2': r2,
        'MSE': mse,
        'MAE': mae,
        'RMSE': np.sqrt(mse)
    }
    
    print(f"\n{name}:")
    print(f"  R2 Score: {r2:.4f}")
    print(f"  RMSE: {np.sqrt(mse):.4f}")
    print(f"  MAE: {mae:.4f}")
    
    # we check if target R2 is achieved
    if r2 > TARGET_R2:
        print(f"Target R2 > {TARGET_R2}")
    else:
        print(f"Target R2 < {TARGET_R2}")

# we found the best model 
best_model_name = max(test_results.keys(), key=lambda x: test_results[x]['R2'])
best_r2 = test_results[best_model_name]['R2']

print(f"\nBEST MODEL: {best_model_name} with R2 = {best_r2:.4f}")

# comparison
print(f"\ncross-validation vs test set results:")
for name in best_models.keys():
    cv_r2 = cv_results[name]
    test_r2 = test_results[name]['R2']
    print(f"{name:20s} | cross_validation: {cv_r2:.4f} | on test data: {test_r2:.4f} | diff: {abs(cv_r2-test_r2):.4f}")


final Evaluation on the test set

Linear Regression:
  R2 Score: -9.9240
  RMSE: 2.8359
  MAE: 2.3420
Target R2 < 0.88

Ridge Regression:
  R2 Score: 0.7153
  RMSE: 0.4578
  MAE: 0.3665
Target R2 < 0.88

Lasso Regression:
  R2 Score: 0.9231
  RMSE: 0.2380
  MAE: 0.1944
Target R2 > 0.88

Elastic Net:
  R2 Score: 0.9184
  RMSE: 0.2452
  MAE: 0.2028
Target R2 > 0.88

Random Forest:
  R2 Score: 0.3456
  RMSE: 0.6941
  MAE: 0.5373
Target R2 < 0.88

BEST MODEL: Lasso Regression with R2 = 0.9231

cross-validation vs test set results:
Linear Regression    | cross_validation: 0.4479 | on test data: -9.9240 | diff: 10.3718
Ridge Regression     | cross_validation: 0.5702 | on test data: 0.7153 | diff: 0.1451
Lasso Regression     | cross_validation: 0.9250 | on test data: 0.9231 | diff: 0.0019
Elastic Net          | cross_validation: 0.9130 | on test data: 0.9184 | diff: 0.0054
Random Forest        | cross_validation: 0.2083 | on test data: 0.3456 | diff: 0.1373


## Conclusion

Looking at our results, we successfully achieved our target R² score of 0.88 on the test set with two of our models. Lasso Regression emerged as the clear winner with an impressive R² score of 0.9231, closely followed by Elastic Net at 0.9184. Both models significantly exceeded our target threshold and came very close to the theoretical Bayes estimator performance of 0.92 mentioned in the exercise.