I have successfully performed cross-validation and grid search for various regression models and obtained RMSE and MAE scores. Now, we can run some key evaluations to further analyse the model performance.

Visualising Predictions vs. Actual Values: You can create scatter plots or line plots to visualize how well each model's predictions align with the actual target values. This can give you a qualitative sense of the model's performance.

Stat tests to compare all models

In [5]:
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error



# Initialise the models
models = {
    "Linear Regression": LinearRegression(),
    "Random Forest": RandomForestRegressor(random_state=42),
    "Gradient Boosting": GradientBoostingRegressor(random_state=42),
    "Ridge": Ridge(alpha=1.0, random_state=42),
    "Lasso": Lasso(alpha=0.1, random_state=42),
    "XGBoost": xgb.XGBRegressor(objective ='reg:squarederror', random_state=42)
}

# Initialise results DataFrame
results_df = pd.DataFrame(columns=['Model', 'RMSE', 'MAE'])

# Perform cross-validation and store results
for model_name, model in models.items():
    # Calculate cross-validated RMSE
    neg_mse_scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
    rmse_scores = np.sqrt(-neg_mse_scores)
    avg_rmse = np.mean(rmse_scores)

    # Calculate cross-validated MAE
    mae_scores = -cross_val_score(model, X, y, cv=5, scoring='neg_mean_absolute_error')
    avg_mae = np.mean(mae_scores)

    # Append results to the DataFrame
    results_df = results_df.append({'Model': model_name, 'RMSE': avg_rmse, 'MAE': avg_mae}, ignore_index=True)

# Hyperparameter grid for NN
param_grid = {
    'hidden_layer_sizes': [(100,), (100, 50)],
    'activation': ['relu', 'tanh'],
    'learning_rate_init': [0.001, 0.01],
}

# Initialise the NN Regressor model
nn_model = MLPRegressor(max_iter=2000, random_state=42)

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Perform Grid Search for NN
grid_search = GridSearchCV(nn_model, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_scaled, y)
best_nn_model = grid_search.best_estimator_

# Compute RMSE and MAE for NN model and add it to the results DataFrame
nn_rmse = np.sqrt(-grid_search.best_score_)
nn_mae_scores = -cross_val_score(best_nn_model, X_scaled, y, cv=5, scoring='neg_mean_absolute_error')
nn_avg_mae = np.mean(nn_mae_scores)

# Append the NN results to the results DataFrame
results_df = results_df.append({
    'Model': 'Neural Network Regressor',
    'RMSE': nn_rmse,
    'MAE': nn_avg_mae
}, ignore_index=True)

# Print the updated results table
print(results_df)


  results_df = results_df.append({'Model': model_name, 'RMSE': avg_rmse, 'MAE': avg_mae}, ignore_index=True)
  results_df = results_df.append({'Model': model_name, 'RMSE': avg_rmse, 'MAE': avg_mae}, ignore_index=True)
  results_df = results_df.append({'Model': model_name, 'RMSE': avg_rmse, 'MAE': avg_mae}, ignore_index=True)
  results_df = results_df.append({'Model': model_name, 'RMSE': avg_rmse, 'MAE': avg_mae}, ignore_index=True)
  results_df = results_df.append({'Model': model_name, 'RMSE': avg_rmse, 'MAE': avg_mae}, ignore_index=True)
  results_df = results_df.append({'Model': model_name, 'RMSE': avg_rmse, 'MAE': avg_mae}, ignore_index=True)


                      Model      RMSE       MAE
0         Linear Regression  1.067225  0.305496
1             Random Forest  0.255806  0.193998
2         Gradient Boosting  0.263997  0.198448
3                     Ridge  0.394622  0.326558
4                     Lasso  0.392976  0.333159
5                   XGBoost  0.264377  0.198081
6  Neural Network Regressor  0.330178  0.241556


  results_df = results_df.append({


When evaluating model performance:

The model with the lowest RMSE and MAE has the least prediction error on average. However, if a model is too complex, it might perform exceptionally well on the training set but poorly on unseen data (overfitting).

Comparative Analysis:

Linear Regression: the most basic model and often serves as a benchmark. If complex models don't perform significantly better than linear regression, it could be a sign that the added complexity isn't beneficial.

Random Forest & Gradient Boosting: These are ensemble models. In our results, they have much lower RMSE compared to the linear models, suggesting they are capturing nonlinear patterns in the data better.

Ridge & Lasso: These are regularised linear regression models. The results show slightly higher error metrics than plain linear regression. This could indicate that the regularised versions aren't as beneficial for our dataset, or the hyperparameters need tuning.

XGBoost: An advanced gradient boosting algorithm. It seems to be performing comparably to the Random Forest and Gradient Boosting models in our case.

Neural Network Regressor: NNs are highly flexible models. Their performance can vary significantly with architecture and hyperparameters. Our neural network is performing better than the linear models but not as good as the tree-based models. The NN is capturing more complex relationships in the dataset than simpler models like Linear Regression, Ridge, or Lasso. This could be due to non-linear patterns in the data which linear models can't capture as effectively. Even though the NN is flexible, in this specific instance, it hasn't outperformed tree-based models like Random Forest, Gradient Boosting, or XGBoost. Tree-based models are adept at handling non-linearity, interaction effects, and can be more robust to certain types of data structures. It's also possible that the tree-based models are inherently better suited for this dataset, or that the specific neural network architecture and hyperparameters chosen weren't optimal.

In [10]:
# Assuming X_train, y_train are your training data
models["Random Forest"].fit(X_train, y_train)

feature_names = X.columns
# Get feature importances
importances = models["Random Forest"].feature_importances_

# Assuming feature_names is a list of your feature names
for feature, importance in zip(feature_names, importances):
    print(f"{feature}: {importance}")


LAEI 1km2 ID: 0.032061205274607665
GRID_ExactCut_ID: 0.03633522462479362
Easting_y: 0.03162297042149062
Northing_y: 0.012987770141891427
SO2: 0.062056202208094775
NMVOC: 0.42571840196920224
NH3: 0.02355037323216946
CO: 0.04887203651307892
CH4: 0.04969606085089379
N2O: 0.02809915907548689
Cd: 0.025181446314902604
Hg: 0.012835127935211334
Pb: 0.058198675432969596
BaP: 0.050208850514976235
PCB: 0.0317355005801904
HCl: 4.34364484089935e-05
PM10_cox_lag1: 0.03585803273884311
PM10_cox_lag2: 0.03493952572278818


The importance value of each feature tells us how much that specific feature contributes to the model's decisions. The higher the importance, the more influential the feature is in determining the model's predictions.

NMVOC has the highest feature importance at approximately 0.4257. This suggests that the Random Forest model considers NMVOC to be the most informative feature when making predictions.
Other features with relatively high importances include SO2, CO, CH4, BaP, Pb, and GRID_ExactCut_ID.

HCl has the smallest importance value, almost 0. 
This means it has very minimal influence in the model's decisions.
However, feature importances in a Random Forest model don't tell you about the relationship direction (whether the relationship is positive or negative), but only the strength or magnitude of the influence.



In [14]:
# Modify the instantiation
models["Random Forest"] = RandomForestRegressor(oob_score=True, random_state=42)

# Train the model (assuming X_train and y_train are your training data)
models["Random Forest"].fit(X_train, y_train)

# Access the OOB score
print(models["Random Forest"].oob_score_)


0.6311366570845751


The oob_score_ is the Out-Of-Bag (OOB) score for a Random Forest model. It's a way to measure the prediction accuracy of a Random Forest using the samples that were not included (left out) during the construction (bootstrap sampling) of individual trees. This means that the model's accuracy on the out-of-bag samples is approximately 63.11%. This is a pretty good score.
It provides a way to get an estimate of the model's performance without needing a separate validation set, which can be especially valuable if you have limited data.
It's a form of cross-validation that comes with the way Random Forests are constructed.


Further evals.

In [17]:
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Splitting the data
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42)  # 20% for testing
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42)  # 60% for training, 20% for validation

# Initialise the models
models = {
    "Linear Regression": LinearRegression(),
    "Random Forest": RandomForestRegressor(random_state=42),
    "Gradient Boosting": GradientBoostingRegressor(random_state=42),
    "Ridge": Ridge(alpha=1.0, random_state=42),
    "Lasso": Lasso(alpha=0.1, random_state=42),
    "XGBoost": xgb.XGBRegressor(objective ='reg:squarederror', random_state=42)
}

# Results dataframe for validation
val_results_df = pd.DataFrame(columns=['Model', 'RMSE', 'MAE'])

# Train and validate the models
for model_name, model in models.items():
    model.fit(X_train, y_train)
    val_predictions = model.predict(X_val)
    
    rmse_val = np.sqrt(mean_squared_error(y_val, val_predictions))
    mae_val = mean_absolute_error(y_val, val_predictions)

    val_results_df = val_results_df.append({'Model': model_name, 'RMSE': rmse_val, 'MAE': mae_val}, ignore_index=True)

# Scaling for NN
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

# Hyperparameter grid for NN
param_grid = {
    'hidden_layer_sizes': [(100,), (100, 50)],
    'activation': ['relu', 'tanh'],
    'learning_rate_init': [0.001, 0.01],
}

# Initialise the NN Regressor model
nn_model = MLPRegressor(max_iter=2000, random_state=42)

# Grid Search for NN
grid_search = GridSearchCV(nn_model, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train_scaled, y_train)
best_nn_model = grid_search.best_estimator_

val_predictions_nn = best_nn_model.predict(X_val_scaled)
rmse_val_nn = np.sqrt(mean_squared_error(y_val, val_predictions_nn))
mae_val_nn = mean_absolute_error(y_val, val_predictions_nn)

# Append the NN validation results
val_results_df = val_results_df.append({
    'Model': 'Neural Network Regressor',
    'RMSE': rmse_val_nn,
    'MAE': mae_val_nn
}, ignore_index=True)

# Final test evaluation
# Choose best model based on validation RMSE/MAE 

best_model = models["Random Forest"]
test_predictions = best_model.predict(X_test)
rmse_test = np.sqrt(mean_squared_error(y_test, test_predictions))
mae_test = mean_absolute_error(y_test, test_predictions)

print("Test RMSE for best model:", rmse_test)
print("Test MAE for best model:", mae_test)

# print(val_results_df)
import IPython

# Display the results
display(val_results_df)

# To display with more formatting (e.g., rounding the numbers), we can use:
display(val_results_df.style.set_precision(4).set_caption("Validation Results").hide_index())



  val_results_df = val_results_df.append({'Model': model_name, 'RMSE': rmse_val, 'MAE': mae_val}, ignore_index=True)
  val_results_df = val_results_df.append({'Model': model_name, 'RMSE': rmse_val, 'MAE': mae_val}, ignore_index=True)
  val_results_df = val_results_df.append({'Model': model_name, 'RMSE': rmse_val, 'MAE': mae_val}, ignore_index=True)
  val_results_df = val_results_df.append({'Model': model_name, 'RMSE': rmse_val, 'MAE': mae_val}, ignore_index=True)
  val_results_df = val_results_df.append({'Model': model_name, 'RMSE': rmse_val, 'MAE': mae_val}, ignore_index=True)
  val_results_df = val_results_df.append({'Model': model_name, 'RMSE': rmse_val, 'MAE': mae_val}, ignore_index=True)


Test RMSE for best model: 0.24275280057627383
Test MAE for best model: 0.17751598684054395


  val_results_df = val_results_df.append({


Unnamed: 0,Model,RMSE,MAE
0,Linear Regression,0.347094,0.274564
1,Random Forest,0.263417,0.198446
2,Gradient Boosting,0.265198,0.19799
3,Ridge,0.397263,0.32425
4,Lasso,0.396276,0.333569
5,XGBoost,0.26383,0.195503
6,Neural Network Regressor,0.300518,0.228293


  display(val_results_df.style.set_precision(4).set_caption("Validation Results").hide_index())
  display(val_results_df.style.set_precision(4).set_caption("Validation Results").hide_index())


Model,RMSE,MAE
Linear Regression,0.3471,0.2746
Random Forest,0.2634,0.1984
Gradient Boosting,0.2652,0.198
Ridge,0.3973,0.3243
Lasso,0.3963,0.3336
XGBoost,0.2638,0.1955
Neural Network Regressor,0.3005,0.2283


To idetify 'best model' adding logic. 

In [16]:
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Splitting the data
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42)  # 20% for testing
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42)  # 60% for training, 20% for validation

# Initialise the models
models = {
    "Linear Regression": LinearRegression(),
    "Random Forest": RandomForestRegressor(random_state=42),
    "Gradient Boosting": GradientBoostingRegressor(random_state=42),
    "Ridge": Ridge(alpha=1.0, random_state=42),
    "Lasso": Lasso(alpha=0.1, random_state=42),
    "XGBoost": xgb.XGBRegressor(objective ='reg:squarederror', random_state=42)
}

# Results dataframe for validation
val_results_df = pd.DataFrame(columns=['Model', 'RMSE', 'MAE'])

# Train and validate the models
for model_name, model in models.items():
    model.fit(X_train, y_train)
    val_predictions = model.predict(X_val)
    
    rmse_val = np.sqrt(mean_squared_error(y_val, val_predictions))
    mae_val = mean_absolute_error(y_val, val_predictions)

    val_results_df = val_results_df.append({'Model': model_name, 'RMSE': rmse_val, 'MAE': mae_val}, ignore_index=True)

# Scaling for NN
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

# Hyperparameter grid for NN
param_grid = {
    'hidden_layer_sizes': [(100,), (100, 50)],
    'activation': ['relu', 'tanh'],
    'learning_rate_init': [0.001, 0.01],
}

# Initialise the NN Regressor model
nn_model = MLPRegressor(max_iter=2000, random_state=42)

# Grid Search for NN
grid_search = GridSearchCV(nn_model, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train_scaled, y_train)
best_nn_model = grid_search.best_estimator_

val_predictions_nn = best_nn_model.predict(X_val_scaled)
rmse_val_nn = np.sqrt(mean_squared_error(y_val, val_predictions_nn))
mae_val_nn = mean_absolute_error(y_val, val_predictions_nn)

# Append the NN validation results
val_results_df = val_results_df.append({
    'Model': 'Neural Network Regressor',
    'RMSE': rmse_val_nn,
    'MAE': mae_val_nn
}, ignore_index=True)

# After getting the validation results
best_model_name = val_results_df.loc[val_results_df['RMSE'].idxmin()]['Model']
best_model = models[best_model_name]

# Now, you can evaluate the best model on the test set:
best_model.fit(X_temp, y_temp)  # First, we retrain on both training and validation combined
test_predictions = best_model.predict(X_test)
rmse_test = np.sqrt(mean_squared_error(y_test, test_predictions))
mae_test = mean_absolute_error(y_test, test_predictions)

print(f"Best Model based on Validation RMSE: {best_model_name}")
print("Test RMSE for best model:", rmse_test)
print("Test MAE for best model:", mae_test)

print(val_results_df)


  val_results_df = val_results_df.append({'Model': model_name, 'RMSE': rmse_val, 'MAE': mae_val}, ignore_index=True)
  val_results_df = val_results_df.append({'Model': model_name, 'RMSE': rmse_val, 'MAE': mae_val}, ignore_index=True)
  val_results_df = val_results_df.append({'Model': model_name, 'RMSE': rmse_val, 'MAE': mae_val}, ignore_index=True)
  val_results_df = val_results_df.append({'Model': model_name, 'RMSE': rmse_val, 'MAE': mae_val}, ignore_index=True)
  val_results_df = val_results_df.append({'Model': model_name, 'RMSE': rmse_val, 'MAE': mae_val}, ignore_index=True)
  val_results_df = val_results_df.append({'Model': model_name, 'RMSE': rmse_val, 'MAE': mae_val}, ignore_index=True)
  val_results_df = val_results_df.append({


Best Model based on Validation RMSE: Random Forest
Test RMSE for best model: 0.24589050844471486
Test MAE for best model: 0.18008911169668682
                      Model      RMSE       MAE
0         Linear Regression  0.347094  0.274564
1             Random Forest  0.263417  0.198446
2         Gradient Boosting  0.265198  0.197990
3                     Ridge  0.397263  0.324250
4                     Lasso  0.396276  0.333569
5                   XGBoost  0.263830  0.195503
6  Neural Network Regressor  0.300518  0.228293


The Random Forest model has been identified as the best model based on the validation RMSE.
On the test set, this Random Forest model achieved an RMSE of approximately 0.2459 and an MAE of approximately 0.1801.
When comparing this test performance with its validation RMSE of 0.2634, the Random Forest model's performance is consistent and even slightly better on the test set, which is a good sign of generalisation.

Comparing across all models:

The Random Forest, XGBoost, and Gradient Boosting models are the top three performers in terms of RMSE. Their performance is closely matched, with Random Forest leading by a small margin in the validation results.
The Neural Network Regressor, though not matching the tree-based models, still outperforms the linear models (Linear Regression, Ridge, and Lasso) in terms of RMSE.