# Shallow Machine Learning Models: Random Forest

<div class="alert alert-info" role="alert" 
     style="font-size: 1.1em; padding: 10px; margin: 10px 0; text-align: center;">
    
    Random forest is an *ensemble* learning method that builds and aggregates multiple decision trees to capture
    complex nonlinear relationships in data.
<div>

### Import Libraries including from `sklearn` for shallow ML

In [None]:
# Data Wrangling
import glob
import pandas as pd
import numpy as np

# Machine Learn
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import shap

# Data Plotting
import matplotlib.pyplot as plt
import seaborn as sns

#ignorewarnings
import warnings
warnings.filterwarnings("ignore")

<div class="alert alert-info" role="alert" 
     style="font-size: 1.8em; font-weight: bold; padding: 15px; margin: 10px 0; text-align: center; background-color: #d9edf7; border-color: #bce8f1; color: #31708f; border-radius: 8px;">
    Data Preprocessing
</div>

### Option 1 to Load CSV files containing variables -- pandas' `read_csv function`

In [None]:
# Data are imported in Dataframe format
SSH = pd.read_csv("cmems_mod_glo_phy_my_0.083deg_P1D-m_SSH.csv", comment='#') # tells pandas to ignore lines starting with '#'
SST = pd.read_csv("cmems_mod_glo_phy_my_0.083deg_P1D-m_SST.csv", comment='#')
SSS = pd.read_csv("cmems_mod_glo_phy_my_0.083deg_P1D-m_SSs.csv", comment='#')
VEL = pd.read_csv("cmems_mod_glo_phy_my_0.083deg_P1D-m_VEL.csv", comment='#')
MLD = pd.read_csv("cmems_mod_glo_phy_my_0.083deg_P1D-m_MLD.csv", comment='#')

In [None]:
# The UTC format that our dates (times) are currently in:
SSH['dates'] = pd.to_datetime(SSH['time'], format='%Y-%m-%dT%H:%M:%S.%fZ', utc=True)

# Step1: convert to a more friendly format, add 'dates' as new column 
SSH['dates'] = SSH['dates'].dt.strftime('%Y-%m-%d %H:%M')
# Step2: Remove the times in this case as they contain no real information
SSH['dates'] = pd.to_datetime(SSH['dates'])
# Step3: Drop the now unnecessary 'time' column
SSH = SSH.drop(columns=['time'])

# Re-order your columns so date is still first
SSH = SSH[['dates','zos']]
# Display
print(SSH.head(2))

In [None]:
# We can now combine these different vars to make a new df
df = pd.DataFrame({'Date':SSH['dates'], 'SSH':SSH['zos'], 'SST':SST['thetao'], 'SSS':SSS['so'], 
                   'Vuo':VEL['uo'], 'Vvo':VEL['vo'], 'MLD':MLD['mlotst']})
print(df.head(3))
print(':')
print(len(df))

### Option 2 to Load CSV files containing variables -- `glob` and pandas' `read_csv function`

### Set the `predictor` and `target` variables (X, y)

In [None]:
predictors = ['SSH', 'SSS', 'Vuo', 'Vvo', 'MLD'] # Predictor vars
X = df[predictors].values 
y = df['SST'].values      # Target variable
# Needs to be (n,n)(n,)
print(X.shape, y.shape)

### Split the data into two sets: `training` (80%) and `test` (20%)

In [None]:
# Split your dataset so 20% is set aside for testing (0.2) 
# Set random_state to ensure yr train-test split is always the same (for reproducibility)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check the 80% training: 20% testing split
print("Trainin set size:", X_train.shape[0])
print("Testing set size:",  X_test.shape[0])

<div class="alert alert-info" role="alert" 
     style="font-size: 1.8em; font-weight: bold; padding: 15px; margin: 10px 0; text-align: center; background-color: #d9edf7; border-color: #bce8f1; color: #31708f; border-radius: 8px;">
    
    Random Forest (tree-based) Model
<div>

In [None]:
# Initialize and Train Random Forest Model
rf_model = RandomForestRegressor(
    n_estimators = 100,  # Number of trees in the forest
    max_depth = None,    # Trees grow until all leaves are pure (default behavior)
    random_state = 42,   # Ensures reproducibility
    n_jobs = -1)

In [None]:
#Fit the model on the training set
rf_model.fit(X_train, y_train)

# Predict SST on the test dataset
y_pred = rf_model.predict(X_test)

<div class="alert alert-info" role="alert" 
     style="font-size: 1.8em; font-weight: bold; padding: 15px; margin: 10px 0; text-align: center; background-color: #d9edf7; border-color: #bce8f1; color: #31708f; border-radius: 8px;">
    Evaluating Model Performance
</div>

### Metrics for Random Forest Model: `R2` and `RMSE`

In [None]:
# Evaluate RF Model Performance
r2 = r2_score(y_test, y_pred)            # R² score (goodness of fit)
mse= mean_squared_error(y_test, y_pred)  # Mean Squared Error

print(f"Random Forest R² : {r2:.2f}")
print(f"Random Forest MSE: {mse:.2f}")

### Better scores again! Do we seem the same feature contributions now that elastic net penalties aren't being applied?

<div class="alert alert-info" role="alert" 
     style="font-size: 1.8em; font-weight: bold; padding: 15px; margin: 10px 0; text-align: center; background-color: #d9edf7; border-color: #bce8f1; color: #31708f; border-radius: 8px;">
    Feature Importance
</div>

In [None]:
# Using SHAP to explain our model predications
explainer = shap.TreeExplainer(rf_model )    # Explain model predictions
shap_vals = explainer.shap_values(X_test)    # Compute SHAP values for test data

# Compare Feature Importances from RF
rf_importance = pd.Series(rf_model.feature_importances_, index=predictors)
# Create DataFrame (df) to hold the feature names and their RF importance:
rf_df = pd.DataFrame({
    "Variables": predictors,
    "RF Importance": rf_importance})

# Sort rf_df so most important variables are at the top (descending order)
rf_df.sort_values(by = "RF Importance", ascending = False, inplace = True)
print(rf_df)

In [None]:
# Compute the mean absolute SHAP values for each feature
# This provides a robust measure of feature importance
shap_importance = np.abs(shap_vals).mean(axis = 0)

# Create a DataFrame to hold the feature names and their SHAP importance
shap_df = pd.DataFrame({
    "Variables": predictors,
    "Mean Absolute SHAP": shap_importance})

# Sort the DataFrame so that the most important features are at the top
shap_df.sort_values(by="Mean Absolute SHAP", ascending=False, inplace=True)
print(shap_df)

In [None]:
# For plotting SHAP:
# Use the diverging "Spectral" palette for colormap:
cmap = sns.color_palette("Spectral", as_cmap = True)

# Compute normalized ranking for each feature (btwn 0 - 1)
# Note -- shap_df values used to determine relative order:
norm_ranks = shap_df["Mean Absolute SHAP"].rank(pct = True)

# Map each normalized rank to a colour via colourmap:
colors = norm_ranks.apply(lambda x: cmap(x)).tolist()

In [None]:
# Create a figure with two subplots side-by-side
fig, axes = plt.subplots(1, 2, figsize=(9, 4))

# ----- Plot RF Feature Importance -----
sns.barplot(data = rf_df, x = "RF Importance", y = "Variables", ax = axes[0],
    palette = colors)
axes[0].set_title( "RF Variable Importance", fontweight='bold')
axes[0].set_xlabel("RF Importance", fontsize = 10)
axes[0].set_ylabel("Features", fontsize = 10)
axes[0].grid(axis = 'x', linestyle = '--', alpha = 0.7)

# ----- Plot SHAP Feature Importance -----
sns.barplot(data = shap_df, x = "Mean Absolute SHAP", y = "Variables", ax=axes[1],
    palette = colors)
axes[1].set_title( "SHAP Variable Importance", fontweight ='bold')
axes[1].set_xlabel("Mean Absolute SHAP Value", fontsize = 10)
axes[1].set_ylabel("")  # Remove redundant ylabel on the right plot
axes[1].grid(axis = 'x', linestyle = '--', alpha = 0.7)

# Show plots
plt.tight_layout()
plt.show()

<div class="alert alert-info" role="alert" 
     style="font-size: 1.em; padding: 15px; margin: 10px 0; text-align: left; background-color: #d9edf7; border-color: #bce8f1; color: #31708f; border-radius: 8px;">

    While the ranking is similar, the scales differ because the two methods measure importance in different ways:
    
    ⦾ Random Forest Feature Importance 
       - Calculated based on the reduction in impurity (e.g. the MSE) that each feature provides when used for splits.
       - Normalized (0 - 1), so they represent the relative importance of features in splitting decisions.
       - Rankings are more about the model's inner workings:
       - i.e.: how frequently and effectively features are used to split nodes and reduce error during model training.
        
    ⦾ SHAP Mean Absolute Values
       - Computed per sample as the contribution each feature makes to the model’s prediction, then averaged across all samples.
       - Values are in the same units as the output variable (SST), and are not normalised in the same way as RF.
       - More about the actual impact of feature values:
       - i.e.: how much each feature’s actual value contributed to the prediction (game theory foundation).
<div>