# Special Case: Impact of Model Complexity on Forest-Guided Clustering

📚 In this notebook, we explore how increasing the **complexity of a Random Forest model affects the interpretability** of its explanations using Forest-Guided Clustering (FGC). While model optimization typically focuses on improving predictive performance, using metrics such as accuracy or R², this often leads to highly complex models with deeply grown trees that closely fit the training data. Although such models may perform well on the training set, they are at risk of overfitting, capturing noise rather than meaningful patterns. This overfitting not only reduces generalization performance, but also impacts the quality of insights obtained from post-hoc explainability methods. FGC leverages the internal structure of Random Forests to uncover stable and interpretable subgroups in the data. However, if the forest becomes too complex, these discovered patterns may no longer reflect general behavior, but rather instance-specific artifacts.

📦 **Installation:** To get started, you need to install the `fgclustering` package. Please follow the instructions on the [official installation guide](https://forest-guided-clustering.readthedocs.io/en/latest/_getting_started/installation.html).

🚧 **Note:** For a general introduction to FGC, please refer to our [Introduction Notebook](https://forest-guided-clustering.readthedocs.io/en/latest/_tutorials/introduction_to_FGC_use_cases.html).

**Imports:**

In [1]:
## Import the Forest-Guided Clustering package
from fgclustering import forest_guided_clustering, DistanceRandomForestProximity, ClusteringKMedoids

## Imports for datasets
import pandas as pd
from sklearn.datasets import fetch_california_housing

## Additional imports for use-cases
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

## 🏠 The California Housing Dataset

To investigate how model complexity affects the interpretability of Forest-Guided Clustering (FGC), we will use the **California Housing dataset**. This dataset was also used in *Use Case 3*, please refer there for a full description. In this example, we focus on a subset of the data: the first 1,500 samples are used as the training set to fit a `RandomForestRegressor`. We vary the `max_depth` hyperparameter to evaluate different model complexities. This will help us understand how increasing the depth of the trees impacts both model performance and the quality of the explanations retrieved by FGC.


In [2]:
data_housing = fetch_california_housing(as_frame=True).frame

data_housing_train = data_housing.iloc[:1500]
X_housing_train = data_housing_train.loc[:, data_housing_train.columns != 'MedHouseVal']
y_housing_train = data_housing_train.MedHouseVal

## 🌲 Evaluating Model Complexity: Shallow vs. Deep Random Forests

Below, we train two Random Forest models with different levels of complexity. The first is a **shallow model** with `max_depth=10`, meaning the maximum depth of any tree in the ensemble is limited to 10. The second is a **deep model** with `max_depth=50`, allowing much deeper trees and therefore greater model complexity. If we were to optimize based purely on performance metrics, we might prefer the deeper model as it achieves a slightly higher training score (0.96) compared to the shallow model’s score of 0.94. However, raw performance doesn't tell the full story. To evaluate interpretability and the stability of patterns captured by each model, we now apply **Forest-Guided Clustering (FGC)** to both trained models. This allows us to assess whether the discovered clusters reflect meaningful, generalizable structure or whether the model has simply overfit the training data.

In [3]:
rf_housing_shallow = RandomForestRegressor(max_depth=10, max_features='log2', max_samples=0.8, bootstrap=True, random_state=42)
rf_housing_shallow.fit(X_housing_train, y_housing_train)

print(f'Train R^2 score Shallow Model: {rf_housing_shallow.score(X_housing_train, y_housing_train)}')

rf_housing_deep = RandomForestRegressor(max_depth=50, max_features='log2', max_samples=0.8, bootstrap=True, random_state=42)
rf_housing_deep.fit(X_housing_train, y_housing_train)

print(f'Train R^2 score Deep Model: {rf_housing_deep.score(X_housing_train, y_housing_train)}')

Train R^2 score Shallow Model: 0.9391860496404276
Train R^2 score Deep Model: 0.9624061659550253


In [4]:
fgc = forest_guided_clustering(
    estimator=rf_housing_shallow, 
    X=data_housing_train, 
    y='MedHouseVal', 
    clustering_distance_metric=DistanceRandomForestProximity(), 
    clustering_strategy=ClusteringKMedoids(method="fasterpam"),
)

Using a sample size of 66.66666666666666 % of the input data.
Using range k = (2, 6) to optimize k.


Optimizing k: 100%|██████████| 5/5 [00:12<00:00,  2.59s/it]


Optimal number of clusters k = 4

Clustering Evaluation Summary:
 k    Score  Stable  Mean_JI                                                  Cluster_JI
 2 0.988031    True    0.678                                        {1: 0.654, 2: 0.703}
 3      NaN   False    0.513                              {1: 0.621, 2: 0.555, 3: 0.364}
 4 0.754074    True    0.620                    {1: 0.711, 2: 0.491, 3: 0.612, 4: 0.665}
 5      NaN   False    0.600          {1: 0.498, 2: 0.525, 3: 0.651, 4: 0.684, 5: 0.643}
 6      NaN   False    0.537 {1: 0.611, 2: 0.379, 3: 0.61, 4: 0.577, 5: 0.678, 6: 0.367}





In [5]:
fgc = forest_guided_clustering(
    estimator=rf_housing_deep, 
    X=data_housing_train, 
    y='MedHouseVal', 
    clustering_distance_metric=DistanceRandomForestProximity(), 
    clustering_strategy=ClusteringKMedoids(method="fasterpam"),
)

Using a sample size of 66.66666666666666 % of the input data.
Using range k = (2, 6) to optimize k.


Optimizing k: 100%|██████████| 5/5 [00:11<00:00,  2.25s/it]


Clustering Evaluation Summary:
 k Score  Stable  Mean_JI                                                   Cluster_JI
 2  None   False    0.494                                         {1: 0.912, 2: 0.076}
 3  None   False    0.320                               {1: 0.117, 2: 0.709, 3: 0.135}
 4  None   False    0.227                     {1: 0.126, 2: 0.499, 3: 0.129, 4: 0.155}
 5  None   False    0.218           {1: 0.483, 2: 0.131, 3: 0.158, 4: 0.166, 5: 0.153}
 6  None   False    0.185 {1: 0.106, 2: 0.326, 3: 0.157, 4: 0.172, 5: 0.175, 6: 0.175}





## 🏁 Insights: Why Model Complexity Affects Interpretability

As shown in the results above, Forest-Guided Clustering (FGC) fails to identify any stable clustering when applied to the **deep Random Forest model** (max\_depth=50). This indicates that the model does not learn **generalizable patterns**, and instead captures noise of the training data that do not translate into coherent, robust groupings.

But why does this happen, especially when the model performs well in terms of R²?

The key lies in understanding that **performance metrics alone do not guarantee interpretability**. When we optimize exclusively for predictive performance, we often end up with highly complex models. In this case, increasing the tree depth to 50 led to very finely partitioned decision trees, with many leaves containing just a handful of samples. Such deep splits are likely to reflect **noise** or overly specific characteristics of individual training samples rather than meaningful structure in the data. This lack of structure is clearly reflected in the **Jaccard Index (JI)**, which measures the **stability of clusters across bootstrap samples**. A low JI indicates that the clusters are not reproducible, i.e., the model segments the data differently each time, pointing to **fragile or non-generalizable splits**. In contrast, the **shallow model** (max\_depth=10), which performs nearly as well in terms of R², yields a **stable clustering** with JI > 0.6 for *k = 2* and *k = 4*. This demonstrates that the clusters are both **coherent** and **stable**.

The takeaway? If you plan to interpret a Random Forest model using FGC, it’s critical to **balance performance with interpretability**. Choosing a simpler model that avoids overfitting often leads to more trustworthy, insightful explanations—and stable, actionable clusters.