<img src="../Images/DSC_Logo.png" style="width: 400px;">

In [None]:
!pip install seaborn pandas matplotlib scikit-learn numpy tensorflow

# Black Box Models

Black box models cannot be understood or interpreted by themselves. This notebook briefly introduces some the most famous black box models (including in the geosciences) with the goal of offering just enough background to understand both their predictive power and the challenges they pose for interpretation.

## 1. Ensemble Methods

Ensemble methods combine multiple simple building blocks to create a stronger, more accurate or more robust model. The simple building blocks are also called "weak learners" because they aren't very powerful on their own. But when combined to a "strong learner", they can be very powerful. Decision trees (see Notebook 2; Sect. 5) are the most common building blocks in many ensemble methods such as Bagging, Random Forests, and Boosting.

Ensemble tree methods improve the main weakness of decision trees: their instability. If you train them on slightly different data, they can give very different results. This is called high variance (see Notebook 1; Sect 3.4).

### **Interpretation:**

The downside of combining many decision trees in an ensemble is reduced interpretability, as it becomes difficult to visualize the overall model or clearly identify which variables are most influential. While individual trees in the ensemble are hard to interpret, we can still assess **feature importance** post-hoc by measuring how much each variable contributes to reducing error across all trees. For regression tasks, this is typically done using the reduction in residual sum of squares (RSS), and for classification, by evaluating the decrease in Gini index. Features that lead to larger reductions are considered more important.

In many tree-based ensemble implementations (like in `scikit-learn` or `XGBoost`), feature importance values are automatically computed and stored as part of the model during training. So in practice, you don’t need to calculate them separately after training. However, we gain interpretability “post-hoc” because these importance values are derived after the model is trained. Feature importance in ensemble methods is therefore an example for a model-specific post-hoc XAI method (Fig. 1).

<div style="text-align: center;">
  <img src="../Images/XAI_Model_Specific.png" style="width: 400px;">
  <div style="font-size: 14px; margin-top: 8px;">Fig. 1 Overview of interpretability methods in machine learning, modified from Molnar (2025)</div>
</div>

## 1.1 Bagging

Bagging is using the **bootstrap** method. It makes many new datasets by randomly sampling from the original data (with replacement). To create a bootstrapped dataset that is the same size as the original, we randomly select samples from the original dataset. It is allowed to pick the same sample more than once.

With bagging, a separate decision tree is trained on each of these datasets and trees are not pruned (each tree has high variance, but low bias). Then, their predictions are averaged (for numbers) or a majority vote is taken (for categories). This reduces the variance. In this way, hundreds or thousands of trees are being combined into a single procedure. The result is a model that is more stable and accurate than a single tree. Bagging is the basis for methods like **Random Forests**.

## 1.2 Random Forests

Random Forests improve on Bagging by adding a simple but important twist: when a tree decides how to split, it only chooses from a **random subset of features**, rather than all features. This prevents all trees from using the same strong predictors and becoming too similar. By decorrelating the trees, Random Forests reduce overall model variance more effectively than Bagging, leading to better prediction accuracy. In practice, we typically choose the number of features at each split as about the square root of the total number of features. Like Bagging, Random Forests remain stable as more trees are added and do not overfit with large numbers of trees.

Fig. 2 illustrates how using a bootstrapped sample and considering only a subset of the variables at each step results in a wide variety of trees (imagine we would have not only 3 trees but hundreds of trees).

<div style="text-align: center;">
  <img src="../Images/RF.png" style="width: 400px;">
  <div style="font-size: 14px; margin-top: 8px;">Fig. 2 A Random Forest combines predictions from multiple decision trees (DTs), trained on different data subsets, using averaging or majority voting to produce a more robust final result.</div>
</div>

---

Let's again fit a tree classifier using the Palmer Penguins dataset (compare Notebook 2; Sects. 2 and 5). However, now we use Random Forest.

In [None]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 12})

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OrdinalEncoder

Load data:

In [None]:
penguins = sns.load_dataset("penguins")
penguins = penguins.dropna()
penguins.head()

Data preparation as before:

In [None]:
# Separate features and target:
X = penguins.drop(columns=['species'])
y = penguins['species']
feature_names = X.columns.tolist()

# Encode categorical features (e.g. OneHotEncoder, OrdinalEncoder): 
categorical_cols = X.select_dtypes(include='object').columns
encoder = OrdinalEncoder()
X[categorical_cols] = encoder.fit_transform(X[categorical_cols])

# Train-test split:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

Train model:

In [None]:
rfc = RandomForestClassifier(max_features=X_train.shape[1], random_state=0)
rfc.fit(X_train , y_train)

Predict:

In [None]:
y_pred = rfc.predict(X_test)

Check how many trees were built:

In [None]:
print(f"Number of trees in the forest: {len(rfc.estimators_)}") # n_estimators=100 is the default in scikit-learn

Evaluate with classification metrics:

In [None]:
accuracy = accuracy_score(y_test, y_pred)
print(f"Test set accuracy: {accuracy:.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

The accuracy is nearly perfect, but this was already achieved with the single decision tree in Notebook 2, as this dataset isn't very challenging. Still, let’s explore how we can inspect feature importance in a Random Forest, since this is one of the options we have to interpret parts of the model. We calculate the build-in feature importance, the Gini importance:

In [None]:
importances = rfc.feature_importances_
feature_imp_df = pd.DataFrame({'Feature': feature_names, 'Gini Importance': importances}).sort_values('Gini Importance', ascending=False) 
print(feature_imp_df)

# Visualize:
sorted_features = feature_imp_df['Feature']
sorted_importances = feature_imp_df['Gini Importance']
plt.figure(figsize=(5, 2))
plt.barh(sorted_features, sorted_importances, color='skyblue')
plt.xlabel('Gini Importance')
plt.gca().invert_yaxis()  # Highest importance at the top
plt.show()

## 1.3 Boosting

Boosting is an ensemble learning technique that turns a weak learner into a strong one by training models **sequentially**. Unlike Bagging, which builds trees independently using bootstrap samples, Boosting does not rely on random sampling with replacement. Instead, it fits new models to the residual errors of previous ones. Each subsequent model is trained to correct the mistakes of the prior model, gradually improving the overall prediction.

In Boosting, instances that are predicted incorrectly are given more importance in the next iteration, often by adjusting their weights or by minimizing a loss function (as in **Gradient Boosting**). A key element of boosting is the learning rate, also known as the shrinkage parameter, which controls how much each new model contributes to the final prediction. This slow, deliberate adjustment process helps reduce overfitting and results in a model that performs well on complex tasks.

---
---

## 2. k-Nearest Neighbors

In contrast to linear regression, k-Nearest Neighbors (KNN) is a non-parametric method for regression and classification tasks that makes no assumptions about the form of the relationship between predictors and the outcome. It does, however, assume that similar data points are located near each other and can be grouped together. With that it can adapt to complex patterns by averaging nearby data points.

Given a value for K and a prediction point x₀, **KNN regression** first identifies the **K training points (neighbors)** that are closest to x₀, represented by N₀. It then estimates f(x₀) using the **average** of all the training responses in N₀ (James et al. 2023):

$$
\hat{y}(x_0) = \frac{1}{K} \sum_{x_i \in N_0} y_i
$$

In **KNN classification**, the K nearest neighbors are used to count how many of them belong to each class. The prediction point x₀ is then assigned to the class with the highest frequency among those neighbors.

In summary, KNN requires two main components: the number of K and a distance metric to measure similarity between data points. The distance metric can be calculated using various measures such as Euclidian distance. The number of K controls how many neighbors the model uses to make a prediction or classification. A small K (like 1 or 2) allows the model to be very flexible and responsive to the training data (low bias), but it can also make predictions unstable and sensitive to noise (high variance). A larger K averages over more neighbors, leading to smoother and more stable predictions (lower variance), but it may miss important local patterns (higher bias).

---
To demonstrate KNN we pick up the following example by GeoSMART 2025.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 12})

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
from matplotlib.colors import ListedColormap

We first create the synthetic dataset representing three rock types: Granite, Basalt, and Sandstone. Each type will have characteristic values for density and magnetic susceptibility:

In [None]:
# Set random seed for reproducibility
np.random.seed(42)

# Number of samples per class
n_samples = 100

# Generate features for Granite
granite_density = np.random.normal(2.9, 0.2, n_samples)
granite_susceptibility = np.random.normal(0.0001, 0.0001, n_samples)
granite_label = ['Granite'] * n_samples

# Generate features for Basalt
basalt_density = np.random.normal(3.2, 0.2, n_samples)
basalt_susceptibility = np.random.normal(0.001, 0.0005, n_samples)
basalt_label = ['Basalt'] * n_samples

# Generate features for Sandstone
sandstone_density = np.random.normal(2.4, 0.2, n_samples)
sandstone_susceptibility = np.random.normal(0.00005, 0.00005, n_samples)
sandstone_label = ['Sandstone'] * n_samples

# Combine data
density = np.concatenate([granite_density, basalt_density, sandstone_density])
susceptibility = np.concatenate([granite_susceptibility, basalt_susceptibility, sandstone_susceptibility])
labels = np.concatenate([granite_label, basalt_label, sandstone_label])

# Create DataFrame
data = pd.DataFrame({
    'Density': density,
    'Magnetic Susceptibility': susceptibility,
    'Lithology': labels
})
data.head()

# Plot
plt.figure(figsize=(5, 3)) 
sns.scatterplot(
    x='Density',
    y='Magnetic Susceptibility',
    hue='Lithology',
    data=data
)
plt.title('Rock Types Based on Density and Magnetic Susceptibility')
plt.show()

Assign X and y:

In [None]:
X = data[['Density', 'Magnetic Susceptibility']]
y = data['Lithology']

Magnetic susceptibility and density are on very different scales. In KNN, the feature with the larger numeric range (density) would dominate the distance calculation. We therefore scale features first before classifying: 

In [None]:
X_unscaled = data[['Density', 'Magnetic Susceptibility']] # save unscaled data before scaling

scaler = StandardScaler()
X = scaler.fit_transform(X)

Split the data between training and testing:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

Train a KNN classifier on the training data:

In [None]:
classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, y_train)

Evaluate performance using common metrics for classification tasks:

In [None]:
y_pred = classifier.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

### **Interpretation:**

By printing the test sample, its nearest neighbors, and their labels we can try to explain the model outputs. Let's check the first test point:

In [None]:
sample = X_test[0]
distances, indices = classifier.kneighbors([sample])

print("Test point:")
print(sample)

print("\nNearest neighbors (from training set):")
print(X_train[indices[0]])
print("\nTheir true labels:")
print(y_train.iloc[indices[0]].values)

The test point was classified as Sandstone because all 5 of its nearest neighbors are labeled Sandstone. They are extremely close in both density and magnetic susceptibility.

Let's explore this further by visualizing the decision boundaries across the feature space using a grid:

In [None]:
# Decision boundary plot
h = 0.001
x_min, x_max = X_unscaled['Density'].min() - 0.1, X_unscaled['Density'].max() + 0.1
y_min, y_max = X_unscaled['Magnetic Susceptibility'].min() - 0.0001, X_unscaled['Magnetic Susceptibility'].max() + 0.0001
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

mesh_points = np.c_[xx.ravel(), yy.ravel()]
mesh_points_scaled = scaler.transform(mesh_points)

Z = classifier.predict(mesh_points_scaled)
Z = Z.reshape(xx.shape)

label_map = {label: i for i, label in enumerate(np.unique(y))}
Z_int = np.vectorize(label_map.get)(Z)
colors = ['#FFA07A', '#87CEFA', '#90EE90']
cmap_light = ListedColormap(colors)

plt.figure(figsize=(5, 3)) 
plt.contourf(xx, yy, Z_int, cmap=cmap_light, alpha=0.5)
sns.scatterplot(x='Density', y='Magnetic Susceptibility', hue='Lithology', data=data, edgecolor='k')
plt.title("KNN Classification Decision Boundary (K = 5)")
plt.xlabel("Density")
plt.ylabel("Magnetic Susceptibility")
plt.legend()
plt.show()

The colored regions illustrate where each class would be predicted in the feature space. However, KNN is not as globally interpretable as a linear model. It's difficult to explain why a prediction was made beyond saying "these K points were closest." With only two features, this is visually intuitive, and we see that KNN adapts to the local class density. However, in high-dimensional spaces, data points become more isolated, making it harder to identify truly close neighbors. This not only affects predictive performance but also reduces interpretability.

> ### **Exercise 1:**
>
>In some areas of the KNN decision boundary plot, it appears that a point surrounded by visually similar neighbors (e.g. a blue dot among other blue dots) is misclassified. What could explain this apparent misclassification when considering the data preparation steps?

---
---

## 3. Support Vector Machines

The core idea of Support Vector Machines (SVMs) is to find the optimal **hyperplane** that separates data into classes by maximizing the **margin** - the distance between the hyperplane and the nearest training points from each class, known as **support vectors**. In simple cases where the data can be perfectly divided, this results in a maximal margin classifier, which draws the widest possible boundary between the classes. However, perfect separation is rare in real-world data due to overlaps or outliers. For example, in Fig. 3A, although we can separate the green and blue points by eye, the two groups come very close to each other. This makes the margin very narrow, meaning the boundary would be sensitive to small changes in the data. 

In these situations, SVMs allow some flexibility: a few points can be on the wrong side of the margin or even misclassified. This more adaptable version is called the **support vector classifier**. Instead of strictly separating all points, it aims to find a boundary that separates most of the data well while still maintaining a reasonably wide margin. Choosing a threshold like this is another example of the bias–variance tradeoff (see Notebook 1; Sect. 3.4): If we won't allow missclassifications, we would pick a threshold that is very sensitive to the training data including outliers but would not perform robust with new data (low bias; high variance). When we allow misclassifications, the distance between the data points and the threshold is called **soft margin** and we calculate the best soft margin (how many misclassifications and data points inside) using cross-validation. For example, Fig. 3B shows a soft margin with one misclassification and two data points that are correctly classified to be within the margin. Fig. 3C shows how the support vector classifier is a line in a 2-dimensional space (with 2 features). In addition, with 3-dimensional data the support vector classifier would form a plane. 

<figure style="text-align: center;">
  <img src="../Images/SVM.png" alt="Regression line and squiggle on training data" style="width: 500px;">
  <figcaption style="font-size: 14px; margin-top: 8px;">
    Fig. 3 Support Vector Machines and margin-based classification, modified from 
    <a href="https://www.youtube.com/watch?v=Gv9_4yMHFhI&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF" target="_blank">
      Josh Starmer (YouTube) 
    </a>
      A. One-dimensional data with a narrow margin between two classes (blue and green). B. A soft margin classifier allows for one misclassification (green point on the blue side) and two correctly classified points within the margin zone. C. The support vector classifier in a two-dimensional feature space, with the decision boundary (solid line), margins (dotted lines).
  </figcaption>
</figure>

SVMs can also handle more complex, non-linear boundaries using what’s known as the **kernel trick**. Rather than drawing a straight line or flat plane, this method transforms the data into a higher-dimensional space where Kernel functions systematically find boundaries to be drawn (support vector classifiers), without having to compute that transformation directly.

### **Interpretation:**

In simple, low-dimensional settings, it’s often possible to visualize how an SVM separates data and which points (the support vectors) influence the decision boundary - much like how nearest neighbors affect predictions in KNN. However, in higher-dimensional spaces, especially when kernel functions are used, the transformations and resulting decision boundary become much harder to interpret. This is why SVMs are generally considered black box models. The learned decision function is typically complex and often impossible to express in human-readable form.

---
---

## 4. Neural Networks

Neural networks (NNs) form the foundation of many recent advances in AI, thanks to their ability to model complex, high-dimensional data. They are inspired by the way the human brain works: They consist of layers of connected "neurons" that process information. We see NNs in everyday life, for example, in voice assistants, image recognition, or large language models (LLMs). Fig. 4A shows a very simple example of a NN. It has an **input layer**, two **hidden layers** (each with five **nodes**), and an **output layer**. Compared to this, LLMs are like super-sized versions with billions of connections. Deep neural networks, commonly referred to as **deep learning** models, are widely used in science for a variety of tasks. Notable examples include convolutional neural networks (CNNs) for image recognition or spatially structured data such as satellite images, recurrent neural networks (RNNs) and long short-term memory networks (LSTMs) for time series and language tasks, and transformer-based models like BERT and GPT for natural language processing.

When building a NN, we need to make several design choices, such as:
- How many hidden layers to include, and how many nodes to place in each (essentially making an educated guess).
- Which loss function to optimize (e.g., mean squared error, or MSE).
- Which activation functions to use.
- …

There are many more adjustments and a variety of algorithms available. However, for now, let’s focus on what all NNs have in common to better understand their basic functionality and their limitations when it comes to interpretability. 

**All NNs share a basic structure of interconnected layers with weighted connections, activation functions, and an optimization process that adjusts weights based on a loss function during training.** Fig. 4B that shows a tiny NN applied to the regression task from Notebook 2 (find the optimal drug dosage for patients by modeling the relationship between dosage and effectiveness). The NN has a single node in the input layer for the input feature "Dosage" and a single node in the output layer for the target feature "Efficiancy". We see and know these inputs and outputs. What makes NNs black boxes are their hidden layers. We here have a single hidden layer with two nodes. This is what is generally happening inside:

1. The input (dosage) is passed to the two hidden neurons.

2. Each hidden neuron is doing:
- A weighted sum of inputs (from dosage)
- A non-linear transformation (e.g. SoftMax; ReLu)

3. The final output is:
- A weighted sum of the two activated outputs
- Plus a bias term
- Sometimes there is an extra activation (not in Fig. 4B)

### **Interpretation:**

Weights and biases are learned during training: The NN makes a prediction, compares it to the correct answer using a loss function, and then adjusts the weights and biases step by step using an optimization algorithm (like **gradient descent**) to reduce the error. Even though this is a tiny network, it's already difficult to intuitively interpret because the hidden layer uses non-linearities: The activation functions apply non-linear transformations, the network combines these in complex ways, and bias terms shift the resulting curves. In essence, the hidden layer learns internal representations or features from the input that are:

- Non-observable (they’re not in the data).
- Intermediate (they’re not final predictions).
- Learned (they're shaped during training to help the network make better predictions).

Although neural networks lack inherent interpretability, certain internal components, such as weights, activations, and gradients, can be analyzed to gain limited insights. Tools like saliency maps, SHAP, LIME, and feature visualization are used to interpret what parts of the model (like weights, activations, or gradients) are doing. These tools typically don’t explain the weights directly, but they reveal how input signals flow through the network (via gradients or activations) or how the network behaves (via output approximations). In short, they make parts of the "black box" visible, but not fully understandable in human terms.

<figure style="text-align: center;">
  <img src="../Images/NN.png" alt="Regression line and squiggle on training data" style="width: 500px;">
  <figcaption style="font-size: 14px; margin-top: 8px;">
    Fig. 4 Structure and inner workings of neural networks., modified from 
    <a href="https://www.youtube.com/watch?v=83LYR-1IcjA&list=PLblh5JKOoLUIxGDQs4LFFD--41Vzf-ME1&index=10" target="_blank">
      Josh Starmer (YouTube) 
    </a>
      A. A generic deep neural network with an input layer, multiple hidden layers, and an output layer.
      B. A simple neural network with one input ("Dosage"), one output ("Efficiency"), and a hidden layer with two nodes. Each hidden node computes a weighted sum of the input plus a bias, followed by a non-linear activation. The final prediction is a weighted combination of the hidden activations.
  </figcaption>
</figure>

> ### **Exercise 2:**
>
>You've already seen a simple NN that takes one input feature ("Dosage") and predicts one output ("Efficiency") using a single hidden layer with two neurons.
> Now imagine you are building a NN to classify iris flowers (dataset introduced in Notebook 2; Sect. 6) into one of three species: Setosa, Versicolor, or Virginica. You will use two input features: Petal Width and Sepal Width.
> Based on your understanding, sketch out how this NN could be structured. Think about: How many input and output nodes are needed? What happens in the hidden layer? Where would weights and biases be placed? How many would be involved?

---

The following is a simple implementation of a NN that trains on a sample dataset and makes predictions using `TensorFlow` and `Keras`, modified from [geeksforgeeks](https://www.geeksforgeeks.org/machine-learning/neural-networks-a-beginners-guide/). 

In [None]:
import numpy as np
import pandas as pd

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

Create a dataset. Convert the data into a format suitable for training (usually NumPy arrays). Define features (X) and labels (y).

In [None]:
data = {
    'feature1': [0.1, 0.2, 0.3, 0.4, 0.5],
    'feature2': [0.5, 0.4, 0.3, 0.2, 0.1],
    'label': [0, 0, 1, 1, 1]
}

df = pd.DataFrame(data)
X = df[['feature1', 'feature2']].values
y = df['label'].values

To build a NN, first create an instance of a `Sequential` model. Then, add layers to it using `Dense` layers, which define fully connected layers. For each layer, specify the number of neurons and an activation function. The first `Dense` layer also defines the input shape, while subsequent layers form the hidden and output layers.

In this example, the model expects two input features (input_dim=2), processes them through a hidden layer with 8 neurons using the ReLU activation, and produces a single output using the sigmoid activation which is suitable for binary classification tasks.

In comparison, in NNs for regression tasks, the output layer often uses a linear activation. It then simply returns the raw output of the neuron without applying any transformation. This linearity is important because it allows the model to predict any real-valued number, not just values in a fixed range. While the hidden layers can be highly nonlinear, the final output here needs to remain unconstrained to represent continuous outcomes.

In [None]:
model = Sequential()
model.add(Dense(8, input_dim=2, activation='relu'))  # Hidden layer
model.add(Dense(1, activation='sigmoid'))  # Output layer

The following step compiles the model, which means preparing it for training by specifying:
- Loss function: 'binary_crossentropy' is used for binary classification problems. It measures how far the predicted probabilities are from the actual class labels (0 or 1).
- Optimizer: 'adam' is an efficient algorithm that adjusts the model’s weights to reduce the loss during training.
- Metrics: 'accuracy' tracks how often the model predicts the correct class, helping you monitor performance during training and evaluation.

In [None]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

The training process for a NN is conceptually similar in Python to how other ML models (like Randpom Forests or linear regression) "fit" to data.
The following code trains the model on the input data (X) and target labels (y) with the following settings:
- `epochs=100`: The model will go through the entire training dataset 100 times.
- `batch_size=1`: The weights are updated after every single training example. Larger batches = faster training, but may be less accurate or use more memory. It's like learning in small groups instead of one example at a time.
- `verbose=1`: Displays training progress.

In [None]:
model.fit(X, y, epochs=100, batch_size=1, verbose=1)

Make predictions. We provide the model with one input example for testing (new unseen data): a data point that has two feature values (0.2 and 0.4).

In [None]:
test_data = np.array([[0.2, 0.4]])
prediction = model.predict(test_data)

The model outputs a probability, which we convert into a binary class label (0 or 1).

In [None]:
predicted_label = (prediction > 0.5).astype(int)
print("Predicted probability:", prediction)
print("Predicted label:", predicted_label)

## References and Further Learning

Denolle, M., Mehra, A., Todoran, S., Cristea, N., Arendt, A., Henderson, S., Sun, Z., Ni, Y., and Kharita, A.: Machine Learning in the Geosciences, **GeoSMART**, University of Washington eScience Institute, available at: https://geo‑smart.github.io/mlgeo-book/about_this_book/about_this_book.html, last access: 30 June **2025**.

James, G., Witten, D., Hastie, T., Tibshirani, R., and Taylor, J.: An Introduction to Statistical Learning with Applications in Python, An Introduction to Statistical Learning: with Applications in Python, 1, 2023.

Molnar, C.: Interpretable Machine Learning: A Guide for Making Black Box Models Explainable (3rd ed.). Retrieved from christophm.github.io/interpretable-ml-book/, 2025.

[StatQuest with Josh Starmer](https://www.youtube.com/channel/UCtYLUTtgS3k1Fg4y5tAhLbw) on YouTube.