# Lab: Trees and Model Stability

Trees are notorious for being **unstable**: Small changes in the data can lead to noticeable or large changes in the tree. We're going to explore this phenomenon, and a common rebuttal.

In the folder for this lab, there are three datasets that we used in class: Divorce, heart failure, and the AirBnB price dataset.

1. Pick one of the datasets and appropriately clean it.
2. Perform a train-test split for a specific seed (save the seed for reproducibility). Fit a classification/regression tree and a linear model on the training data and evaluate their performance on the test data. Set aside the predictions these models make.
3. Repeat step 2 for three to five different seeds (save the seeds for reproducibility). How different are the trees that you get? Your linear model coefficients?. Set aside the predictions these models make.

Typically, you would see the trees changing what appears to be a non-trivial amount, while the linear model coefficients don't vary nearly as much. Often, the changes appear substantial. 

But are they?

4. Instead of focusing on the tree or model coefficients, do three things:
    1. Make scatterplots of the predicted values on the test set from question 2 against the predicted values for the alternative models from part 3, separately for your trees and linear models. Do they appear reasonably similar?
    2. Compute the correlation between your model in part 2 and your alternative models in part 3, separately for your trees and linear models. Are they highly correlated or not?
    3. Run a simple linear regression of the predicted values on the test set from the alternative models on the predicted values from question 2, separately for your trees and linear models. Is the intercept close to zero? Is the slope close to 1? Is the $R^2$ close to 1?

5. Do linear models appear to have similar coefficients and predictions across train/test splits? Do trees?
6. True or false, and explain: "Even if the models end up having a substantially different appearance, the predictions they generate are often very similar."

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score, roc_auc_score

**1.**

In [2]:
# Load heart failure data
df = pd.read_csv('./data/heart_failure/heart_failure_clinical_records_dataset.csv')
print(df.dtypes)
df.head()

age                         float64
anaemia                       int64
creatinine_phosphokinase      int64
diabetes                      int64
ejection_fraction             int64
high_blood_pressure           int64
platelets                   float64
serum_creatinine            float64
serum_sodium                  int64
sex                           int64
smoking                       int64
time                          int64
DEATH_EVENT                   int64
dtype: object


Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1


In [3]:
# Clean data
df = df.drop_duplicates()

if df.isna().sum().sum() > 0:
    df = df.dropna()

print("Dimensions:", df.shape)
df.head()

Dimensions: (299, 13)


Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1


**2.**

In [4]:
# Features
X = df.drop(columns=['DEATH_EVENT'])
y = df['DEATH_EVENT']

# Seed for reproducibility
INITIAL_SEED = 1

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y,
    test_size=0.3,
    random_state=INITIAL_SEED,
    stratify=y
)

In [5]:
# Classification tree
clf = DecisionTreeClassifier(max_depth=4, random_state=INITIAL_SEED)
clf.fit(X_train, y_train)

# Predicitions
y_pred_tree = clf.predict(X_test)
y_proba_tree = clf.predict_proba(X_test)[:, 1]

# Evaluate
clf_acc = accuracy_score(y_test, y_pred_tree)
clf_auc = roc_auc_score(y_test, y_proba_tree)

print("Decision Tree Accuracy:", round(clf_acc, 3))
print("Decision Tree ROC AUC:", round(clf_auc, 3))

# Tree structure
print("\nTree structure:")
print(export_text(clf, feature_names=list(X.columns)))

Decision Tree Accuracy: 0.756
Decision Tree ROC AUC: 0.688

Tree structure:
|--- time <= 67.50
|   |--- creatinine_phosphokinase <= 80.50
|   |   |--- serum_creatinine <= 2.80
|   |   |   |--- class: 0
|   |   |--- serum_creatinine >  2.80
|   |   |   |--- class: 1
|   |--- creatinine_phosphokinase >  80.50
|   |   |--- ejection_fraction <= 72.50
|   |   |   |--- creatinine_phosphokinase <= 2018.00
|   |   |   |   |--- class: 1
|   |   |   |--- creatinine_phosphokinase >  2018.00
|   |   |   |   |--- class: 1
|   |   |--- ejection_fraction >  72.50
|   |   |   |--- class: 0
|--- time >  67.50
|   |--- ejection_fraction <= 27.50
|   |   |--- ejection_fraction <= 22.50
|   |   |   |--- serum_sodium <= 137.00
|   |   |   |   |--- class: 1
|   |   |   |--- serum_sodium >  137.00
|   |   |   |   |--- class: 0
|   |   |--- ejection_fraction >  22.50
|   |   |   |--- creatinine_phosphokinase <= 920.50
|   |   |   |   |--- class: 0
|   |   |   |--- creatinine_phosphokinase >  920.50
|   |   | 

In [6]:
# Linear model
reg = LinearRegression()
reg.fit(X_train, y_train)

# Predictions
y_pred_lr = reg.predict(X_test)
y_pred_lr_class = (y_pred_lr >= 0.5).astype(int)

# Evaluate
reg_acc = accuracy_score(y_test, y_pred_lr_class)
reg_auc = roc_auc_score(y_test, y_pred_lr)

print("Linear Regression Accuracy:", round(reg_acc, 3))
print("Linear Regression ROC AUC:", round(reg_auc, 3))

Linear Regression Accuracy: 0.789
Linear Regression ROC AUC: 0.818


**3.**

In [12]:
seeds = [0, 10, 42, 67, 100, 2025]

results_accuracy = [] # store accuracty/AUC for each seed
tree_structures = {} # store tree text structure
reg_coefs = {} # store linear regression coefficients
predictions = {} # store predictions for each seed

for seed in seeds:
    # Train-test split
    X_train, X_test, y_train, y_test = train_test_split(
        X, y,
        test_size=0.3,
        random_state=seed,
        stratify=y
    )

    # Decision tree
    clf = DecisionTreeClassifier(max_depth=4, random_state=seed)
    clf.fit(X_train, y_train)

    # Predicitions
    y_pred_tree = clf.predict(X_test)
    y_proba_tree = clf.predict_proba(X_test)[:, 1]

    # Evaluate
    clf_acc = accuracy_score(y_test, y_pred_tree)
    clf_auc = roc_auc_score(y_test, y_proba_tree)

    # Save tree structure
    tree_structures[seed] = export_text(clf, feature_names=list(X.columns))

    # Linear regression
    reg = LinearRegression()
    reg.fit(X_train, y_train)

    # Predictions
    y_pred_lr = reg.predict(X_test)
    y_pred_lr_class = (y_pred_lr >= 0.5).astype(int)

    # Evaluate
    reg_acc = accuracy_score(y_test, y_pred_lr_class)
    reg_auc = roc_auc_score(y_test, y_pred_lr)

    # Save coefficients
    reg_coefs[seed] = {
        "intercept": reg.intercept_,
        "coefficients": pd.Series(reg.coef_, index=X.columns)
    }

    # Save predicitions for seed
    preds_df = pd.DataFrame({
        "y_true": y_test.values,
        "clf_pred": y_pred_tree,
        "clf_proba": y_proba_tree,
        "reg_pred_cont": y_pred_lr,
        "reg_pred_class": y_pred_lr_class
    }, index=y_test.index)

    predictions[seed] = preds_df

    # Store metrics
    results_accuracy.append({
        "seed": seed,
        "clf_acc": clf_acc,
        "clf_auc": clf_auc,
        "reg_acc": reg_acc,
        "reg_auc": reg_auc
    })

    # Convert to DataFrame
    results_df = pd.DataFrame(results_accuracy)

In [13]:
print("\n========== Metrics across different seeds ==========")
print(results_df)

# Show example tree structures
print("\n========== Example Tree (seed =", seeds[0], ") ==========\n")
print(tree_structures[seeds[0]])

print("\n========== Example Tree (seed =", seeds[1], ") ==========\n")
print(tree_structures[seeds[1]])

print("\n========== Example Tree (seed =", seeds[2], ") ==========\n")
print(tree_structures[seeds[2]])

print("\n========== Example Tree (seed =", seeds[3], ") ==========\n")
print(tree_structures[seeds[3]])

print("\n========== Example Tree (seed =", seeds[4], ") ==========\n")
print(tree_structures[seeds[4]])

print("\n========== Example Tree (seed =", seeds[5], ") ==========\n")
print(tree_structures[seeds[5]])

# Compare linear regression coefficients
print("\n========== Linear Regression Coefficients (seed =", seeds[0], ") ==========\n")
print(reg_coefs[seeds[0]]["coefficients"])

print("\n========== Linear Regression Coefficients (seed =", seeds[1], ") ==========\n")
print(reg_coefs[seeds[1]]["coefficients"])

print("\n========== Linear Regression Coefficients (seed =", seeds[2], ") ==========\n")
print(reg_coefs[seeds[2]]["coefficients"])

print("\n========== Linear Regression Coefficients (seed =", seeds[3], ") ==========\n")
print(reg_coefs[seeds[3]]["coefficients"])

print("\n========== Linear Regression Coefficients (seed =", seeds[4], ") ==========\n")
print(reg_coefs[seeds[4]]["coefficients"])

print("\n========== Linear Regression Coefficients (seed =", seeds[5], ") ==========\n")
print(reg_coefs[seeds[5]]["coefficients"])


   seed   clf_acc   clf_auc   reg_acc   reg_auc
0     0  0.855556  0.815715  0.800000  0.879028
1    10  0.833333  0.887790  0.833333  0.860373
2    42  0.833333  0.784341  0.844444  0.867722
3    67  0.833333  0.880441  0.800000  0.853024
4   100  0.811111  0.801865  0.844444  0.860373
5  2025  0.833333  0.893443  0.833333  0.910684


|--- time <= 67.50
|   |--- ejection_fraction <= 27.50
|   |   |--- class: 1
|   |--- ejection_fraction >  27.50
|   |   |--- time <= 52.00
|   |   |   |--- platelets <= 299500.00
|   |   |   |   |--- class: 1
|   |   |   |--- platelets >  299500.00
|   |   |   |   |--- class: 1
|   |   |--- time >  52.00
|   |   |   |--- class: 0
|--- time >  67.50
|   |--- ejection_fraction <= 32.50
|   |   |--- serum_creatinine <= 1.35
|   |   |   |--- time <= 78.50
|   |   |   |   |--- class: 1
|   |   |   |--- time >  78.50
|   |   |   |   |--- class: 0
|   |   |--- serum_creatinine >  1.35
|   |   |   |--- creatinine_phosphokinase <= 65.00
|   |   |   |   |--- cla

**4.**