* STEP 1 ‚Äî Imports

In [16]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder

from sklearn.compose import ColumnTransformer

* STEP 2 ‚Äî Train‚ÄìTest Split

In [17]:
import pandas as pd
import numpy as np

df = pd.read_csv("loan_dataset.csv")
df.head()

X=df.drop('default',axis=1)
y=df['default']

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.3,
    random_state=42,
    stratify=y
)


num_cols= ["age", "income", "loan_amount"]
cat_cols=X_train.select_dtypes(include="object").columns.tolist()


numeric_pipeline = Pipeline([
    ("scaler", StandardScaler())
])

categorical_pipeline = Pipeline([
    ("onehot", OneHotEncoder(drop="first", handle_unknown="ignore"))
])

preprocessor = ColumnTransformer([
    ("num", numeric_pipeline, num_cols),
    ("cat", categorical_pipeline, cat_cols)
])



* MODEL 1 ‚Äî SHALLOW TREE (CONTROLLED)

In [19]:
tree_shallow = Pipeline([
    ("preprocessing", preprocessor),
    ("model", DecisionTreeClassifier(
        max_depth=2,
        random_state=42
    ))
])

tree_shallow.fit(X_train, y_train)

train_auc_shallow = roc_auc_score(
    y_train, tree_shallow.predict_proba(X_train)[:, 1]
)

test_auc_shallow = roc_auc_score(
    y_test, tree_shallow.predict_proba(X_test)[:, 1]
)

cv_shallow = cross_val_score(
    tree_shallow, X, y,
    cv=5,
    scoring="roc_auc"
)


* DEEP TREE (Variance Explosion)

In [21]:
tree_deep = Pipeline([
    ("preprocessing", preprocessor),
    ("model", DecisionTreeClassifier(
        random_state=42
    ))
])

tree_deep.fit(X_train, y_train)

train_auc_deep = roc_auc_score(
    y_train, tree_deep.predict_proba(X_train)[:, 1]
)

test_auc_deep = roc_auc_score(
    y_test, tree_deep.predict_proba(X_test)[:, 1]
)

cv_deep = cross_val_score(
    tree_deep, X, y,
    cv=5,
    scoring="roc_auc"
)


* Comparison Output

In [22]:
print("SHALLOW TREE")
print("Train AUC:", train_auc_shallow)
print("Test  AUC:", test_auc_shallow)
print("CV Mean :", cv_shallow.mean())
print("CV Std  :", cv_shallow.std())

print("\nDEEP TREE")
print("Train AUC:", train_auc_deep)
print("Test  AUC:", test_auc_deep)
print("CV Mean :", cv_deep.mean())
print("CV Std  :", cv_deep.std())


SHALLOW TREE
Train AUC: 0.5505348255912625
Test  AUC: 0.5536915961373678
CV Mean : 0.5435685172376259
CV Std  : 0.008130738480495961

DEEP TREE
Train AUC: 1.0
Test  AUC: 0.5435624304564322
CV Mean : 0.5373304659604397
CV Std  : 0.003150418902986174


---
 
 **(1) where errors can occur, (2) why they occur, and (3) how to correctly interpret your output**.
This is **interview-ready** and technically precise.

---

# Decision Tree (Shallow vs Deep) ‚Äî Errors & Output Explanation

## 1Ô∏è‚É£ Where errors can occur and why

### ‚ùå Error 1: *‚Äúcould not convert string to float‚Äù*

**Where it happens**

* When calling `tree.fit(X_train, y_train)`

**Why it happens**

* `DecisionTreeClassifier` (like all sklearn models) requires **numeric input**
* Your dataset contains **categorical string values** (e.g., `"Self-Employed"`)
* Trees do **not** accept raw strings

**Correct understanding**

* Decision Trees **do not need scaling**
* But they **do require categorical encoding**

**Correct fix**

* Always use a **Pipeline with a ColumnTransformer**
* Encode categorical variables (OneHotEncoder)
* Then pass the transformed data to the tree

---

### ‚ùå Error 2: Misunderstanding ‚ÄúTrees don‚Äôt need preprocessing‚Äù

**Why this is wrong**

* This statement is only partially true

**Correct version**

> *Decision Trees do not need feature scaling, but they still require categorical features to be encoded.*

This is a very common interview trap.

---

### ‚ùå Error 3: Misinterpreting CV Std as ‚Äúvariance of the model‚Äù

**Why this happens**

* Many people assume:

  > low CV Std = low variance model

This is **not always true**, especially on weak datasets.

---

## 2Ô∏è‚É£ Your actual output (cleaned)

### üå± Shallow Tree

* **Train AUC:** 0.5505
* **Test AUC:** 0.5537
* **CV Mean:** 0.5436
* **CV Std:** 0.0081

### üå≥ Deep Tree

* **Train AUC:** **1.0**
* **Test AUC:** 0.5436
* **CV Mean:** 0.5373
* **CV Std:** **0.0031**

---

## 3Ô∏è‚É£ How to interpret this output correctly

### üîπ Shallow Tree interpretation

**What you see**

* Train AUC ‚âà Test AUC
* CV Std is small

**What it means**

* The model is **not overfitting**
* It is learning only simple patterns
* Performance is limited, but **stable**

**Bias‚ÄìVariance view**

* **High bias**
* **Low variance**

**Engineering conclusion**

> Shallow trees generalize better on noisy or weak-signal datasets.

---

### üîπ Deep Tree interpretation

**What you see**

* Train AUC = **1.0**
* Test AUC drops
* CV Mean is lower than shallow tree
* CV Std is very small

**What this actually means**

* The model **perfectly memorized the training data**
* This is **textbook overfitting**
* The model fails to generalize

---

## 4Ô∏è‚É£ Critical confusion point (very important)

### ‚ùì Why is CV Std low for a deep tree if it is high variance?

Because:

> **Low CV Std does NOT always mean low variance.**

In your case:

* The dataset has **weak predictive signal**
* The deep tree performs **consistently poorly** on unseen data
* Poor performance is **stable across folds**

So:

* **Model structure is unstable**
* **But performance is consistently bad**
* Hence: **low CV Std**

---

## 5Ô∏è‚É£ Structural instability vs performance stability

These are **different concepts**:

| Concept        | Meaning                                |
| -------------- | -------------------------------------- |
| Model variance | Sensitivity of model structure to data |
| CV Std         | Variability of performance scores      |

Your deep tree:

* Has **unstable structure**
* But **consistently poor generalization**
* Therefore CV Std appears low

---

## 6Ô∏è‚É£ Correct final conclusions (interview-ready)

### ‚úÖ Correct conclusions

* The deep tree **severely overfits** (Train AUC = 1.0)
* Low CV Std does **not** indicate stability here
* The dataset is **bias-dominated / weak-signal**
* Regularization or depth control alone cannot fix this
* Shallow trees are **engineering-wise safer**

---

## 7Ô∏è‚É£ One-line statements you can say in interviews

> **‚ÄúA deep Decision Tree can show low CV standard deviation not because it is stable, but because it generalizes poorly in a consistent way on a weak-signal dataset.‚Äù**

> **‚ÄúLow CV Std does not always imply low variance; it can also indicate consistently poor generalization.‚Äù**

---

