# Árbol de Decisión — Overfitting vs Good Fit vs Underfitting

**Dataset**: *Banknote Authentication* (binario: 0=auténtico, 1=falso).

- Kaggle (referencia): Bank Note Authentication UCI data. "https://www.kaggle.com/datasets/ritesaluja/bank-note-authentication-uci-data"
- UCI (descarga directa en el código): *Banknote Authentication Data Set*.

**Objetivo didáctico**: entrenar **tres árboles** con distinta complejidad y comparar:
1) **Overfitting** (árbol muy profundo)
2) **Good fit** (profundidad moderada)
3) **Underfitting** (árbol muy poco profundo)

Al final, **escribe tus observaciones** y di **cuál elegirías y por qué**.

Recursos de apoyo: https://4geeks.com/es/lesson/arboles-de-decision
https://scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix


## 1) Cargar dataset y quedarnos con 2 features
Usaremos **variance** y **skewness** para poder dibujar regiones de decisión fácilmente.

In [None]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00267/data_banknote_authentication.txt'
cols = ['variance','skewness','curtosis','entropy','target']
df = pd.read_csv(url, header=None, names=cols)

X = df[['variance','skewness']].values
y = df['target'].values
df.head()

## 2) División train/test

In [None]:
RANDOM_STATE = 42
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=RANDOM_STATE, stratify=y
)
X_train.shape, X_test.shape

## 3) Función para dibujar regiones de decisión

In [4]:
def plot_decision_regions(X, y, clf, title='Decision regions'):
    x_min, x_max = X[:,0].min() - 1.0, X[:,0].max() + 1.0
    y_min, y_max = X[:,1].min() - 1.0, X[:,1].max() + 1.0
    xx, yy = np.meshgrid(
        np.linspace(x_min, x_max, 400),
        np.linspace(y_min, y_max, 400)
    )
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.figure(figsize=(5,4))
    plt.contourf(xx, yy, Z, alpha=0.3)
    plt.scatter(X[:,0], X[:,1], c=y, s=25)
    plt.title(title)
    plt.xlabel('variance')
    plt.ylabel('skewness')
    plt.tight_layout()
    plt.show()


## 4) Entrenar tres árboles con distinta complejidad

### 4.1 Overfitting (árbol aobreajustado)

In [None]:
over = DecisionTreeClassifier(random_state=RANDOM_STATE, max_depth=None, min_samples_leaf=1)
over.fit(X_train, y_train)
y_pred_tr = over.predict(X_train)
y_pred_te = over.predict(X_test)
acc_tr = accuracy_score(y_train, y_pred_tr)
acc_te = accuracy_score(y_test, y_pred_te)
cm_over = confusion_matrix(y_test, y_pred_te)
print('Accuracy train:', round(acc_tr,3))
print('Accuracy test :', round(acc_te,3))
print('Confusion matrix (test):\n', cm_over)
plot_decision_regions(X_train, y_train, over, title='Overfitting — decision regions (train)')
plot_decision_regions(X_test, y_test, over, title='Overfitting — decision regions (test)')


### 4.2 Good fit (profundidad moderada)

In [None]:
good = DecisionTreeClassifier(random_state=RANDOM_STATE, max_depth=3, min_samples_leaf=5)
good.fit(X_train, y_train)
y_pred_tr = good.predict(X_train)
y_pred_te = good.predict(X_test)
acc_tr = accuracy_score(y_train, y_pred_tr)
acc_te = accuracy_score(y_test, y_pred_te)
cm_good = confusion_matrix(y_test, y_pred_te)
print('Accuracy train:', round(acc_tr,3))
print('Accuracy test :', round(acc_te,3))
print('Confusion matrix (test):\n', cm_good)
plot_decision_regions(X_train, y_train, good, title='Good fit — decision regions (train)')
plot_decision_regions(X_test, y_test, good, title='Good fit — decision regions (test)')


### 4.3 Underfitting (árbol demasiado simple)

In [None]:
under = DecisionTreeClassifier(random_state=RANDOM_STATE, max_depth=1)
under.fit(X_train, y_train)
y_pred_tr = under.predict(X_train)
y_pred_te = under.predict(X_test)
acc_tr = accuracy_score(y_train, y_pred_tr)
acc_te = accuracy_score(y_test, y_pred_te)
cm_under = confusion_matrix(y_test, y_pred_te)
print('Accuracy train:', round(acc_tr,3))
print('Accuracy test :', round(acc_te,3))
print('Confusion matrix (test):\n', cm_under)
plot_decision_regions(X_train, y_train, under, title='Underfitting — decision regions (train)')
plot_decision_regions(X_test, y_test, under, title='Underfitting — decision regions (test)')


## 5) Conclusión — Recuadro para tus observaciones
- **Overfitting**: ¿qué observas en entrenamiento vs test?
- **Good fit**: ¿cómo se comparan las fronteras de decisión entre clases? ¿y las métricas?
- **Underfitting**: ¿qué patrón ves? ¿qué está pasando con la frontera?

**¿Cuál es el mejor y por qué?**

> Escribe aquí tu conclusión final.