# **DATASET IRIS**

In [2]:
# iris_boosting_stacking.py
# Requirements: scikit-learn, optionally xgboost, lightgbm, catboost
# Run: python iris_boosting_stacking.py  (atau copy ke notebook dan jalankan cell)

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier, GradientBoostingClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, classification_report
from sklearn.preprocessing import label_binarize
import warnings
warnings.filterwarnings("ignore")

# Load Iris
iris = load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names

# Split 80/20 stratified
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Base + bagging
models = {
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "Extra Trees": ExtraTreesClassifier(n_estimators=100, random_state=42),
    "AdaBoost": AdaBoostClassifier(n_estimators=100, random_state=42),
    "GradientBoosting": GradientBoostingClassifier(n_estimators=100, random_state=42)
}

# Try optional boosted libs
try:
    from xgboost import XGBClassifier
    models["XGBoost"] = XGBClassifier(use_label_encoder=False, eval_metric="logloss", random_state=42)
except Exception:
    print("XGBoost not available — skipping.")

try:
    from lightgbm import LGBMClassifier
    models["LightGBM"] = LGBMClassifier(random_state=42)
except Exception:
    print("LightGBM not available — skipping.")

try:
    from catboost import CatBoostClassifier
    models["CatBoost"] = CatBoostClassifier(verbose=0, random_state=42)
except Exception:
    print("CatBoost not available — skipping.")

# Stacking: RF, ET, GB as base, logistic as final
base_estimators = [
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('et', ExtraTreesClassifier(n_estimators=100, random_state=42)),
    ('gb', GradientBoostingClassifier(n_estimators=100, random_state=42))
]
stack = StackingClassifier(estimators=base_estimators, final_estimator=LogisticRegression(max_iter=1000), cv=5, n_jobs=-1)
models["Stacking (RF,ET,GB -> LR)"] = stack

# Helper: ROC AUC multi-class
def safe_roc_auc(model, X_test, y_test):
    try:
        if hasattr(model, "predict_proba"):
            y_score = model.predict_proba(X_test)
            y_bin = label_binarize(y_test, classes=np.unique(y_test))
            if y_score.ndim==2 and y_score.shape[1] >= 2:
                return roc_auc_score(y_bin, y_score, average="macro", multi_class="ovr")
        if hasattr(model, "decision_function"):
            y_score = model.decision_function(X_test)
            y_bin = label_binarize(y_test, classes=np.unique(y_test))
            return roc_auc_score(y_bin, y_score, average="macro", multi_class="ovr")
    except Exception:
        return np.nan
    return np.nan

# Train & evaluate
rows = []
for name, mdl in models.items():
    try:
        mdl.fit(X_train, y_train)
        y_pred = mdl.predict(X_test)
        acc = accuracy_score(y_test, y_pred)
        prec = precision_score(y_test, y_pred, average="macro", zero_division=0)
        rec = recall_score(y_test, y_pred, average="macro", zero_division=0)
        f1 = f1_score(y_test, y_pred, average="macro", zero_division=0)
        rocauc = safe_roc_auc(mdl, X_test, y_test)
        rows.append({
            "Model": name,
            "Accuracy": acc,
            "Precision (macro)": prec,
            "Recall (macro)": rec,
            "F1 (macro)": f1,
            "ROC AUC (macro/ovr)": rocauc
        })
    except Exception as e:
        rows.append({
            "Model": name,
            "Accuracy": np.nan,
            "Precision (macro)": np.nan,
            "Recall (macro)": np.nan,
            "F1 (macro)": np.nan,
            "ROC AUC (macro/ovr)": np.nan
        })
        print(f"Model {name} failed: {e}")

results_df = pd.DataFrame(rows).sort_values(by="F1 (macro)", ascending=False).reset_index(drop=True)
print("\n=== Results (Iris) ===")
print(results_df.to_string(index=False))

# Save results
results_df.to_csv("iris_model_performance.csv", index=False)
print("\nSaved iris_model_performance.csv")


CatBoost not available — skipping.
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000034 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 86
[LightGBM] [Info] Number of data points in the train set: 120, number of used features: 4
[LightGBM] [Info] Start training from score -1.098612
[LightGBM] [Info] Start training from score -1.098612
[LightGBM] [Info] Start training from score -1.098612

=== Results (Iris) ===
                    Model  Accuracy  Precision (macro)  Recall (macro)  F1 (macro)  ROC AUC (macro/ovr)
Stacking (RF,ET,GB -> LR)  0.966667           0.969697        0.966667    0.966583             0.993333
         GradientBoosting  0.966667           0.969697        0.966667    0.966583             0.990000
            Decision Tree  0.933333           0.933333        0.933333    0.933333             0.950000
              Extra Trees  0.933333           0.933333        0.933333    0.933

# **Penjelasan Kode – Klasifikasi Dataset Iris (Bagging, Boosting, Stacking)**

## **1. Import Library dan Dataset**
Pada bagian ini dilakukan import library yang dibutuhkan seperti:
- `pandas`, `numpy`, `matplotlib` → pengolahan dan visualisasi data.
- `sklearn` → menyediakan dataset Iris, fungsi preprocessing, model machine learning, evaluasi, serta splitting data.
- `StratifiedKFold` → memastikan proporsi kelas tetap seimbang pada setiap fold.

Dataset Iris kemudian dimuat menggunakan `load_iris()`, dan dipisah menjadi variabel fitur `X` dan label `y`.

## **2. Pembagian Dataset**
Dataset dibagi menjadi:
- **80% data training**
- **20% data testing**
Menggunakan `train_test_split()` dengan `stratify=y` agar distribusi kelas tetap seimbang.

## **3. Pembuatan dan Pelatihan Model**
Tiga model dasar berbasis Bagging dibangun:

### **a. Decision Tree**
Model dasar sederhana berbasis pohon keputusan. Digunakan sebagai baseline.

### **b. Random Forest**
Menggunakan ratusan decision tree yang dilatih secara bootstrap untuk mengurangi overfitting.

### **c. Extra Trees**
Mirip Random Forest, namun pembagian node lebih acak sehingga performa lebih stabil dan cepat.

Semua model dilatih menggunakan `fit(X_train, y_train)`.

## **4. Evaluasi dengan Confusion Matrix**
Confusion matrix digunakan untuk melihat:
- True Positive
- False Positive
- True Negative
- False Negative

Untuk setiap model, dilakukan:
- Prediksi menggunakan `predict()`
- Pembuatan confusion matrix menggunakan `ConfusionMatrixDisplay()`

Visualisasi ditampilkan dalam 3 grafik berdampingan untuk membandingkan kinerja masing-masing model.

## **5. Evaluasi ROC/AUC**
Label pada dataset Iris memiliki 3 kelas. Untuk memvisualisasikan ROC:
- Label di-binarisasi menggunakan `label_binarize()`
- Probabilitas prediksi diambil dengan `predict_proba()`
- ROC curve dihitung dengan `roc_curve()`
- AUC dihitung dengan `auc()`

ROC curve menunjukkan kemampuan model membedakan kelas pada skala probabilitas.

## **6. Learning Curve**
Learning curve digunakan untuk mengetahui:
- Apakah model mengalami overfitting/underfitting
- Bagaimana performa berubah ketika jumlah data training ditingkatkan

Fungsi `learning_curve()` memberikan nilai:
- Training score
- Cross-validation score

Kemudian digambarkan grafik untuk tiap model:
- Decision Tree
- Random Forest
- Extra Trees

Learning curve membantu memahami stabilitas model dan kebutuhan data tambahan.


# **Dataset Heart Attack Analysis & Prediction Dataset**

In [3]:
# heart_boosting_stacking.py
# Requirements: scikit-learn, pandas, numpy, matplotlib
# Optional: xgboost, lightgbm, catboost
# Put heart.csv in same folder or ubah path.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier, GradientBoostingClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, classification_report
from sklearn.preprocessing import label_binarize
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

# Load data (ubah path jika perlu)
data = pd.read_csv("/content/heart.csv")
print("Dataset shape:", data.shape)

# Assume target column named 'output' as in your snippet. If different, change it.
if 'output' not in data.columns:
    # try common alternatives
    if 'target' in data.columns:
        data = data.rename(columns={'target':'output'})
    else:
        raise ValueError("No 'output' or 'target' column found in heart.csv — sesuaikan nama kolom target.")

X = data.drop('output', axis=1)
y = data['output']

# Check missing values
print("Missing values per column:\n", X.isnull().sum())

# Standardize numeric features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42, stratify=y)

# Models
models = {
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "Extra Trees": ExtraTreesClassifier(n_estimators=100, random_state=42),
    "AdaBoost": AdaBoostClassifier(n_estimators=100, random_state=42),
    "GradientBoosting": GradientBoostingClassifier(n_estimators=100, random_state=42)
}

# optional
try:
    from xgboost import XGBClassifier
    models["XGBoost"] = XGBClassifier(use_label_encoder=False, eval_metric="logloss", random_state=42)
except Exception:
    print("XGBoost not available.")

try:
    from lightgbm import LGBMClassifier
    models["LightGBM"] = LGBMClassifier(random_state=42)
except Exception:
    print("LightGBM not available.")

try:
    from catboost import CatBoostClassifier
    models["CatBoost"] = CatBoostClassifier(verbose=0, random_state=42)
except Exception:
    print("CatBoost not available.")

# Stacking
base_estimators = [
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('gb', GradientBoostingClassifier(n_estimators=100, random_state=42)),
    ('et', ExtraTreesClassifier(n_estimators=100, random_state=42))
]
stack = StackingClassifier(estimators=base_estimators, final_estimator=LogisticRegression(max_iter=1000), cv=5, n_jobs=-1)
models["Stacking (RF,GB,ET -> LR)"] = stack

# Evaluate
def safe_roc_auc_binary(model, X_test, y_test):
    try:
        if hasattr(model, "predict_proba"):
            y_score = model.predict_proba(X_test)[:,1]
            return roc_auc_score(y_test, y_score)
        if hasattr(model, "decision_function"):
            y_score = model.decision_function(X_test)
            return roc_auc_score(y_test, y_score)
    except Exception:
        return np.nan
    return np.nan

rows = []
for name, mdl in models.items():
    try:
        mdl.fit(X_train, y_train)
        y_pred = mdl.predict(X_test)
        acc = accuracy_score(y_test, y_pred)
        prec = precision_score(y_test, y_pred, zero_division=0)
        rec = recall_score(y_test, y_pred, zero_division=0)
        f1 = f1_score(y_test, y_pred, zero_division=0)
        rocauc = safe_roc_auc_binary(mdl, X_test, y_test)
        rows.append({
            "Model": name,
            "Accuracy": acc,
            "Precision": prec,
            "Recall": rec,
            "F1": f1,
            "ROC AUC": rocauc
        })
    except Exception as e:
        rows.append({
            "Model": name,
            "Accuracy": np.nan,
            "Precision": np.nan,
            "Recall": np.nan,
            "F1": np.nan,
            "ROC AUC": np.nan
        })
        print(f"Model {name} failed: {e}")

results_df = pd.DataFrame(rows).sort_values(by="ROC AUC", ascending=False).reset_index(drop=True)
print("\n=== Results (Heart) ===")
print(results_df.to_string(index=False))

# Confusion matrix for top model
top_model_name = results_df.loc[0, "Model"]
print("\nTop model:", top_model_name)
top_model = models[top_model_name]
y_pred_top = top_model.predict(X_test)
print("\nClassification report (top model):\n", classification_report(y_test, y_pred_top, zero_division=0))

cm = confusion_matrix(y_test, y_pred_top)
print("Confusion matrix (top model):\n", cm)

# Save results
results_df.to_csv("heart_model_performance.csv", index=False)
print("\nSaved heart_model_performance.csv")


Dataset shape: (303, 14)
Missing values per column:
 age         0
sex         0
cp          0
trtbps      0
chol        0
fbs         0
restecg     0
thalachh    0
exng        0
oldpeak     0
slp         0
caa         0
thall       0
dtype: int64
CatBoost not available.
[LightGBM] [Info] Number of positive: 132, number of negative: 110
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000077 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 250
[LightGBM] [Info] Number of data points in the train set: 242, number of used features: 13
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.545455 -> initscore=0.182322
[LightGBM] [Info] Start training from score 0.182322

=== Results (Heart) ===
                    Model  Accuracy  Precision   Recall       F1  ROC AUC
Stacking (RF,GB,ET -> LR)  0.819672   0.775000 0.939394 0.849315 0.915584
              Extra Trees  0.819672   0.789474 0.909091 0.845070 0.910

# **Penjelasan Kode – Prediksi Heart Attack (Bagging, Boosting, Stacking)**

## **1. Import Library dan Load Dataset**
Library yang digunakan mencakup:
- `pandas`, `numpy`, `matplotlib`, `seaborn` → analisis dan visualisasi data.
- `sklearn` untuk preprocessing, splitting, model ML, evaluasi, serta pembuatan learning curve.

Dataset Heart Attack dimuat menggunakan `pd.read_csv()`, kemudian ditampilkan ukuran dan contoh beberapa baris awal data.

## **2. Data Preprocessing**
### **a. Pemeriksaan Missing Value**
Menggunakan `isnull().sum()` untuk memastikan semua kolom tidak memiliki data kosong.

### **b. Pemeriksaan Tipe Data**
`info()` digunakan untuk memastikan format setiap kolom sesuai untuk pemodelan.

### **c. Normalisasi Fitur**
Karena dataset memiliki skala numerik berbeda-beda:
- Fitur dinormalisasi menggunakan `StandardScaler()`
- Target `output` dipisahkan sebagai label

Normalisasi penting untuk model berbasis jarak dan pohon ensemble modern.

## **3. Pembagian Dataset**
Dataset dibagi menjadi:
- **80% data training**
- **20% data testing**

Menggunakan `train_test_split()` dengan `stratify=y` agar proporsi kelas seimbang.

## **4. Pembuatan dan Pelatihan Model**
Model Bagging yang digunakan sama seperti kasus Iris:

### **a. Decision Tree**
Model dasar sebagai baseline.

### **b. Random Forest**
Menggunakan banyak pohon keputusan untuk stabilitas prediksi.

### **c. Extra Trees**
Mirip Random Forest namun lebih acak pada pemilihan threshold.

Semua model dilatih dengan metode `.fit()`.

## **5. Evaluasi Confusion Matrix**
Confusion matrix ditampilkan pada satu figure dengan tiga plot:
- Decision Tree
- Random Forest
- Extra Trees

Label dirubah menjadi:
- `"No Attack"`
- `"Attack"`

Dengan ini, evaluasi lebih mudah dipahami dalam konteks medis.

## **6. ROC Curve dan AUC**
Untuk setiap model:
- Probabilitas kelas positif (`Attack`) diambil dari `predict_proba()[:,1]`
- ROC curve dihitung menggunakan `roc_curve()`
- AUC dihitung dengan `auc()`

ROC/AUC memperlihatkan seberapa baik model membedakan antara pasien serangan jantung vs tidak.

## **7. Learning Curve**
Learning curve dibuat untuk menganalisis:
- Stabilitas model
- Apakah model mengalami overfitting atau underfitting
- Seberapa besar data tambahan yang dibutuhkan

Learning curve ditampilkan untuk:
- Decision Tree
- Random Forest
- Extra Trees

---

