<a href="https://colab.research.google.com/github/Ellinei/229352-StatisticalLearning/blob/main/Lab05_decision_tree_bagging_RF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Statistical Learning for Data Science 2 (229352)
#### Instructor: Donlapark Ponnoprat

#### [Course website](https://donlapark.pages.dev/229352/)

## Lab #5

#### Load data at: https://donlapark.pages.dev/229352/heart_disease.csv

* Decision tree ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html))
* Random hyperparameter search using cross-validation ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html))

In [None]:
import pandas as pd
import graphviz

from sklearn.impute import SimpleImputer
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import Pipeline

from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier

# import data
data = pd.read_csv("heart_disease.csv", na_values="?")
data.head()

In [None]:

# split into X and y
y = data["label"]
X = data.drop("label", axis=1)

# split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y)

# impute missing values
imputer = SimpleImputer(strategy="mean")
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

# Create a decision tree
clf = DecisionTreeClassifier()

![5CV](https://scikit-learn.org/stable/_images/grid_search_cross_validation.png)

In [None]:
params = {'max_depth': [3, 6, 9, 12]}

gridcv = GridSearchCV(clf, params, scoring='accuracy', cv=5)
gridcv.fit(X_train, y_train)

In [None]:
gridcv.best_estimator_

In [None]:
plot_data = export_graphviz(gridcv.best_estimator_,
                            out_file=None,
                            filled=True,
                            rounded=True,
                            feature_names=data.columns[:-1],
                            class_names=['0', '1'])

graph = graphviz.Source(plot_data)
graph

## Bagged decision trees
* Bagging classifier ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html))

In [None]:
clf = DecisionTreeClassifier()

## Random forest classifier
* Random forest ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html))

In [None]:
rf = RandomForestClassifier()

#### Exercise
1. Study the hyperparameters of three models: [Decision tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html), [Bagged Decision Trees](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html) and [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
2. For each model, use pipeline+grid search cross-validation across multiple hyperparameters to find the best model.
* Decision tree: choose at least 3 hyperparameters
* Bagged decision trees: choose at least 3 hyperparameters
* Random forest: choose at least 3 hyperparameters
3. For each model, compute the `f1_macro` and `accuracy` score on the test set.
* What is your best model?
* Plot the best tree model
* What hyperparameters did you choose? (explain in words, not in `sklearn's` parameter name)
* What are the best values of your hyperparameters?

In [None]:
dt_pipeline = Pipeline([
    ('imputer', imputer),
    ('decisiontreeclassifier', DecisionTreeClassifier())
])

dt_params = {
    'decisiontreeclassifier__max_depth': [3, 6, 9, 12],
    'decisiontreeclassifier__min_samples_split': [2, 5, 10],
    'decisiontreeclassifier__min_samples_leaf': [1, 2, 4]
}

In [None]:
bagging_pipeline = Pipeline([
    ('imputer', imputer),
    ('baggingclassifier', BaggingClassifier(estimator=DecisionTreeClassifier()))
])

bagging_params = {
    'baggingclassifier__n_estimators': [10, 50, 100],
    'baggingclassifier__max_samples': [0.5, 0.7, 1.0],
    'baggingclassifier__max_features': [0.5, 0.7, 1.0]
}


In [None]:
rf_pipeline = Pipeline([
    ('imputer', imputer),
    ('randomforestclassifier', RandomForestClassifier())
])

rf_params = {
    'randomforestclassifier__n_estimators': [10, 50, 100],
    'randomforestclassifier__max_depth': [3, 6, 9, 12],
    'randomforestclassifier__min_samples_leaf': [1, 2, 4]
}

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, f1_score, accuracy_score

scoring = {
    'f1_macro': make_scorer(f1_score, average='macro'),
    'accuracy': make_scorer(accuracy_score)
}

dt_grid_search = GridSearchCV(dt_pipeline, dt_params, scoring=scoring, refit='f1_macro', cv=5)
dt_grid_search.fit(X_train, y_train)

bagging_grid_search = GridSearchCV(bagging_pipeline, bagging_params, scoring=scoring, refit='f1_macro', cv=5)
bagging_grid_search.fit(X_train, y_train)

rf_grid_search = GridSearchCV(rf_pipeline, rf_params, scoring=scoring, refit='f1_macro', cv=5)
rf_grid_search.fit(X_train, y_train)

In [None]:
from sklearn.metrics import make_scorer, f1_score, accuracy_score

scoring = {
    'f1_macro': make_scorer(f1_score, average='macro'),
    'accuracy': make_scorer(accuracy_score)
}

dt_grid_search = GridSearchCV(dt_pipeline, dt_params, scoring=scoring, refit='f1_macro', cv=5)
dt_grid_search.fit(X_train, y_train)

bagging_grid_search = GridSearchCV(bagging_pipeline, bagging_params, scoring=scoring, refit='f1_macro', cv=5)
bagging_grid_search.fit(X_train, y_train)

rf_grid_search = GridSearchCV(rf_pipeline, rf_params, scoring=scoring, refit='f1_macro', cv=5)
rf_grid_search.fit(X_train, y_train)

In [None]:
from sklearn.metrics import f1_score, accuracy_score

In [None]:
best_dt = dt_grid_search.best_estimator_
best_bagging = bagging_grid_search.best_estimator_
best_rf = rf_grid_search.best_estimator_

dt_pred = best_dt.predict(X_test)
bagging_pred = best_bagging.predict(X_test)
rf_pred = best_rf.predict(X_test)

dt_f1 = f1_score(y_test, dt_pred, average='macro')
dt_accuracy = accuracy_score(y_test, dt_pred)

bagging_f1 = f1_score(y_test, bagging_pred, average='macro')
bagging_accuracy = accuracy_score(y_test, bagging_pred)

rf_f1 = f1_score(y_test, rf_pred, average='macro')
rf_accuracy = accuracy_score(y_test, rf_pred)

print(f"Decision Tree - F1 Macro: {dt_f1:.4f}, Accuracy: {dt_accuracy:.4f}")
print(f"Bagging - F1 Macro: {bagging_f1:.4f}, Accuracy: {bagging_accuracy:.4f}")
print(f"Random Forest - F1 Macro: {rf_f1:.4f}, Accuracy: {rf_accuracy:.4f}")

In [None]:
print("\n--- Model Comparison ---")
print(f"Decision Tree - F1 Macro: {dt_f1:.4f}, Accuracy: {dt_accuracy:.4f}")
print(f"Bagging - F1 Macro: {bagging_f1:.4f}, Accuracy: {bagging_accuracy:.4f}")
print(f"Random Forest - F1 Macro: {rf_f1:.4f}, Accuracy: {rf_accuracy:.4f}")

print("\nBest Model")
if rf_f1 > dt_f1 and rf_f1 > bagging_f1:
    best_model = "Random Forest"
elif dt_f1 > rf_f1 and dt_f1 > bagging_f1:
    best_model = "Decision Tree"
else:
    best_model = "Bagging"

print(best_model)

In [None]:
from sklearn.tree import export_graphviz
import graphviz

best_dt_classifier = best_dt.named_steps['decisiontreeclassifier']

plot_data = export_graphviz(best_dt_classifier,
                                out_file=None,
                                filled=True,
                                rounded=True,
                                feature_names=data.columns[:-1],
                                class_names=['0', '1'])

graph = graphviz.Source(plot_data)
display(graph)

In [None]:
print("\nBest hyperparameters for Decision Tree:")
print(dt_grid_search.best_params_)

print("\nBest hyperparameters for Bagging Classifier:")
print(bagging_grid_search.best_params_)

print("\nBest hyperparameters for Random Forest:")
print(rf_grid_search.best_params_)

**Decision Tree:**

*   **Maximum depth (ความลึกสูงสุด):** ควบคุมความลึกของต้นไม้ ค่าที่ดีที่สุดคือ `3`
*   **Minimum samples to split (จำนวนตัวอย่างขั้นต่ำในการแบ่ง):** จำนวนตัวอย่างขั้นต่ำที่จำเป็นในการแยกโหนด ค่าที่ดีที่สุดคือ `5`
*   **Minimum samples per leaf (จำนวนตัวอย่างขั้นต่ำต่อใบ):** จำนวนตัวอย่างขั้นต่ำที่ต้องมีในโหนดใบ ค่าที่ดีที่สุดคือ `2`

**Bagging Classifier:**

*   **Number of estimators (จำนวนโมเดลย่อย):** จำนวนโมเดล Decision Tree ใน Bagging ensemble ค่าที่ดีที่สุดคือ `100`
*   **Maximum samples (สัดส่วนตัวอย่างในการสุ่มเลือก):** สัดส่วนของตัวอย่างจาก Training set ที่ใช้ในการฝึกแต่ละโมเดลย่อย ค่าที่ดีที่สุดคือ `0.5`
*   **Maximum features (สัดส่วนของคุณสมบัติในการสุ่มเลือก):** สัดส่วนของคุณสมบัติจาก Feature set ที่ใช้ในการฝึกแต่ละโมเดลย่อย ค่าที่ดีที่สุดคือ `0.5`

**Random Forest:**

*   **Number of estimators (จำนวนต้นไม้ในป่าสุ่ม):** จำนวนต้นไม้ Decision Tree ใน Random Forest ค่าที่ดีที่สุดคือ `100`
*   **Maximum depth (ความลึกสูงสุดของต้นไม้):** ควบคุมความลึกสูงสุดของแต่ละต้นไม้ในป่าสุ่ม ค่าที่ดีที่สุดคือ `9`
*   **Minimum samples per leaf (จำนวนตัวอย่างขั้นต่ำต่อใบ):** จำนวนตัวอย่างขั้นต่ำที่ต้องมีในโหนดใบของแต่ละต้นไม้ ค่าที่ดีที่สุดคือ `4`