<a href="https://colab.research.google.com/github/Ellinei/229352-StatisticalLearning/blob/main/Lab07_Boosted_trees.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Statistical Learning for Data Science 2 (229352)
#### Instructor: Donlapark Ponnoprat

#### [Course website](https://donlapark.pages.dev/229352/)

## Lab #6

## Boosted tree models on a simulated dataset

- [AdaBoostClassifier documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html#sklearn-ensemble-adaboostclassifier)
- [XGBClassifier documentation](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBClassifier)
- [LGBMClassifier documentation](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html#lightgbm-lgbmclassifier)
- [GridSeachCV documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)


- [Data](https://github.com/donlapark/ds352-labs/raw/main/Lab06-data.zip)


Perform GridSearchCV of the following three models on the provided training set (`X_train.csv` and `y_train.csv`)

1. Evaluate these models on the test set (`X_test.csv` and `y_test.csv`).

2. For each model, plot the feature importances

For `AdaBoostClassifier`, feature importances can be obtained by calling the `feature_importances` attribute after fitting the model.

For `XGBClassifier` and `LGBMClassifier`, feature importances can be obtained using the library’s `plot_importance` function. Here is a minimal example in XGBoost:

- AdaBoost. Grid search over `n_estimators` and `learning_rate`.
- XGBoost. Grid search over `n_estimators`, `max_depth` and `learning_rate`.
- LightGBM. Grid search over `n_estimators`, `max_depth` and `learning_rate`.

In [None]:
from xgboost import plot_importance
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.ensemble import AdaBoostClassifier
from lightgbm import LGBMClassifier
from lightgbm import plot_importance

In [None]:
from xgboost import XGBClassifier, plot_importance

from sklearn import datasets


iris = datasets.load_iris()
X = iris.data
y = iris.target

model = XGBClassifier()
model.fit(X, y)
plot_importance(model);

In [None]:
from xgboost import plot_tree

plot_tree(model, num_trees=1);

In [None]:
import pandas as pd

data = pd.read_csv('X_train.csv', header=None)


data

In [None]:
import numpy as np

data = np.genfromtxt("X_train.csv", delimiter=",")

data

In [None]:
X_train = pd.read_csv('X_train.csv', header=None)
y_train = pd.read_csv('y_train.csv', header=None).values.ravel()

param_grid_ada = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 1.0]
}

grid_search_ada = GridSearchCV(AdaBoostClassifier(), param_grid_ada, cv=5)

grid_search_ada.fit(X_train, y_train)

print("Best parameters for AdaBoost:", grid_search_ada.best_params_)
print("Best cross-validation score for AdaBoost:", grid_search_ada.best_score_)

In [None]:
X_test = pd.read_csv('X_test.csv', header=None)
y_test = pd.read_csv('y_test.csv', header=None).values.ravel()

best_ada_model = grid_search_ada.best_estimator_
y_pred_ada = best_ada_model.predict(X_test)
accuracy_ada = accuracy_score(y_test, y_pred_ada)
print("AdaBoost Test Accuracy:", accuracy_ada)

feature_importances_ada = best_ada_model.feature_importances_
plt.figure(figsize=(10, 6))
plt.bar(range(len(feature_importances_ada)), feature_importances_ada)
plt.xticks(range(len(feature_importances_ada)), [f'Feature {i}' for i in range(len(feature_importances_ada))], rotation=90)
plt.xlabel("Feature")
plt.ylabel("Importance")
plt.title("AdaBoost Feature Importances")
plt.tight_layout()
plt.show()

In [None]:
param_grid_xgb = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 1.0]
}

grid_search_xgb = GridSearchCV(XGBClassifier(), param_grid_xgb, cv=5)

grid_search_xgb.fit(X_train, y_train)

print("Best parameters for XGBoost:", grid_search_xgb.best_params_)
print("Best cross-validation score for XGBoost:", grid_search_xgb.best_score_)

In [None]:
best_xgb_model = grid_search_xgb.best_estimator_
y_pred_xgb = best_xgb_model.predict(X_test)
accuracy_xgb = accuracy_score(y_test, y_pred_xgb)
print("XGBoost Test Accuracy:", accuracy_xgb)

plt.figure(figsize=(10, 6))
plot_importance(best_xgb_model, ax=plt.gca())
plt.title("XGBoost Feature Importances")
plt.tight_layout()
plt.show()

In [None]:
param_grid_lgbm = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 1.0]
}

grid_search_lgbm = GridSearchCV(LGBMClassifier(), param_grid_lgbm, cv=5)

grid_search_lgbm.fit(X_train, y_train)

print("Best parameters for LightGBM:", grid_search_lgbm.best_params_)
print("Best cross-validation score for LightGBM:", grid_search_lgbm.best_score_)

In [None]:
from lightgbm import plot_importance

best_lgbm_model = grid_search_lgbm.best_estimator_
y_pred_lgbm = best_lgbm_model.predict(X_test)
accuracy_lgbm = accuracy_score(y_test, y_pred_lgbm)
print("LightGBM Test Accuracy:", accuracy_lgbm)

plt.figure(figsize=(10, 6))
plot_importance(best_lgbm_model, ax=plt.gca())
plt.title("LightGBM Feature Importances")
plt.tight_layout()
plt.show()