# Exercise 01 Recap ML & Ensemble Learning I

## Pedagogy

This notebook contains both theoretical explanations and executable cells to execute your code.

When you see the <span style="color:red">**[TBC]**</span> (To Be Completed) sign, it means that you need to perform an action else besides executing the cells of code that already exist. These actions can be:
- Complete the code with proper comments
- Respond to a question
- Write an analysis
- etc.

### Import libraries

In [None]:
# import all libraries used in this notebook here
import warnings

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, mean_squared_error, r2_score
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.preprocessing import LabelEncoder

In [None]:
# suppress all warnings
warnings.filterwarnings("ignore")

## Part 1. Recap of Classical ML

In this part, we will implement the best practice of supervised learning to build a classification model using the decision tree algorithm.

We will execute the following steps:
- Load the dataset
- Split the dataset into training and test set
- Hyper-parameter tuning through cross-validation using the training dataset
- Evaluation the best model with the best hyper-parameters using the test dataset
- Inference with unseen data

### 1.1. Load dataset

We will use a toy dataset, the Iris plants dataset, provided by `scikit-learn` here.

There are four feature variables and one target variables in this dataset, which are:
- Feature
    - Sepal length in cm
    - Sepal width in cm
    - Petal length in cm
    - Petal width in cm
- Target
    - Class of iris
        - Iris-Setosa
        - Iris-Versicolour
        - Iris-Virginica
        
There are 150 labeled examples in the dataset.

Use `sklearn.datasets.load_iris()` to get this dataset.

In [None]:
# load dataset
feature_df, target_df = datasets.load_iris(
    return_X_y = True, # If True, returns (data.data, data.target) instead of a Bunch object.
    as_frame = True # If True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric).
)

In [None]:
# display the first five rows of the features
feature_df.head()

In [None]:
# display the unique values of the target variable
target_df.unique()

### 1.2. Train test split

We will split the whole dataset into two parts: the training and test dataset.
- 70% for training
- 30% for test

Use `sklearn.model_selection.train_test_split()` to do this.

In [None]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(
    feature_df.values, # call `.values` to convert the feature from pd.DataFrame to np.array
    target_df.values, # ca;; `.values` to convert the target from pd.Series to np.array
    train_size = 0.7, # 70% for training, 30% for test
    random_state = 0 # controls the shuffling, set to zero for reproduciblillity
)

### 1.3. Hyper-parameter tuning through cross-validation

We will perform hyper-parameter tuning through cross-validation using `sklearn.model_selection.GridSearchCV()` on the training dataset.
- First, define the potential values of hyper-parameters for search
- Call `sklearn.model_selection.GridSearchCv()` to perform an exhausive search
    - Use weighted average of F1-Score as the evaluatoin metric
    - Set the number of folds to 5 for the cross-validation

In [None]:
# define the hyper-parameters to search
param_dict = {
    'criterion': ['gini', 'entropy', 'log_loss'],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 3, 4],
    'min_samples_leaf': [1, 2, 3],
    'max_features': [None, 'sqrt', 'log2'],
    'min_impurity_decrease': [0.0, 0.1, 0.2]
}

In [None]:
# hyper-parameter tuning through cross-validation
grid_clf = GridSearchCV(
    estimator = DecisionTreeClassifier(),
    param_grid = param_dict,
    scoring = 'f1_weighted',
    refit = True,
    cv = 5,
    verbose = 1,
    n_jobs = -1
)
grid_clf.fit(X_train, y_train)

In [None]:
# obtain the best hyper-parameters and the best score
print('Best hyper-parameters:', grid_clf.best_params_)
print('Best score:', grid_clf.best_score_)

### 1.4. Evaluation using the test dataset

After finding the best combination of hyper-parameter values, we can use the test dataset to evaluate the performance of the best model.

If the parameter `refit` was set to `True`, the `GridSearchCV()` has already refitted the model using the best hyper-parameters on the whole training dataset. There is no need to manually refit the model.

In [None]:
# predict categories for test dataset
y_pred = grid_clf.predict(X_test)

In [None]:
# obtain classification metrics using `classification_report`
print(classification_report(y_test, y_pred))

### 1.5. Inference with unseen data

After evaluation, if we are satisfied with the performance of the model, we can then use the learned model to make predictions on unseen data, so called inference.
- Predict the category directly
- Predict the probability of belonging to a specific category

In [None]:
# inference with unseen data
X_unseen = np.array([5.2, 3.1, 1.3, 0.2])
print("The predicted category of the unseen data:", grid_clf.predict(X_unseen.reshape(-1, 4)))
print("The predicted probabilities:", grid_clf.predict_proba(X_unseen.reshape(-1, 4)))

## Part 2. The bagging method - Random forests

In this part, we will use the random forests algorith, a bagging method for ensemble learning, to build a classifier with the same dataset.

Random forests is a meta model that fits a number of decision tree classifiers on various sub-samples of the dataset.

### 2.1. Train the random forest using the training dataset

To test whether random forest can improve prediction accuracy compared to a single decision tree model, we set the hyperparameters of each decision tree in the random forest to be the same as the decision tree model after hyperparameter optimization in the Part 1.

The only difference is that, random forest will fit 10 decision trees using 10 different subsets of the training dataset

In [None]:
# create the random forest classifier
clf = RandomForestClassifier(
    n_estimators = 10, # set the number of decision trees to 10
    criterion = grid_clf.best_params_['criterion'],
    max_depth = grid_clf.best_params_['max_depth'],
    max_features = grid_clf.best_params_['max_features'],
    min_impurity_decrease = grid_clf.best_params_['min_impurity_decrease'],
    min_samples_leaf = grid_clf.best_params_['min_samples_leaf'],
    min_samples_split = grid_clf.best_params_['min_samples_split'],
    n_jobs = -1, # -1 means using all available processors for parallelized computation
    random_state = 0 # set random state to 0 for reproduciblity
)

In [None]:
# fit the model to the training dataset
clf.fit(X_train, y_train)

### 2.2. Evaluate the random forest using the test dataset

After training, we can using the test dataset to evaluate the performance of the random forest and compare it with the performance of the single decision tree classifier.

In [None]:
# predict categories for test dataset
y_pred = clf.predict(X_test)

In [None]:
# obtain classification metrics using `classification_report`
print(classification_report(y_test, y_pred))

Normally, the performance of random forest should be better than, at least not worse than, the single decision tree classifier.

### 2.3. Obtain feature importances

Ranomd forests provide the importances of different features for the model to make the correct predictions.

We can call the attribute `feature_importances_` to get this information.

In [None]:
# obtain feature importances
clf.feature_importances_

In [None]:
# display the boxplots of the 4 features for the 3 different classes
# merge the feature and target into one DataFrame
df = feature_df
df['class'] = target_df
# create the figure
df.boxplot(
    column = ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'],
    by = 'class',
    grid = False,
    figsize = (8, 5),
    layout = (1, 4)
)
plt.show()

We can see that for different iris classes:
- Petal length and width are distributed in different intervals, so the importances of these two features are higher.
- The distribution intervals of sepal length overlap slightly, so the importance is low.
- The distribution intervals of sepal width overlap strongly, so the importance is close to zero.

### 2.4. Inference with unseen data

In [None]:
# predict the category and probabilities for the unseen data
print("The predicted category of the unseen data:", clf.predict(X_unseen.reshape(-1, 4)))
print("The predicted probabilities:", clf.predict_proba(X_unseen.reshape(-1, 4)))

## Part 3. Hands-on exercise

In this exercise, you are required to build a regression model using the random forests algorithm.

The problem to be solved is predicting the price of flights.

Please download the flight price dataset from Learn.

<span style="color:red">**[TBC]**</span> Please complete the following tasks:

- Load and explore the dataset
- Preprocess the dataset
- Build and evaluate a regression model using random forests with default hyper-parameters
- Hyper-parameter tuning through cross-validation for random forests

### Task 1. Load and explore the dataset

Dataset contains information about flight booking options from the website Easemytrip for flight travel between India's top 6 metro cities.

There are 300,261 datapoints and 11 features in the cleaned dataset.

More details can be found in this [link](https://www.kaggle.com/datasets/shubhambathwal/flight-price-prediction?resource=download#).

<span style="color:red">**[TBC]**</span> Please complete the following tasks:

- Load the dataset
- Obtain the general information of the dataset
- Obtain the number of unique values of each feature in the dataset

In [None]:
# [TBC] complete your code here with proper comments
# load the dataset
# hint: pandas.read_csv()


In [None]:
# [TBC] complete your code here with proper comments
# obtain the general information of the dataset
# hint: pandas.DataFrame.info()


In [None]:
# [TBC] complete your code here with proper comments
# obtain the number of unique values of each feature in the dataset
# hint: pandas.DataFrame.nunique()


### Task 2. Preprocess the dataset

You may find some features in the dataset is categorical. Before training, you need to convert these categorical features into numerical variables.

<span style="color:red">**[TBC]**</span> Please complete the following tasks:

- Encode categorical features
- Train test split

In [None]:
# [TBC] complete your code here with proper comments
# Encode categorical features
# hint: sklearn.preprocessing.LabelEncoder()


In [None]:
# [TBC] complete your code here with proper comments
# Train test split
# hint: sklearn.model_selection.train_test_split()
# hint: first divide the encoded dataset into features and target, then perform train test split


### Task 3. Build and evaluate a regression model using random forests with default hyper-parameters

You need to train a regression model using random forests algorithms with default hyper-parameters and evaluate its performance.

<span style="color:red">**[TBC]**</span> Please complete the following tasks:

- Train the regression model
- Evaluate the performance on test dataset

Warning: The difficulty with this task lies in the size of the data set. If your computer doesn't have enough memory to complete the task, you need to find workarounds.

- You may reduce the size of training set, but be careful of the size of test set. A overly large test set may cause failure when you use the learned model to make predictions on it.
- You may adjust the hyper-parameters of random forests to reduce the complexity of model, which will reduce the memory required. 

In [None]:
# [TBC] complete your code here with proper comments
# train the regression model
# hint: sklearn.ensemble.RandomForestRegressor()


In [None]:
# [TBC] complete your code here with proper comments
# evaluate the performance on test dataset
# hint: you may use RMSE and r2 score
# hint: you may also create a plot to visualize the predictions


### Task 4. Hyper-parameter tuning through cross-validation for random forests

You need to perform hyper-parameter tuning through cross-validation for random forests.

<span style="color:red">**[TBC]**</span> Please complete the following tasks:

- Hyper-parameter tuning through cross-validation
- Evaluate the performance of the final model on test dataset
- Get feature importances from the final model

Warning: The difficulty with this task lies in the size of the data set. If your computer doesn't have enough memory to complete the task, you need to find workarounds.

In [None]:
# [TBC] complete your code here with proper comments
# Hyper-parameter tuning through cross-validation
# hint: sklearn.model_selection.GridSearchCV()


In [None]:
# [TBC] complete your code here with proper comments
# evaluate the performance of the final model on test dataset
# hint: you may use RMSE and r2 score
# hint: you may also create a plot to visualize the predictions


In [None]:
# [TBC] complete your code here with proper comments
# Get feature importances from the final model
# hint: you may first obtain the best estimator from GridSearchCV
# hint: then get feature importances from the best estimator
