# Ensemble Systems
This notebook includes an exercise that allows you to understand bagging. a very common form of ensemble learning.

## Authors
- Sahan Bulathwela

## Learning Outcomes
- **Fundamental Concepts:** Gain a deep understanding of the foundational concepts in bagging.
- **Weak vs. Strong Classifier:** Experience the performance difference between a strong classifier vs. an ensemble system.
- **Random Forest Algorithm:** Understand the Random Forests Algorithm as a bagging classifier

## Task

This notebook attempts to develop a random forest algorithm for the breast cancer Wisconsin dataset (classification). 

## Data 

The breast cancer dataset is a classic and very easy binary classification dataset. A copy can be found [here](https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic)

## libraries

In [1]:
!pip install pandas
!pip install scikit-learn



### Importing Libraries

In [2]:
# import libraries
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split

### Loading Data

We load the dataset here and partition it into a training and a test set.

In [3]:
# Download the data and load it into variables 
bc = datasets.load_breast_cancer()
X, y = bc.data, bc.target

# Partition the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)

# Setting a baseline

Before we move on to building an ensemble , it is sensible to know how well we can do with a single model. This will set a competitive baseline for our ensemble model. 

## Single model performance

We will use a simple Naive Bayes classifier here with default parameters. 

Complete the Code below to instantiate a decision tree model using the [`sklearn.tree.DecisionTreeClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html). Import the required class and instantiate it with the variable name `dtc`. Make sure you set the `random_state` parameter to 42 to preserve reproducibility of results.

In [9]:
from sklearn.tree import DecisionTreeClassifier

def create_dtc():
    """
    Instantiates a decision tree classifier

    Returns:
        dtc: Decision Tree Classifier
    """
    # Insert Your Code Here
    dtc = DecisionTreeClassifier(random_state=42)
    return dtc

dtc = create_dtc()

In [10]:
dtc

Let us now evaluate the performance of our model. Use the [`sklearn.metrics.classification_report`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) helper function to get the critical classification metrics.

In [11]:
from sklearn.metrics import classification_report, accuracy_score

# train the model
dtc.fit(X_train, y_train)

# print accuracy on the train data
y_train_pred = dtc.predict(X_train)
print("Training Accuracy:", accuracy_score(y_train, y_train_pred))
print("\nTraining Classification Report:")
print(classification_report(y_train, y_train_pred))

# evaluate performance on the test data
y_test_pred = dtc.predict(X_test)
print("Test Accuracy:", accuracy_score(y_test, y_test_pred))
print("\nTest Classification Report:")
print(classification_report(y_test, y_test_pred))


Training Accuracy: 1.0

Training Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       167
           1       1.00      1.00      1.00       288

    accuracy                           1.00       455
   macro avg       1.00      1.00      1.00       455
weighted avg       1.00      1.00      1.00       455

Test Accuracy: 0.8947368421052632

Test Classification Report:
              precision    recall  f1-score   support

           0       0.90      0.82      0.86        45
           1       0.89      0.94      0.92        69

    accuracy                           0.89       114
   macro avg       0.90      0.88      0.89       114
weighted avg       0.90      0.89      0.89       114



# Building a bagging ensemble

Now, let's attempt to observe the effect a simple bagging ensemble has.

`sklearn` implements a helper class that handles this for us. it is called the [`sklearn.ensemble.BaggingClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html)

This class takes in several parameters, which can be tuned. But the main parameters of interest are:

`estimator` - the base model to use

`n_estimators` - how many copies of the base model to use.

`max_samples` - what subset of samples should be used for training each base model

`random_state` - Setting this parameter ensures reproducibility of the result. 

Create a bagging classifier named `dtc_ensemble`, using `DecisionTreeClassifier()` as the `estimator`. Make sure that 40% of the training data is used in training each of the base learners. Keep the number of estimators at 10 and set `random_state` to 42.

In [16]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

dtc_ensemble = BaggingClassifier(
    estimator=DecisionTreeClassifier(random_state=42),  # 基学习器
    n_estimators=100,     # 一堆树
    max_samples=0.3,     # 每个基学习器用???%的训练样本
    bootstrap=True,      # Bagging，默认True
    random_state=42,
    n_jobs=-1            
)

dtc_ensemble

Once you have created this ensemble, it works the same way as any model in `sklearn` does. You can use methods like `.fit()`, `.score()` and `.predict()` to access different functionalities of a typical scikit-learn model instance..

Repeat the training/scoring/classification report process from the single model approach you did previously on this new ensemble model.

Train only on the training data and get the classification report based on the test data.

What do you observe now?

In [17]:
from sklearn.metrics import classification_report, accuracy_score

# train the model
dtc_ensemble.fit(X_train, y_train)

# print accuracy on the train data
y_tr_pred = dtc_ensemble.predict(X_train)
print("Training Accuracy (Bagging):", accuracy_score(y_train, y_tr_pred))
print("\nTraining Classification Report (Bagging):")
print(classification_report(y_train, y_tr_pred))

# evaluate performance on the test data
y_te_pred = dtc_ensemble.predict(X_test)
print("Test Accuracy (Bagging):", accuracy_score(y_test, y_te_pred))
print("\nTest Classification Report (Bagging):")
print(classification_report(y_test, y_te_pred))

Training Accuracy (Bagging): 0.9802197802197802

Training Classification Report (Bagging):
              precision    recall  f1-score   support

           0       0.98      0.97      0.97       167
           1       0.98      0.99      0.98       288

    accuracy                           0.98       455
   macro avg       0.98      0.98      0.98       455
weighted avg       0.98      0.98      0.98       455

Test Accuracy (Bagging): 0.9035087719298246

Test Classification Report (Bagging):
              precision    recall  f1-score   support

           0       0.90      0.84      0.87        45
           1       0.90      0.94      0.92        69

    accuracy                           0.90       114
   macro avg       0.90      0.89      0.90       114
weighted avg       0.90      0.90      0.90       114



# Adapting the bagging ensemble to be a Random Forest

The key differentiator that makes a vanilla bagging classifier Random Forest is how the features are chosen in the decision tree at each decision stump. Unlike a typical bagging classifier that only subsamples the data that goes to each weak learner, the random forest algorithm also samples features that go into training each tree, adding to the diversity. 

The number of features used in each decision stump at each tree can be controlled using the `max_features` hyperparameter of the Decision Tree model. You can learn more about how to set this feature in the [`sklearn.tree.DecisionTreeClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) documentation. 

To create the ideal base model for a random forest, implement the following function (similar to the `create_dtc` function above), where the maximum number of features used is 70% of the number of features in the dataset. 

In [19]:
from sklearn.tree import DecisionTreeClassifier

def create_rf_dtc():
    """
    Instantiates a decision tree classifier for creating a random forest

    Returns:
        dtc: Decision Tree Classifier
    """
    # 数据集中共有多少个特征？
    n_features = X.shape[1]

    # max_features = 特征总数的 70%
    max_features = int(0.7 * n_features)

    # Decision Tree Classifier
    dtc = DecisionTreeClassifier(max_features=max_features, random_state=42 ) #随机种子desu
    
    return dtc

fr_dtc = create_rf_dtc()
fr_dtc

## Build a Random Forest

Now let's build a Random Forest Classifier using the bagging classifier. The only change we need to make in comparison to our previous bagging classifier implementation is to use the new function we implemented, where only 70% of the features are considered randomly when creating each base classifier.

Now let us plug the new function and create a new bagging classifier, which is a random forest classifier. We name this model `dtc_rf_ensemble`.


In [32]:
from sklearn.ensemble import BaggingClassifier


# fr_dtc = create_rf_dtc() 重新生成一个

dtc_rf_ensemble = BaggingClassifier(
    estimator=fr_dtc,   
    n_estimators=10, #致死量。。。    
    max_samples=0.3,    
    bootstrap=True,
    random_state=42,
    n_jobs=-1           
)
dtc_rf_ensemble

Repeat the training/scoring/classification report process you did previously on this new ensemble model.

Train only on the training data and get the classification report based on the test data.

What do you observe now?

In [33]:
from sklearn.metrics import accuracy_score, classification_report

# train the model
dtc_rf_ensemble.fit(X_train, y_train)

# print accuracy on the train data
y_tr_pred = dtc_rf_ensemble.predict(X_train)
print("Training Accuracy (RF via Bagging):", accuracy_score(y_train, y_tr_pred))
print("\nTraining Classification Report (RF via Bagging):")
print(classification_report(y_train, y_tr_pred))

# evaluate performance on the test data
y_te_pred = dtc_rf_ensemble.predict(X_test)
print("Test Accuracy (RF via Bagging):", accuracy_score(y_test, y_te_pred))
print("\nTest Classification Report (RF via Bagging):")
print(classification_report(y_test, y_te_pred))

Training Accuracy (RF via Bagging): 0.978021978021978

Training Classification Report (RF via Bagging):
              precision    recall  f1-score   support

           0       0.97      0.97      0.97       167
           1       0.98      0.98      0.98       288

    accuracy                           0.98       455
   macro avg       0.98      0.98      0.98       455
weighted avg       0.98      0.98      0.98       455

Test Accuracy (RF via Bagging): 0.9035087719298246

Test Classification Report (RF via Bagging):
              precision    recall  f1-score   support

           0       0.90      0.84      0.87        45
           1       0.90      0.94      0.92        69

    accuracy                           0.90       114
   macro avg       0.90      0.89      0.90       114
weighted avg       0.90      0.90      0.90       114



# Compare performance with Random Forest

Now let us recreate the same classifier with the RandomForest Classifier to compare the results. Use the [`sklearn.ensemble.RandomForestClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) to implement a similar Random Forest classifier as above. Let's call this model `rf`.

In [34]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=10,  
    max_features=0.7, 
    bootstrap=True,   
    random_state=42,
    n_jobs=-1        
)
rf

Repeat the training/scoring/classification report process you did previously on this new ensemble model.

Train only on the training data and get the classification report based on the test data.

What do you observe now?

In [35]:
from sklearn.metrics import accuracy_score, classification_report

# train the model
rf.fit(X_train, y_train)

# print accuracy on the train data
y_tr_pred = rf.predict(X_train)
print("Training Accuracy (RandomForest):", accuracy_score(y_train, y_tr_pred))
print("\nTraining Classification Report (RandomForest):")
print(classification_report(y_train, y_tr_pred))

# evaluate performance on the test data
y_te_pred = rf.predict(X_test)
print("Test Accuracy (RandomForest):", accuracy_score(y_test, y_te_pred))
print("\nTest Classification Report (RandomForest):")
print(classification_report(y_test, y_te_pred))

Training Accuracy (RandomForest): 0.9956043956043956

Training Classification Report (RandomForest):
              precision    recall  f1-score   support

           0       0.99      0.99      0.99       167
           1       1.00      1.00      1.00       288

    accuracy                           1.00       455
   macro avg       1.00      1.00      1.00       455
weighted avg       1.00      1.00      1.00       455

Test Accuracy (RandomForest): 0.9298245614035088

Test Classification Report (RandomForest):
              precision    recall  f1-score   support

           0       0.93      0.89      0.91        45
           1       0.93      0.96      0.94        69

    accuracy                           0.93       114
   macro avg       0.93      0.92      0.93       114
weighted avg       0.93      0.93      0.93       114

