### Codio Activity 21.4: Implementing Bagging

**Expected Time = 60 minutes**

**Total Points = 50**

This activity focuses on using the `BaggingClassifier`.  You will use the scikit-learn implementation to compare performance on the fetal health dataset to that of the other models in the module -- Random Forests, Adaptive Boosting, and Gradient Boosting. The `BaggingClassifier` is a meta estimator that will aggregate estimators built on samples of the data.  You are to specify certain estimators and samples to become familiar with the functionality of the estimator and the variations you can produce with important arguments.  

#### Index

- [Problem 1](#-Problem-1)
- [Problem 2](#-Problem-2)
- [Problem 3](#-Problem-3)
- [Problem 4](#-Problem-4)
- [Problem 5](#-Problem-5)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


### Data and Documentation

Below the data is loaded and prepared.  For this exercise, you will be expected to consult the documentation on the `BaggingClassifier` [here](https://scikit-learn.org/stable/modules/ensemble.html#bagging-meta-estimator).  The vocabulary in each problem can be found in the documentation and you are expected to use the correct settings for the arguments as a result of reading the documentation.  For each model, be sure to set `random_state = 42`.    

In [2]:
df = pd.read_csv('data/fetal.zip', compression = 'zip')
X, y = df.drop('fetal_health', axis = 1), df['fetal_health']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

[Back to top](#-Index)

### Problem 1

#### Aggregating bootstrap models

**10 Points**

To start, create an ensemble of `DecisionTreeClassifier` classifiers built on bootstrap samples of the data. Remember to set the `random_state = 42`.  *This is equivalent to the default model for `BaggingClassifier`.

In [3]:
### GRADED
bagged_model = ''
bagged_score = ''
    
# YOUR CODE HERE
bagged_model = BaggingClassifier(random_state=42).fit(X_train, y_train)
bagged_score = bagged_model.score(X_test, y_test)

### ANSWER CHECK
print(bagged_score)

0.9511278195488722


[Back to top](#-Index)

### Problem 2

#### Pasting vs. Bagging

**10 Points**

Consult the documentation [here](https://scikit-learn.org/stable/modules/ensemble.html#bagging-meta-estimator) and adjust the appropriate argument to change from **bagging** to **pasting**.  Create your model as `pasted_model` and score on the test data as `pasted_score`.  Be sure to set `random_state = 42`.

In [4]:
### GRADED
pasted_model = ''
pasted_score = ''
    
# YOUR CODE HERE
pasted_model = BaggingClassifier(random_state=42, bootstrap=False).fit(X_train, y_train)
pasted_score = pasted_model.score(X_test, y_test)

### ANSWER CHECK
print(pasted_score)

0.9379699248120301


[Back to top](#-Index)

### Problem 3

#### Random Subspaces

**10 Points**

Consult the documentation [here](https://scikit-learn.org/stable/modules/ensemble.html#bagging-meta-estimator) and adjust the appropriate argument to change from **bagging** to **random subspaces** with at most 10 features sampled.  Train this on the training data and score it on the test data. Create your model as `random_subspace` and score as `subspace_score`.

In [5]:
### GRADED
subspace_model = ''
subspace_score = ''
    
# YOUR CODE HERE
subspace_model = BaggingClassifier(random_state=42, bootstrap=False, max_features=10).fit(X_train, y_train)
subspace_score = subspace_model.score(X_test, y_test)

### ANSWER CHECK
print(subspace_score)

0.943609022556391


[Back to top](#-Index)

### Problem 4

#### Random Patches

**10 Points**

Consult the documentation [here](https://scikit-learn.org/stable/modules/ensemble.html#bagging-meta-estimator) and adjust the appropriate argument to change from **bagging** to **random patches**. Use no more than 30% of the data and no more than 10 features in your samples.  Train this on the training data and score it on the test data as `patches_model` and `patches_score` below.

In [6]:
### GRADED
patches_model = ''
patches_score = ''
    
# YOUR CODE HERE
patches_model = BaggingClassifier(random_state=42, bootstrap=False, max_features=10, max_samples=0.3).fit(X_train, y_train)
patches_score = patches_model.score(X_test, y_test)

### ANSWER CHECK
print(patches_score)

0.9304511278195489


[Back to top](#-Index)

### Problem 5

#### Nature of the Tree Models

**10 Points**

Consult the documentation [here](https://scikit-learn.org/stable/modules/ensemble.html#bagging-meta-estimator) and observe whether or not bagging typically works with simple or complex tree models.  Enter your answer as `simple` or `complex` as a string to `ans5`. 

In [7]:
### GRADED
ans5 = ''
    
# YOUR CODE HERE
ans5 = 'complex'

### ANSWER CHECK
print(ans5)

complex
