### Adaboost

This activity focuses on using the `AdaBoostClassifier` and the performance resulting from changing the base classifier that is used.  As discussed in the lectures, adaptive boosting is a successive reweighting of data using a set number of estimators.  These weighted estimators are what form the ensemble, and the predictions are a result of a weighted combination of the estimators.  

- [Problem 1](#-Problem-1)
- [Problem 2](#-Problem-2)
- [Problem 3](#-Problem-3)
- [Problem 4](#-Problem-4)
- [Problem 5](#-Problem-5)

In [95]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, train_test_split

In [54]:
df = pd.read_csv('data/fetal.zip', compression = 'zip')

In [55]:
X = df.drop('fetal_health', axis = 1).values
y = df['fetal_health']

In [56]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                   random_state=42)

[Back to top](#-Index)

### Problem 1

#### `AdaBoostClassifier`

**10 Points**

What is the default estimator in the `AdaBoostClassifier`?  Instantiate this with the correct hyperparameters to `ans1` below.

In [61]:
### GRADED
ans1 = ''
    
### BEGIN SOLUTION
ans1 = DecisionTreeClassifier(max_depth=1)
### END SOLUTION

### ANSWER CHECK
ans1

DecisionTreeClassifier(max_depth=1)

In [62]:
### BEGIN HIDDEN TESTS
ans1_ = DecisionTreeClassifier(max_depth=1)
#
#
#
assert ans1.max_depth == ans1_.max_depth
### END HIDDEN TESTS

[Back to top](#-Index)

### Problem 2

#### Fitting the Ensemble

**10 Points**

Below, use the `AdaBoostClassifier` to fit the data.  Use all default settings and and assign the accuracy of the model on the test data to `model_1_acc` below.

In [63]:
### GRADED
model_1 = ''
model_1_acc = ''
    
### BEGIN SOLUTION
model_1 = AdaBoostClassifier().fit(X_train, y_train)
model_1_acc = model_1.score(X_test, y_test)
### END SOLUTION

### ANSWER CHECK
print(model_1_acc)

0.881578947368421


In [64]:
### BEGIN HIDDEN TESTS
model_1_ = AdaBoostClassifier().fit(X_train, y_train)
model_1_acc_ = model_1_.score(X_test, y_test)
#
#
#
assert model_1_acc_ == model_1_acc
### END HIDDEN TESTS

[Back to top](#-Index)

### Problem 3

#### Grid Searching the Ensemble

**10 Points**

As the documentation states [here](https://scikit-learn.org/stable/modules/ensemble.html#usage), the main parameters to search are the number of estimators and the complexity of the base estimator.  Create a parameter grid that considers the following parameters:

- *number of estimators*: 100, 200
- *max_depths*: 1, 2, 3

as `params` below.  Use this with the `AdaBoostClassifier` to grid search named `tree_grid` on the train data.  Assign the score on the test data as `grid_acc`.

This estimator will likely time out the grader.  To avoid this, you will save the results of the model to a file using the `joblib` library.  After fitting your estimator, assign this to a file named `grid_model.joblib` **and place this file in the `models` folder**.  The grader will load this model and compare it to the expected score on the test data.  Look over the documentation of `joblib` [here](https://scikit-learn.org/stable/model_persistence.html#python-specific-serialization).  Assign the score of the model on the test data by loading the `.joblib` model and scoring it using the test data as `score` below.

> *Model persistance is an important idea.  Consider looking over the more general `pickle` library for serialization as well.*

In [80]:
import os

In [81]:
from joblib import dump, load

In [90]:
### GRADED
model_files = os.listdir('models')
score = ''

### BEGIN SOLUTION
model_files = os.listdir('models')
s = load('models/soln_1.joblib')
score = s.score(X_test, y_test)
### END SOLUTION

### ANSWER CHECK
print(model_files)

['soln_1.joblib']


In [91]:
### BEGIN HIDDEN TESTS
a = load('models/soln_1.joblib')
ans = a.score(X_test, y_test)
#
#
#
assert ans == score
### END HIDDEN TESTS

[Back to top](#-Index)

### Problem 4

#### A Different Base Estimator

**10 Points**

Consider using a different base estimator such as `LogisticRegression` estimator.  Explore the neighbors parameters with 

- `C = [.001, 0.01, 0.1, 1.0, 10.0]`

Create a `Pipeline` that scales the data first and then implements an `AdaBoostClassifier` with `random_state = 42` and a Logistic Regression model.  Grid search the pipeline with a grid   again writing the model out to a file named `lgr.joblib` in the `models` folder.  Assign the score on the test data to `score2`. 


In [100]:
### GRADED
score2 = ''

    
### BEGIN SOLUTION
s = load('models/ans2.joblib')
score2 = s.score(X_test, y_test)
### END SOLUTION

### ANSWER CHECK
print(score2)

0.9078947368421053


In [103]:
### BEGIN HIDDEN TESTS
# params_ = {'mod__base_estimator__C': [.001, 0.01, 0.1, 1.0, 10.0]}
# p = Pipeline([('scale', StandardScaler()),
#              ('mod', AdaBoostClassifier(base_estimator = LogisticRegression(), 
#                                        random_state = 42))
#              ])
# g = GridSearchCV(p,
#                 param_grid=params_)
# g.fit(X_train, y_train)
a = load('models/ans2.joblib')
#
#
#
assert a.score(X_test, y_test) == score2
### END HIDDEN TESTS

[Back to top](#-Index)

### Problem 5

#### Evaluating the models

**10 Points**

Which model performed the best on the test data?

- `a`: Base `AdaBoostClassifier`
- `b`: Grid Searched Tree Model
- `c`: Grid Searched Logistic Model
- `d`: None of the above

Assign your answer as a string to `ans5` below.

In [105]:
### GRADED
ans5 = ''
    
### BEGIN SOLUTION
ans5 = 'b'
### END SOLUTION

### ANSWER CHECK
print(ans5)

b


In [106]:
### BEGIN HIDDEN TESTS
ans5_ = 'b'
#
#
#
#
assert ans5 == ans5_
### END HIDDEN TESTS