In [1]:
# run this to shorten the data import from the files
import os
cwd = os.path.dirname(os.getcwd())+'/'
path_data = os.path.join(os.path.dirname(os.getcwd()), 'datasets/')


In [2]:
import pandas as pd
data = pd.read_csv(path_data+'Pokemon.csv')
data.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


In [8]:
from sklearn.model_selection import train_test_split as tts
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score, confusion_matrix

X = data.iloc[:, 4:-2]
y = data.iloc[:, -1].replace({False:0, True:1})

X_train, X_test, y_train, y_test = tts(X, y, test_size=0.2, random_state=42)

  y = data.iloc[:, -1].replace({False:0, True:1})


In [9]:
# exercise 01

"""
Restricted and unrestricted decision trees

For this exercise, we will revisit the Pokémon dataset from the last chapter. Recall that the goal is to predict whether or not a given Pokémon is legendary.

Here, you will build two separate decision tree classifiers. In the first, you will specify the parameters min_samples_leaf and min_samples_split, but not a maximum depth, so that the tree can fully develop without any restrictions.

In the second, you will specify some constraints by limiting the depth of the decision tree. By then comparing the two models, you'll better understand the notion of a "weak" learner.
"""

# Instructions

"""
Build an unrestricted decision tree using the parameters min_samples_leaf=3, min_samples_split=9, and random_state=500.
---
Build a restricted tree by replacing min_samples_leaf and min_samples_split with max_depth=4 and max_features=2.
"""

# solution

# Build unrestricted decision tree
clf = DecisionTreeClassifier(min_samples_leaf=3, min_samples_split=9, random_state=500)
clf.fit(X_train, y_train)

# Predict the labels
pred = clf.predict(X_test)

# Print the confusion matrix
cm = confusion_matrix(y_test, pred)
print('Confusion matrix:\n', cm)

# Print the F1 score
score = f1_score(y_test, pred)
print('F1-Score: {:.3f}'.format(score))

#----------------------------------#

# Build restricted decision tree
clf = DecisionTreeClassifier(max_depth=4, max_features=2, random_state=500)
clf.fit(X_train, y_train)

# Predict the labels
pred = clf.predict(X_test)

# Print the confusion matrix
cm = confusion_matrix(y_test, pred)
print('Confusion matrix:\n', cm)

# Print the F1 score
score = f1_score(y_test, pred)
print('F1-Score: {:.3f}'.format(score))

#----------------------------------#

# Conclusion

"""
Well done! Notice how the restricted decision tree performs worse, and is only slightly better than random guessing.
"""

Confusion matrix:
 [[144   6]
 [  2   8]]
F1-Score: 0.667
Confusion matrix:
 [[143   7]
 [  3   7]]
F1-Score: 0.583


'\nWell done! Notice how the restricted decision tree performs worse, and is only slightly better than random guessing.\n'

In [1]:
# exercise 02

"""
"Weak" decision tree

In the previous exercise you built two decision trees. Which one is fine-tuned and which one is "weak"?

Decision tree "A":

    min_samples_leaf = 3 and min_samples_split = 9
    F1-Score: ~58%

Decision tree "B":

    max_depth = 4 and max_features = 2
    F1-Score: ~53%

Both classifiers are available for you as clf_A and clf_B.
"""

# Instructions

"""
Possible answers:
    
    Model A is "weak" while model B is fine-tuned.
    
    Model A is fine-tuned while model B is "weak". {Answer}
    
    Both models are fine-tuned with optimal parameters.
    
    Both models are "weak", as they are restricted.
"""

# solution



#----------------------------------#

# Conclusion

"""
Correct choice! Model A is a fine-tuned decision tree, with a decent performance on its own. Model B is 'weak', restricted in height and with performance just above 50%.
"""

'\n\n'

In [10]:
# exercise 03

"""
Training with bootstrapping

Let's now build a "weak" decision tree classifier and train it on a sample of the training set drawn with replacement. This will help you understand what happens on every iteration of a bagging ensemble.

To take a sample, you'll use pandas' .sample() method, which has a replace parameter. For example, the following line of code samples with replacement from the whole DataFrame df:

df.sample(frac=1.0, replace=True, random_state=42)

"""

# Instructions

"""

    Take a sample drawn with replacement (replace=True) from the whole (frac=1.0) training set, X_train.
    Build a decision tree classifier using the parameter max_depth = 4.
    Fit the model to the sampled training data.


"""

# solution

# Take a sample with replacement
X_train_sample = X_train.sample(frac=1.0, replace=True, random_state=42)
y_train_sample = y_train.loc[X_train_sample.index]

# Build a "weak" Decision Tree classifier
clf = DecisionTreeClassifier(max_depth=4, random_state=500)

# Fit the model to the training sample
clf.fit(X_train_sample, y_train_sample)

#----------------------------------#

# Conclusion

"""
Nicely done! You took a sample with replacement from the training set and built a decision tree with it. This represents one iteration of a bagging ensemble.
"""

'\nNicely done! You took a sample with replacement from the training set and built a decision tree with it. This represents one iteration of a bagging ensemble.\n'

In [31]:
import numpy as np
import scipy.stats as stats

def build_decision_tree(X_train, y_train, random_state=None):
	# Take a sample with replacement
	X_train_sample = X_train.sample(frac=1.0, replace=True, random_state=random_state)
	y_train_sample = y_train.loc[X_train_sample.index]

	# Build a "weak" Decision Tree classifier
	clf = DecisionTreeClassifier(max_depth=4, random_state=500)

	# Fit the model on the training sample
	clf.fit(X_train_sample, y_train_sample)
	
	return clf

def predict_voting(classifiers, X):
	# Make the individual predictions
	pred_list = [clf.predict(X) for clf in classifiers]
	# Combine the predictions using "Voting"
	pred_vote = []
	for i in range(X.shape[0]):
		individual_preds = np.array([pred[i] for pred in pred_list])
		combined_pred = stats.mode(individual_preds)[0]
		pred_vote.insert(i, combined_pred)
	
	return pred_vote


In [32]:
# exercise 04

"""
A first attempt at bagging

You've seen what happens in a single iteration of a bagging ensemble. Now let's build a custom bagging model!

Two functions have been prepared for you:

def build_decision_tree(X_train, y_train, random_state=None):
    # Takes a sample with replacement,
    # builds a "weak" decision tree,
    # and fits it to the train set

def predict_voting(classifiers, X_test):
    # Makes the individual predictions 
    # and then combines them using "Voting"

Technically, the build_decision_tree() function is what you did in the previous exercise. Here, you will build multiple such trees and then combine them. Let's see if this ensemble of "weak" models improves performance!
"""

# Instructions

"""

    Build the individual models by calling build_decision_tree(), passing the training set and the index i as the random state.
    Predict the labels of the test set using predict_voting(), with the list of classifiers clf_list and the input test features.

"""

# solution

# Build the list of individual models
clf_list = []
for i in range(21):
	weak_dt = build_decision_tree(X_train, y_train, random_state=i)
	clf_list.append(weak_dt)

# Predict on the test set
pred = predict_voting(clf_list, X_test)

# Print the F1 score
print('F1 score: {:.3f}'.format(f1_score(y_test, pred)))

#----------------------------------#

# Conclusion

"""
Excellent! You just built a custom bagging ensemble. This got a better performance than that of the single 'weak' model, and you only used 21 of them! Now that you have an intuition for how bagging ensembles work underneath the hood, let's learn how to build them using the scikit-learn framework.
"""

F1 score: 0.800


"\nExcellent! You just built a custom bagging ensemble. This got a better performance than that of the single 'weak' model, and you only used 21 of them! Now that you have an intuition for how bagging ensembles work underneath the hood, let's learn how to build them using the scikit-learn framework.\n"

In [33]:
# exercise 05

"""
Bagging: the scikit-learn way

Let's now apply scikit-learn's BaggingClassifier to the Pokémon dataset.

You obtained an F1 score of around 0.63 with your custom bagging ensemble.

Will BaggingClassifier() beat it? Time to find out!
"""

# Instructions

"""

    Instantiate the base model, clf_dt: a "restricted" decision tree with a max depth of 4.
    Build a bagging classifier using 21 estimators, with the decision tree as base estimator.
    Predict the labels of the test set.

"""

# solution
from sklearn.ensemble import BaggingClassifier
# Instantiate the base model
clf_dt = DecisionTreeClassifier(max_depth=4)

# Build and train the Bagging classifier
clf_bag = BaggingClassifier(
  clf_dt,
  n_estimators=21,
  random_state=500)
clf_bag.fit(X_train, y_train)

# Predict the labels of the test set
pred = clf_bag.predict(X_test)

# Show the F1-score
print('F1-Score: {:.3f}'.format(f1_score(y_test, pred)))

#----------------------------------#

# Conclusion

"""
You just built a bagging classifier using the scikit-learn framework. Well done! It got a better performance than our custom ensemble (0.67 vs 0.63) and also using only 21 estimators!
"""

F1-Score: 0.800


'\nYou just built a bagging classifier using the scikit-learn framework. Well done! It got a better performance than our custom ensemble (0.67 vs 0.63) and also using only 21 estimators!\n'

In [35]:
# exercise 06

"""
Checking the out-of-bag score

Let's now check the out-of-bag score for the model from the previous exercise.

So far you've used the F1 score to measure performance. However, in this exercise you should use the accuracy score so that you can easily compare it to the out-of-bag score.

The decision tree classifier from the previous exercise, clf_dt, is available in your workspace.

The pokemon dataset is already loaded for you and split into train and test sets. In addition, the decision tree classifier was fit and is available for you as clf_dt to use it as base estimator.
"""

# Instructions

"""

    Build the bagging classifier using the decision tree as base estimator and 21 estimators. This time, use the out-of-bag score by specifying an argument for the oob_score parameter.
    Print the classifier's out-of-bag score.

"""

# solution
from sklearn.metrics import accuracy_score
# Build and train the bagging classifier
clf_bag = BaggingClassifier(
  clf_dt,
  n_estimators=21,
  oob_score=True,
  random_state=500)
clf_bag.fit(X_train, y_train)

# Print the out-of-bag score
print('OOB-Score: {:.3f}'.format(clf_bag.oob_score_))

# Evaluate the performance on the test set to compare
pred = clf_bag.predict(X_test)
print('Accuracy: {:.3f}'.format(accuracy_score(y_test, pred)))

#----------------------------------#

# Conclusion

"""
Both scores are close and above 90%! Now you know how to use the out-of-bag score for bagging ensemble models. Great work! Let's now learn a few more bagging tips and tricks.
"""

OOB-Score: 0.942
Accuracy: 0.969


"\nBoth scores are close and above 90%! Now you know how to use the out-of-bag score for bagging ensemble models. Great work! Let's now learn a few more bagging tips and tricks.\n"

In [49]:
uci_secom = pd.read_csv(path_data+'uci-secom.csv')
uci_secom.head(), uci_secom.shape

(                  Time        0        1          2          3       4      5  \
 0  2008-07-19 11:55:00  3030.93  2564.00  2187.7333  1411.1265  1.3602  100.0   
 1  2008-07-19 12:32:00  3095.78  2465.14  2230.4222  1463.6606  0.8294  100.0   
 2  2008-07-19 13:17:00  2932.61  2559.94  2186.4111  1698.0172  1.5102  100.0   
 3  2008-07-19 14:43:00  2988.72  2479.90  2199.0333   909.7926  1.3204  100.0   
 4  2008-07-19 15:22:00  3032.24  2502.87  2233.3667  1326.5200  1.5334  100.0   
 
           6       7       8  ...       581     582     583     584      585  \
 0   97.6133  0.1242  1.5005  ...       NaN  0.5005  0.0118  0.0035   2.3630   
 1  102.3433  0.1247  1.4966  ...  208.2045  0.5019  0.0223  0.0055   4.4447   
 2   95.4878  0.1241  1.4436  ...   82.8602  0.4958  0.0157  0.0039   3.1745   
 3  104.2367  0.1217  1.4882  ...   73.8432  0.4990  0.0103  0.0025   2.0544   
 4  100.3967  0.1235  1.5031  ...       NaN  0.4800  0.4766  0.1045  99.3032   
 
       586     587     5

In [50]:
# exercise 07

"""
Exploring the UCI SECOM data

To round out this chapter and solidify your understanding of bagging, it's time to work with a new dataset! This data is from a semi-conductor manufacturing process, obtained from the UCI Machine Learning Repository.

Each row represents a production entity. The features are measurements from sensors or points in the process. The labels represent whether the entity passes (1) or fails (-1) the test.

The dataset is loaded and available to you as uci_secom. The target is the 'Pass/Fail' column. Use the .value_counts() and .describe() methods to check this variable. What do you notice?
"""

# Instructions

"""
Possible answers:
    
    There are fewer negative than positive tests.
    
    The target has many missing values.
    
    There is evidence of high class imbalance in the target. {Answer}
"""

# solution

print(uci_secom['Pass/Fail'].value_counts(dropna=False))

#----------------------------------#

# Conclusion

"""
Correct! It seems like this target is imbalanced, as more than 90% of the tests are negative. An individual model may be prone to overfitting, so it's a good idea to leverage an ensemble method here.
"""

Pass/Fail
-1    1463
 1     104
Name: count, dtype: int64


"\nCorrect! It seems like this target is imbalanced, as more than 90% of the tests are negative. An individual model may be prone to overfitting, so it's a good idea to leverage an ensemble method here.\n"

In [54]:
uci_secom.fillna(0, inplace=True)

In [55]:
X = uci_secom.iloc[:, :-1]
y = uci_secom.iloc[:, -1]

X_train, X_test, y_train, y_test = tts(X, y, test_size=0.4, random_state=42)

In [57]:
# exercise 08

"""
A more complex bagging model

Having explored the semi-conductor data, let's now build a bagging classifier to predict the 'Pass/Fail' label given the input features.

The preprocessed dataset is available in your workspace as uci_secom, and training and test sets have been created for you.

As the target has a high class imbalance, use a "balanced" logistic regression as the base estimator here.

We will also reduce the computation time for LogisticRegression with the parameter solver='liblinear', which is a faster optimizer than the default.
"""

# Instructions

"""

    
    Instantiate a logistic regression to use as the base classifier with the parameters: class_weight='balanced', solver='liblinear', and random_state=42.
    
    Build a bagging classifier using the logistic regression as the base estimator, including the out-of-bag score, and using the maximum number of features as 10.
    
    Print the out-of-bag score to compare to the accuracy.


"""

# solution
from sklearn.linear_model import LogisticRegression
# Build a balanced logistic regression
clf_lr = LogisticRegression(class_weight='balanced', solver='liblinear', random_state=42)

# Build and fit a bagging classifier
clf_bag = BaggingClassifier(clf_lr, max_features=10, oob_score=True, random_state=500)
clf_bag.fit(X_train, y_train)

# Evaluate the accuracy on the test set and show the out-of-bag score
pred = clf_bag.predict(X_test)
print('Accuracy:  {:.2f}'.format(accuracy_score(y_test, pred)))
print('OOB-Score: {:.2f}'.format(clf_bag.oob_score_))

# Print the confusion matrix
print(confusion_matrix(y_test, pred))

#----------------------------------#

# Conclusion

"""
Not bad for an initial model, with an accuracy ~71% and unbiased predictions for the test set. In addition, the out-of-bag score is a good indicator of the actual performance - close to 60%.
"""

Accuracy:  0.70
OOB-Score: 0.58
[[416 170]
 [ 21  20]]


  warn(
  oob_decision_function = predictions / predictions.sum(axis=1)[:, np.newaxis]


'\nNot bad for an initial model, with an accuracy ~71% and unbiased predictions for the test set. In addition, the out-of-bag score is a good indicator of the actual performance - close to 60%.\n'

In [59]:
# exercise 09

"""
Tuning bagging hyperparameters

While you can easily build a bagging classifier using the default parameters, it is highly recommended that you tune these in order to achieve optimal performance. Ideally, these should be optimized using K-fold cross-validation.

In this exercise, let's see if we can improve model performance by modifying the parameters of the bagging classifier.

Here we are also passing the parameter solver='liblinear' to LogisticRegression to reduce the computation time.
"""

# Instructions

"""

    Build a bagging classifier with 20 base estimators, 10 maximum features, and 0.65 (65%) maximum samples (max_samples). Sample without replacement.
    
    Use clf_bag to predict the labels of the test set, X_test.

"""

# solution
from sklearn.metrics import classification_report
# Build a balanced logistic regression
clf_base = LogisticRegression(class_weight='balanced', solver='liblinear', random_state=42)

# Build and fit a bagging classifier with custom parameters
clf_bag = BaggingClassifier(clf_base, n_estimators=20, max_features=10, max_samples=0.65, bootstrap=False, random_state=500)
clf_bag.fit(X_train, y_train)

# Calculate predictions and evaluate the accuracy on the test set
y_pred = clf_bag.predict(X_test)
print('Accuracy:  {:.2f}'.format(accuracy_score(y_test, y_pred)))

# Print the classification report
print(classification_report(y_test, y_pred))

#----------------------------------#

# Conclusion

"""
Great work! With the correct hyperparameters the model could get to a better performance.
"""

Accuracy:  0.69
              precision    recall  f1-score   support

          -1       0.95      0.71      0.81       586
           1       0.09      0.41      0.15        41

    accuracy                           0.69       627
   macro avg       0.52      0.56      0.48       627
weighted avg       0.89      0.69      0.77       627



'\nGreat work! With the correct hyperparameters the model could get to a better performance.\n'