# Ensemble learning in scikit-learn

This notebook shows how to use a number of different types of [ensembles](https://scikit-learn.org/stable/modules/ensemble.html) in scikit-learn. We use the Adult dataset to exemplify.

### Reading data and preprocessing

What we do here is probably going to be a bit more obvious after the next lecture, where we discuss preprocessing.

We use the same dataset as we'll use elsewhere in the course (among other places, in Programming assignment 2). The task here is a binary classification task, where we want to predict whether someone earns more than 50K dollars a year or not, given a set of demographic features. The dataset comes with a pre-defined train/test split and you can download the [training set](http://www.cse.chalmers.se/~richajo/dit866/data/adult_train.csv) and the [test set](http://www.cse.chalmers.se/~richajo/dit866/data/adult_test.csv) as separate files.

As in PA 2, we convert the rows of the Pandas dataframe into dictionaries, which works nicely with the `DictVectorizer`.

In [1]:
import pandas as pd

train_data = pd.read_csv('data/adult_train.csv')

n_cols = len(train_data.columns)
Xtrain_dicts = train_data.iloc[:, :n_cols-1].to_dict('records')
Ytrain = train_data.iloc[:, n_cols-1]

test_data = pd.read_csv('data/adult_test.csv')
Xtest_dicts = test_data.iloc[:, :n_cols-1].to_dict('records')
Ytest = test_data.iloc[:, n_cols-1]


To give you a feel for the dataset, here are the first five rows. We want to predict the `target` column, given the other columns.

In [2]:
train_data.head()

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,target
0,27,Private,Some-college,10,Divorced,Adm-clerical,Unmarried,White,Female,0,0,44,United-States,<=50K
1,27,Private,Bachelors,13,Never-married,Prof-specialty,Not-in-family,White,Female,0,0,40,United-States,<=50K
2,25,Private,Assoc-acdm,12,Married-civ-spouse,Sales,Husband,White,Male,0,0,40,United-States,<=50K
3,46,Private,5th-6th,3,Married-civ-spouse,Transport-moving,Husband,Amer-Indian-Eskimo,Male,0,1902,40,United-States,<=50K
4,45,Private,11th,7,Divorced,Transport-moving,Not-in-family,White,Male,0,2824,76,United-States,>50K


Here is the representation of the first individual:

In [3]:
Xtrain_dicts[0]

{'age': 27,
 'workclass': 'Private',
 'education': 'Some-college',
 'education-num': 10,
 'marital-status': 'Divorced',
 'occupation': 'Adm-clerical',
 'relationship': 'Unmarried',
 'race': 'White',
 'sex': 'Female',
 'capital-gain': 0,
 'capital-loss': 0,
 'hours-per-week': 44,
 'native-country': 'United-States'}

To work with scikit-learn, we need to convert the symbolic features into a numerical matrix. Here is how we do this. Again, this is going to be clearer after the next lecture!

In [4]:
# basic preprocessing stuff
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction import DictVectorizer


preprocessing_pipeline = make_pipeline(DictVectorizer(), StandardScaler(with_mean=False))

Xtrain = preprocessing_pipeline.fit_transform(Xtrain_dicts)
Xtest = preprocessing_pipeline.transform(Xtest_dicts)

### Building an ensemble of any set of classifiers

The following example shows how to combine a set of classifiers into an ensemble. This can be done simply in scikit-learn by using a [`VotingClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html). Here, we combine two types of logistic regression, a decision tree, and a neural network.

By default, the `VotingClassifier` will use voting to compute the final prediction. By using the option `voting='soft'`, the `VotingClassifier` will use averaging of probabilities instead. Note that this requires a probability-aware classifier: it needs to have a method called `predict_proba`.

The option `n_jobs=-1` is for efficiency and simply means that we use all available processors on the machine and run the training of the submodels in parallel.

In [5]:
# for evaluation
from sklearn.metrics import accuracy_score

# a few different types of classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
# turn off annoying warnings
import warnings; warnings.simplefilter('ignore')

# and the VotingClassifier
from sklearn.ensemble import VotingClassifier

In [6]:
ensemble = [
            ('lr', LogisticRegression()),
            ('dt', DecisionTreeClassifier(max_depth=5)),
            ('lr1', LogisticRegression(penalty='l1', solver='liblinear')),
            ('mlp', MLPClassifier(hidden_layer_sizes=(8), max_iter=10000))
           ]

voting = VotingClassifier(ensemble)
#voting = VotingClassifier(ensemble, voting='soft')

voting.fit(Xtrain, Ytrain)

accuracy_score(Ytest, voting.predict(Xtest))

0.8540630182421227

### Stacking

We can create an ensemble using stacking in more or less the same way. This will take a bit of time, because cross-validation is used during training. (Why?)

In [7]:
from sklearn.ensemble import StackingClassifier

ensemble = [
            ('lr', LogisticRegression()),
            ('dt', DecisionTreeClassifier(max_depth=5)),
            ('lr1', LogisticRegression(penalty='l1', solver='liblinear')),
            ('mlp', MLPClassifier(hidden_layer_sizes=(8), max_iter=10000))
           ]

stacking = StackingClassifier(ensemble)

stacking.fit(Xtrain, Ytrain)

accuracy_score(Ytest, stacking.predict(Xtest))

0.8560899207665377

### Creating an ensemble using bagging and random subspace learning

In contrast to the example above, where we just combined a few different classifiers, we will now see how an ensemble can be created in a way where we more systematically try to achieve a diversity among the classifiers. We will use decision trees in this example.

Before we do that, let's see what kind of accuracy we get when we use a single decision tree with this dataset.

In [8]:
tree = DecisionTreeClassifier()

tree.fit(Xtrain, Ytrain)
accuracy_score(Ytest, tree.predict(Xtest))

0.8170259812050856

The [`BaggingClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html) creates an ensemble using the [*bagging*](https://en.wikipedia.org/wiki/Bootstrap_aggregating) method and/or [*random subspace learning*](https://en.wikipedia.org/wiki/Random_subspace_method) ("feature bagging"). In bagging, diversity of sub-classifiers is achieved by selecting new training sets from the original set by drawing *instances* with replacement. In random subspace learning, the different sub-classifiers instead use different subsets of *features*.

By setting `bootstrap=True` (this is true by default), bagging is enabled, and random subspace learning is turned on by setting `bootstrap_feature=True`. As you can see, by turning on both options, we can get an accuracy in the 0.85-0.86 range by turning on these features when using an ensemble. The exact accuracy you get will depend on how random sampling of instances and features is done; you can get reproducible results by setting the `random_state`. In general, you will get a higher accuracy when using a larger number of sub-classifiers (`n_estimators`) and there is no risk of overfitting by increasing this value, but this will of course make the ensemble slower.

In [9]:
from sklearn.ensemble import BaggingClassifier

for bootstrap_instances in [False, True]:
    for bootstrap_features in [False, True]:
        bagging = BaggingClassifier(DecisionTreeClassifier(), 
                                    n_estimators=10, 
                                    bootstrap=bootstrap_instances, bootstrap_features=bootstrap_features, 
                                    random_state=0, n_jobs=-1)
        

        bagging.fit(Xtrain, Ytrain)

        acc = accuracy_score(Ytest, bagging.predict(Xtest))

        print(f'Instance bootstrapping: {bootstrap_instances}; feature bootstrapping: {bootstrap_features}; accuracy: {acc:.3f}')


Instance bootstrapping: False; feature bootstrapping: False; accuracy: 0.819
Instance bootstrapping: False; feature bootstrapping: True; accuracy: 0.847
Instance bootstrapping: True; feature bootstrapping: False; accuracy: 0.844
Instance bootstrapping: True; feature bootstrapping: True; accuracy: 0.851


### Random forests

The [`RandomForestClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) uses the [*random forest*](https://en.wikipedia.org/wiki/Random_forest) method to build ensembles of decision trees. This ensemble training method uses training set bagging as well as random subspace learning each time a feature is selected when building the decision trees. This is usually a high-quality model for "tabular" data: that is, a set of named columns, such as what we get if we load a CSV or Excel file using Pandas. This is also the situation we have here.

As in the `BaggingClassifier`, the main hyperparameter to adjust when constructing the ensemble is the number of sub-trees used in the ensemble (`n_estimators`). Apart from that, the `RandomForestClassifier` (and the equivalent model for regression, `RandomForestRegression`) has a number of hyperparameter controlling the tree building, similar to a `DecisionTreeClassifier`.

In [10]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, max_depth=20, random_state=0, n_jobs=-1)

rf.fit(Xtrain, Ytrain)
accuracy_score(Ytest, rf.predict(Xtest))

0.8624777347828757

Important hyperparameters for random forests:

- `n_estimators` controls the size of the ensemble
- `max_features`: how many features to consider when splitting; by default, sqrt(n_features)
- tree-related hyperparameters including `max_depth`
- `n_jobs` for how many CPU cores to use
- `random_state` for reproducibility

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=727b6f02-a414-4cc8-9dec-71ed60bb1da5' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>