Choosing a Classifier Ensemble

This Python code is aimed to measure and compare predictive powers of Scikit-Learn Classifier Ensembles.

Background

General Info

There are many useful classification methods within Scikit-Learn, machine learning library for Python. In this article, we'll concentrate on ensemble methods. Ensemble models combine multiple models to create a better model and produce more accurate predictions.

Remember that each dataset has its unique characteristics, meaning the results would vary depending on dataset.

The aim of this code is to make things easy to decide which ensemble model should be selected. Since the goal of this code is comparing models, you can add or remove other models including non-ensemble ones (like Logistic Regression). If you decide to make changes, the code would still be very similar as long as you use Scikit-Learn library.

Dataset

The dataset used here is found on Kaggle and the information was provided by Sloan Digital Sky Survey. Various measurements about sky objects are presented, and you're asked for classifying the objects as galaxies, quasars or stars. More information about the dataset and each variable can be found here. You can download the dataset directly from this link.

Methodology

In this article, we'll use Python to compare

AdaBoost
Bagging
Extra Trees
Gradient Boosting
Random Forest

classifiers. All classifiers will be used with their default parameters.

Also, we'll employ KFold to get average model scores, use the classifiers to make predictions and then create confusion matrices to visualize prediction results.

Let's start.

Analysis

Preparations

First, we need to import the necessary modules.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier, ExtraTreesClassifier, GradientBoostingClassifier, RandomForestClassifier
from sklearn import model_selection
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
import itertools
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import time

The reason to import warnings is to suppress FutureWarnings.
We're also going to use the time library to measure the KFold model running times.

df = pd.read_csv('C:/Users/Emir/Desktop/Skyserver.csv')
df = df.fillna(0)
df.head()

	objid	ra	dec	u	g	r	i	z	run	rerun	camcol	field	specobjid	class	redshift	plate	mjd	fiberid
0	1.237650e+18	183.531326	0.089693	19.47406	17.04240	15.94699	15.50342	15.22531	752	301	4	267	3.722360e+18	STAR	-0.000009	3306	54922	491
1	1.237650e+18	183.598371	0.135285	18.66280	17.21449	16.67637	16.48922	16.39150	752	301	4	267	3.638140e+17	STAR	-0.000055	323	51615	541
2	1.237650e+18	183.680207	0.126185	19.38298	18.19169	17.47428	17.08732	16.80125	752	301	4	268	3.232740e+17	GALAXY	0.123111	287	52023	513
3	1.237650e+18	183.870529	0.049911	17.76536	16.60272	16.16116	15.98233	15.90438	752	301	4	269	3.722370e+18	STAR	-0.000111	3306	54922	510
4	1.237650e+18	183.883288	0.102557	17.55025	16.26342	16.43869	16.55492	16.61326	752	301	4	269	3.722370e+18	STAR	0.000590	3306	54922	512

Now that the data is properly loaded, let's take a deper look into each object class.

df.groupby(['class']).mean()

	objid	ra	dec	u	g	r	i	z	run	rerun	camcol	field	specobjid	redshift	plate	mjd	fiberid
class
GALAXY	1.237650e+18	177.333570	15.764372	18.804339	17.350216	16.649506	16.272770	16.017923	996.711685	301.0	3.654662	300.963585	5.379141e+17	0.080325	477.680672	52030.280912	340.108844
QSO	1.237650e+18	177.468000	20.570639	18.942928	18.678714	18.498535	18.360007	18.274761	1036.120000	301.0	3.694118	304.983529	1.447231e+18	1.218366	1285.305882	52694.289412	381.558824
STAR	1.237650e+18	172.962158	12.544824	18.330439	17.130547	16.732093	16.594047	16.531119	950.886561	301.0	3.632225	303.552264	3.018202e+18	0.000043	2680.613198	54093.892823	362.838391

It is already obvious that any variable that contains some sort of ID number won't help for analysis; so 'objid' and 'specobjid' will be out. Beside those two, we can also see that the variable 'rerun' is constant among variables and doesn't provide any useful information.

So, we get our predictors and target like this:

x = df.drop(['class','objid','rerun','specobjid'], axis=1)
y = df.filter(['class'])

Next, we need to split the dataset to train the models and test them on the splits.

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.8, random_state=0)

Method Declaration

Now it's time to specify the models to be compared and the methods of comparison.

models = []
models.append(('ABC', AdaBoostClassifier()))
models.append(('BC ', BaggingClassifier()))
models.append(('ETC', ExtraTreesClassifier()))
models.append(('GBC', GradientBoostingClassifier()))
models.append(('RFC', RandomForestClassifier()))

We're creating a list to store all models. First value of each list object are the initials of model names. There's one extra space character at the end of BC, that is for alignment.
This also is the part you can change if you want to make similar classification comparisons. Just append any model that you've imported into this list (remember to use that model's own initials!). You can also change the default parameters by stating them within the brackets.

kfold = model_selection.KFold(n_splits=10, random_state=7)
scoring = 'accuracy'

KFold will split the training data into 10 pieces, and use each one as validation while the other 9 pieces serve as training data.
Models will be scored according to their accuracy.

Models Runs and Scores

It's time to see what the models can produce.

for name, model in models:
    start_time = time.time()
    cv_results = model_selection.cross_val_score(model, x, y.values.ravel(), cv=kfold, scoring=scoring)
    print('%s: %f  in %.2f seconds' % (name, cv_results.mean(), time.time() - start_time))

This part of the code will print the name, accuracy and time needed to complete the KFold validation for each model.
The start_time will fetch the exact time when the for loop starts a new loop. When it's time to print the result, that time will be subtracted from the current time, and '%.2f' part of the code will print it as a float with 2 decimal places.

Here are the results:

ABC: 0.843600  in 8.53 seconds
BC : 0.988700  in 4.99 seconds
ETC: 0.970200  in 0.54 seconds
GBC: 0.989700  in 22.41 seconds
RFC: 0.987900  in 1.31 seconds

Gradient Boosting Classifier has the best score, but also took the longest time to run. Bagging Classifier is a close second with much better running time.

Let's use the models for predictions:

y_pred = []
for name, model in models:
    model.fit(x_train,y_train.values.ravel())
    PredictionResults = model.predict(x_test)
    y_pred.append([name, PredictionResults])
    print('%s fitted and used for predictions.' % name)

We have a list called y_pred, and it will save all the prediction information for each model. Notice that the name of the model is also recorded into the list because we will need it during the creation of confusion matrices.

The data is fitted into all models, prediction results are individually saved into the 'PredictionResults' and then appended to y_pred for future use. Because this is a loop, anything you want to use later must be recorded into a variable that is out of the loop!

The print command is here to show that the code ran succesfully.

ABC fitted and used for predictions.
BC  fitted and used for predictions.
ETC fitted and used for predictions.
GBC fitted and used for predictions.
RFC fitted and used for predictions.

Visualization

So, we got our predictions. But how do we see and compare them?

Here we'll use confusion matrices. Below is a custom function for creating them; this was taken from an example on Scikit-Learn website and several changes were made to give it a little bit of a custom look.

def plot_confusion_matrix(cm, classes, normalize=False, title='Confusion matrix', cmap=plt.cm.Greens):
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes)
    plt.yticks(tick_marks, classes)
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt), horizontalalignment="center", color="white" if cm[i, j] > thresh else "black")
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

This code can be used for both standard and normalized confusion matrices. It has a default title, and the color map can be specified. Details of this code won't be discussed here since it's out of the scope of this article.

In the final step, we'll run two confusion matrices for each prediction set; one will be standard, one will be normalized. We'll use another for loop to do this.

for name, pred in y_pred:
    conmat = confusion_matrix(y_test, pred)
    y_names=np.unique(y_test.iloc[:,0:1].values)
    f = plt.figure(figsize=(12, 5))
    f.add_subplot(1,2,1)
    plot_confusion_matrix(conmat, classes=y_names, title='%s Standard Confusion Matrix' % name)
    f.add_subplot(1,2,2)
    plot_confusion_matrix(conmat, classes=y_names, normalize=True, title='%s Normalized Confusion Matrix' % name)
    plt.subplots_adjust(wspace=0.2)
    plt.show()

All our predictions in our y_pred list go through the confusion_matrix() function and saved as 'conmat' within the loop. Names of the target variables are taken from the y_test dataset. We create a new figure with 12 to 5 inches in size. Then the plot_confusion_matrix() custom function takes over: It is run twice for each figure and creates two confusion matrix subplots. The ones at the left are the standard plots, the ones at the right are the normalized plots. The plots will have a space between them that is 20% of the whole axis width.

All confusion matrices are shown below:

As you can see, confusion matrices allow us to easily notice where the models were successful and where they failed.

AdaBoost Classifier had a big problem labeling stars; it labeled 78% of all stars within the test dataset as galaxies.
Bagging Classifier was quite successful, its only notable fault was labeling 6% of quasars as galaxies.
Extra Trees Classifier performed well, though had problems both in quasar and star labeling.
Gradient Boosting Classifier has the best results, maximum partial error rate is 5%.
Random Forest Classifier accomplished good results, again having difficulties labeling quasars. Note that each time the same code is run, you might have slightly different results.

Futrher Interpretation for this Analysis

If your only choice of classifying these objects would be one of these 5 models with their default parameters, you would have to go with the Gradient Boosting Classifier despite the 5% partial error rate - which is not that high and also consider that the class where this error rate pccurs is smaller than the other classes. In real world, you probably wouldn't have such model selection limitations, so you might choose one of the following directions to follow:

Play around with default parameters of each ensemble model until you get a better result. Keep in mind that the both AdaBoost and Bagging Classifiers have a base_estimator parameter where you can change the standard decision tree estimator into something else, like Logistic Regression.
All 5 models had some level of problem when it came to labeling quasars; this indicates that the dataset itself may not contain sufficient information to classify all objects correctly. You may want to search for other related variables to increase model accuracy.
Look for alternative models, such as Artifical Neural Network Classifier, and measure their accuracies on the dataset.

Conclusion

Although AdaBoost Classifier didn't perform as well in this example, you can use all of these ensembles for your classification analyzes. They are powerful methods and can increase the accuracy of any existing simple method. Running times vary depending on model, so if the computational power would be an issue - especially for a large dataset - then you may choose a model that runs faster in exchange of some accuracy.

Have a nice day,
Emir Korkut Ünal

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
images		images
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

images

images

README.md

README.md

Repository files navigation

Choosing a Classifier Ensemble

Background

General Info

Dataset

Methodology

Analysis

Preparations

Method Declaration

Models Runs and Scores

Visualization

Futrher Interpretation for this Analysis

Conclusion

About

Releases

Packages

EmirKorkutUnal/Choosing-a-Classifier-Ensemble

Folders and files

Latest commit

History

images

images

README.md

README.md

Repository files navigation

Choosing a Classifier Ensemble

Background

General Info

Dataset

Methodology

Analysis

Preparations

Method Declaration

Models Runs and Scores

Visualization

Futrher Interpretation for this Analysis

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages