Skip to content

This Python code is aimed to measure and compare Scikit-Learn Classifier Ensembles.

Notifications You must be signed in to change notification settings

EmirKorkutUnal/Choosing-a-Classifier-Ensemble

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 

Repository files navigation

Choosing a Classifier Ensemble

This Python code is aimed to measure and compare predictive powers of Scikit-Learn Classifier Ensembles.

Background

General Info

There are many useful classification methods within Scikit-Learn, machine learning library for Python. In this article, we'll concentrate on ensemble methods. Ensemble models combine multiple models to create a better model and produce more accurate predictions.

Remember that each dataset has its unique characteristics, meaning the results would vary depending on dataset.

The aim of this code is to make things easy to decide which ensemble model should be selected. Since the goal of this code is comparing models, you can add or remove other models including non-ensemble ones (like Logistic Regression). If you decide to make changes, the code would still be very similar as long as you use Scikit-Learn library.

Dataset

The dataset used here is found on Kaggle and the information was provided by Sloan Digital Sky Survey. Various measurements about sky objects are presented, and you're asked for classifying the objects as galaxies, quasars or stars. More information about the dataset and each variable can be found here. You can download the dataset directly from this link.

Methodology

In this article, we'll use Python to compare

  • AdaBoost
  • Bagging
  • Extra Trees
  • Gradient Boosting
  • Random Forest
classifiers. All classifiers will be used with their default parameters.

Also, we'll employ KFold to get average model scores, use the classifiers to make predictions and then create confusion matrices to visualize prediction results.

Let's start.

Analysis

Preparations

First, we need to import the necessary modules.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier, ExtraTreesClassifier, GradientBoostingClassifier, RandomForestClassifier
from sklearn import model_selection
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
import itertools
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import time
The reason to import warnings is to suppress FutureWarnings.
We're also going to use the time library to measure the KFold model running times.
df = pd.read_csv('C:/Users/Emir/Desktop/Skyserver.csv')
df = df.fillna(0)
df.head()
objid ra dec u g r i z run rerun camcol field specobjid class redshift plate mjd fiberid
0 1.237650e+18 183.531326 0.089693 19.47406 17.04240 15.94699 15.50342 15.22531 752 301 4 267 3.722360e+18 STAR -0.000009 3306 54922 491
1 1.237650e+18 183.598371 0.135285 18.66280 17.21449 16.67637 16.48922 16.39150 752 301 4 267 3.638140e+17 STAR -0.000055 323 51615 541
2 1.237650e+18 183.680207 0.126185 19.38298 18.19169 17.47428 17.08732 16.80125 752 301 4 268 3.232740e+17 GALAXY 0.123111 287 52023 513
3 1.237650e+18 183.870529 0.049911 17.76536 16.60272 16.16116 15.98233 15.90438 752 301 4 269 3.722370e+18 STAR -0.000111 3306 54922 510
4 1.237650e+18 183.883288 0.102557 17.55025 16.26342 16.43869 16.55492 16.61326 752 301 4 269 3.722370e+18 STAR 0.000590 3306 54922 512
Now that the data is properly loaded, let's take a deper look into each object class.
df.groupby(['class']).mean()
objid ra dec u g r i z run rerun camcol field specobjid redshift plate mjd fiberid
class
GALAXY 1.237650e+18 177.333570 15.764372 18.804339 17.350216 16.649506 16.272770 16.017923 996.711685 301.0 3.654662 300.963585 5.379141e+17 0.080325 477.680672 52030.280912 340.108844
QSO 1.237650e+18 177.468000 20.570639 18.942928 18.678714 18.498535 18.360007 18.274761 1036.120000 301.0 3.694118 304.983529 1.447231e+18 1.218366 1285.305882 52694.289412 381.558824
STAR 1.237650e+18 172.962158 12.544824 18.330439 17.130547 16.732093 16.594047 16.531119 950.886561 301.0 3.632225 303.552264 3.018202e+18 0.000043 2680.613198 54093.892823 362.838391
It is already obvious that any variable that contains some sort of ID number won't help for analysis; so 'objid' and 'specobjid' will be out. Beside those two, we can also see that the variable 'rerun' is constant among variables and doesn't provide any useful information.

So, we get our predictors and target like this:
x = df.drop(['class','objid','rerun','specobjid'], axis=1)
y = df.filter(['class'])
Next, we need to split the dataset to train the models and test them on the splits.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.8, random_state=0)

Method Declaration

Now it's time to specify the models to be compared and the methods of comparison.
models = []
models.append(('ABC', AdaBoostClassifier()))
models.append(('BC ', BaggingClassifier()))
models.append(('ETC', ExtraTreesClassifier()))
models.append(('GBC', GradientBoostingClassifier()))
models.append(('RFC', RandomForestClassifier()))
We're creating a list to store all models. First value of each list object are the initials of model names. There's one extra space character at the end of BC, that is for alignment.
This also is the part you can change if you want to make similar classification comparisons. Just append any model that you've imported into this list (remember to use that model's own initials!). You can also change the default parameters by stating them within the brackets.
kfold = model_selection.KFold(n_splits=10, random_state=7)
scoring = 'accuracy'
KFold will split the training data into 10 pieces, and use each one as validation while the other 9 pieces serve as training data.
Models will be scored according to their accuracy.

Models Runs and Scores

It's time to see what the models can produce.
for name, model in models:
    start_time = time.time()
    cv_results = model_selection.cross_val_score(model, x, y.values.ravel(), cv=kfold, scoring=scoring)
    print('%s: %f  in %.2f seconds' % (name, cv_results.mean(), time.time() - start_time))
This part of the code will print the name, accuracy and time needed to complete the KFold validation for each model.
The start_time will fetch the exact time when the for loop starts a new loop. When it's time to print the result, that time will be subtracted from the current time, and '%.2f' part of the code will print it as a float with 2 decimal places.

Here are the results:
ABC: 0.843600  in 8.53 seconds
BC : 0.988700  in 4.99 seconds
ETC: 0.970200  in 0.54 seconds
GBC: 0.989700  in 22.41 seconds
RFC: 0.987900  in 1.31 seconds
Gradient Boosting Classifier has the best score, but also took the longest time to run. Bagging Classifier is a close second with much better running time.

Let's use the models for predictions:
y_pred = []
for name, model in models:
    model.fit(x_train,y_train.values.ravel())
    PredictionResults = model.predict(x_test)
    y_pred.append([name, PredictionResults])
    print('%s fitted and used for predictions.' % name)
We have a list called y_pred, and it will save all the prediction information for each model. Notice that the name of the model is also recorded into the list because we will need it during the creation of confusion matrices.

The data is fitted into all models, prediction results are individually saved into the 'PredictionResults' and then appended to y_pred for future use. Because this is a loop, anything you want to use later must be recorded into a variable that is out of the loop!

The print command is here to show that the code ran succesfully.
ABC fitted and used for predictions.
BC  fitted and used for predictions.
ETC fitted and used for predictions.
GBC fitted and used for predictions.
RFC fitted and used for predictions.

Visualization

So, we got our predictions. But how do we see and compare them?

Here we'll use confusion matrices. Below is a custom function for creating them; this was taken from an example on Scikit-Learn website and several changes were made to give it a little bit of a custom look.
def plot_confusion_matrix(cm, classes, normalize=False, title='Confusion matrix', cmap=plt.cm.Greens):
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes)
    plt.yticks(tick_marks, classes)
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt), horizontalalignment="center", color="white" if cm[i, j] > thresh else "black")
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
This code can be used for both standard and normalized confusion matrices. It has a default title, and the color map can be specified. Details of this code won't be discussed here since it's out of the scope of this article.

In the final step, we'll run two confusion matrices for each prediction set; one will be standard, one will be normalized. We'll use another for loop to do this.
for name, pred in y_pred:
    conmat = confusion_matrix(y_test, pred)
    y_names=np.unique(y_test.iloc[:,0:1].values)
    f = plt.figure(figsize=(12, 5))
    f.add_subplot(1,2,1)
    plot_confusion_matrix(conmat, classes=y_names, title='%s Standard Confusion Matrix' % name)
    f.add_subplot(1,2,2)
    plot_confusion_matrix(conmat, classes=y_names, normalize=True, title='%s Normalized Confusion Matrix' % name)
    plt.subplots_adjust(wspace=0.2)
    plt.show()
All our predictions in our y_pred list go through the confusion_matrix() function and saved as 'conmat' within the loop. Names of the target variables are taken from the y_test dataset. We create a new figure with 12 to 5 inches in size. Then the plot_confusion_matrix() custom function takes over: It is run twice for each figure and creates two confusion matrix subplots. The ones at the left are the standard plots, the ones at the right are the normalized plots. The plots will have a space between them that is 20% of the whole axis width.

All confusion matrices are shown below:

As you can see, confusion matrices allow us to easily notice where the models were successful and where they failed.

AdaBoost Classifier had a big problem labeling stars; it labeled 78% of all stars within the test dataset as galaxies.
Bagging Classifier was quite successful, its only notable fault was labeling 6% of quasars as galaxies.
Extra Trees Classifier performed well, though had problems both in quasar and star labeling.
Gradient Boosting Classifier has the best results, maximum partial error rate is 5%.
Random Forest Classifier accomplished good results, again having difficulties labeling quasars. Note that each time the same code is run, you might have slightly different results.

Futrher Interpretation for this Analysis

If your only choice of classifying these objects would be one of these 5 models with their default parameters, you would have to go with the Gradient Boosting Classifier despite the 5% partial error rate - which is not that high and also consider that the class where this error rate pccurs is smaller than the other classes. In real world, you probably wouldn't have such model selection limitations, so you might choose one of the following directions to follow:
  • Play around with default parameters of each ensemble model until you get a better result. Keep in mind that the both AdaBoost and Bagging Classifiers have a base_estimator parameter where you can change the standard decision tree estimator into something else, like Logistic Regression.
  • All 5 models had some level of problem when it came to labeling quasars; this indicates that the dataset itself may not contain sufficient information to classify all objects correctly. You may want to search for other related variables to increase model accuracy.
  • Look for alternative models, such as Artifical Neural Network Classifier, and measure their accuracies on the dataset.

Conclusion

Although AdaBoost Classifier didn't perform as well in this example, you can use all of these ensembles for your classification analyzes. They are powerful methods and can increase the accuracy of any existing simple method. Running times vary depending on model, so if the computational power would be an issue - especially for a large dataset - then you may choose a model that runs faster in exchange of some accuracy.

Have a nice day,
Emir Korkut Ünal

About

This Python code is aimed to measure and compare Scikit-Learn Classifier Ensembles.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published