Remember that each dataset has its unique characteristics, meaning the results would vary depending on dataset.
The aim of this code is to make things easy to decide which ensemble model should be selected. Since the goal of this code is comparing models, you can add or remove other models including non-ensemble ones (like Logistic Regression). If you decide to make changes, the code would still be very similar as long as you use Scikit-Learn library. The dataset used here is found on Kaggle and the information was provided by Sloan Digital Sky Survey. Various measurements about sky objects are presented, and you're asked for classifying the objects as galaxies, quasars or stars. More information about the dataset and each variable can be found here. You can download the dataset directly from this link. In this article, we'll use Python to compare
- AdaBoost
- Bagging
- Extra Trees
- Gradient Boosting
- Random Forest
Also, we'll employ KFold to get average model scores, use the classifiers to make predictions and then create confusion matrices to visualize prediction results.
Let's start. First, we need to import the necessary modules.
import pandas as pd import matplotlib.pyplot as plt import numpy as np from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier, ExtraTreesClassifier, GradientBoostingClassifier, RandomForestClassifier from sklearn import model_selection from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score from sklearn.metrics import confusion_matrix import itertools import warnings warnings.simplefilter(action='ignore', category=FutureWarning) import timeThe reason to import warnings is to suppress FutureWarnings.
We're also going to use the time library to measure the KFold model running times.
df = pd.read_csv('C:/Users/Emir/Desktop/Skyserver.csv') df = df.fillna(0) df.head()
objid | ra | dec | u | g | r | i | z | run | rerun | camcol | field | specobjid | class | redshift | plate | mjd | fiberid | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.237650e+18 | 183.531326 | 0.089693 | 19.47406 | 17.04240 | 15.94699 | 15.50342 | 15.22531 | 752 | 301 | 4 | 267 | 3.722360e+18 | STAR | -0.000009 | 3306 | 54922 | 491 |
1 | 1.237650e+18 | 183.598371 | 0.135285 | 18.66280 | 17.21449 | 16.67637 | 16.48922 | 16.39150 | 752 | 301 | 4 | 267 | 3.638140e+17 | STAR | -0.000055 | 323 | 51615 | 541 |
2 | 1.237650e+18 | 183.680207 | 0.126185 | 19.38298 | 18.19169 | 17.47428 | 17.08732 | 16.80125 | 752 | 301 | 4 | 268 | 3.232740e+17 | GALAXY | 0.123111 | 287 | 52023 | 513 |
3 | 1.237650e+18 | 183.870529 | 0.049911 | 17.76536 | 16.60272 | 16.16116 | 15.98233 | 15.90438 | 752 | 301 | 4 | 269 | 3.722370e+18 | STAR | -0.000111 | 3306 | 54922 | 510 |
4 | 1.237650e+18 | 183.883288 | 0.102557 | 17.55025 | 16.26342 | 16.43869 | 16.55492 | 16.61326 | 752 | 301 | 4 | 269 | 3.722370e+18 | STAR | 0.000590 | 3306 | 54922 | 512 |
df.groupby(['class']).mean()
objid | ra | dec | u | g | r | i | z | run | rerun | camcol | field | specobjid | redshift | plate | mjd | fiberid | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
class | |||||||||||||||||
GALAXY | 1.237650e+18 | 177.333570 | 15.764372 | 18.804339 | 17.350216 | 16.649506 | 16.272770 | 16.017923 | 996.711685 | 301.0 | 3.654662 | 300.963585 | 5.379141e+17 | 0.080325 | 477.680672 | 52030.280912 | 340.108844 |
QSO | 1.237650e+18 | 177.468000 | 20.570639 | 18.942928 | 18.678714 | 18.498535 | 18.360007 | 18.274761 | 1036.120000 | 301.0 | 3.694118 | 304.983529 | 1.447231e+18 | 1.218366 | 1285.305882 | 52694.289412 | 381.558824 |
STAR | 1.237650e+18 | 172.962158 | 12.544824 | 18.330439 | 17.130547 | 16.732093 | 16.594047 | 16.531119 | 950.886561 | 301.0 | 3.632225 | 303.552264 | 3.018202e+18 | 0.000043 | 2680.613198 | 54093.892823 | 362.838391 |
So, we get our predictors and target like this:
x = df.drop(['class','objid','rerun','specobjid'], axis=1) y = df.filter(['class'])Next, we need to split the dataset to train the models and test them on the splits.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.8, random_state=0)Now it's time to specify the models to be compared and the methods of comparison.
models = [] models.append(('ABC', AdaBoostClassifier())) models.append(('BC ', BaggingClassifier())) models.append(('ETC', ExtraTreesClassifier())) models.append(('GBC', GradientBoostingClassifier())) models.append(('RFC', RandomForestClassifier()))We're creating a list to store all models. First value of each list object are the initials of model names. There's one extra space character at the end of BC, that is for alignment.
This also is the part you can change if you want to make similar classification comparisons. Just append any model that you've imported into this list (remember to use that model's own initials!). You can also change the default parameters by stating them within the brackets.
kfold = model_selection.KFold(n_splits=10, random_state=7) scoring = 'accuracy'KFold will split the training data into 10 pieces, and use each one as validation while the other 9 pieces serve as training data.
Models will be scored according to their accuracy. It's time to see what the models can produce.
for name, model in models: start_time = time.time() cv_results = model_selection.cross_val_score(model, x, y.values.ravel(), cv=kfold, scoring=scoring) print('%s: %f in %.2f seconds' % (name, cv_results.mean(), time.time() - start_time))This part of the code will print the name, accuracy and time needed to complete the KFold validation for each model.
The start_time will fetch the exact time when the for loop starts a new loop. When it's time to print the result, that time will be subtracted from the current time, and '%.2f' part of the code will print it as a float with 2 decimal places.
Here are the results:
ABC: 0.843600 in 8.53 seconds BC : 0.988700 in 4.99 seconds ETC: 0.970200 in 0.54 seconds GBC: 0.989700 in 22.41 seconds RFC: 0.987900 in 1.31 secondsGradient Boosting Classifier has the best score, but also took the longest time to run. Bagging Classifier is a close second with much better running time.
Let's use the models for predictions:
y_pred = [] for name, model in models: model.fit(x_train,y_train.values.ravel()) PredictionResults = model.predict(x_test) y_pred.append([name, PredictionResults]) print('%s fitted and used for predictions.' % name)We have a list called y_pred, and it will save all the prediction information for each model. Notice that the name of the model is also recorded into the list because we will need it during the creation of confusion matrices.
The data is fitted into all models, prediction results are individually saved into the 'PredictionResults' and then appended to y_pred for future use. Because this is a loop, anything you want to use later must be recorded into a variable that is out of the loop!
The print command is here to show that the code ran succesfully.
ABC fitted and used for predictions. BC fitted and used for predictions. ETC fitted and used for predictions. GBC fitted and used for predictions. RFC fitted and used for predictions.So, we got our predictions. But how do we see and compare them?
Here we'll use confusion matrices. Below is a custom function for creating them; this was taken from an example on Scikit-Learn website and several changes were made to give it a little bit of a custom look.
def plot_confusion_matrix(cm, classes, normalize=False, title='Confusion matrix', cmap=plt.cm.Greens): if normalize: cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] plt.imshow(cm, interpolation='nearest', cmap=cmap) plt.title(title) plt.colorbar() tick_marks = np.arange(len(classes)) plt.xticks(tick_marks, classes) plt.yticks(tick_marks, classes) fmt = '.2f' if normalize else 'd' thresh = cm.max() / 2. for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])): plt.text(j, i, format(cm[i, j], fmt), horizontalalignment="center", color="white" if cm[i, j] > thresh else "black") plt.tight_layout() plt.ylabel('True label') plt.xlabel('Predicted label')This code can be used for both standard and normalized confusion matrices. It has a default title, and the color map can be specified. Details of this code won't be discussed here since it's out of the scope of this article.
In the final step, we'll run two confusion matrices for each prediction set; one will be standard, one will be normalized. We'll use another for loop to do this.
for name, pred in y_pred: conmat = confusion_matrix(y_test, pred) y_names=np.unique(y_test.iloc[:,0:1].values) f = plt.figure(figsize=(12, 5)) f.add_subplot(1,2,1) plot_confusion_matrix(conmat, classes=y_names, title='%s Standard Confusion Matrix' % name) f.add_subplot(1,2,2) plot_confusion_matrix(conmat, classes=y_names, normalize=True, title='%s Normalized Confusion Matrix' % name) plt.subplots_adjust(wspace=0.2) plt.show()All our predictions in our y_pred list go through the confusion_matrix() function and saved as 'conmat' within the loop. Names of the target variables are taken from the y_test dataset. We create a new figure with 12 to 5 inches in size. Then the plot_confusion_matrix() custom function takes over: It is run twice for each figure and creates two confusion matrix subplots. The ones at the left are the standard plots, the ones at the right are the normalized plots. The plots will have a space between them that is 20% of the whole axis width.
All confusion matrices are shown below:
As you can see, confusion matrices allow us to easily notice where the models were successful and where they failed.
AdaBoost Classifier had a big problem labeling stars; it labeled 78% of all stars within the test dataset as galaxies.
Bagging Classifier was quite successful, its only notable fault was labeling 6% of quasars as galaxies.
Extra Trees Classifier performed well, though had problems both in quasar and star labeling.
Gradient Boosting Classifier has the best results, maximum partial error rate is 5%.
Random Forest Classifier accomplished good results, again having difficulties labeling quasars. Note that each time the same code is run, you might have slightly different results. If your only choice of classifying these objects would be one of these 5 models with their default parameters, you would have to go with the Gradient Boosting Classifier despite the 5% partial error rate - which is not that high and also consider that the class where this error rate pccurs is smaller than the other classes. In real world, you probably wouldn't have such model selection limitations, so you might choose one of the following directions to follow:
- Play around with default parameters of each ensemble model until you get a better result. Keep in mind that the both AdaBoost and Bagging Classifiers have a base_estimator parameter where you can change the standard decision tree estimator into something else, like Logistic Regression.
- All 5 models had some level of problem when it came to labeling quasars; this indicates that the dataset itself may not contain sufficient information to classify all objects correctly. You may want to search for other related variables to increase model accuracy.
- Look for alternative models, such as Artifical Neural Network Classifier, and measure their accuracies on the dataset.
Have a nice day,
Emir Korkut Ünal