# ***Machine Learning algorithms comparison through the Penn Machine Learning Benchmarks benchmark***

This notebook is mainly based on article *PMLB: a large benchmark suite for machine learning evaluation and comparison* from Randal S. Olson et al. (1)

Notebook by Adrien Delgado.

*Disclaimer: this notebook was made on Google Colab and designed for Google Colab. It can be run locally, but the visuals might not be functioning as well on Jupyter and JupyterLab as on Colab. Also, it looks better in bright mode, so if you have dark-mode enabled, you might want to switch back.*



# Background

Nowadays, the number of Machine Learning-compliant problems and algorithms is skyrocketing. The multiplicity of Machine Learning (ML) methods make their selection, development, but also comparison a very difficult task. The specificity of each problem, as well as its goals, can change drastically the performance of a particular algorithm over another. From there, how to efficiently benchmark the different ML algorithms regarding each problem?

Numerous publicly available real-world and simulated benchmark datasets have emerged from different sources, but their organization and adoption as standards have been inconsistent. 

Therefore, selecting and preprocessing specific benchmarks is still a burden for machine learning practitioners and data scientists.

Here is a quick overview of the different ML algorithms' history:
![History of ML algorithms](https://media.springernature.com/full/springer-static/image/art%3A10.1186%2Fs13174-018-0087-2/MediaObjects/13174_2018_87_Fig5_HTML.png?as=webp)


# Introduction

### What is benchmarking?
![Benchmarking](https://cdn2.hubspot.net/hubfs/2621212/job-benchmarking-basics.gif)
The term **benchmarking** is used in ML to refer to the **evaluation and comparison of ML methods** regarding their ability to learn patterns in ‘benchmark’ datasets that have been applied as ‘standards’. 

Benchmarking could be thought as simply as a **sanity check** to confirm that a new method successfully runs as expected and can reliably find simple patterns that existing methods are known to identify (3). A more rigorous way to view benchmarking is as an approach to identify the respective **strengths and weaknesses** of a given methodology in contrast with others (4). Comparisons could be made over a range of evaluation metrics, e.g., power to detect signal, prediction accuracy, computational complexity, and model interpretability. 

This approach to benchmarking would be important for demonstrating new methodological abilities or simply to guide the selection of an appropriate ML method for a given problem.



### Types of benchmark datasets

Benchmark **datasets** typically take one of **three forms**. The first is accessible, well-studied *real-world data*, taken from different real-world problem domains of interest. The second is *simulated data*, or data that has been artificially generated, often to ‘look’ like real-world data, but with known, underlying patterns. The third form is *toy data*, which we will define here as data that is also artificially generated with a known embedded pattern but without an emphasis on representing real-world data, and has often been used to describe a small and simple dataset such as the examples included with algorithm software.

While some benchmark repositories and datasets have emerged as more popular than others, ML still lacks a **central, comprehensive, and concise set of benchmark datasets** that accentuate the strengths and weaknesses of established ML methods. Individual studies often restrict their benchmarking efforts for various reasons, for example based on comparing variants of the ML algorithm of interest. The scope of benchmarking may also be limited by practical computational requirements.



### Benchmarking challenges

There are currently a **number of challenges** that make it difficult to benchmark ML methods in a useful and globally accepted manner. For one, there are an overwhelming number of publications that reference the use of benchmark datasets, however there are surprisingly few publications that discuss the topic of appropriate ML benchmarking in general. Additionally, **collecting and curating** real-world benchmark datasets remains a challenge for many researchers (5). As a result, many benchmark datasets go unused simply because they are **too difficult to preprocess**. 

Another challenge in benchmarking is that researchers often use only a handful of datasets when evaluating their methods, which can make it difficult to properly compare one ML method to the state-of-the-art ML methods (5). For example, these datasets may be handpicked to **highlight the strengths** of the proposed method, while **failing to demonstrate** the proposed method’s **potential weaknesses**. As a result, although a ML method may perform well on a handful of datasets, it may fail to generalize to a broader range of problems.

![preprocessing](https://thelastbyteblog.files.wordpress.com/2020/06/data-scrubbing-service.gif?w=800&zoom=2)

### The Penn Machine Learning Benchmark

Regarding these issues, it is vital for the bioinformatics and ML community to have a comprehensive benchmark suite with which to compare and contrast ML methods. Towards this goal, the present study introduces the **Penn Machine Learning Benchmark (PMLB)**, a publicly available dataset suite initialized with real-world, simulated, and toy benchmark datasets for evaluating supervised classification methods. The PMLB is not meant to be comprehensive; it initially included many real-world datasets and focused only on **classification**, totalling 165 datasets. Since version 1.0, it also includes **regression** datasets, for a **total suite of 286 datasets**, with a broad repartition of characteristics.

![dataset_sizes](https://github.com/EpistasisLab/pmlb/raw/master/datasets/dataset_sizes.svg)



### Algorithms comparison and evaluation
The number of ML algorithms keeps increasing every year, therefore nowadays it would be illusory trying to be comprehensive in terms of variety. The study here will focus on a fixed number of methods, that have been chosen for their diversity and large usage.

Further, we evaluate the performance of **13 standard statistical ML methods** from scikit-learn (8) over the full set of PMLB datasets. We then assess the diversity of these benchmark datasets from the perspective of their meta-features as well as based on the predictive performance over the set of ML methods applied. Beyond introducing a new simplified resource for ML benchmarks, this study was designed to provide insight into the limitations of currently utilized benchmarks, and direct the expansion and curation of a future improved PMLB dataset suite that more efficiently and comprehensively allows for the comparison of ML methods. This work provides another important step toward the assembly of a effective and diverse set of benchmarking standards integrating real-world, simulated, and toy datasets for generalized ML evaluation and comparison.

# Penn machine learning benchmark

## Overview

We compiled the Penn Machine Learning Benchmark (PMLB) datasets from a **wide range of existing ML benchmark suites**; as such, the PMLB includes most of the real-world benchmark datasets commonly used in ML benchmarking studies.

All datasets are stored in a **common format**:

- First row is the column names
- Each following row corresponds to one row of the data
- The target column is named target
- All columns are tab (\t) separated
- All files are compressed with gzip to conserve space

The complete table of dataset characteristics is also available for download (7). Please note, in our documentation, a feature is considered:

- "binary" if it is of type integer and has 2 unique values (equivalent to pandas profiling's "boolean")
- "categorical" if it is of type integer and has more than 2 unique values (equivalent to pandas profiling's "categorical")
- "continuous" if it is of type float (equivalent to pandas profiling's "numeric").

## Datasets Meta-Features

Here are the meta-features of the classification datasets of the PMLB:

![link text](https://media.springernature.com/full/springer-static/image/art%3A10.1186%2Fs13040-017-0154-4/MediaObjects/13040_2017_154_Fig1_HTML.gif?as=webp)

- *#Instances*: The number of instances in each dataset.

- *#Features*: The number of features in each dataset.

- *#Binary Features*: The number of categorical features in each dataset with only two levels.

- *#Categorical and Ordinal Features*: The number of discrete features in each dataset with > 2 levels.

- *#Continuous Features*: The number of continuous-valued features in each dataset. Discriminating categorical and ordinal features from continuous features was determined automatically based on whether a variable was considered to be a ‘float’ in a Pandas DataFrame [24].

- *Endpoint Type*: Whether each dataset is a binary or multiclass supervised classification problem. Again, continuous endpoints for regression have been excluded in this study.

- *#Classes*: The number of classes to predict in each dataset’s endpoint.

- *Class Imbalance*: The level of class imbalance in each dataset ∈[0, 1), where 0.0 corresponds to perfectly balanced classes and a value approaching 1.0 corresponds to extreme class imbalance, i.e. where nearly all instances have the same class value. Imbalance is calculated by measuring the squared distance of each class’s instance proportion from perfect balance in the dataset, as:

  $I = K\sum_{i=1}^K(\frac{n_i}{N}-\frac{1}{K})^2$

  where $n_i$ is the number of instances of class $i∈Y$.



# Time for fun!

![coding](https://miro.medium.com/max/1600/1*vJjJ3Mdok6Rvxx85IIRqBQ.gif)

## The Datasets

Ok, now let's run that thing. We will begin by importing the package, which is conveniently named pmlb. 

*If you already have the pmlb package, you can skip the next cell.*

In [None]:
pip install pmlb

Importing datasets is relatively easy and done through the `fetch_data` method.

We can also find a list of classification and regression datasets under `classification_dataset_names` and `regression_dataset_names` respectively.



In [None]:
from pmlb import fetch_data, classification_dataset_names, regression_dataset_names

Let's see how a typical dataset looks like:

In [None]:
# Returns a pandas DataFrame
adult_data = fetch_data('adult')
print(adult_data.describe())

The fetch_data function has two additional parameters:

- `return_X_y` (True/False): Whether to return the data in scikit-learn format, with the features and labels stored in separate NumPy arrays.
- `local_cache_dir` (string): The directory on your local machine to store the data files so you don't have to fetch them over the web again. By default, the wrapper does not use a local cache directory. (If you are on colab, there is no real benefit from using `local_cache_dir`).

For example:

In [None]:
# Run this cell if you use your local machine 
# (unless you don't want to have more files locally)

# Returns NumPy arrays
adult_X, adult_y = fetch_data('adult', return_X_y=True, local_cache_dir='./datasets/')
print(adult_X)
print(adult_y)

In [None]:
# Run this cell if you use Google Colab

# Returns NumPy arrays
adult_X, adult_y = fetch_data('adult', return_X_y=True)
print(adult_X)
print(adult_y)

You can also use `dataset_names` to have the list of all dataset names, including regression and classification, instead of `classification_dataset_names` and `regression_dataset_names`

In [None]:
from pmlb import dataset_names

print(dataset_names)

In [None]:
print(classification_dataset_names)
print('')
print(regression_dataset_names)

## Evaluating ML Algorithms

Now that we know how to import the datasets, let's take a look at the algorithms and see how they compare. The PMLB works really fine with the scikit-learn library (8), which includes numerous ML classifiers and regressors. Perfect for what we need!

![scikit-learn](https://scikit-learn.org/stable/_static/ml_map.png)

Let's begin by importing a couple classifiers from scikit-learn, and plotting libraries to show our results.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
import seaborn as sb

We will start by comparing the performance of two classifiers on the first one of the datasets included in the PMLB, called *GAMETES_Epistasis_2_Way_1000atts_0.4H_EDM_1_EDM_1_1*.

In [None]:
gametes_data = fetch_data('GAMETES_Epistasis_2_Way_1000atts_0.4H_EDM_1_EDM_1_1')
print(gametes_data.describe())

Looking at the mean of the target column in the dataset, we can notice that it is perfectly balanced, which seems to make it a good starting point for our comparison.

In [None]:
X, y = fetch_data('GAMETES_Epistasis_2_Way_1000atts_0.4H_EDM_1_EDM_1_1', return_X_y=True)
train_X, test_X, train_y, test_y = train_test_split(X, y)

logit = LogisticRegression(max_iter=500)
gnb = GaussianNB()

logit.fit(train_X, train_y)
gnb.fit(train_X, train_y)

logit_test_score = logit.score(test_X, test_y)
gnb_test_score = gnb.score(test_X, test_y)

print('---Comparison of the algorithms---')
print('Logistic regressor: accuracy =',logit_test_score)
print('Gaussian Naive Bayes: accuracy =',gnb_test_score,'\n')

Let's take a quick look at the results. It seems that for that dataset, the Logistic Regressor and Gaussian Naive Bayes have similar accuracy. You can run the above cell a few more times to see how the results fluctuate.

However, something we can notice is that the datasets from PMLB are not originally scaled. This is a choice from the developing team, to stick to the original data as much as possible.

In order to have standardized results throughout the different datasets in the PMLB, we will need to scale the data, so they are all comparable. We can use scikit-learn standard scaler for that (mean is null, deviation is standard), along with a pipeline for convenient implementation.

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

In [None]:
X, y = fetch_data('GAMETES_Epistasis_2_Way_1000atts_0.4H_EDM_1_EDM_1_1', return_X_y=True)
train_X, test_X, train_y, test_y = train_test_split(X, y)

logit = LogisticRegression(max_iter=500)
gnb = GaussianNB()

pipe_logit = make_pipeline(StandardScaler(), logit)
pipe_gnb = make_pipeline(StandardScaler(), gnb)

pipe_logit.fit(train_X, train_y)
pipe_gnb.fit(train_X, train_y)

logit_test_score = pipe_logit.score(test_X, test_y)
gnb_test_score = pipe_gnb.score(test_X, test_y)

print('Comparison of the algorithms')
print('Logistic regressor: accuracy =',logit_test_score)
print('Gaussian Naive Bayes: accuracy =',gnb_test_score)

Ok, after scaling, the two classifiers still have a similar accuracy on this dataset. In previous studies, one could have concluded that the Logistic Regressor and Gaussian Naive Bayes are similar in terms of performance. 

But is it *true*? Can we just state that they are similar based on a single dataset? 

Luckily, thanks to PMLB, we can check that in an easy way. Let's just try with another dataset!

In [None]:
X, y = fetch_data('Hill_Valley_with_noise', return_X_y=True)
train_X, test_X, train_y, test_y = train_test_split(X, y)

logit = LogisticRegression(max_iter=500)
gnb = GaussianNB()

pipe_logit = make_pipeline(StandardScaler(), logit)
pipe_gnb = make_pipeline(StandardScaler(), gnb)

pipe_logit.fit(train_X, train_y)
pipe_gnb.fit(train_X, train_y)

logit_test_score = pipe_logit.score(test_X, test_y)
gnb_test_score = pipe_gnb.score(test_X, test_y)

print('Comparison of the algorithms')
print('Logistic regressor: accuracy =',logit_test_score)
print('Gaussian Naive Bayes: accuracy =',gnb_test_score)

Ok, so this time the difference is significant, at the benefit of the logistic regressor for this particular dataset.

One of the main goals of PMLB is being able to **benchmark algorithms on multiple datasets at once**. We will now iterate on all classification datasets in PMLB, and look at the repartition of accuracy for each method.

- NB: the parameter `max_iter` that you can see in the logistic regressor defines the maximum number of iterations it will compute for each dataset before stopping. This parameter defaults at 100, but the algorithm will fail to converge on a few datasets with that base value. To avoid that, you can increase the value of `max_iter`, but be careful as it will also increase the computation time of your benchmark! For this first test, I recommend to leave the default value, but you can try to increase it a little bit if you want better results.

### **Coffee break!** *(...or tea if you are British)*
The next cell takes a few minutes to compute (5-6 minutes on Colab if `max_iter` is set to 100), so now is a good time to take something warm and relax a bit. 

### **Do not forget to run the next cell before leaving!**
![coffee_break](https://i.pinimg.com/originals/b5/0b/47/b50b4768194455943ca0f1cf07fcf9af.gif)

In [None]:
logit_test_scores = []
gnb_test_scores = []

for classification_dataset in classification_dataset_names:
    X, y = fetch_data(classification_dataset, return_X_y=True)
    train_X, test_X, train_y, test_y = train_test_split(X, y)

    logit = LogisticRegression(max_iter=100)
    gnb = GaussianNB()

    pipe_logit = make_pipeline(StandardScaler(), logit)
    pipe_gnb = make_pipeline(StandardScaler(), gnb)

    pipe_logit.fit(train_X, train_y)
    pipe_gnb.fit(train_X, train_y)

    logit_test_scores.append(pipe_logit.score(test_X, test_y))
    gnb_test_scores.append(pipe_gnb.score(test_X, test_y))

sb.boxplot(data=[logit_test_scores, gnb_test_scores], notch=True)
plt.xticks([0, 1], ['LogisticRegression', 'GaussianNB'])
plt.ylabel('Test Accuracy')

Looking at the results, it becomes clear that one of the algorithms is fairly superior to the other, with a mean around 80% accuracy against 70-75% for the other.

Tweaking the meta-parameters on each algorithm can lead to better results, as we will see just below.

### Algorithms meta-parameters

As said before, to provide a basis for comparison, we evaluated 13 supervised ML classification methods from scikit-learn on the 165 classification datasets in PMLB.

Once the datasets were scaled, we performed a comprehensive **grid search** of each of the ML method’s parameters using **10-fold cross-validation** to find the best parameters (according to mean cross-validation balanced accuracy) for each ML method on each data set. This process resulted in a **total of over 5.5 million evaluations** of the 13 ML methods over the 165 data sets. For a comprehensive parameter search, we used expert knowledge about the ML methods to decide what parameters and parameter values to evaluate. It should be noted that due to the different number of parameters for each algorithm, **not every algorithm had the same number of evaluations**. A complete table of ML algorithms meta-parameters is available with the original article (9).

Due to available computation time, we cannot afford to reevaluate the whole PMLB with cross-validation for each method here. What we can do is take one algorithm and see how the meta-parameters tuning can affect its performance.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

In [None]:
wine_data = fetch_data('wine_quality_white')
print(wine_data.describe())

In [None]:
X, y = fetch_data('wine_quality_white', return_X_y=True)

train_X, test_X, train_y, test_y = train_test_split(X, y)

knn = KNeighborsClassifier()
meta_parameters = {
    'weights': ['uniform', 'distance'],
    'n_neighbors': [3,4,5,6,7,8]
}

default_knn = KNeighborsClassifier()
default_pipe = make_pipeline(StandardScaler(), default_knn)

search = GridSearchCV(knn, meta_parameters, cv = 6)
default_search = default_pipe

search.fit(train_X,train_y)
default_search.fit(train_X,train_y)

grid_score = search.score(test_X, test_y)
default_score = default_search.score(test_X, test_y)

print('\n---Comparison of the algorithms---')
print('Optimized KNN: accuracy =',grid_score)
print('Default KNN: accuracy =',default_score)

### Yes but...

I know what you may say:
- After looking at the dataset, it was obvious that there was 6 target classes, so the default KNN (5 classes) was not adapted from the beginning.
- The dataset was maybe chosen to show a difference, and sometimes the results will be the same with or without parameter tuning.
- We could just look at the data, take the right parameters and save computation time...

Yes, you are absolutely right, a good look at the data and parameter picking is sometimes more efficient than cross-validation. But I also may answer...
- Here the number of classes is easy to spot, but for other algorithms, the link between data and parameters may be harder to understand.
- The whole point of the PMLB is to automatize and almost 'industrialize' the benchmarking, it is not about fine-tuning each parameter for each method on each dataset.


# Results

![results](https://i.pinimg.com/originals/80/04/e7/8004e78d9a4d63d94f3cff837e27790c.gif)

## Datasets meta-features

We used k-means to **cluster the normalized meta-features of the datasets** into 5 clusters, visualized along the first two principal component axes in below figure (note that the first two components of the PCA explain 49% of the variance, so we expect there to be some overlap of clusters in visualization).

![datasets meta-features](https://media.springernature.com/full/springer-static/image/art%3A10.1186%2Fs13040-017-0154-4/MediaObjects/13040_2017_154_Fig2_HTML.gif?as=webp)

![datasets meta-features](https://media.springernature.com/full/springer-static/image/art%3A10.1186%2Fs13040-017-0154-4/MediaObjects/13040_2017_154_Fig3_HTML.gif?as=webp)

- Clusters 0 and 1 contain most of the datasets, and are separated by their **endpoint type**, i.e. cluster 0 is comprised of binary classification problems, whereas cluster 1 is comprised of multiclass classification problems. 
- Cluster 2 is made up of 3 datasets with relatively **high numbers of features** (a GAMETES dataset with 1000 features and the MNIST dataset with 784). 
- Cluster 3 contains datasets with **high imbalance between classes** in the data set. 
- Finally, cluster 4 is reserved for the KDD Cup dataset, which has **exceptionally high number of instances** (nearly 500,000). 

The clustering analysis thus reflects fairly intuitive ways in which the challenges presented by a particular dataset can be categorized, namely: large numbers of instances, large numbers of features, high class imbalance, and binary versus multiclass classification.

## Model-dataset biclustering

We now analyze the datasets based on ML performance in a “**Model-dataset biclustering**” figure (see below), which identifies **which datasets can be solved with high or low accuracy**, as well as **which datasets are appearing universally easy or hard** for the set of different ML algorithms to model accurately **versus which ones appear to be particularly useful for highlighting differential ML algorithm performance**.

![bi-clustering](https://media.springernature.com/full/springer-static/image/art%3A10.1186%2Fs13040-017-0154-4/MediaObjects/13040_2017_154_Fig4_HTML.gif?as=webp)

- a. Biclustering of the 13 ML models and 165 datasets according to the balanced accuracy of the models using their best parameter settings **-> Basically how they perform, the more blue the better!**
- b. Deviation from the mean balanced accuracy across all 13 ML models. Highlights datasets upon which all ML methods performed similarly versus those where certain ML methods performed better or worse than others **-> Basically how they compare to others on the same dataset, the more blue the better!**
- c. Identifies the boundaries of the 40 contiguous biclusters identified based on the 4 ML-wise clusters by the 10 data-wise clusters **-> similar algorithms are grouped together here.**

## Performance by dataset

Overall, the current suite of datasets span a **reasonable range of difficulty** for the tested ML approaches. Figure below shows the distribution of scores for each tuned ML method for each dataset in the suite, sorted by best balanced accuracy score achieved by any method. The left-most dataset corresponds to `clean2`, and the right-most is `analcatdata_dmft`, with a maximum accuracy score of 0.544 for the methods tested. Approximately half (87) of the current suite can be classified with a balanced accuracy of 0.9 or higher, and nearly all (98.8%) of the datasets can be classified with a balanced accuracy of 0.6 or higher. Thus, although a range of model fidelity is observed, **the datasets are biased towards problems that can be solved with a higher balanced accuracy**.

A good challenge now could be trying to find algorithms that perform well on the right size of this figure, where the best scores remain fairly low.

![performance](https://media.springernature.com/full/springer-static/image/art%3A10.1186%2Fs13040-017-0154-4/MediaObjects/13040_2017_154_Fig5_HTML.gif?as=webp)

# Your time to play!

![play](https://i.pinimg.com/originals/c5/b3/ed/c5b3ed1aa1b9b28be4ce963fdc6347f5.gif)

The cells below will let you **compare any algorithm on the PMLB datasets**. Feel free to use ones that were not originally covered by the article, such as **regression** or even **classification algorithms left aside before**. The first cells will focus on classification, while the others will focus on regression.

## **Have fun !**

In [None]:
# Run that to get the list of all estimators in scikit-learn
from sklearn.utils import all_estimators

### Classification

In [None]:
# Here you will have all the available classifiers in scikit-learn

estimators = all_estimators(type_filter='classifier')

all_regs = []
for name, ClassifierClass in estimators:
    try:
        reg = ClassifierClass()
        all_regs.append(reg)
        print(name)
    except Exception as e:
        print(e)

To get the import path of the regressors you want, just copy and paste their name from the list above into the search bar of scikit-learn and put them in the cell below (or uncomment the ones you want that are already in the cell below): https://scikit-learn.org/stable/search.html

In [None]:
# Classification algorithms, just uncomment the ones you want to try out
# Or search their import path at: https://scikit-learn.org/stable/search.html

from sklearn.dummy import DummyClassifier
from sklearn.ensemble import AdaBoostClassifier

#from sklearn.tree import ExtraTreeClassifier
#from sklearn.tree import DecisionTreeClassifier
#from sklearn.svm.classes import OneClassSVM
#from sklearn.neural_network.multilayer_perceptron import MLPClassifier
#from sklearn.neighbors.classification import RadiusNeighborsClassifier
#from sklearn.neighbors.classification import KNeighborsClassifier
#from sklearn.multioutput import ClassifierChain
#from sklearn.multioutput import MultiOutputClassifier
#from sklearn.multiclass import OutputCodeClassifier
#from sklearn.multiclass import OneVsOneClassifier
#from sklearn.multiclass import OneVsRestClassifier
#from sklearn.linear_model.stochastic_gradient import SGDClassifier
#from sklearn.linear_model.ridge import RidgeClassifierCV
#from sklearn.linear_model.ridge import RidgeClassifier
#from sklearn.linear_model.passive_aggressive import PassiveAggressiveClassifier    
#from sklearn.gaussian_process.gpc import GaussianProcessClassifier
#from sklearn.ensemble.voting_classifier import VotingClassifier
#from sklearn.ensemble.weight_boosting import AdaBoostClassifier
#from sklearn.ensemble.gradient_boosting import GradientBoostingClassifier
#from sklearn.ensemble.bagging import BaggingClassifier
#from sklearn.ensemble.forest import ExtraTreesClassifier
#from sklearn.ensemble.forest import RandomForestClassifier
#from sklearn.naive_bayes import BernoulliNB
#from sklearn.calibration import CalibratedClassifierCV
#from sklearn.naive_bayes import GaussianNB
#from sklearn.semi_supervised import LabelPropagation
#from sklearn.semi_supervised import LabelSpreading
#from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
#from sklearn.svm import LinearSVC
#from sklearn.linear_model import LogisticRegression
#from sklearn.linear_model import LogisticRegressionCV
#from sklearn.naive_bayes import MultinomialNB  
#from sklearn.neighbors import NearestCentroid
#from sklearn.svm import NuSVC
#from sklearn.linear_model import Perceptron
#from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
#from sklearn.svm import SVC
#from sklearn.mixture import DPGMM
#from sklearn.mixture import GMM 
#from sklearn.mixture import GaussianMixture
#from sklearn.mixture import VBGMM


If you want to do your tests on a random subset of the available datasets to save time, enter a value in `NB_OF_DATASETS` (20 is a good compromise)

In [None]:
import numpy.random as rd

###
NB_OF_DATASETS = None 
###

select_clf_datasets = classification_dataset_names
select_reg_datasets = regression_dataset_names

if (NB_OF_DATASETS != None) :
  max_clf = len(select_clf_datasets)
  max_reg = len(select_reg_datasets)

  if (NB_OF_DATASETS < max_clf):
    select_clf_datasets = rd.choice(select_clf_datasets, NB_OF_DATASETS)
  
  if (NB_OF_DATASETS < max_reg):
    select_reg_datasets = rd.choice(select_reg_datasets, NB_OF_DATASETS)

In [None]:
clf1_scores = []
clf2_scores = []

for classification_dataset in select_clf_datasets:
    X, y = fetch_data(classification_dataset, return_X_y=True)
    train_X, test_X, train_y, test_y = train_test_split(X, y)

    clf1 = DummyClassifier() ## Pick a classifier from above here
    clf2 = AdaBoostClassifier() ## Pick a classifier from above here

    pipe1 = make_pipeline(StandardScaler(), clf1)
    pipe2 = make_pipeline(StandardScaler(), clf2)

    pipe1.fit(train_X, train_y)
    pipe2.fit(train_X, train_y)

    clf1_scores.append(pipe1.score(test_X, test_y))
    clf2_scores.append(pipe2.score(test_X, test_y))

sb.boxplot(data=[clf1_scores, clf2_scores], notch=True)
plt.xticks([0, 1], ['First classifier', 'Second classifier'])
plt.ylabel('Test Accuracy')

### Regression

In [None]:
# Here you will have all the available regressors in scikit-learn

estimators = all_estimators(type_filter='regressor')

all_regs = []
for name, RegressorClass in estimators:
    try:
        reg = RegressorClass()
        all_regs.append(reg)
        print(name)
    except Exception as e:
        print(e)

To get the import path of the regressors you want, just copy and paste their name from the list above into the search bar of scikit-learn and put them in the cell below: 
https://scikit-learn.org/stable/search.html

In [None]:
from sklearn.dummy import DummyRegressor
from sklearn.ensemble import AdaBoostRegressor

# Add any regressor you want

In [None]:
reg1_scores = []
reg2_scores = []

for regression_dataset in select_reg_datasets:
    X, y = fetch_data(regression_dataset, return_X_y=True)
    train_X, test_X, train_y, test_y = train_test_split(X, y)

    reg1 = DummyRegressor() ## Pick a classifier from above here
    reg2 = AdaBoostRegressor() ## Pick a classifier from above here

    pipe1 = make_pipeline(StandardScaler(), reg1)
    pipe2 = make_pipeline(StandardScaler(), reg2)

    pipe1.fit(train_X, train_y)
    pipe2.fit(train_X, train_y)

    reg1_scores.append(pipe1.score(test_X, test_y))
    reg2_scores.append(pipe2.score(test_X, test_y))

sb.boxplot(data=[reg1_scores, reg2_scores], notch=True)
plt.xticks([0, 1], ['First regressor', 'Second regressor'])
plt.ylabel('Test Accuracy')

# Sources:


(1) [Randal S. Olson, William La Cava, Patryk Orzechowski, Ryan J. Urbanowicz, and Jason H. Moore (2017). *PMLB: a large benchmark suite for machine learning evaluation and comparison.* BioData Mining 10, page 36.](https://biodatamining.biomedcentral.com/articles/10.1186/s13040-017-0154-4)

(2) [PMLB GitHub repository - EpistasisLab](https://github.com/EpistasisLab/pmlb) 

(3) [Hastie TJ, Tibshirani RJ, Friedman JH. *The elements of statistical learning: data mining, inference, and prediction.* New York: Springer; 2009.](http://scholar.google.com/scholar_lookup?&title=The%20elements%20of%20statistical%20learning%3A%20data%20mining%2C%20inference%2C%20and%20prediction&publication_year=2009&author=Hastie%2CTJ&author=Tibshirani%2CRJ&author=Friedman%2CJH)

(4) [Caruana R, Niculescu-Mizil A. *An empirical comparison of supervised learning algorithms.* In: Proceedings of the 23rd International Conference on Machine Learning. Pittsburgh: ACM: 2006. p. 161–8.](http://scholar.google.com/scholar_lookup?&title=Proceedings%20of%20the%2023rd%20International%20Conference%20on%20Machine%20Learning&publication_year=2006&author=Caruana%2CR&author=Niculescu-Mizil%2CA)

(5) [Macià N, Bernadó-Mansilla E. *Towards UCI+: a mindful repository design.* Inf Sci. 2014; 261:237–62.](http://scholar.google.com/scholar_lookup?&title=Towards%20UCI%2B%3A%20a%20mindful%20repository%20design&journal=Inf%20Sci&volume=261&pages=237-62&publication_year=2014&author=Maci%C3%A0%2CN&author=Bernad%C3%B3-Mansilla%2CE)

(6) [Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. *Scikit-learn: machine learning in Python.* J Mach Learn Res. 2011; 12:2825–30.](http://scholar.google.com/scholar_lookup?&title=Scikit-learn%3A%20machine%20learning%20in%20Python&journal=J%20Mach%20Learn%20Res&volume=12&pages=2825-30&publication_year=2011&author=Pedregosa%2CF&author=Varoquaux%2CG&author=Gramfort%2CA&author=Michel%2CV&author=Thirion%2CB&author=Grisel%2CO&author=Blondel%2CM&author=Prettenhofer%2CP&author=Weiss%2CR&author=Dubourg%2CV&author=Vanderplas%2CJ&author=Passos%2CA&author=Cournapeau%2CD&author=Brucher%2CM&author=Perrot%2CM&author=Duchesnay%2CE)

(7) [Complete table of PLMB datasets - GitHub](https://github.com/EpistasisLab/pmlb/blob/master/pmlb/all_summary_stats.tsv)

(8) [Scikit-Learn official page](https://scikit-learn.org/stable/index.html)

(9) [PMLB algorithms meta-parameters](https://biodatamining.biomedcentral.com/articles/10.1186/s13040-017-0154-4/tables/1)

![end](https://dwnloadsexy330.weebly.com/uploads/1/2/4/8/124816452/389481239.gif)