<div style='margin: auto; width: 40%;'><h1 style='font-size: 55px; display: inline-block'>OpenML</h1> <img style="float: left; height:80px; margin-right:10px;" src="https://raw.githubusercontent.com/PGijsbers/Talks/master/odsc/images/openml/dots.png"></div>


[OpenML](https://www.openml.org) is an open platform for open science collaboration in machine learning,
used to share datasets and results of machine learning experiments.
It defines machine learning experiments through the following concepts:

 - *Datasets*. The regular (tabular) datasets. 
 - *Tasks*. Define splits of a dataset that should be used to evaluate a model (in a reproducible way). E.g. 10-fold cross-validation.
 - *Flows*. Define algorithms, workflows or scripts solving tasks. E.g. RandomForest, auto-sklearn.
 - *Runs*. The result of running a Flow on a Task, i.e. a recorded machine learning experiment.
   Concretely it is a set of predictions generated by the trained model, corresponding score, and some meta-data on its execution.

OpenML hosts thousands of datasets and flows and has millions of runs (recorded machine learning experiment).
All this information is publicly available for downloading through any of its interfaces (website, a REST API, and API's in R, Java, Python and C#). In this example we are using Python. With this data we will answer questions in this tutorial like:

 - What datasets are available?
 - What is the best algorithm for a dataset?
 - How do the hyperparametervalues affect the performance of an algorithm on a dataset?
 - Can I model this relation?

In [1]:
import openml

You can download from the OpenML server without any authentication. But if you want to upload any data, you will need to authenticate yourself through setting an apikey. For more on setting up authentication for `openml-python`, see [this tutorial](https://openml.github.io/openml-python/master/examples/20_basic/introduction_tutorial.html#sphx-glr-examples-20-basic-introduction-tutorial-py).

In [2]:
# openml.config.apikey = 'YOUR_KEY_HERE'  # or modify the ~/.openml/config file.

<h2 style='background-color: #4caf50; height:50px; line-height: 50px; color: white; text-align: center;'>Datasets</h2>

In [3]:
openml.datasets.list_datasets(output_format='dataframe', size=5)

Unnamed: 0,did,name,version,uploader,status,format,MajorityClassSize,MaxNominalAttDistinctValues,MinorityClassSize,NumberOfClasses,NumberOfFeatures,NumberOfInstances,NumberOfInstancesWithMissingValues,NumberOfMissingValues,NumberOfNumericFeatures,NumberOfSymbolicFeatures
2,2,anneal,1,1,active,ARFF,684.0,7.0,8.0,5.0,39.0,898.0,898.0,22175.0,6.0,33.0
3,3,kr-vs-kp,1,1,active,ARFF,1669.0,3.0,1527.0,2.0,37.0,3196.0,0.0,0.0,0.0,37.0
4,4,labor,1,1,active,ARFF,37.0,3.0,20.0,2.0,17.0,57.0,56.0,326.0,8.0,9.0
5,5,arrhythmia,1,1,active,ARFF,245.0,13.0,2.0,13.0,280.0,452.0,384.0,408.0,206.0,74.0
6,6,letter,1,1,active,ARFF,813.0,26.0,734.0,26.0,17.0,20000.0,0.0,0.0,16.0,1.0


In [4]:
dataset = openml.datasets.get_dataset('letter')
# openml.datasets.get_dataset(6)  # fetching by id is also possible, and advised for reproducibility (names are not unique)
dataset

OpenML Dataset
Name..........: letter
Version.......: 1
Format........: ARFF
Upload Date...: 2014-04-06 23:19:41
Licence.......: Public
Download URL..: https://www.openml.org/data/v1/download/6/letter.arff
OpenML URL....: https://www.openml.org/d/6
# of features.: 17
# of instances: 20000

In [5]:
data, _, is_categorical, column_names = dataset.get_data()
data.head()

Unnamed: 0,x-box,y-box,width,high,onpix,x-bar,y-bar,x2bar,y2bar,xybar,x2ybr,xy2br,x-ege,xegvy,y-ege,yegvx,class
0,2,4,4,3,2,7,8,2,9,11,7,7,1,8,5,6,Z
1,4,7,5,5,5,5,9,6,4,8,7,9,2,9,7,10,P
2,7,10,8,7,4,8,8,5,10,11,2,8,2,5,5,10,S
3,4,9,5,7,4,7,7,13,1,7,6,8,3,8,0,8,H
4,6,7,8,5,4,7,6,3,7,10,7,9,3,8,3,7,H


<h2 style='background-color: #ff9800; height:50px; line-height: 50px; color: white; text-align: center;'>Tasks</h2>

In [6]:
openml.tasks.list_tasks(output_format='dataframe', data_id=6).head()

Unnamed: 0,tid,ttid,did,name,task_type,status,estimation_procedure,source_data,target_feature,MajorityClassSize,...,NumberOfFeatures,NumberOfInstances,NumberOfInstancesWithMissingValues,NumberOfMissingValues,NumberOfNumericFeatures,NumberOfSymbolicFeatures,evaluation_measures,number_samples,quality_measure,target_value
6,6,1,6,letter,Supervised Classification,active,10-fold Crossvalidation,6,class,813,...,17,20000,0,0,16,1,,,,
236,236,1,6,letter,Supervised Classification,active,33% Holdout set,6,class,813,...,17,20000,0,0,16,1,predictive_accuracy,,,
1705,1705,3,6,letter,Learning Curve,active,10-fold Learning Curve,6,class,813,...,17,20000,0,0,16,1,predictive_accuracy,18.0,,
1770,1770,1,6,letter,Supervised Classification,active,5 times 2-fold Crossvalidation,6,class,813,...,17,20000,0,0,16,1,predictive_accuracy,,,
1886,1886,1,6,letter,Supervised Classification,active,10 times 10-fold Crossvalidation,6,class,813,...,17,20000,0,0,16,1,predictive_accuracy,,,


In [7]:
task = openml.tasks.get_task(6)
task

OpenML Classification Task
Task Type Description: https://www.openml.org/tt/1
Task ID..............: 6
Task URL.............: https://www.openml.org/t/6
Estimation Procedure.: crossvalidation
Target Feature.......: class
# of Classes.........: 26
Cost Matrix..........: Available

<h2 style='background-color: #2196f3; height:50px; line-height: 50px; color: white; text-align: center;'>Flows</h2>

In [8]:
from sklearn.tree import DecisionTreeClassifier

In [9]:
flow_id = openml.flows.get_flow_id(model=DecisionTreeClassifier())  # Comment out this line if no API key is configured.
# flow_id = 17350  # Use this line if no API key is configured.
flow = openml.flows.get_flow(flow_id)
flow

OpenML Flow
Flow ID.........: 17350 (version 58)
Flow URL........: https://www.openml.org/f/17350
Flow Name.......: sklearn.tree.tree.DecisionTreeClassifier
Flow Description: A decision tree classifier.
Upload Date.....: 2019-11-07 19:05:02
Dependencies....: sklearn==0.21.3
numpy>=1.6.1
scipy>=0.9

*note*: results may vary as different scikit-learn or dependency versions are considered separate flows.

<h2 style='background-color: #f44336; height:50px; line-height: 50px; color: white; text-align: center;'>Runs</h2>

In [10]:
openml.runs.list_runs(output_format='dataframe', flow=[flow_id])

Unnamed: 0,run_id,task_id,setup_id,flow_id,uploader,task_type,upload_time,error_message
10417575,10417575,6,8254532,17350,869,1,2019-11-17 13:31:57,
10417577,10417577,6,8254534,17350,869,1,2019-11-18 12:53:28,
10417580,10417580,6,8254537,17350,869,1,2019-11-18 15:18:21,
10417581,10417581,6,8254538,17350,869,1,2019-11-18 16:50:20,


In [11]:
run = openml.runs.get_run(10417577)
run

OpenML Run
Uploader Name...: Pieter Gijsbers
Uploader Profile: https://www.openml.org/u/869
Metric..........: None
Run ID..........: 10417577
Run URL.........: https://www.openml.org/r/10417577
Task ID.........: 6
Task Type.......: Supervised Classification
Task URL........: https://www.openml.org/t/6
Flow ID.........: 17350
Flow Name.......: sklearn.tree.tree.DecisionTreeClassifier(58)
Flow URL........: https://www.openml.org/f/17350
Setup ID........: 8254534
Setup String....: None
Dataset ID......: 6
Dataset URL.....: https://www.openml.org/d/6

In [12]:
run.evaluations

OrderedDict([('area_under_roc_curve', 0.938776184239766),
             ('average_cost', 0.0),
             ('f_measure', 0.8823034192477287),
             ('kappa', 0.8775364690090109),
             ('kb_relative_information_score', 0.8807574205814684),
             ('mean_absolute_error', 0.00905769230769213),
             ('mean_prior_absolute_error', 0.07396191835229461),
             ('number_of_instances', 20000.0),
             ('precision', 0.8824526736855726),
             ('predictive_accuracy', 0.88225),
             ('prior_entropy', 4.6998107276322685),
             ('recall', 0.88225),
             ('relative_absolute_error', 0.12246426957922626),
             ('root_mean_prior_squared_error', 0.19230433563020993),
             ('root_mean_squared_error', 0.09517190923635045),
             ('root_relative_squared_error', 0.4949025664161858),
             ('total_cost', 0.0)])

<h2 style='background-color: black; height:50px; line-height: 50px; color: white; text-align: center;'>ML Experiments with OpenML</h2>

## Performing a machine learning experiment

In [13]:
# Running the DecisionTree flow on the Letter task:
run = openml.runs.run_flow_on_task(flow, task, avoid_duplicate_runs=False)
run

OpenML Run
Uploader Name: None
Metric.......: None
Run ID.......: None
Task ID......: 6
Task Type....: None
Task URL.....: https://www.openml.org/t/6
Flow ID......: None
Flow Name....: sklearn.tree.tree.DecisionTreeClassifier
Flow URL.....: https://www.openml.org/f/None
Setup ID.....: None
Setup String.: Python_3.6.9. Sklearn_0.21.3. NumPy_1.17.4. SciPy_1.3.2. DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=4699, splitter='best')
Dataset ID...: 6
Dataset URL..: https://www.openml.org/d/6

In [14]:
# run.publish()  # This only works if an API key is configured.

## Finding the best Flow for a Task

In [15]:
df = openml.evaluations.list_evaluations_setups(
    function='predictive_accuracy',
    task=[6],  # the letter task
    output_format='dataframe',
    sort_order='desc',
    size=5  # There are tens of thousands of runs for this task - we are only interested in a top 5
)
df[['run_id', 'setup_id', 'flow_id', 'flow_name', 'value']]

Unnamed: 0,run_id,setup_id,flow_id,flow_name,value
0,6065128,4099662,6952,sklearn.pipeline.Pipeline(imputation=openmlstu...,0.9809
1,6058845,4093489,6952,sklearn.pipeline.Pipeline(imputation=openmlstu...,0.98085
2,6021912,4056597,6952,sklearn.pipeline.Pipeline(imputation=openmlstu...,0.9808
3,6040703,4075384,6952,sklearn.pipeline.Pipeline(imputation=openmlstu...,0.98075
4,6056465,4091114,6952,sklearn.pipeline.Pipeline(imputation=openmlstu...,0.98075


In [16]:
df.flow_name[0]

'sklearn.pipeline.Pipeline(imputation=openmlstudy14.preprocessing.ConditionalImputer,hotencoding=sklearn.preprocessing.data.OneHotEncoder,variencethreshold=sklearn.feature_selection.variance_threshold.VarianceThreshold,classifier=sklearn.svm.classes.SVC)(1)'

<h2 style='background-color: black; height:50px; line-height: 50px; color: white; text-align: center;'>Up Next</h2>

This data can be used to answer many questions, and in this tutorial session we will explore two settings.
First, we will visualize the effect of tuning the hyperparameters 'C' and 'Gamma' of a Support Vector Machine for a specific problem.
Secondly we use the (public) results of machine learning experiments to train a surrogate model, 
which will predict the performance of an algorithm for a given problem.