<div style='margin: auto; width: 40%;'><h1 style='font-size: 55px; display: inline-block'>OpenML</h1> <img style="float: left; height:80px; margin-right:10px;" src="images/openml/dots.png"></div>


[OpenML](https://www.openml.org) is an open platform for open science collaboration in machine learning,
used to share datasets and results of machine learning experiments.
It defines machine learning experiments through the following concepts:

 - *Datasets*. The regular (tabular) datasets. 
 - *Tasks*. Define splits of a dataset that should be used to evaluate a model (in a reproducible way). E.g. 10-fold cross-validation.
 - *Flows*. Define algorithms, workflows or scripts solving tasks. E.g. RandomForest, auto-sklearn.
 - *Runs*. The result of running a Flow on a Task, i.e. a recorded machine learning experiment.
   Concrete it is a set of predictions generated by the trained model, corresponding score, and some meta-data on its execution.

OpenML hosts thousands of datasets and flows and has millions of runs (recorded machine learning experiment).
All this information is publicly available for downloading through any of its interfaces (website, a REST API, and API's in R, Java, Python and C#). In this example we are using Python.

In [1]:
import openml

<h2 style='background-color: #4caf50; height:50px; line-height: 50px; color: white; text-align: center;'>Datasets</h2>

In [2]:
openml.datasets.list_datasets(output_format='dataframe', size=5)

Unnamed: 0,did,name,version,uploader,status,format,MajorityClassSize,MaxNominalAttDistinctValues,MinorityClassSize,NumberOfClasses,NumberOfFeatures,NumberOfInstances,NumberOfInstancesWithMissingValues,NumberOfMissingValues,NumberOfNumericFeatures,NumberOfSymbolicFeatures
2,2,anneal,1,1,active,ARFF,684.0,7.0,8.0,5.0,39.0,898.0,898.0,22175.0,6.0,33.0
3,3,kr-vs-kp,1,1,active,ARFF,1669.0,3.0,1527.0,2.0,37.0,3196.0,0.0,0.0,0.0,37.0
4,4,labor,1,1,active,ARFF,37.0,3.0,20.0,2.0,17.0,57.0,56.0,326.0,8.0,9.0
5,5,arrhythmia,1,1,active,ARFF,245.0,13.0,2.0,13.0,280.0,452.0,384.0,408.0,206.0,74.0
6,6,letter,1,1,active,ARFF,813.0,26.0,734.0,26.0,17.0,20000.0,0.0,0.0,16.0,1.0


In [3]:
openml.datasets.get_dataset('kr-vs-kp')

OpenML Dataset
Name..........: kr-vs-kp
Version.......: 1
Format........: ARFF
Upload Date...: 2014-04-06 23:19:28
Licence.......: Public
Download URL..: https://www.openml.org/data/v1/download/3/kr-vs-kp.arff
OpenML URL....: https://www.openml.org/d/3
# of features.: 37
# of instances: 3196

In [4]:
openml.datasets.get_dataset(3)

OpenML Dataset
Name..........: kr-vs-kp
Version.......: 1
Format........: ARFF
Upload Date...: 2014-04-06 23:19:28
Licence.......: Public
Download URL..: https://www.openml.org/data/v1/download/3/kr-vs-kp.arff
OpenML URL....: https://www.openml.org/d/3
# of features.: 37
# of instances: 3196

<h2 style='background-color: #ff9800; height:50px; line-height: 50px; color: white; text-align: center;'>Tasks</h2>

In [5]:
openml.tasks.list_tasks(output_format='dataframe', data_id=3).head()

Unnamed: 0,tid,ttid,did,name,task_type,status,estimation_procedure,source_data,target_feature,MajorityClassSize,...,NumberOfFeatures,NumberOfInstances,NumberOfInstancesWithMissingValues,NumberOfMissingValues,NumberOfNumericFeatures,NumberOfSymbolicFeatures,evaluation_measures,number_samples,quality_measure,target_value
3,3,1,3,kr-vs-kp,Supervised Classification,active,10-fold Crossvalidation,3,class,1669,...,37,3196,0,0,0,37,,,,
63,63,3,3,kr-vs-kp,Learning Curve,active,10 times 10-fold Learning Curve,3,class,1669,...,37,3196,0,0,0,37,predictive_accuracy,12.0,,
233,233,1,3,kr-vs-kp,Supervised Classification,active,33% Holdout set,3,class,1669,...,37,3196,0,0,0,37,predictive_accuracy,,,
1702,1702,3,3,kr-vs-kp,Learning Curve,active,10-fold Learning Curve,3,class,1669,...,37,3196,0,0,0,37,predictive_accuracy,12.0,,
1767,1767,1,3,kr-vs-kp,Supervised Classification,active,5 times 2-fold Crossvalidation,3,class,1669,...,37,3196,0,0,0,37,predictive_accuracy,,,


In [6]:
openml.tasks.get_task(3)

OpenML Classification Task
Task Type Description: https://www.openml.org/tt/1
Task ID..............: 3
Task URL.............: https://www.openml.org/t/3
Estimation Procedure.: crossvalidation
Target Feature.......: class
# of Classes.........: 2
Cost Matrix..........: Available

<h2 style='background-color: #2196f3; height:50px; line-height: 50px; color: white; text-align: center;'>Flows</h2>

In [7]:
from sklearn.tree import DecisionTreeClassifier

In [8]:
flow_id = openml.flows.get_flow_id(model=DecisionTreeClassifier())
openml.flows.get_flow(flow_id)

OpenML Flow
Flow ID.........: 17334 (version 55)
Flow URL........: https://www.openml.org/f/17334
Flow Name.......: sklearn.tree.tree.DecisionTreeClassifier
Flow Description: A decision tree classifier.
Upload Date.....: 2019-10-30 15:21:50
Dependencies....: sklearn==0.21.3
numpy>=1.6.1
scipy>=0.9

*note*: results may vary as different scikit-learn or dependency versions are considered separate flows.

<h2 style='background-color: #f44336; height:50px; line-height: 50px; color: white; text-align: center;'>Runs</h2>

In [9]:
openml.runs.list_runs(output_format='dataframe', flow=[flow_id])

Unnamed: 0,run_id,task_id,setup_id,flow_id,uploader,upload_time,error_message
10417339,10417339,3,8254412,17334,869,2019-11-14 12:25:54,


In [10]:
openml.runs.get_run(10417339)

OpenML Run
Uploader Name...: Pieter Gijsbers
Uploader Profile: https://www.openml.org/u/869
Metric..........: None
Run ID..........: 10417339
Run URL.........: https://www.openml.org/r/10417339
Task ID.........: 3
Task Type.......: Supervised Classification
Task URL........: https://www.openml.org/t/3
Flow ID.........: 17334
Flow Name.......: sklearn.tree.tree.DecisionTreeClassifier(55)
Flow URL........: https://www.openml.org/f/17334
Setup ID........: 8254412
Setup String....: None
Dataset ID......: 3
Dataset URL.....: https://www.openml.org/d/3

<h2 style='background-color: black; height:50px; line-height: 50px; color: white; text-align: center;'>Let's continue</h2>

This data can be used to answer many questions, and in this tutorial sessions we will explore two settings.
First, we will visualize the effect of tuning the hyperparameters 'C' and 'Gamma' of a Support Vector Machine for a specific problem.
Secondly use the (public) results of machine learning experiments to train a surrogate model, 
which will predict the performance of an algorithm for a given problem.