In [None]:
%matplotlib inline


# OpenML and Scikit-Learn Tutorial 

This notebook covers key concepts of OpenML and simple examples of its use in combination with Python and Scikit-Learn.


OpenML is an online collaboration platform for machine learning which allows
you to:

* Find or share interesting, well-documented datasets
* Define research / modelling goals (tasks)
* Explore large amounts of machine learning algorithms, with APIs in Java, R, Python
* Log and share reproducible experiments, models, results
* Works seamlessly with scikit-learn and other libraries
* Large scale benchmarking, compare to state of the art




# Setup


## Installation
Installation is done either through *Anaconda-Navigator* (see *Environments*), or via ``pip``:

.. code:: bash

    pip install openml scikit-learn

For further information, please check out the installation guide at https://openml.github.io/openml-python/develop/usage.html#installation and https://scikit-learn.org/stable/install.html



In [3]:
# License: BSD 3-Clause

import openml
import sklearn



## Authentication with main server

The OpenML main server can only be accessed by users who have signed up on the
OpenML platform. If you don’t have an account yet, sign up now.
You will receive an *API key*, which will authenticate you to the server
and allow you to download and upload datasets, tasks, runs and flows.

It is important to configure the Python connector with the proper API endpoint (usually good by default) and the proper API key. 

* Create an OpenML account (free) on https://www.openml.org.
* After logging in, open your account page (avatar on the top right)
* Open 'Account Settings', then 'API authentication' to find your API key.

There are several ways to permanently authenticate:

* Create a plain text file **~/.openml/config** with the line
  **'apikey=MYKEY'**, replacing **MYKEY** with your API key. The config
  file must be in the directory ~/.openml/config and exist prior to
  importing the openml module. 
*  Running the code below and replacing 'MYKEY' with your API key, you authenticate for the duration of the python process. 
* Use the ``openml`` CLI tool with ``openml configure apikey MYKEY``,
  replacing **MYKEY** with your API key.

**IMPORTANT: Do not share code with credentials such as your OpenML API key**
For example, in the first option above, keep your API key in a configuration file that is not synchonised through your Git-repository.


In [1]:
import openml

openml.config.server = 'https://test.openml.org/api/v1/'
openml.config.apikey = 'MYKEY'

## Alternative: Test server

You can specify to use a test server for OpenML, rather than the main server. 
For example, when you run code that simulates the upload of data. In such an example, you can connect to the test server at test.openml.org. This prevents the main server from crowding with example datasets, tasks, runs, and so on. The use of this test server can affect behaviour and performance of the OpenML-Python API.

Before first connecting to the test server, specify:

    openml.config.start_using_configuration_for_example()


After completing all connections, specify:

    openml.config.stop_using_configuration_for_example()


When using the main server instead, make sure your apikey is configured.


In [7]:
import openml

openml.config.start_using_configuration_for_example()
# ... some code that connects to the server ...
openml.config.stop_using_configuration_for_example()

### Caching
When downloading datasets, tasks, runs and flows, they will be cached to
retrieve them without calling the server later. As with the API key,
the cache directory can be either specified through the config file or
through the API:

* Add the  line **cachedir = 'MYDIR'** to the config file, replacing
  'MYDIR' with the path to the cache directory. By default, OpenML
  will use **~/.openml/cache** as the cache directory.
* Run the code below, replacing 'MYDIR' with the path to the cache directory.



In [6]:
# Uncomment and set your OpenML cache directory
# import os
# openml.config.cache_directory = os.path.expanduser('MYDIR')

## Datasets, tasks, and runs

OpenML distinguishes datasets, tasks, flows, and runs. This structure facilitates sharing of results of experiments with different model (configurations).

- **Datasets** simply consist of a number of rows, also called instances, usually in tabular form. They are identified by their *did*.
  Example: https://www.openml.org/search?type=data&sort=runs&id=40704
  Note: There are often multiple variants of one dataset. For example, compare the results of a search for all datasets that contain 'Titanic': https://www.openml.org/search?type=data&status=active
  
-  **Tasks** consists of a dataset, together with a machine learning task to perform, such as classification or clustering and an evaluation method. For supervised tasks, this also specifies the target column in the data.
  Example: https://www.openml.org/search?type=task&id=146230&source_data.data_id=40704

- **Flows** identify a particular machine learning algorithm from a particular library or framework such as scikit-learn, Weka, or mlr. It should at least contain a name, details about the workbench and its version and a list of settable hyperparameters. Ideally, the appropriate workbench can deserialize it again (the algorithm, not the model).
  Example: https://www.openml.org/search?type=flow&id=18869

- **Runs** are particular flows, that is algorithm, with a particular parameter setting, applied to a particular task.
  Example: https://www.openml.org/search?type=run&id=10229128&run_task.task_id=146230


OpenML ensures that all datasets are uniformy formatted. They have rich, consistent metadata that is automatically managed and allows to filter datasets. This is important in particular for datasets that might have a large number of instances and variables.

For more information, see https://docs.openml.org/


## Simple Example


**Example-1**: In the examples below, we will use the same task from the simple example above (task `146230` https://www.openml.org/search?type=task&sort=runs&id=146230 for the `Titanic` dataset with did `40704`: https://www.openml.org/search?type=data&sort=runs&id=40704&status=active), but with optimised classifiers. 

**Example-2 (Bonus)**: the more complex classification task `167119`  https://www.openml.org/search?type=task&sort=runs&id=167119 for the `Jungle Chess` dataset with did `41027`: https://www.openml.org/search?type=data&sort=runs&id=41027&status=active


Download the OpenML task 146230 for the Titanic dataset, and execute it with a kNN classifier. Report the results in accuracy and training runtime.



- Get the task and dataset using

      openml.tasks.get_task(tid)
      openml.datasets.get_dataset(did)
      
- Run the model of a KNeighborsClassifier on that task using

      openml.runs.run_model_on_task(clf, task)

- Print a summary of the run using

      print(run)
           
- State the accuracy and runtime on each of the first (bonus: all 10) fold(s) using
      
      measures["predictive_accuracy"][repeat][fold]
      measures["wall_clock_time_millis_training"][repeat][fold]    


In [48]:
print('Simple OpenML Example:\n')
import openml
from sklearn import neighbors

# Get the task and dataset:
#task_id = 146230 # Titanic
task_id = 167119 # Jungle Chess
task = openml.tasks.get_task(task_id)
data = openml.datasets.get_dataset(task.dataset_id)

# Specify a classifier and run it:
clf = neighbors.KNeighborsClassifier(n_neighbors=5)
print('Experiment started.')
run = openml.runs.run_model_on_task(clf, task, avoid_duplicate_runs=False)
print('Experiment completed.')
print()

# Print a summary of the run:
print(run)
print()

# List the available performance metrics:
measures = run.fold_evaluations
print("The timing and performance metrics available: ")
for key in measures.keys():
    print(key)
print()

# State the accuracy and runtime of the first repetition and fold:
print("For the first repetition and fold, accuracy is {:.4f}, training wall time is {:.4f}".format(measures["predictive_accuracy"][0][0], measures["wall_clock_time_millis_training"][0][0]))
print()

# Bonus: State the accuracy and runtime on each of the 10 folds:
print(
    "The results over all repetitions and folds, \nwith accuracy (`predictive_accuracy`) \nand training time(wall_clock_time_millis_training):"
)
for repeat, val1 in measures["predictive_accuracy"].items():
    for fold, val2 in val1.items():
        print("Repeat #{}, Fold #{}: Accuracy: {:.4f}, Wall-Time: {:.4f}".format(repeat, fold, measures["predictive_accuracy"][repeat][fold], measures["wall_clock_time_millis_training"][repeat][fold]))
    print()
    

# Optional: Publish the experiment on OpenML (requires an API key).
# For such tests/tutorials, consider using the test server
# as to not crowd the main server with runs created by examples.
#myrun = run.publish()
#print(f"kNN on {data.name}: {myrun.openml_url}")


Simple OpenML Example:

Experiment started.
Experiment completed.

OpenML Run
Uploader Name: None
Metric.......: None
Run ID.......: None
Task ID......: 167119
Task Type....: None
Task URL.....: https://www.openml.org/t/167119
Flow ID......: None
Flow Name....: sklearn.neighbors._classification.KNeighborsClassifier
Flow URL.....: https://www.openml.org/f/None
Setup ID.....: None
Setup String.: Python_3.9.12. Sklearn_1.0.2. NumPy_1.21.2. SciPy_1.7.3.
Dataset ID...: 41027
Dataset URL..: https://www.openml.org/d/41027

The timing and performance metrics available: 
usercpu_time_millis_training
wall_clock_time_millis_training
usercpu_time_millis_testing
usercpu_time_millis
wall_clock_time_millis_testing
wall_clock_time_millis
predictive_accuracy

For the first repetition and fold, accuracy is 0.7361, training wall time is 129.1437

The results over all repetitions and folds, 
with accuracy (`predictive_accuracy`) 
and training time(wall_clock_time_millis_training):
Repeat #0, Fold #0: Accu

# Optimising a model in OpenML

A key step in machine learning is performing a systematic hyper-parameter optimization/tuning (HPO) and evaluation.
This can easily be done by combining OpenML with the automated HPO tools provided by scikit-learn.

**Example-1**: In the examples below, we will use the same task from the simple example above (task `146230` https://www.openml.org/search?type=task&sort=runs&id=146230 for the `Titanic` dataset with did `40704`: https://www.openml.org/search?type=data&sort=runs&id=40704&status=active), but with optimised classifiers. 

**Example-2 (Bonus)**: the more complex classification task `167119`  https://www.openml.org/search?type=task&sort=runs&id=167119 for the `Jungle Chess` dataset with did `41027`: https://www.openml.org/search?type=data&sort=runs&id=41027&status=active

A detailed tutorial on HPO with scikit-learn is available here: https://scikit-learn.org/stable/modules/grid_search.html#exhaustive-grid-search
A related tutorial for OpenML is given here: https://openml.github.io/openml-python/main/examples/30_extended/fetch_runtimes_tutorial.html

## Example: Run a single model on a task

Running a scikit-learn model on a task is done using the function `run_model_on_task(...)` ([see docs](https://openml.github.io/openml-python/master/generated/openml.runs.run_model_on_task.html#openml.runs.run_model_on_task)) or `run_flow_on_task(...)`.  In particular, review the `avoid_duplicate_run` option (especially important for tutorials). The function `get_metric_fn` ([doc](https://openml.github.io/openml-python/master/generated/openml.OpenMLRun.html#openml.OpenMLRun)) can be used to obtain metric scores before uploading. 
* Use the function `run_model_on_task` to run your favorite scikit-learn classifier (e.g., a `Random Forest Classifier`) on the task `146230` of the `Titanic` dataset. Report the score.
* Use the function `run_flow_on_task` to run another scikit-learn classifier on the same task. Report the score.

In [46]:
# Use the function `run_model_on_task` to run your favorite scikit-learn classifier (e.g., a `Random Forest Classifier`) for task 146230 on the `Titanic` dataset.
# For starters, do a single run without hyper-parameter tuning:

# imports and classifier specification:
import openml
import numpy as np
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=10, n_jobs=2)

# helper function for printing the evaluation results:
def print_compare_runs(measures):
    for repeat, val1 in measures["usercpu_time_millis_training"].items():
        for fold, val2 in val1.items():
            print(
                "Repeat #{}, Fold #{}: ACC {:.3f}, Training Walltime {:.3f}".format(
                    repeat, fold, measures["predictive_accuracy"][repeat][fold], measures["wall_clock_time_millis_training"][repeat][fold]
                )
            )


# get the task 146230 and print its summary:
#task_id = 146230 # Titanic
task_id = 167119 # Jungle Chess
task = openml.tasks.get_task(task_id)
print(task)

# run the configuration and print the results:
print('Starting run.')
run = openml.runs.run_model_on_task(
    model=clf, task=task, upload_flow=False, avoid_duplicate_runs=False,
)

print('Run completed.')
measures = run.fold_evaluations
print_compare_runs(measures)


OpenML Classification Task
Task Type Description: https://www.openml.org/tt/TaskType.SUPERVISED_CLASSIFICATION
Task ID..............: 167119
Task URL.............: https://www.openml.org/t/167119
Estimation Procedure.: crossvalidation
Target Feature.......: class
# of Classes.........: 3
Cost Matrix..........: Available
Starting run.
Run completed.
Repeat #0, Fold #0: ACC 0.773, Training Walltime 298.536
Repeat #0, Fold #1: ACC 0.771, Training Walltime 243.038
Repeat #0, Fold #2: ACC 0.779, Training Walltime 214.649
Repeat #0, Fold #3: ACC 0.784, Training Walltime 349.574
Repeat #0, Fold #4: ACC 0.779, Training Walltime 304.194
Repeat #0, Fold #5: ACC 0.781, Training Walltime 223.819
Repeat #0, Fold #6: ACC 0.766, Training Walltime 223.972
Repeat #0, Fold #7: ACC 0.771, Training Walltime 269.892
Repeat #0, Fold #8: ACC 0.780, Training Walltime 230.138
Repeat #0, Fold #9: ACC 0.777, Training Walltime 204.337


In [0]:
# Use the function run_flow_on_task to run another scikit-learn classifier on the diabetes dataset

## Advanced Examples: Random Search and Grid Search

Scikit-learn natively supports Random Search and Grid Search procedures, to optimize the hyperparameters. These classifiers can natively be used using the openml connector. Read [this article](https://scikit-learn.org/stable/auto_examples/model_selection/plot_randomized_search.html) to understand how these work. 

* Run Random Search and Grid Search on a SVM from scikit-learn. Make sure to optimize at least 2 hyperparameters. What are the most important hyperparameters? What is the main difference between these two classifiers?

In [47]:
# As above, use the function `run_model_on_task` to run your favorite scikit-learn classifier (e.g., a `Random Forest Classifier`) for task 146230 on the `Titanic` dataset.
# This time, use a pipe for RandomizedSearchCV:

# as above, imports and classifier specification:
import openml
import numpy as np
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=10, n_jobs=2)

# as above, helper function for printing the evaluation results:
def print_compare_runs(measures):
    for repeat, val1 in measures["usercpu_time_millis_training"].items():
        for fold, val2 in val1.items():
            print(
                "Repeat #{}, Fold #{}: ACC {:.3f}, Training Walltime {:.3f}".format(
                    repeat, fold, measures["predictive_accuracy"][repeat][fold], measures["wall_clock_time_millis_training"][repeat][fold]
                )
            )


# as above, get the task 146230 and print its summary:
#task_id = 146230 # Titanic
task_id = 167119 # Jungle Chess
task = openml.tasks.get_task(task_id)
print(task)

# create a pipeline for a RandomizedSearchCV using 2-fold cross validation:
n_iter = 2
rs_pipe = RandomizedSearchCV(
    estimator=clf,
    param_distributions={
        "n_estimators": np.linspace(start=1, stop=50, num=15).astype(int).tolist()
    },
    cv=2,
    n_iter=n_iter,
    n_jobs=2,
)


# run the search and print the results:
print('Starting runs.')
run = openml.runs.run_model_on_task(
    model=rs_pipe, task=task, upload_flow=False, avoid_duplicate_runs=False, n_jobs=2
)
print('Runs completed.')
measures = run.fold_evaluations
print_compare_runs(measures)



OpenML Classification Task
Task Type Description: https://www.openml.org/tt/TaskType.SUPERVISED_CLASSIFICATION
Task ID..............: 167119
Task URL.............: https://www.openml.org/t/167119
Estimation Procedure.: crossvalidation
Target Feature.......: class
# of Classes.........: 3
Cost Matrix..........: Available
Starting runs.
Runs completed.
Repeat #0, Fold #0: ACC 0.775, Training Walltime 1308.550
Repeat #0, Fold #1: ACC 0.771, Training Walltime 1301.561
Repeat #0, Fold #2: ACC 0.772, Training Walltime 1525.371
Repeat #0, Fold #3: ACC 0.787, Training Walltime 1572.490
Repeat #0, Fold #4: ACC 0.783, Training Walltime 1542.226
Repeat #0, Fold #5: ACC 0.781, Training Walltime 1529.850
Repeat #0, Fold #6: ACC 0.766, Training Walltime 1542.290
Repeat #0, Fold #7: ACC 0.774, Training Walltime 1553.626
Repeat #0, Fold #8: ACC 0.779, Training Walltime 1240.561
Repeat #0, Fold #9: ACC 0.777, Training Walltime 1242.521


In [44]:
# As above, use the function `run_model_on_task` to run your favorite scikit-learn classifier (e.g., a `Random Forest Classifier`) for task 146230 on the `Titanic` dataset.
# This time, use a pipe for GridSearchCV:

# as above, imports and classifier specification:
import openml
import numpy as np
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=10, n_jobs=2)

# as above, helper function for printing the evaluation results:
def print_compare_runs(measures):
    for repeat, val1 in measures["usercpu_time_millis_training"].items():
        for fold, val2 in val1.items():
            print(
                "Repeat #{}, Fold #{}: ACC {:.3f}, Training Walltime {:.3f}".format(
                    repeat, fold, measures["predictive_accuracy"][repeat][fold], measures["wall_clock_time_millis_training"][repeat][fold]
                )
            )


# as above, get the task 146230 and print its summary:
#task_id = 146230 # Titanic
task_id = 167119 # Jungle Chess
task = openml.tasks.get_task(task_id)
print(task)

# create a pipeline for a RandomizedSearchCV using 2-fold cross validation:
n_iter = 5
grid_pipe = GridSearchCV(
    estimator=clf,
    param_grid={"n_estimators": np.linspace(start=1, stop=50, num=n_iter).astype(int).tolist()},
    cv=2,
    n_jobs=2,
)


# run the search and print the results:
print('Starting runs.')
run = openml.runs.run_model_on_task(
    model=grid_pipe, task=task, upload_flow=False, avoid_duplicate_runs=False, n_jobs=2
)
print('Runs completed.')
measures = run.fold_evaluations
print_compare_runs(measures)



OpenML Classification Task
Task Type Description: https://www.openml.org/tt/TaskType.SUPERVISED_CLASSIFICATION
Task ID..............: 146230
Task URL.............: https://www.openml.org/t/146230
Estimation Procedure.: crossvalidation
Evaluation Measure...: precision
Target Feature.......: class
# of Classes.........: 2
Cost Matrix..........: Available
Starting runs.
Runs completed.
Repeat #0, Fold #0: ACC 0.833, Training Walltime 732.122
Repeat #0, Fold #1: ACC 0.809, Training Walltime 744.972
Repeat #0, Fold #2: ACC 0.800, Training Walltime 840.389
Repeat #0, Fold #3: ACC 0.768, Training Walltime 849.087
Repeat #0, Fold #4: ACC 0.786, Training Walltime 599.890
Repeat #0, Fold #5: ACC 0.791, Training Walltime 587.464
Repeat #0, Fold #6: ACC 0.741, Training Walltime 637.936
Repeat #0, Fold #7: ACC 0.800, Training Walltime 637.594
Repeat #0, Fold #8: ACC 0.764, Training Walltime 568.018
Repeat #0, Fold #9: ACC 0.805, Training Walltime 573.165
