In [1]:
import os
import sys

# In case it is not installed as a package
# the project folder needs to be added to
# the path

# Change the path to the project folder
project_path = os.path.expanduser(os.path.join("~", "Documents", "ResultExtractor"))
if project_path not in sys.path:
    sys.path.append(project_path)

# Introduction to the OpenML ResultExtractor

The OpenML ResultExtractor is a package that allows to analyse the data at OpenML given different task and flow filters. It builds a Pandas DataFrame from the Cartesian product of the tasks and flows, where each entry can be a run, a list of runs or empty for the task and flow combination.

The ResultExtractor can be initiliazed in multiple ways:

* With non keyworded arguments which represent flow ids to consider
* With keyworded arguments which represent run restrictions
* Non keyworded and keyworded arguments combined.
* Without any arguments, in which case all flows and all tasks run on them will be considered.

The last case will not be covered in the tutorial, since there is a large amount of results at OpenML and it takes too long for the results to be available.

## Getting flow ids given flow identifiers

The package offers a helper function which returns flow ids given flow qualifiers.
A flow qualifier can be a flow name eg. **'mlr.classif.svm'** or it can be a flow name combined with a flow version **'mlr.classif.svm_6'**, the later is a **unique** flow qualifier and corresponds to a single id.

A list of flows can be found at https://www.openml.org/search?type=flow and it 
can be sorted according to your needs.

In [2]:
# Covering 2 simple use cases of the helper function
from pprint import pprint 
from src.util import get_flow_ids

# There are 10 different flow versions 
# for the svm algorithm.
print("Providing only a flow name:")
flow_ids = get_flow_ids('mlr.classif.svm')
pprint(flow_ids)

# A unique flow identifier.
print("Providing a flow name and a version:")
flow_ids = get_flow_ids('mlr.classif.svm_6')
pprint(flow_ids)

# Providing multiple arguments
print("Providing multiple flows:")
flow_ids = get_flow_ids('mlr.classif.svm', 'weka.RandomForest_5')
pprint(flow_ids)

Providing only a flow name:
{5891, 4102, 6599, 4141, 6669, 5969, 6322, 5524, 5527, 4319}
Providing a flow name and a version:
{5527}
Providing multiple flows:
{5891, 4102, 6599, 4141, 6669, 5969, 6322, 5524, 1079, 5527, 4319}


## Restricting the flows considered

To limit the number of flows considered, you have to initialize the result extractor with flow ids, given as positional arguments.

In [3]:
from src.result_extractor import ResultExtractor

# Using a single flow
flow_ids = get_flow_ids('mlr.classif.svm_6')
# Calling the Result extracter with the flow ids
result_extracter = ResultExtractor(*flow_ids)
print("Showing the first 3 tasks out of %d" % len(result_extracter.results))
pprint(result_extracter.results.iloc[0:3, 0:3])

Showing the first 3 tasks out of 63
                                                 5527
3   {3932162, 5013506, 5013507, 5505026, 3932175, ...
31  {6684673, 6684677, 4718607, 4718608, 4718609, ...
37  {5079040, 5079060, 5079111, 5079112, 5079115, ...


## Restricting the runs

To filter results based on different run parameters, the result extractor should be initialized with keyworded arguments.
The supported run filters at the moment are **uploader**, **task_type** and **tag**.

In [4]:
# The uploader argument should be a list
result_extracter = ResultExtractor(uploader=[86])
print("Filtering by uploader")
print("Showing the first 3 tasks out of %d and limiting to only 3 flows out of %d" % (len(result_extracter.results), len(result_extracter.results.columns)))
pprint(result_extracter.results.iloc[0:3, 0:3])

Filtering by uploader
Showing the first 3 tasks out of 104 and limiting to only 3 flows out of 23
        7218       7223       7226
2  {7942330}  {7943064}  {7942154}
3  {7942334}  {7943085}  {7942221}
6  {7943048}  {7942450}  {7943071}


In [5]:
print("Filtering by task type: 'Learning Curve'")
result_extracter = ResultExtractor(task_type=3)
print("Showing the first 3 tasks out of %d and limiting to only 3 flows out of %d" % (len(result_extracter.results), len(result_extracter.results.columns)))
pprint(result_extracter.results.iloc[0:3, 0:3])

Filtering by task type: 'Learning Curve'
Showing the first 3 tasks out of 252 and limiting to only 3 flows out of 323
               381             391                           385
61         {25089}  {25147, 51318}  {51392, 51390, 25214, 51391}
62  {47972, 25037}  {48354, 25044}                       {48118}
63  {25090, 51235}         {48353}                       {51405}


In [6]:
print("Filtering by tag: 'weka'")
result_extracter = ResultExtractor(tag='weka')
print("Showing the first 3 tasks out of %d and limiting to only 3 flows out of %d" % (len(result_extracter.results), len(result_extracter.results.columns)))
pprint(result_extracter.results.iloc[0:3, 0:3])

Filtering by tag: 'weka'
Showing the first 3 tasks out of 1632 and limiting to only 3 flows out of 1229
   527                                                364  \
1  NaN  {66336, 64549, 64550, 84076, 66093, 84014, 645...   
2  NaN                                            {84019}   
3  NaN                                            {84020}   

                                                 675  
1  {148513, 88866, 148506, 148519, 284748, 84028,...  
2                                            {84029}  
3                                            {84030}  


### Minimum number of tasks for flow

One more filter which is available, is the minimum number of tasks per flow. The filter should be given as 'min_task_flow' and it only considers **tasks** for which the requirement is fullfilled for **each flow**.

It should be taken in consideration that when using multiple flows, it can be that the requirement is not met for a single or a minority of flows and the ResultExtractor will discard the task from the results.

In the above case, it is better if the minimum number of tasks for flow is kept at a lower value or not given.

## Restricting the flows and tasks considered

Using all of the above information, below you can find an example which limits the flows to consider and also applies different task restrictions.

In [7]:
# Using a single flow
flow_identifier = 'weka.RandomForest_5'
task_type = 1
min_task_flow = 5
flow_ids = get_flow_ids(flow_identifier)
result_extracter = ResultExtractor(
    *flow_ids, 
    task_type=task_type, 
    min_task_flow=min_task_flow
)
print("Filtering by flow %s, \ntask type %d and \nminimum number of tasks for flow %d" 
      % (
          flow_identifier,
          task_type,
          min_task_flow
      )
)
print("Showing the first 3 tasks out of %d" % len(result_extracter.results))
pprint(result_extracter.results.iloc[0:3, 0:3])

Filtering by flow weka.RandomForest_5, 
task type 1 and 
minimum number of tasks for flow 5
Showing the first 3 tasks out of 968
                                                1079
1  {385697, 148578, 348327, 361032, 475240, 31969...
2  {326273, 385698, 365763, 475255, 355976, 47524...
3  {185440, 374561, 385699, 355973, 348330, 36103...
