<div style='margin: auto; width: 80%;'><h1 style='font-size: 55px; display: inline-block'>CSV to OpenML Helper</h1> <img style="float: left; height:80px; margin-right:10px;" src="https://raw.githubusercontent.com/PGijsbers/Talks/master/odsc/images/openml/dots.png"></div> <i class="fas fa-file-csv"></i>

This notebook helps you upload a csv-file to OpenML.
To use this notebook, run it cell-by-cell.
Whenever the text prompts you to do something, do that before continuing in the notebook.

If you experience issues using this notebook, or have further questions, please [click here](https://github.com/PGijsbers/csv-to-openml/issues/new) to open an issue on Github.

In [1]:
from __future__ import print_function
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
from IPython.display import Markdown

import csv
import io
import re
import numpy as np
import openml
import pandas as pd

In [2]:
# UI components that will be rendered in this notebook:
upload_widget = widgets.FileUpload(
    accept='.csv',
    multiple=False,
    description='Select a csv file'
)

publish_button = widgets.Button(
    description='Publish dataset',
    disabled=True,
    button_style='', # 'success', 'info', 'warning', 'danger' or ''
    tooltip='Click me',
    icon='check',
    visible=False
)

In [3]:
data = None

def on_file_uploaded(input_):
    global data
    file_content = io.StringIO(upload_widget.data[0].decode())
    
    has_header = csv.Sniffer().has_header(file_content.read(1024))
    file_content.seek(0)
    
    data = pd.read_csv(file_content, header=0 if has_header else None)
    publish_button.visible=True

upload_widget.observe(on_file_uploaded, 'data')

Please provide select the CSV file you want to upload to OpenML:

In [4]:
upload_widget

FileUpload(value={}, accept='.csv', description='Select a csv file')

In [5]:
# If no header is provided, column names are integer, which is inconvenient later, change to str;
data.columns = data.columns.astype(str)

In [6]:
Markdown(f"The selected file has {len(data)} rows and {len(data.columns)} columns. "
         "Below is a preview of the first rows of your csv file.")

The selected file has 150 rows and 5 columns. Below is a preview of the first rows of your csv file.

In [7]:
data.head()

Unnamed: 0,A,B,C,D,Flower
0,1,3.5,1.4,0.2,Iris-setosa
1,9,3.0,1.4,0.2,Iris-setosa
2,7,3.2,1.3,0.2,Iris-setosa
3,6,3.1,1.5,0.2,Iris-setosa
4,0,3.6,1.4,0.2,Iris-setosa


Next, we are going to run some code to help with data annotation...

In [8]:
# infer which variables are categorical
MAX_UNIQUE_VALUES = 10
for column in data.columns:
    if data[column].nunique() <= MAX_UNIQUE_VALUES:
        data[column] = data[column].astype('category')

In [9]:
ignore_columns = {}
id_column = []
id_radio_buttons = []
target_column = []
target_radio_buttons = []

def set_ignore_column_maker(column_widget):
    def ignore_column(change):
        if change['new']:  # Checkbox is set to True.
            ignore_columns[column_widget.value] = True
        else:
            ignore_columns[column_widget.value] = False
    return ignore_column

def set_id_column_maker(column_widget):
    def set_id_column(change):
        if change['new'] == '':
            id_column[:] = [column_widget.value]
            for radio_button in id_radio_buttons:
                if change['owner'] != radio_button:
                    radio_button.value = None
    return set_id_column

def set_target_column_maker(column_widget):
    def set_target_column(change):
        if change['new'] == '':
            target_column[:] = [column_widget.value]
            for radio_button in target_radio_buttons:
                if change['owner'] != radio_button:
                    radio_button.value = None
    return set_target_column

def set_column_type_maker(column_widget):
    def set_column_type(change):
        if change['new'] == 'categorical':
            data[column_widget.value] = data[column_widget.value].astype('category')
        if change['new'] == 'numeric':
            data[column_widget.value] = pd.to_numeric(data[column_widget.value])
        if change['new'] == 'string':
            data[column_widget.value] = data[column_widget.value].astype('str')
    return set_column_type

w_col_name, w_col_type, w_col_examples, w_ignore, w_id = '200px', '150px', '300px', '60px', '40px'

coltype_widgets = []
coltype_widgets.append(
    widgets.HBox(
        [
            widgets.Label(
                'Column Names',
                layout=widgets.Layout(width=w_col_name)),
            widgets.Label(
                'Column Types',
                layout=widgets.Layout(width=w_col_type)),
            widgets.Label(
                'Example Values',
                layout=widgets.Layout(width=w_col_examples)),
            widgets.Label(
                'Ignore',
                layout=widgets.Layout(width=w_ignore)),
            widgets.Label(
                'ID',
                layout=widgets.Layout(width=w_id)),
            widgets.Label(
                'Target',
                layout=widgets.Layout(width=w_ignore)),
        ])
)

for i, column in enumerate(data.columns):
    column_name_widget = widgets.Text(
        value=column,
        layout=widgets.Layout(width=w_col_name)
    )    
    def set_column_name(change):        
        data.rename(columns={change['old']: change['new']}, inplace=True)
        # Update Ignore dict and ID field to refer to the new column name
        if change['old'] in ignore_columns:
            ignore_columns[change['new']] = ignore_columns[change['old']]
            del ignore_columns[change['old']]
        if change['old'] in id_column:
            id_column[:] = [change['new']]
        if change['old'] in target_column:
            target_column[:] = [change['new']]
    
    column_name_widget.observe(set_column_name, 'value')
    
    if data[column].dtype.name == 'category':
        coltype = 'categorical'
    elif np.issubdtype(data[column].dtype, np.number):
        coltype = 'numeric'
    else:
        coltype = 'string'
    
    column_type_widget = widgets.Dropdown(
        options=['numeric', 'string', 'categorical'],
        value=coltype,
        layout=widgets.Layout(width=w_col_type)
    )    
    set_column_type = set_column_type_maker(column_name_widget)    
    column_type_widget.observe(set_column_type, 'value')
    
    example_values_widget = widgets.Text(
        value=', '.join([str(v) for v in data[column].head().values]),
        layout=widgets.Layout(width=w_col_examples)
    )    

    ignore_widget = widgets.Checkbox(value=False, 
                                     layout=widgets.Layout(width=w_ignore), 
                                     style={'description_width':'0px'})
    set_ignore_column = set_ignore_column_maker(column_name_widget)
    ignore_widget.observe(set_ignore_column, 'value')
    
    id_widget = widgets.RadioButtons(options=[''],
                                     value=None,
                                     layout=widgets.Layout(width=w_id), 
                                     style={'description_width':'0px'})
    set_id_column = set_id_column_maker(column_name_widget)
    id_widget.observe(set_id_column, 'value')
    id_radio_buttons.append(id_widget)
    
    target_widget = widgets.RadioButtons(options=[''],
                                 value=None,
                                 layout=widgets.Layout(width=w_id), 
                                 style={'description_width':'0px'})
    set_target_column = set_target_column_maker(column_name_widget)
    target_widget.observe(set_target_column, 'value')
    target_radio_buttons.append(target_widget)
    
    coltype_widgets.append(
        widgets.HBox(
        [
            column_name_widget,
            column_type_widget,
            example_values_widget,
            ignore_widget,
            id_widget,
            target_widget
        ])
    )    

OpenML wants to capture some rich meta-data about uploaded datasets, so that other users and programs may make better use of the datasets.

It's crucial for OpenML to know the *type* of data in each column.
Each feature should be one of:

 - A numeric feature. Examples: `car price` or `tree height`
 - A string (text). Examples: `sales text` or `tree name`
 - A categorical feature (can only take one of a set of unique values). Examples: `car color` (red, blue, ...) or `evergreen` (yes, no).
 
Based on the data found in our csv file, we inferred the the types for each column.
Below you will find a table which allows you to add or edit any of the feature meta-data OpenML accepts:

In the **'Column Names'** column you will find the column names of your data.
You can edit the column names directly.

The **'Column Types'** column shows the column types as described above, as inferred from the data.
If the column type is not correct, please select the correct option from the dropdown menu.

The **'Example Values'** column simply shows some values of the column for easy reference.
This column should not be edited (editing it has no effect).

In the **'Ignore'** column, you can select the columns which should be ignored when creating models (e.g. identifiers or indexes).
If no such column exists in the dataset, this column may be ignored.

In the **'ID'** column you can select the column that contains row ids, if such a column is present.
If no row id column is present in the dataset, this column may be ignored.


Please check that the names and types are correct, complete the 'Ignore' and 'ID' columns

In [10]:
widgets.VBox(coltype_widgets)

VBox(children=(HBox(children=(Label(value='Column Names', layout=Layout(width='200px')), Label(value='Column T…

Before you continue, double-check that the data, column names and data types look correct (if not, retrace steps above). 

*If you accidentally selected an ID or Target column but there should be none, please rerun the large code cell before the previous markdown segment as well as the `widgets.VBox(coltype_widgets)` cell. This will erase the Ignore, ID and Target data (but not names and types).*

## Meta-data

Thanks for bearing with us! All this extra information is going to make sure the dataset is easier for others to find and understand. There's just a few more things we'd like to know:

In [11]:
def create_widget(label: str, long_description: str, type_, layout_args=dict(width='900px')):
    return type_(
        value=None,
        placeholder=long_description,
        description=label,
        long_description=long_description,
        layout=widgets.Layout(**layout_args)
    )


name_widget = create_widget('Name', 'Name of the dataset', widgets.Text)

desc_long =(
    'A description of the dataset. '
    'Include for example:\n'
    ' - What is the domain of this dataset?\n'
    ' - How was the dataset gathered?\n'
    ' - What is the meaning of each feature?'
)
description_widget = create_widget(
    'Description', desc_long, widgets.Textarea,
    dict(width='900px', height='120px')
)

collection_date_widget = create_widget('Collection Date', 'Date data was originally collected', widgets.DatePicker)
creator_widget = create_widget('Creator(s)', 'Original creator(s) of the dataset', widgets.Text)
contributor_widget = create_widget('Contributor(s)', 'People who further contributed to the dataset (e.g. formatting)', widgets.Text)
dataset_url_widget  = create_widget('Data URL', 'URL to the dataset if it is also hosted elsewhere', widgets.Text)
paper_url_widget =  create_widget('Paper URL', 'URL to the paper which introduced the dataset', widgets.Text)
citation_widget = create_widget('Citation', 'Citation for inclusion in a bibliography', widgets.Text)
# https://help.data.world/hc/en-us/articles/115006114287-Common-license-types-for-datasets
licence_widget = create_widget('Licence', 'Licence of the dataset, e.g. Public Domain, CC0, CC BY-NC', widgets.Text)
language_widget = create_widget('Language(s)', 'Language(s) in which the data is represented.', widgets.Text)

widgets.VBox([
    name_widget, description_widget, creator_widget, contributor_widget, collection_date_widget,
    dataset_url_widget, paper_url_widget, citation_widget, licence_widget, language_widget
])


VBox(children=(Text(value='', description='Name', layout=Layout(width='900px'), placeholder='Name of the datas…

## Uploading to OpenML
The following few code cells process your input and format the data for uploading.

In [12]:
from openml.datasets.functions import create_dataset

id_ = id_column[0]
target = target_column[0] == ''
ignore_ = [col for col, ignore in ignore_columns.items() if ignore]
collection_date = None if collection_date_widget.value is None else collection_date_widget.value.strftime("%d-%m-%Y")

for column in data.columns:
    if data[column].dtype.name == 'category':
        # OpenML Python requires categorical values to be strings.
        data[column] = data[column].astype(str).astype('category')

oml_dataset = create_dataset(
    name=name_widget.value,
    description=description_widget.value,
    creator=creator_widget.value or None,
    contributor=contributor_widget.value or None,
    collection_date=collection_date,
    language=language_widget.value  or None,
    licence=licence_widget.value  or None,
    default_target_attribute=target or None,
    row_id_attribute=id_ or None,
    citation=citation_widget.value or None,
    ignore_attribute=ignore_ or None,
    attributes='auto',
    data=data,
    #version_label
    original_data_url=dataset_url_widget.value or None,
    paper_url=paper_url_widget.value or None
)

  attributes_ = attributes_arff_from_df(data)


In [13]:
# Cell for testing
# openml.config.start_using_configuration_for_example()
# openml.config.apikey = ''

In [14]:
def publish(_):
    oml_dataset.publish()
    display(oml_dataset)
    publish_button.disabled = True
    publish_button.description = 'Published!'

publish_button.on_click(publish)

In [15]:
text_and_input = None
if openml.config.apikey == '':
    key_text = widgets.Output()    
    need_api_key_text = """    
    We noticed you have not configured an API key for OpenML yet.
    To find your API key, log in on the [OpenML website](https://openml.org) ([register](https://www.openml.org/register) if needed)
    , go to your account page (click the avatar image on the top right) and click "API Authentication".
    """
    with key_text:
        display(Markdown(need_api_key_text))

    def set_openml_apikey(key):
        openml.config.apikey = key
        if re.fullmatch('[a-f0-9]{32}', key):
            text_and_input.close()
            publish_button.disabled = False

    key_input = interactive(set_openml_apikey, key='')
    key_input.kwargs_widgets[0].description = 'API Key:'

    text_and_input = widgets.VBox([key_text, key_input])
    # show 'need_api_key_text'
else:
    publish_button.disabled = False

Below the following the following cell you should find the button which allows you to publish to OpenML!
In case your authentication is not (correcty) configured, follow the instructions to enable the button.

In [16]:
if text_and_input is not None:
    display(text_and_input)
display(publish_button)

Button(description='Publish dataset', icon='check', style=ButtonStyle(), tooltip='Click me')

OpenML Dataset
Name.........: Iris-Test3
Version......: None
Format.......: arff
Licence......: None
Download URL.: None
OpenML URL...: https://www.openml.org/d/18939
# of features: None

---
### Thank you very much for sharing your dataset and contributing to a world of Open Science!

-----
#### Please Ignore Anything Below

TODOs:

[ ] Format text as Markdown for API Key message (Maybe have separate cell output markdown widget).

[x] Infer Row ID attribute and/or allow user to set this. 
    [ ] try infer?
    
[x] Set default target attribute 
    [ ] try infer?

[ ] Perform checks for **required** meta-data fields before publish (name, description)

[ ] Bug - Column names may not be identical **at any point**.

[ ] Input Checking - Perform xsd checks (e.g. no space in column name)