## Ignore this section

TODOs:

[x] Feedback on loaded file: column names, number columns, instances.

[x] Infer Column Types

[x] Allow user to modify column types.

[ ] Perform Publish call

[ ] Fix bug: spaces introduced in categorical values list

[ ] Format text as Markdown for API Key message (Maybe have separate cell output markdown widget).

[ ] Modify column types with a dropdown instead (but then text for categorical?)

[ ] Infer Row ID attribute and/or allow user to set this. (radio button in Infer Column Types table?)

[ ] Set default target attribute (radio button in Infer Column Types Table?)

[ ] Add required meta-data fields (name and description? more?)

[ ] OpenML Logo :)


# CSV to OpenML helper

This notebook helps you upload a csv-file to OpenML.
To use this notebook, run it cell-by-cell.
Whenever the text prompts you to do something, do that before continuing in the notebook.

If you experience issues using this notebook, or have further questions, please [click here](https://github.com/PGijsbers/csv-to-openml/issues/new) to open an issue on Github.

In [2]:
from __future__ import print_function
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
from IPython.display import Markdown

import csv
import io
import re
import numpy as np
import openml
import pandas as pd

In [3]:
# UI components that will be rendered in this notebook:
upload_widget = widgets.FileUpload(
    accept='.csv',
    multiple=False,
    description='Select a csv file'
)

publish_button = widgets.Button(
    description='Publish dataset',
    disabled=True,
    button_style='', # 'success', 'info', 'warning', 'danger' or ''
    tooltip='Click me',
    icon='check',
    visible=False
)

In [4]:
data = None

def on_file_uploaded(input_):
    global data
    file_content = io.StringIO(upload_widget.data[0].decode())
    
    has_header = csv.Sniffer().has_header(file_content.read(1024))
    file_content.seek(0)
    
    data = pd.read_csv(file_content, header=0 if has_header else None)
    publish_button.visible=True

upload_widget.observe(on_file_uploaded, 'data')

Please provide select the CSV file you want to upload to OpenML:

In [5]:
upload_widget

FileUpload(value={}, accept='.csv', description='Select a csv file')

In [7]:
Markdown(f"The selected file has {len(data)} rows and {len(data.columns)} columns. "
         "Below is a preview of the first rows of your csv file.")

The selected file has 150 rows and 5 columns. Below is a preview of the first rows of your csv file.

In [8]:
data.head()

Unnamed: 0,A,B,C,D,Flower
0,1,3.5,1.4,0.2,Iris-setosa
1,9,3.0,1.4,0.2,Iris-setosa
2,7,3.2,1.3,0.2,Iris-setosa
3,6,3.1,1.5,0.2,Iris-setosa
4,0,3.6,1.4,0.2,Iris-setosa


Next, we are going to run some code to help with data annotation...

In [46]:
# infer which variables are categorical
MAX_UNIQUE_VALUES = 10
for column in data.columns:
    if data[column].nunique() <= MAX_UNIQUE_VALUES:
        data[column] = data[column].astype('category')

In [61]:
def set_column_type_maker(column_widget):
    def set_column_type(change):
        if change['new'] == 'categorical':
            data[column_widget.value] = data[column_widget.value].astype('category')
        if change['new'] == 'numeric':
            data[column_widget.value] = pd.to_numeric(data[column_widget.value])
        if change['new'] == 'string':
            data[column_widget.value] = data[column_widget.value].astype('str')
    return set_column_type

coltype_widgets = []
coltype_widgets.append(
    widgets.HBox(
        [
            widgets.Label(
                'Column Names',
                layout=widgets.Layout(width='300px')),
            widgets.Label(
                'Column Types',
                layout=widgets.Layout(width='300px')),
            widgets.Label(
                'Example Values',
                layout=widgets.Layout(width='300px')),
        ])
)

for i, column in enumerate(data.columns):
    column_name_widget = widgets.Text(
        value=column,
        layout=widgets.Layout(width='300px')
    )    
    def set_column_name(change):
        column_types[change['new']] = column_types[change['old']]
        del column_types[change['old']]
        data.rename(columns={change['old']: change['new']}, inplace=True)
    
    column_name_widget.observe(set_column_name, 'value')
    
    if data[column].dtype.name == 'category':
        coltype = 'categorical'
    elif np.issubdtype(data[column].dtype, np.number):
        coltype = 'numeric'
    else:
        coltype = 'string'
    
    column_type_widget = widgets.Dropdown(
        options=['numeric', 'string', 'categorical'],
        value=coltype,
        layout=widgets.Layout(width='300px')
    )    
    set_column_type = set_column_type_maker(column_name_widget)    
    column_type_widget.observe(set_column_type, 'value')
    
    example_values_widget = widgets.Text(
        value=', '.join([str(v) for v in data[column].head().values]),
        layout=widgets.Layout(width='300px')
    )    
    
    coltype_widgets.append(
        widgets.HBox(
        [
            column_name_widget,
            column_type_widget,
            example_values_widget
        ])
    )    

It's crucial for OpenML to know the *type* of data in each column.
Each feature should be one of:

 - A numeric feature. Examples: `car price` or `tree height`
 - A string (text). Examples: `sales text` or `tree name`
 - A nominal feature (can only take one of a set of unique values). Examples: `car color` (red, blue, ...) or `evergreen` (yes, no).
 
Based on the data found in our csv file, we inferred the the types for each column.

Below you will find a table, in the left column you will find the column names of your data.
You can edit the column names directly.

The middle column shows the column types.
The accepted values here are either `numeric`, `string` or a number of values separated with `', '` (please note the space after the comma).

The right column simply shows some values of the column for easy reference.
This column should not be edited.

Please check that the types are correct, and correct any mistakes.

In [65]:
widgets.VBox(coltype_widgets)

VBox(children=(HBox(children=(Label(value='Column Names', layout=Layout(width='300px')), Label(value='Column T…

Double-check that the data, column names and data types look correct (if not, retrace steps above):

In [68]:
display(data.head())
display(data.dtypes)

Unnamed: 0,Actually,Bees,Cover,Droves,Flower
0,1,3.5,1.4,0.2,Iris-setosa
1,9,3.0,1.4,0.2,Iris-setosa
2,7,3.2,1.3,0.2,Iris-setosa
3,6,3.1,1.5,0.2,Iris-setosa
4,0,3.6,1.4,0.2,Iris-setosa


Actually    category
Bees         float64
Cover        float64
Droves       float64
Flower      category
dtype: object

## Meta-data

Great! Now we just need some meta-data such as a name and description.
This meta-data makes it easier for others to find and understand your dataset.

In [26]:
def create_widget(label: str, long_description: str, type_, layout_args=dict(width='900px')):
    return type_(
        placeholder=long_description,
        description=label,
        layout=widgets.Layout(**layout_args)
    )


name_widget = create_widget('Name', 'Name of the dataset', widgets.Text)

desc_long =(
    'A description of the dataset. '
    'Include for example:\n'
    ' - What is the domain of this dataset?\n'
    ' - How was the dataset gathered?\n'
    ' - What is the meaning of each feature?'
)
description_widget = create_widget(
    'Description', desc_long, widgets.Textarea,
    dict(width='900px', height='120px')
)

collection_date_widget = create_widget('Collection Date', 'Date data was originally collected', widgets.DatePicker)
creator_widget = create_widget('Creator(s)', 'Original creator(s) of the dataset', widgets.Text)
contributor_widget = create_widget('Contributor(s)', 'People who further contributed to the dataset (e.g. formatting)', widgets.Text)
dataset_url_widget  = create_widget('Data URL', 'URL to the dataset if it is also hosted elsewhere', widgets.Text)
paper_url_widget =  create_widget('Paper URL', 'URL to the paper which introduced the dataset', widgets.Text)
citation_widget = create_widget('Citation', 'Citation for inclusion in a bibliography', widgets.Text)
# https://help.data.world/hc/en-us/articles/115006114287-Common-license-types-for-datasets
licence_widget = create_widget('Licence', 'Licence of the dataset, e.g. Public Domain, CC0, CC BY-NC', widgets.Text)
language_widget = create_widget('Language(s)', 'Language(s) in which the data is represented.', widgets.Text)

widgets.VBox([
    name_widget, description_widget, creator_widget, contributor_widget, collection_date_widget,
    dataset_url_widget, paper_url_widget, citation_widget, licence_widget, language_widget
])


VBox(children=(Text(value='', description='Name', layout=Layout(width='900px'), placeholder='Name of the datas…

## Uploading to OpenML

In [44]:
from openml.datasets.functions import create_dataset

oml_dataset = create_dataset(
    name=name_widget.value,
    description=description_widget.value,
    creator=creator_widget.value,
    contributor=contributor_widget.value,
    collection_date=collection_date_widget.value.strftime("%d-%m-%Y"),
    language=language_widget.value,
    licence=licence_widget.value,
    #default_target_attribute
    #row_id_attribute
    citation=citation_widget.value,
    attributes='auto',
    data=data,
    #version_label
    original_data_url=dataset_url_widget.value,
    paper_url=paper_url_widget.value
)

TypeError: create_dataset() missing 2 required positional arguments: 'default_target_attribute' and 'ignore_attribute'

In [None]:
openml.config.start_using_configuration_for_example()

In [None]:
# openml.config.apikey = ''

In [None]:
# retrieve data: name_widget.value

In [None]:
if openml.config.apikey == '':
    key_text = widgets.Output()    
    need_api_key_text = """    
    We noticed you have not configured an API key for OpenML yet.
    To find your API key, log in on the [OpenML website](https://openml.org) ([register](https://www.openml.org/register) if needed)
    , go to your account page (click the avatar image on the top right) and click "API Authentication".
    """
    with key_text:
        display(Markdown(need_api_key_text))

    def set_openml_apikey(key):
        openml.config.apikey = key
        if re.fullmatch('[a-f0-9]{32}', key):
            text_and_input.close()
            publish_button.disabled = False

    key_input = interactive(set_openml_apikey, key='')
    key_input.kwargs_widgets[0].description = 'API Key:'

    text_and_input = widgets.VBox([key_text, key_input])
    # show 'need_api_key_text'
    display(text_and_input)
display(publish_button)