# Regression Datasets

In [1]:
import openml
import pandas as pd

In [2]:
did_tid = {41021: 167210, 550: 359930, 546: 359931, 541: 359932, 507: 359933, 505: 359934, 287: 359935, 216: 359936, 42705: 317614, 42571: 233212, 41540: 359937, 42225: 233211, 42708: 317613, 42688: 359938, 42572: 233214, 42570: 233215, 422: 359939, 416: 359940, 3050: 13854, 3277: 14097, 42724: 359941, 42721: 359926, 42727: 359942, 42729: 359943, 42726: 359944, 42730: 359945, 201: 359946, 4549: 233213, 41702: 359947, 41980: 359948, 42731: 359949, 531: 359950, 42563: 359951, 574: 359952}
regression_datasets = list(did_tid)
datasets, = openml.datasets.list_datasets(regression_datasets, output_format="dataframe"),

In [3]:
datasets.columns

Index(['did', 'name', 'version', 'uploader', 'status', 'format',
       'NumberOfClasses', 'NumberOfFeatures', 'NumberOfInstances',
       'NumberOfInstancesWithMissingValues', 'NumberOfMissingValues',
       'NumberOfNumericFeatures', 'NumberOfSymbolicFeatures',
       'MaxNominalAttDistinctValues'],
      dtype='object')

In [4]:
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
import mpld3
import numpy as np
mpld3.enable_notebook()


from warnings import simplefilter
from matplotlib.cbook.deprecation import MatplotlibDeprecationWarning
# ignore all future warnings
simplefilter(action='ignore', category=MatplotlibDeprecationWarning)

In [5]:
import ipywidgets as widgets
from ipywidgets import Layout

# Visualizations

The following code allows you to explore the dataset characteristics, you need to first click on of the UI controls so that the plot is updated and shown.

In [9]:
x_var = widgets.Select(
    options=list(datasets.columns),
    value="NumberOfInstances",
    rows=1,
    description='X:',
)

y_var = widgets.Select(
    options=list(datasets.columns),
    value="NumberOfFeatures",
    rows=1,
    description='Y:',
)

log_x = widgets.Checkbox(
    value=False,
    description='log x',
    disabled=False,
    indent=False,
    layout=Layout(width="75px"),
)

log_y = widgets.Checkbox(
    value=False,
    description='log y',
    disabled=False,
    indent=False,
    layout=Layout(width="75px"),
)

exclude = widgets.SelectMultiple(
    options=["None"] + list(datasets["name"]),
    value=["None"],
    description='Exclude',
    disabled=False
)

suite = widgets.Select(
    options=["Regression", "Classification", "Classification (old)", "Classification (new)"],
    value="Regression",
    rows=1,
    description='Set:',
)

exclude.options = ["None"] + list(datasets["name"])
exclude.value = ["None"]

def plot(x, y, logx, logy, exclude):
    fig, ax = plt.subplots()
    filtered = datasets[~datasets["name"].isin(exclude)]
    # for some reason ax.set(xscale="log") does not work and shows points all over.
    xval = np.log10(filtered[x]) if logx else filtered[x]
    yval = np.log10(filtered[y]) if logy else filtered[y]
    ax = sns.scatterplot(x=xval, y=yval, data=filtered, ax=ax)
    ax.yaxis.labelpad = 25
    ax.xaxis.labelpad = 5
    labels=[f"{name} ({xi}, {yi})" for name, xi, yi in zip(filtered["name"], filtered[x], filtered[y])
            if not (np.isnan(xi) or np.isnan(yi))]
    tooltip = mpld3.plugins.PointLabelTooltip(ax.get_children()[0], labels=labels)
    mpld3.plugins.connect(fig, tooltip)


out = widgets.interactive_output(plot, {'x': x_var, 'y': y_var, 'logx': log_x, 'logy': log_y, 'exclude': exclude})
widgets.HBox([widgets.VBox([exclude, x_var, y_var, widgets.HBox([log_x, log_y])]), out])

HBox(children=(VBox(children=(SelectMultiple(description='Exclude', index=(0,), options=('None', 'pol', 'eleva…

# Potential Issues

## Problem 1: 4 House datasets
This selection contains the following four housing datasets (in addition to a house rent price dataset):

| Dataset  | N (rows) | P (cols)  | Used in  | Source  | openml id |
|---|---|---|---|---|---|
| House_16H  | 22784   | 17  | Balaji  | 1990 Census USA  | 574 |
| Boston  | 506  | 14  | Balaji  | Boston 1970s | 531 |
| house_price_nominal  | 1460  | 80  | AutoGluon  | Iowa 2006-2010  | 42563 |
| king county  |  21613 | 20  | mlr3  | KC (Seattle) 2014-2015  | 42092 |

They are from different times and different regions, which isn't necessarily problematic. With the total suite size of 30-35 datasets it would be a very dominating category.

## Problem 2: Lack of meta-data
Some datasets lack interpretability of the problem. This is true for both AutoML ChaLearn challenges (Flora & Yolanda). It has also been identified as an issue for [pol](https://www.openml.org/d/201) (and for [Buzzinsocialmedia_Twitter](https://www.openml.org/d/4549) meta-data needs to be translated from French).

## Problem 3: Runtime Prediction Datasets
[MIP-2016-Regression](https://www.openml.org/d/41702) and [SAT-hand-runtime-regression](https://www.openml.org/d/41980) are algorithm selection/runtime prediction datasets. Each dataset features 5-15 algorithms which is run on 200-300 problems, which would make 10-fold CV a less than optimal evaluation procedure.