# Load to datasets

Loads a scikit-learn toy dataset for classification or regression
    
The following datasets are available ('name' : desription):

    'iris'            : iris dataset (classification)
    'wine'            : wine dataset (classification)
    'breast_cancer'   : breast cancer wisconsin dataset (classification)
    'digits'          : digits dataset (classification)
    'boston'          : boston house-prices dataset (regression)
    'diabetes'        : diabetes dataset (regression)
    'linnerud'        : linnerud dataset (multivariate regression)

Currently the `iris`, `wine` and `breast_cancer` datasets run through the sklearn_classifier.  

**TODO**: The digits dataset requires the addition of images (add one or two lines of code to flatten the image pixel matrix to a feature vector, maybe add a parameter to indicate the inputs are images and the final image size for input to ML algo, maybe create a separate image preprocessing stage that can run processing in parallel and feed the trainer from a queue, trainer blocks until queue starts to fill...)

**TODO**: The regression datasets are available through this function, however a `sklearn_regression` function needs to be written, almost a copy paste of `sklearn_classifier`.  Alternatively, the training function can be split into 2 parts, fit and evaluate, where the fit is identical for regressor or classifier, and only the evaluate differs. 

The scikit-learn toy dataset functions return a data bunch including the following items:<br>
&emsp;{<br>
&emsp;&emsp;'data'  :  the features matrix,<br>
&emsp;&emsp;'target' : the ground truth labels<br>
&emsp;&emsp;'DESCR'  :  a description of the dataset<br>
&emsp;&emsp;'feature_names' :  header for data<br>
&emsp;}<br>

The features (and their names) are stored with the target labels in a DataFrame.

For further details see **[Scikit Learn Toy Datasets](https://scikit-learn.org/stable/datasets/index.html#toy-datasets)**

## mlconfig

In [1]:
from mlrun import mlconf
import os

mlconf.dbpath = mlconf.dbpath or "http://mlrun-api:8080"
mlconf.artifact_path = mlconf.artifact_path or f'{os.environ["HOME"]}/artifacts'

## Save

In [2]:
import yaml

with open("item.yaml") as item_file:
    items = yaml.load(item_file, Loader=yaml.FullLoader)

In [3]:
from mlrun import code_to_function

# create job function object from notebook code
fn = code_to_function(
    name=items["name"],
    kind=items["spec"]["kind"],
    handler=items["spec"]["handler"],
    filename=items["spec"]["filename"],
    image=items["spec"]["image"],
    description=items["description"],
    categories=items["categories"],
    labels=items["labels"],
    requirements=items["spec"]["requirements"],
)

fn.export("load_dataset.yaml")

> 2021-02-17 09:38:25,592 [info] function spec saved to path: load_dataset.yaml


<mlrun.runtimes.kubejob.KubejobRuntime at 0x7f91b205dad0>

## Examples

In [4]:
# load function from marketplacen
from mlrun import import_function

# vcs_branch = 'development'
# base_vcs = f'https://raw.githubusercontent.com/mlrun/functions/{vcs_branch}/'
# mlconf.hub_url = mlconf.hub_url or base_vcs + f'{name}/function.yaml'
# fn = import_function("hub://load_dataset")

In [5]:
if "V3IO_HOME" in list(os.environ):
    from mlrun import mount_v3io

    fn.apply(mount_v3io())
else:
    # is you set up mlrun using the instructions at https://github.com/mlrun/mlrun/blob/master/hack/local/README.md
    from mlrun.platforms import mount_pvc

    fn.apply(mount_pvc("nfsvol", "nfsvol", "/home/joyan/data"))

In [6]:
from mlrun import NewTask

task_params = {"name": "tasks-load-toy-dataset", "params": {"dataset": "wine"}}

### run remotely

In [7]:
run = fn.run(NewTask(**task_params), artifact_path=mlconf.artifact_path)

> 2021-02-17 09:38:25,665 [info] starting run tasks-load-toy-dataset uid=88b40f3c85b7450d8cbed5b86dbf2865 DB=http://mlrun-api:8080
> 2021-02-17 09:38:25,879 [info] Job is running in the background, pod: tasks-load-toy-dataset-vx9sg
> 2021-02-17 09:38:31,248 [info] run executed, status=completed
final state: completed


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
default,...6dbf2865,0,Feb 17 09:38:30,completed,tasks-load-toy-dataset,v3io_user=adminkind=jobowner=adminhost=tasks-load-toy-dataset-vx9sg,,dataset=wine,,wine


to track results use .show() or .logs() or in CLI: 
!mlrun get run 88b40f3c85b7450d8cbed5b86dbf2865 --project default , !mlrun logs 88b40f3c85b7450d8cbed5b86dbf2865 --project default
> 2021-02-17 09:38:32,085 [info] run executed, status=completed


### or locally

In [8]:
from mlrun import run_local
from load_dataset import load_dataset

In [9]:
for dataset in ["wine", "iris", "breast_cancer"]:
    run_local(
        handler=load_dataset,
        inputs={"dataset": dataset},
        artifact_path=mlconf.artifact_path,
    )

> 2021-02-17 09:38:32,172 [info] starting run mlrun-446035-load_dataset uid=a618556def4b42af9c72d909c54247b8 DB=http://mlrun-api:8080


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
default,...c54247b8,0,Feb 17 09:38:32,completed,mlrun-446035-load_dataset,v3io_user=adminkind=handlerowner=adminhost=jupyter-7b854d9bd6-mkmbn,dataset,,,wine


to track results use .show() or .logs() or in CLI: 
!mlrun get run a618556def4b42af9c72d909c54247b8 --project default , !mlrun logs a618556def4b42af9c72d909c54247b8 --project default
> 2021-02-17 09:38:33,172 [info] run executed, status=completed
> 2021-02-17 09:38:33,173 [info] starting run mlrun-2c9af6-load_dataset uid=c3bff9ff6d1640d69c77074e51cb2b05 DB=http://mlrun-api:8080


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
default,...51cb2b05,0,Feb 17 09:38:33,completed,mlrun-2c9af6-load_dataset,v3io_user=adminkind=handlerowner=adminhost=jupyter-7b854d9bd6-mkmbn,dataset,,,iris


to track results use .show() or .logs() or in CLI: 
!mlrun get run c3bff9ff6d1640d69c77074e51cb2b05 --project default , !mlrun logs c3bff9ff6d1640d69c77074e51cb2b05 --project default
> 2021-02-17 09:38:33,428 [info] run executed, status=completed
> 2021-02-17 09:38:33,429 [info] starting run mlrun-4d1ed9-load_dataset uid=5b89549ca1c147ae8b625e8fd5c7a848 DB=http://mlrun-api:8080


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
default,...d5c7a848,0,Feb 17 09:38:33,completed,mlrun-4d1ed9-load_dataset,v3io_user=adminkind=handlerowner=adminhost=jupyter-7b854d9bd6-mkmbn,dataset,,,breast_cancer


to track results use .show() or .logs() or in CLI: 
!mlrun get run 5b89549ca1c147ae8b625e8fd5c7a848 --project default , !mlrun logs 5b89549ca1c147ae8b625e8fd5c7a848 --project default
> 2021-02-17 09:38:33,887 [info] run executed, status=completed
