# Create a new Dataset version

In this Notebook we will create a new dataset in the AML Workspace. For this we will use the *diabetes* dataset witch you can donwload from the following URL

- https://datahub.io/machine-learning/diabetes/r/diabetes.csv

Download the Dataset an place it into a datafolder in your Workspace. You can load an display the data wiht the python pandas package


In [2]:
import pandas as pd

# load an display the diabetes dataframe
data = pd.read_csv("data/diabetes_csv.csv")
data.head()

Unnamed: 0,preg,plas,pres,skin,insu,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,tested_positive
1,1,85,66,29,0,26.6,0.351,31,tested_negative
2,8,183,64,0,0,23.3,0.672,32,tested_positive
3,1,89,66,23,94,28.1,0.167,21,tested_negative
4,0,137,40,35,168,43.1,2.288,33,tested_positive


# 1. Create a Dataset with python

## Connect to the AML Workspace

In [4]:
from azureml.core import Workspace, Datastore, Dataset, Environment

# connect to the workspace with config
# interactive_auth = InteractiveLoginAuthentication(tenant_id="tenant-id")
ws = Workspace.from_config(".azure")

# connect to the workspace with credentials
# ws = Workspace.get(name="myworkspace",
#                    subscription_id='<azure-subscription-id>',
#                    resource_group='myresourcegroup')


## Get the default datastore and compute target

In [5]:
datastore = ws.get_default_datastore()
compute_target = ws.compute_targets["cpu-cluster"]

In [6]:
datastore.upload(src_dir="data", target_path='raw/diabetes.csv')

Uploading an estimated of 1 files
Uploading data\diabetes_csv.csv
Uploaded data\diabetes_csv.csv, 1 files out of an estimated total of 1
Uploaded 1 files


$AZUREML_DATAREFERENCE_2b8124e0c1464737a4a215857bcba85e

## Create a Dataset

In [9]:
datastore_paths = [(datastore, 'raw/diabetes.csv')]

diabetes_ds = Dataset.Tabular.from_delimited_files(path=datastore_paths)

In [10]:
diabetes_ds

{
  "source": [
    "('workspaceblobstore', 'raw/diabetes.csv')"
  ],
  "definition": [
    "GetDatastoreFiles",
    "ParseDelimited",
    "DropColumns",
    "SetColumnTypes"
  ]
}

## Convert the dataset to a pandas dataframe

In [12]:
diabetes_df = diabetes_ds.to_pandas_dataframe()
diabetes_df.head()

Unnamed: 0,preg,plas,pres,skin,insu,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,tested_positive
1,1,85,66,29,0,26.6,0.351,31,tested_negative
2,8,183,64,0,0,23.3,0.672,32,tested_positive
3,1,89,66,23,94,28.1,0.167,21,tested_negative
4,0,137,40,35,168,43.1,2.288,33,tested_positive


## Register the dataset to the AML Workspace

In [14]:
diabetes_ds = diabetes_ds.register(workspace=ws,name='diabetes',description='Diabetes training data', create_new_version=True)

****

# 2. Define Pipeline to Update the Dataset frequently 

## Define the Python execution script

With the IPython magic command *%%writefile* we can write the codecell into a python script

In [1]:
%%writefile src/data_ingest.py

from azureml.core import Run, Dataset

# get the workspace from the current run
run = Run.get_context()
ws = run.experiment.workspace


# get the default datastore
datastore = ws.get_default_datastore()
# # create a TabularDataset
datastore_paths = [(datastore,  'raw/diabetes.csv')]
diabetes_ds = Dataset.Tabular.from_delimited_files(path=datastore_paths)
# diabetes_df = diabetes_ds.to_pandas_dataframe()

diabetes_ds = diabetes_ds.register(workspace=ws, name='diabetes',description='Diabetes training data', create_new_version=True)


Overwriting src/data_ingest.py


## Create a new Pipeline step

In [16]:
from azureml.pipeline.core import PipelineData
from azureml.pipeline.steps import PythonScriptStep
from azureml.pipeline.core.graph import PipelineParameter

# create a pipeline step
data_ingest_step = PythonScriptStep(
    script_name="data_ingest.py",
    source_directory="src",
    # outputs = [diabetes_ds],
    compute_target=compute_target,
    allow_reuse=False)

## Create and run the Pipeline

In [17]:
from azureml.pipeline.core import Pipeline

# create the pipeline
pipeline = Pipeline(workspace=ws, steps=[data_ingest_step])

In [18]:
from azureml.core import Experiment

# Submit the pipeline to be run
pipeline_run = Experiment(ws, 'data_ingest').submit(pipeline)
pipeline_run.wait_for_completion(show_output=True)

Created step data_ingest.py [62a2f86c][c179c6a5-8e1a-4a46-ad56-77c79ccdbc2b], (This step will run and generate new outputs)
Submitted PipelineRun 1a5c0f11-a35f-43d3-a657-f478b03d8349
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/data_ingest/runs/1a5c0f11-a35f-43d3-a657-f478b03d8349?wsid=/subscriptions/3a0172d3-ec0d-46bb-a88a-ff41a302711a/resourcegroups/Evonik/workspaces/AMLWorkspace
PipelineRunId: 1a5c0f11-a35f-43d3-a657-f478b03d8349
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/data_ingest/runs/1a5c0f11-a35f-43d3-a657-f478b03d8349?wsid=/subscriptions/3a0172d3-ec0d-46bb-a88a-ff41a302711a/resourcegroups/Evonik/workspaces/AMLWorkspace
PipelineRun Status: NotStarted
PipelineRun Status: Running
Expected a StepRun object but received <class 'azureml.core.run.Run'> instead.
This usually indicates a package conflict with one of the dependencies of azureml-core or azureml-pipeline-core.
Please check for package conflicts in your python enviro

'Finished'

In [19]:
pipeline_run.publish_pipeline(
     name="Ingest_Pipeline",
     description="Pipeline to create new Dataset with ADF",
     version="1.0")

Name,Id,Status,Endpoint
Ingest_Pipeline,36ddcf2f-5e9f-41ba-afd3-423c4b2bba4a,Active,REST Endpoint
