<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Environment" data-toc-modified-id="Environment-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Environment</a></span></li><li><span><a href="#Compute-Target-and-container" data-toc-modified-id="Compute-Target-and-container-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Compute Target and container</a></span></li><li><span><a href="#Cleaning-step" data-toc-modified-id="Cleaning-step-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Cleaning step</a></span></li><li><span><a href="#Pipeline" data-toc-modified-id="Pipeline-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Pipeline</a></span></li><li><span><a href="#Run-pipeline" data-toc-modified-id="Run-pipeline-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Run pipeline</a></span></li></ul></div>

In [1]:
import azureml.core
from azureml.data.data_reference import DataReference
from azureml.data.datapath import DataPath
from azureml.core import Workspace, Datastore, Dataset
from azureml.pipeline.steps import PythonScriptStep
from azureml.pipeline.core import Pipeline, PipelineData
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.runconfig import DEFAULT_CPU_IMAGE
from azureml.core import Experiment

# Environment

In [2]:
ws = Workspace.from_config()
def_blob_store = Datastore(ws, "workspaceblobstore")
steps_dir = './pipeline_steps'
cpu_cluster_name = "cpucluster"

# Compute Target and container

In [3]:
cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
cpu_cluster.wait_for_completion(show_output=True)

# Create a new runconfig object
run_amlcompute = RunConfiguration()

# Use the cpu_cluster you created above. 
run_amlcompute.target = cpu_cluster

# Enable Docker
run_amlcompute.environment.docker.enabled = True

# Set Docker base image to the default CPU-based image
run_amlcompute.environment.docker.base_image = DEFAULT_CPU_IMAGE

# Use conda_dependencies.yml to create a conda environment in the Docker image for execution
run_amlcompute.environment.python.user_managed_dependencies = False

# Specify CondaDependencies obj, add necessary packages
pip_packages=['azureml-dataprep[fuse,pandas]',
              'azureml.core']
conda_packages=['pandas',
                'scikit-learn==0.22']
run_amlcompute.environment.python.conda_dependencies = CondaDependencies.create(python_version='3.7.7',
                                                                                pip_packages=pip_packages,
                                                                                conda_packages=conda_packages)

Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


# Cleaning step

In [4]:
dataset_full = Dataset.get_by_name(ws, name="annonces_ds")

clean_ds = PipelineData("dataset_clean",
                        datastore=def_blob_store)

clean_step = PythonScriptStep(
    script_name="clean.py",
    arguments=["--input", dataset_full.name, "--output", clean_ds.name],
    inputs=[dataset_full.as_named_input(dataset_full.name)],
    outputs=[clean_ds],
    compute_target=cpu_cluster,
    runconfig=run_amlcompute,
    source_directory=steps_dir
)

# Pipeline

In [5]:
train_pipeline = Pipeline(workspace=ws, steps=[clean_step])

# Run pipeline

In [6]:
pipeline_run = Experiment(ws, 'PipelineTest').submit(train_pipeline)
pipeline_run.wait_for_completion()

Created step clean.py [9555ef0f][6abd6c82-df8d-4386-a8c3-40e751d55a35], (This step is eligible to reuse a previous run's output)
Submitted PipelineRun e8c776ef-8596-4c5f-8016-5a4822c4a656
Link to Azure Machine Learning studio: https://ml.azure.com/experiments/PipelineTest/runs/e8c776ef-8596-4c5f-8016-5a4822c4a656?wsid=/subscriptions/68bdd703-8837-469c-80bd-bfb35f3b886f/resourcegroups/ProjectGroup2/workspaces/RealEstatePG2
PipelineRunId: e8c776ef-8596-4c5f-8016-5a4822c4a656
Link to Portal: https://ml.azure.com/experiments/PipelineTest/runs/e8c776ef-8596-4c5f-8016-5a4822c4a656?wsid=/subscriptions/68bdd703-8837-469c-80bd-bfb35f3b886f/resourcegroups/ProjectGroup2/workspaces/RealEstatePG2
PipelineRun Status: NotStarted
PipelineRun Status: Running


StepRunId: 7d3b4e37-64e8-488b-b171-99e77f6882e4
Link to Portal: https://ml.azure.com/experiments/PipelineTest/runs/7d3b4e37-64e8-488b-b171-99e77f6882e4?wsid=/subscriptions/68bdd703-8837-469c-80bd-bfb35f3b886f/resourcegroups/ProjectGroup2/workspac


libedit-3.1.20181209 | 188 KB    |            |   0% [0m[91m
libedit-3.1.20181209 | 188 KB    | ########## | 100% [0m[91m

numpy-1.18.1         | 5.2 MB    |            |   0% [0m[91m
numpy-1.18.1         | 5.2 MB    | #######6   |  76% [0m[91m
numpy-1.18.1         | 5.2 MB    | #########2 |  92% [0m[91m
numpy-1.18.1         | 5.2 MB    | ########## | 100% [0m[91m

libgfortran-ng-7.3.0 | 1.7 MB    |            |   0% [0m[91m
libgfortran-ng-7.3.0 | 1.7 MB    | #######9   |  80% [0m[91m
libgfortran-ng-7.3.0 | 1.7 MB    | ########## | 100% [0m[91m

ld_impl_linux-64-2.3 | 616 KB    |            |   0% [0m[91m
ld_impl_linux-64-2.3 | 616 KB    | #########  |  91% [0m[91m
ld_impl_linux-64-2.3 | 616 KB    | ########## | 100% [0m[91m

pip-20.0.2           | 1.0 MB    |            |   0% [0m[91m
pip-20.0.2           | 1.0 MB    | #######9   |  80% [0m[91m
pip-20.0.2           | 1.0 MB    | #########6 |  97% [0m[91m
pip-20.0.2           | 1.0 MB    | ########## | 10


Streaming azureml-logs/55_azureml-execution-tvmps_9ff32a432ad4b1961816ca61d191e830b378461ba630ed66d72b50e2b5675e5b_d.txt
2020-03-30T14:00:24Z Starting output-watcher...
2020-03-30T14:00:24Z IsDedicatedCompute == True, won't poll for Low Pri Preemption
Login Succeeded
Using default tag: latest
latest: Pulling from azureml/azureml_b9d4df30c66ad988b08f2d1b638cec91
a1298f4ce990: Pulling fs layer
04a3282d9c4b: Pulling fs layer
9b0d3db6dc03: Pulling fs layer
8269c605f3f1: Pulling fs layer
6504d449e70c: Pulling fs layer
4e38f320d0d4: Pulling fs layer
b0a763e8ee03: Pulling fs layer
11917a028ca4: Pulling fs layer
a6c378d11cbf: Pulling fs layer
6cc007ad9140: Pulling fs layer
6c1698a608f3: Pulling fs layer
412c27e4d1ba: Pulling fs layer
da54ffda441c: Pulling fs layer
d501dc84689a: Pulling fs layer
7024abdf3949: Pulling fs layer
950a4737a0e4: Pulling fs layer
9533c7360891: Pulling fs layer
6cc007ad9140: Waiting
6c1698a608f3: Waiting
412c27e4d1ba: Waiting
da54ffda441c: Waiting
d501dc84689a: Waitin

ActivityFailedException: ActivityFailedException:
	Message: Activity Failed:
{
    "error": {
        "code": "UserError",
        "message": "User program failed with ValueError: could not convert string to float: ",
        "detailsUri": "https://aka.ms/azureml-known-errors",
        "details": [],
        "debugInfo": {
            "type": "ValueError",
            "message": "could not convert string to float: ",
            "stackTrace": "  File \"/mnt/batch/tasks/shared/LS_root/jobs/realestatepg2/azureml/7d3b4e37-64e8-488b-b171-99e77f6882e4/mounts/workspaceblobstore/azureml/7d3b4e37-64e8-488b-b171-99e77f6882e4/azureml-setup/context_manager_injector.py\", line 127, in execute_with_context\n    runpy.run_path(sys.argv[0], globals(), run_name=\"__main__\")\n  File \"/azureml-envs/azureml_aacecb477b18031a273ac58adf6d7773/lib/python3.7/runpy.py\", line 263, in run_path\n    pkg_name=pkg_name, script_name=fname)\n  File \"/azureml-envs/azureml_aacecb477b18031a273ac58adf6d7773/lib/python3.7/runpy.py\", line 96, in _run_module_code\n    mod_name, mod_spec, pkg_name, script_name)\n  File \"/azureml-envs/azureml_aacecb477b18031a273ac58adf6d7773/lib/python3.7/runpy.py\", line 85, in _run_code\n    exec(code, run_globals)\n  File \"clean.py\", line 29, in <module>\n    df_full['surface'] = df_full['surface'].str.replace(\",\", \".\").astype(float)\n  File \"/azureml-envs/azureml_aacecb477b18031a273ac58adf6d7773/lib/python3.7/site-packages/pandas/core/generic.py\", line 5698, in astype\n    new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors)\n  File \"/azureml-envs/azureml_aacecb477b18031a273ac58adf6d7773/lib/python3.7/site-packages/pandas/core/internals/managers.py\", line 582, in astype\n    return self.apply(\"astype\", dtype=dtype, copy=copy, errors=errors)\n  File \"/azureml-envs/azureml_aacecb477b18031a273ac58adf6d7773/lib/python3.7/site-packages/pandas/core/internals/managers.py\", line 442, in apply\n    applied = getattr(b, f)(**kwargs)\n  File \"/azureml-envs/azureml_aacecb477b18031a273ac58adf6d7773/lib/python3.7/site-packages/pandas/core/internals/blocks.py\", line 625, in astype\n    values = astype_nansafe(vals1d, dtype, copy=True)\n  File \"/azureml-envs/azureml_aacecb477b18031a273ac58adf6d7773/lib/python3.7/site-packages/pandas/core/dtypes/cast.py\", line 897, in astype_nansafe\n    return arr.astype(dtype, copy=True)\n"
        }
    },
    "time": "0001-01-01T00:00:00.000Z"
}
	InnerException None
	ErrorResponse 
{
    "error": {
        "message": "Activity Failed:\n{\n    \"error\": {\n        \"code\": \"UserError\",\n        \"message\": \"User program failed with ValueError: could not convert string to float: \",\n        \"detailsUri\": \"https://aka.ms/azureml-known-errors\",\n        \"details\": [],\n        \"debugInfo\": {\n            \"type\": \"ValueError\",\n            \"message\": \"could not convert string to float: \",\n            \"stackTrace\": \"  File \\\"/mnt/batch/tasks/shared/LS_root/jobs/realestatepg2/azureml/7d3b4e37-64e8-488b-b171-99e77f6882e4/mounts/workspaceblobstore/azureml/7d3b4e37-64e8-488b-b171-99e77f6882e4/azureml-setup/context_manager_injector.py\\\", line 127, in execute_with_context\\n    runpy.run_path(sys.argv[0], globals(), run_name=\\\"__main__\\\")\\n  File \\\"/azureml-envs/azureml_aacecb477b18031a273ac58adf6d7773/lib/python3.7/runpy.py\\\", line 263, in run_path\\n    pkg_name=pkg_name, script_name=fname)\\n  File \\\"/azureml-envs/azureml_aacecb477b18031a273ac58adf6d7773/lib/python3.7/runpy.py\\\", line 96, in _run_module_code\\n    mod_name, mod_spec, pkg_name, script_name)\\n  File \\\"/azureml-envs/azureml_aacecb477b18031a273ac58adf6d7773/lib/python3.7/runpy.py\\\", line 85, in _run_code\\n    exec(code, run_globals)\\n  File \\\"clean.py\\\", line 29, in <module>\\n    df_full['surface'] = df_full['surface'].str.replace(\\\",\\\", \\\".\\\").astype(float)\\n  File \\\"/azureml-envs/azureml_aacecb477b18031a273ac58adf6d7773/lib/python3.7/site-packages/pandas/core/generic.py\\\", line 5698, in astype\\n    new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors)\\n  File \\\"/azureml-envs/azureml_aacecb477b18031a273ac58adf6d7773/lib/python3.7/site-packages/pandas/core/internals/managers.py\\\", line 582, in astype\\n    return self.apply(\\\"astype\\\", dtype=dtype, copy=copy, errors=errors)\\n  File \\\"/azureml-envs/azureml_aacecb477b18031a273ac58adf6d7773/lib/python3.7/site-packages/pandas/core/internals/managers.py\\\", line 442, in apply\\n    applied = getattr(b, f)(**kwargs)\\n  File \\\"/azureml-envs/azureml_aacecb477b18031a273ac58adf6d7773/lib/python3.7/site-packages/pandas/core/internals/blocks.py\\\", line 625, in astype\\n    values = astype_nansafe(vals1d, dtype, copy=True)\\n  File \\\"/azureml-envs/azureml_aacecb477b18031a273ac58adf6d7773/lib/python3.7/site-packages/pandas/core/dtypes/cast.py\\\", line 897, in astype_nansafe\\n    return arr.astype(dtype, copy=True)\\n\"\n        }\n    },\n    \"time\": \"0001-01-01T00:00:00.000Z\"\n}"
    }
}

In [None]:
dataset_clean = clean_ds.as_dataset().parse_delimited_files()
clean_df = dataset_clean.to_pandas_dataframe()

clean_df.head()