# Create Azure Machine Learning Data assets

**Requirements** - In order to benefit from this tutorial, you will need:
- A basic understanding of Machine Learning
- An Azure account with an active subscription. [Create an account for free](https://azure.microsoft.com/free/?WT.mc_id=A261C142F)
- An Azure ML workspace. [Check this notebook for creating a workspace](/sdk/resources/workspace/workspace.ipynb) 
- A python environment
- Installed Azure Machine Learning Python SDK v2 - [install instructions](/sdk/README.md#getting-started)

**Learning Objectives** - By the end of this tutorial, you should be able to:
- Create Azure Machine Learning `Data` from Python SDK for
  - Local files and folders
  - Remote files and folders
- Use data in a CommandJob

**Motivations** - Azure Machine Learning `data` assets are references to file(s) or folder in local or remote storage along with any corresponding metadata. They are not copies of your data. You can use these data assets to access relevant data during model training and mount or download the referenced data to your compute target.

# 1. Connect to Azure Machine Learning Workspace

The [workspace](https://docs.microsoft.com/en-us/azure/machine-learning/concept-workspace) is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. In this section we will connect to the workspace in which the job will be run.

## 1.1. Import the required libraries

In [None]:
#import required libraries
from azure.ml import MLClient
from azure.ml.entities import Data, CommandJob, JobInput
from azure.identity import InteractiveBrowserCredential
from azure.ml._constants import AssetTypes

## 1.2. Configure workspace details and get a handle to the workspace

To connect to a workspace, we need identifier parameters - a subscription, resource group and workspace name. We will use these details in the `MLClient` from `azure.ml` to get a handle to the required Azure Machine Learning workspace. We use the default [interactive authentication](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity.interactivebrowsercredential?view=azure-python) for this tutorial. More advanced connection methods can be found [here](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity?view=azure-python).

In [None]:
#Enter details of your AML workspace
subscription_id = '<SUBSCRIPTION_ID>'
resource_group = '<RESOURCE_GROUP>'
workspace = '<AML_WORKSPACE_NAME>'

In [None]:
#get a handle to the workspace
ml_client = MLClient(InteractiveBrowserCredential(), subscription_id, resource_group, workspace)

# 2. Create data asset
In this section we will create data from various locations

## 2.1 Configuring data asset
The Data class allows user to configure the the following key aspects. 
- `name` - Name of the data asset in the workspace
- `type` - The type of data being referred to. Allowed values are `uri_file` and `uri_folder`. The default is `uri_folder`.
- `path` - The path to the file or folder. These can be local or remote files or folders. For remote files - http/https, wasb are supported.
- `version` - Version of the data asset. If omitted, Azure ML will autogenerate a version.
- `description` - Description of the data asset.

## 2.2 Create data asset from a local file or folder
Let us use `Data` to create a data asset from a local file and folder 

In [None]:
# Use a local file
local_dataset = Data(
    type=AssetTypes.URI_FILE,
    path="./data/titanic.csv", 
    name="local-file-example", 
    description="Dataset created from local file.")

ml_client.data.create_or_update(local_dataset)

# Use a local folder
local_folder_dataset = Data(
    type=AssetTypes.URI_FILE,
    path="./data",
    name="local-folder-example", 
    description="Dataset created from local folder.")

ml_client.data.create_or_update(local_folder_dataset)

## 2.3 Create data asset from files or folders in the cloud
Let us use `Data` to create a data asset from remote locations. Supported remote locations are http/https, wasb and azureml locations.

In [None]:
#Create dataset from a file in the aml workspace
cloud_ds_aml_file = Data(
    path="azureml://datastores/workspaceblobstore/paths/example-data/titanic.csv",
    type=AssetTypes.URI_FILE,
    name="cloud-file-example",
    description="Dataset created from file in cloud."
)
ml_client.data.create_or_update(cloud_ds_aml_file)

#create dataset from a public file with hhtps URL
cloud_ds_file = Data(
    type=AssetTypes.URI_FILE,
    path="https://azuremlexamples.blob.core.windows.net/datasets/titanic.csv",
    name="public-file-https-example",
    description="Dataset created from a publicly available file using https URL."
)
ml_client.data.create_or_update(cloud_ds_file)

#Create dataset from a folder in the cloud
cloud_ds_folder = Data(
    type=AssetTypes.URI_FILE,
    path="https://mainstorage9c05dabf5c924.blob.core.windows.net/azureml-blobstore-54887b46-3cb0-485b-bb15-62e7b5578ee6/example-data/",
    name="cloud-folder-https-example",
    description="Dataset created from folder in cloud using https URL."
)
ml_client.data.create_or_update(cloud_ds_folder)

#Create a dataset from a file with wasbs URL
cloud_ds_wasbs_file = Data(
    type=AssetTypes.URI_FILE,
    path="wasbs://mainstorage9c05dabf5c924.blob.core.windows.net/azureml-blobstore-54887b46-3cb0-485b-bb15-62e7b5578ee6/example-data/titanic.csv",
    name="cloud-file-wasbs-example",
    description="Dataset created from a file in cloud using wasbs URL."
)
ml_client.data.create_or_update(cloud_ds_wasbs_file)

#Create a dataset from a folder with wasbs URL
cloud_ds_wasbs_folder = Data(
    type=AssetTypes.URI_FILE,
    path="wasbs://mainstorage9c05dabf5c924.blob.core.windows.net/azureml-blobstore-54887b46-3cb0-485b-bb15-62e7b5578ee6/example-data/",
    name="cloud-folder-wasbs-example",
    description="Dataset created from folder in cloud using wasbs URL."
)
ml_client.data.create_or_update(cloud_ds_wasbs_folder)

# 3. Use data asset in a Job
You can now use any of the above data assets in a job (or a pipeline).

To illustrate, let us use the data asset `public-file-https-example` in a `CommandJob`. We will look for a file _titanic.csv_ in the `dataset`, and print out the column names and number of rows in the file.

## 3.1 Configure the CommandJob

In [None]:
#create the command job
job = CommandJob(
    code="./src", #local path where the code is stored
    command= 'python main.py --input-dataset ${{inputs.input_dataset}}',
    inputs={"input_dataset": JobInput(type=AssetTypes.URI_FOLDER, path="public-file-https-example:1")},
    #inputs={"input_dataset":JobInput(dataset="public-file-https-example:1")},
    environment= "AzureML-sklearn-0.24-ubuntu18.04-py37-cpu:9",
    compute = "cpu-cluster", #replace this with compute in your workspace
    display_name="use-dataset-in-a-job"
)

## 3.2 Run the CommandJob
Using the `MLClient` created earlier, we will now run this CommandJob in the workspace.

In [None]:
#submit the command job
returned_job = ml_client.jobs.create_or_update(job)
#get a URL for the status of the job
returned_job.services["Studio"].endpoint