# Working with Data

In this notebook you will learn how to use the AzureML SDK to:

1. Read/write data in a job.
1. Create a data asset to share with others in your team.
1. Abstract schema for tabular data using `MLTable`.

## Connect to Azure Machine Learning Workspace

To connect to a workspace, we need identifier parameters - a subscription, resource group and workspace name. We will use these details in the `MLClient` from `azure.ai.ml` to get a handle to the required Azure Machine Learning workspace. We use the default [default azure authentication](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity.defaultazurecredential?view=azure-python) for this tutorial. Check the [configuration notebook](../../jobs/configuration.ipynb) for more details on how to configure credentials and connect to a workspace.
```

In [None]:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

# enter details of your AML workspace
subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace = "<AML_WORKSPACE_NAME>"

# get a handle to the workspace
ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace
)

## Reading/writing data in a job

In this example we will use the titanic dataset in this repo - ([./sample_data/titanic.csv](./sample_data/titanic.csv)) and set-up a command that executes the following python code:

```python
df = pd.read_csv(args.input_data)
print(df.head(10))
```

Below is the code for submitting the command to the cloud - note that both the code *and* the data is automatically uploaded to the cloud. Note: The data is only re-uploaded on subsequent job submissions if data has changed.

In [None]:
from azure.ai.ml import command
from azure.ai.ml.entities import Data
from azure.ai.ml import Input
from azure.ai.ml.constants import AssetTypes

# === Note on path ===
# can be can be a local path or a cloud path. AzureML supports https://`, `abfss://`, `wasbs://` and `azureml://` URIs.
# Local paths are automatically uploaded to the default datastore in the cloud.
# More details on supported paths: https://docs.microsoft.com/azure/machine-learning/how-to-read-write-data-v2#supported-paths

inputs = {
    "input_data": Input(type=AssetTypes.URI_FILE, path="./sample_data/titanic.csv")
}

job = command(
    code="./src",  # local path where the code is stored
    command="python read_data.py --input_data ${{inputs.input_data}}",
    inputs=inputs,
    environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu:1",
    compute="cpu-cluster",
)

# submit the command
returned_job = ml_client.jobs.create_or_update(job)
# get a URL for the status of the job
returned_job.studio_url

### Reading *and* writing data in a job

By design, you cannot *write* to `Inputs` only `Outputs`. The code below creates an `Output` that will mount your AzureML default datastore (Azure Blob) in Read-*Write* mode. The python code simply takes the CSV as import and exports it as a parquet file, i.e.

```python
df = pd.read_csv(args.input_data)
output_path = os.path.join(args.output_folder, "my_output.parquet")
df.to_parquet(output_path)
```

In [None]:
from azure.ai.ml import command
from azure.ai.ml.entities import Data
from azure.ai.ml import Input, Output
from azure.ai.ml.constants import AssetTypes

inputs = {
    "input_data": Input(type=AssetTypes.URI_FILE, path="./sample_data/titanic.csv")
}

outputs = {
    "output_folder": Output(
        type=AssetTypes.URI_FOLDER,
        path=f"azureml://subscriptions/{subscription_id}/resourcegroups/{resource_group}/workspaces/{workspace}/datastores/workspaceblobstore/paths/",
    )
}

job = command(
    code="./src",  # local path where the code is stored
    command="python read_write_data.py --input_data ${{inputs.input_data}} --output_folder ${{outputs.output_folder}}",
    inputs=inputs,
    outputs=outputs,
    environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu:1",
    compute="cpu-cluster",
)

# submit the command
returned_job = ml_client.create_or_update(job)
# get a URL for the status of the job
returned_job.studio_url

## Create Data Assets

You can create a data asset in Azure Machine Learning, which has the following benefits:

- Easy to share with other members of the team (no need to remember file locations)
- Versioning of the metadata (location, description, etc)
- Lineage tracking

Below we show an example of versioning the sample data in this repo. The data is uploaded to cloud storage and registered as an asset.

In [None]:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

try:
    registered_data_asset = ml_client.data.get(name="titanic", version="1")
    print("Found data asset. Will not create again")
except Exception as ex:
    my_data = Data(
        path="./sample_data/titanic.csv",
        type=AssetTypes.URI_FILE,
        description="Titanic Data",
        name="titanic",
        version="1",
    )
    ml_client.data.create_or_update(my_data)
    registered_data_asset = ml_client.data.get(name="titanic", version="1")
    print("Created data asset")

> Note: Whilst the above example shows a local file. Remember that `path` supports cloud storage (`https`, `abfss`, `wasbs` protocols). Therefore, if you want to register data in a cloud location just specify the path with any of the supported protocols.

### Consume data assets in a job

Below shows how to consume a data asset in the job:


In [None]:
from azure.ai.ml import command, Input, Output
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

registered_data_asset = ml_client.data.get(name="titanic", version="1")

my_job_inputs = {
    "input_data": Input(type=AssetTypes.URI_FILE, path=registered_data_asset.id)
}

job = command(
    code="./src",
    command="python read_data.py --input_data ${{inputs.input_data}}",
    inputs=my_job_inputs,
    environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu:1",
    compute="cpu-cluster",
)

# submit the command
returned_job = ml_client.create_or_update(job)
# get a URL for the status of the job
returned_job.studio_url

### Authenticate with user identity

When running a job on a compute cluster, you can also use your user identity to access data. To enable job to access data on behald of you, specify **identity=UserIdentity()** in job definition, as shown below. For more details, see [Accessing storage services](https://learn.microsoft.com/azure/machine-learning/how-to-identity-based-service-authentication)

In [None]:
from azure.ai.ml import UserIdentityConfiguration

job = command(
    code="./src",
    command="python read_data.py --input_data ${{inputs.input_data}}",
    inputs=my_job_inputs,
    environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu:1",
    compute="cpu-cluster",
    identity=UserIdentityConfiguration(),
)

## MLTable

`MLTable` is a way to abstract the schema definition for tabular data so that it is easier for consumers of the data to materialize the table into a Pandas/Dask/Spark dataframe. [A more detailed explanation and motivation is provided on docs.microsoft.com.](https://docs.microsoft.com/azure/machine-learning/concept-data#mltable).

The ideal scenarios to use `MLTable` are:

- The schema of your data is complex and/or changes frequently.
- You only need a subset of data (for example: a sample of rows or files, specific columns, etc).
- AutoML jobs requiring tabular data.

If your scenario does not fit the above then it is likely that URIs are a more suitable type.

### The `MLTable` file

The `MLTable` file defines the schema for tabular data. Below is a sample:

In [None]:
! cat ./sample-mltable/MLTable

We recommend that you co-locate your `MLTable` file with the underlying data (i.e. the `MLTable` file should be in the same (or parent) directory. You can can load an `MLTable` artefact using the `mltable` library - below below.

In [None]:
import mltable

# Note: the uri below can be a local folder or folder located in cloud storage. The folder must contain a valid MLTable file.
tbl = mltable.load(uri="./sample-mltable")
tbl.to_pandas_dataframe()

### Read an MLTable in a job

#### Create an environment

Firstly, you need to create an environment that contains the mltable Python Library:

In [None]:
from azure.ai.ml.entities import Environment

env_docker_conda = Environment(
    image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
    conda_file="env-mltable.yml",
    name="mltable",
    description="Environment created for consuming MLTable.",
)

ml_client.environments.create_or_update(env_docker_conda)

#### Create a job

In [None]:
from azure.ai.ml import command
from azure.ai.ml.entities import Data
from azure.ai.ml import Input
from azure.ai.ml.constants import AssetTypes

inputs = {"input_data": Input(type=AssetTypes.MLTABLE, path="./sample-mltable")}

job = command(
    code="./src",  # local path where the code is stored
    command="python read_mltable.py --input_data ${{inputs.input_data}}",
    inputs=inputs,
    environment=env_docker_conda,
    compute="cpu-cluster",
)

# submit the command
returned_job = ml_client.jobs.create_or_update(job)
# get a URL for the status of the job
returned_job.studio_url