#Working with MLTable

In this notebook you will learn how to:

1. Read `mltable` in a job
1. Register an `mltable` as a data asset in Azure Machine Learning
1. Consume registered `mltable` assets in a job

## Prerequisites

You will need to install the following Python package dependencies:

```bash
pip uninstall azure-ai-ml
pip install --pre azure-ai-ml
pip uninstall mltable
pip install --pre mltable
pip install pandas
```

## Connect to Azure Machine Learning Workspace

To connect to a workspace, we need identifier parameters - a subscription, resource group and workspace name. We will use these details in the `MLClient` from `azure.ai.ml` to get a handle to the required Azure Machine Learning workspace. We use the [DefaultAzureCredential](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity.defaultazurecredential?view=azure-python) for this tutorial. More advanced connection methods can be found [here](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity?view=azure-python).
...

In [None]:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

# Enter details of your AML workspace
subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace = "<AML_WORKSPACE_NAME>"

# get a handle to the workspace
ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace
)

## MLTable definition file from local data path

In [None]:
!cat ./sample_data/MLTable

We can look at the contents of the `mltable` file in the notebook, using:

In [None]:
import mltable

tbl = mltable.load("./sample_data")
df = tbl.to_pandas_dataframe()
df.head(5)

## MLTable definition file from datastore uri path
1. get datastore uri path from registered data asset with uri_file type
2. construct mltable yaml with the datastore uri path and load it to pandas dataframe

Get datastore uri path from registered data asset with uri_file type

In [None]:
# get datastore uri from local data path
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

my_parquet_data = Data(
    path="./sample_data/data.parquet",
    type=AssetTypes.URI_FILE,
    description="Parquet data",
    name="v2_parquet_urifile",
)

my_parquet_data = ml_client.data.create_or_update(my_parquet_data)
print(my_parquet_data.path)

Construct MLTable definition file from the datastore uri path(taking parquet file data source as example) and load it into pandas

In [None]:
# helper function to create MLTable datastore uri path for parquet file
# uri_path is the datastore uri path in the format of long form datastore uri format: azureml://subscriptions/<sub-id>/resourcegroups/<resource-group>/workspaces/<workspace>/datastores/{datastore_name}/paths/{relative_data_path}
# mltable_folder is where to save the MLTable yaml
def mltable_from_parquet(uri_path, mltable_folder):
    import yaml
    import mltable

    # MLTable yaml dictionary
    mltable_from_parquet = {
        "paths": [{"file": f"{uri_path}"}],
        "transformations": ["read_parquet"],
    }

    temp_dir = tempfile.gettempdir()
    with open(f"{mltable_folder}/MLTable", "w") as mltable_yaml:
        yaml.dump(mltable_from_parquet, mltable_yaml, default_flow_style=False)

    return mltable.load(mltable_folder)

In [None]:
import tempfile

temp_dir = tempfile.gettempdir()

# get datastore uri path from the registered asset
# save MLTable file to temp folder
mlt = mltable_from_parquet(my_parquet_data.path, temp_dir)
mlt.to_pandas_dataframe()

## Reading `mltable` in a job

Below we show how you can consume an `mltable` in a job. This job just prints the first 10 records of the table:

```python
# read_mltable.py
import argparse
import mltable
import pandas

parser = argparse.ArgumentParser()
parser.add_argument("--input_data", type=str)
args = parser.parse_args()

tbl = mltable.load(args.input_data)
df = tbl.to_pandas_dataframe()
print(df.head(10))
```


In [None]:
from azure.ai.ml import Input, command
from azure.ai.ml.entities import Data, Environment
from azure.ai.ml.constants import AssetTypes

env = Environment(
    image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
    conda_file="environments/mltable_environment.yaml",
)

my_job_inputs = {"input_data": Input(type=AssetTypes.MLTABLE, path="./sample_data")}

job = command(
    code="./src",  # local path where the code is stored
    command="python read_mltable.py --input_data ${{inputs.input_data}}",
    inputs=my_job_inputs,
    environment=env,
    compute="cpu-cluster",
)

# submit the command job
returned_job = ml_client.jobs.create_or_update(job)
# get a URL for the status of the job
returned_job.studio_url

### Understanding the code

When the job has executed, you will see in the log files a print out of the first 10 records of the titanic sample data. The cell above you can see the inputs to the job were defined using a `dict`:

```python
my_job_inputs = {
    "input_data": Input(
        type=AssetTypes.MLTABLE, 
        path='./sample_data'
    )
}
```

The `Input` class allow you to define data inputs where:

- `type` can be a `uri_file` (a specific file), `uri_folder` (a folder location) or `mltable` (an abstraction over tabular data)
- `path` can be a local path or a cloud path. Azure Machine Learning supports `https://`, `abfss://`, `wasbs://` and `azureml://` URIs. As you saw above, if the path is local but your compute is defined to be in the cloud, Azure Machine Learning will automatically upload the data to cloud storage for you.

The `Input` defaults the `mode` - how the input will be exposed during job runtime - to `InputOutputModes.RO_MOUNT` (read-only mount). Put another way, Azure Machine Learning will mount the file or folder to the compute and set the file/folder to read-only. By design, you cannot *write* to `Inputs` only `JobOutputs`.

#### Accessing data already in the cloud

As mentioned above, the `path` in Input supports `https://`, `abfss://`, `wasbs://` and `azureml://` protocols. Therefore, you can simply change the `path` in the above cell to a cloud-based URI.

## Registering an `mltable` as an asset in Azure Machine Learning

You can register an `mltable` as a data asset in Azure Machine Learning. The benefits of registering data are:

- Easy to share with other members of the team (no need to remember file locations)
- Versioning of the metadata (location, description, etc)

Below we show an example of versioning the sample data in this repo. The data is uploaded to cloud storage and registered as an asset.

In [None]:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

my_data = Data(
    path="./sample_data",
    type=AssetTypes.MLTABLE,
    description="Titanic Data",
    name="titanic-mltable-example",
)

my_mltable = ml_client.data.create_or_update(my_data)
mltable_version = my_mltable.version
print(mltable_version)

> Note: Whilst the above example shows a local file. Remember that `path` supports cloud storage (`https`, `abfss`, `wasbs` protocols). Therefore, if you want to register data in a cloud location just specify the path with any of the supported protocols.

### Consume data assets in an Azure Machine Learning Job

Below we use the previously registered data asset in the job by refering to the long-form ID in the `path`:

```txt
/subscriptions/XXXXX/resourceGroups/XXXXX/providers/Microsoft.MachineLearningServices/workspaces/XXXXX/datasets/titanic-mltable/versions/1
```

This long-form URI is accessed using:

```python
registered_data_asset = ml_client.data.get(name='titanic-mltable-example', version = mltable_version)
registered_data_asset.id
```


In [None]:
from azure.ai.ml import Input, command
from azure.ai.ml.entities import Data, Environment
from azure.ai.ml.constants import AssetTypes

env = Environment(
    image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
    conda_file="environments/mltable_environment.yaml",
)

registered_data_asset = ml_client.data.get(
    name="titanic-mltable-example", version=mltable_version
)

my_job_inputs = {"input_data": Input(path=registered_data_asset.id)}

job = command(
    code="./src",
    command="python read_mltable.py --input_data ${{inputs.input_data}}",
    inputs=my_job_inputs,
    environment=env,
    compute="cpu-cluster",
)

# submit the command job
returned_job = ml_client.create_or_update(job)
# get a URL for the status of the job
returned_job.studio_url