# MLTable

In this notebook you will learn how to:

1. Read data in a job
1. Read *and* write data in a job
1. Register data as an asset in Azure Machine Learning
1. Read registered data assets from Azure Machine Learning in a job

## Connect to Azure Machine Learning Workspace

To connect to a workspace, we need identifier parameters - a subscription, resource group and workspace name. We will use these details in the `MLClient` from `azure.ml` to get a handle to the required Azure Machine Learning workspace. We use the default [interactive authentication](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity.interactivebrowsercredential?view=azure-python) for this tutorial. More advanced connection methods can be found [here](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity?view=azure-python).
```

In [1]:
from azure.ml import MLClient
from azure.identity import InteractiveBrowserCredential

#enter details of your AML workspace
subscription_id = ''
resource_group = ''
workspace = ''

#get a handle to the workspace
ml_client = MLClient(InteractiveBrowserCredential(), subscription_id, resource_group, workspace)

## MLTable file

The ML table file looks like...

In [4]:
!cat ./sample_data/MLTable

paths: 
  - file: ./titanic.csv
transformations: 
  - read_delimited: 
      delimiter: ',' 
      encoding: 'ascii' 
      empty_as_string: false 

In [2]:
import mltable as mlt

tbl = mlt.load("./sample_data")
df = tbl.to_pandas_dataframe()
df.head(5)

Unnamed: 0,Column1,Column2,Column3,Column4,Column5,Column6,Column7,Column8,Column9,Column10,Column11,Column12
0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C
3,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S


## Reading data in a job

Similar to URI but here we use MLTable

In [31]:
from azure.ml.entities import Environment

env = Environment(image="mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04", 
    conda_file="env/mltable_dep.yaml"
)

In [32]:
from azure.ml.entities import Data, JobInput, CommandJob
from azure.ml._constants import AssetTypes

my_job_inputs = {
    "input_data": JobInput(
        type=AssetTypes.MLTABLE, 
        path='./sample_data'
    )
}

job = CommandJob(
    code="./src", #local path where the code is stored
    command='python read_mltable.py --input_data ${{inputs.input_data}}',
    inputs=my_job_inputs,
    environment=env,
    compute="cpu-cluster"
)

#submit the command job
returned_job = ml_client.jobs.create_or_update(job)
#get a URL for the status of the job
returned_job.services["Studio"].endpoint


'https://ml.azure.com/runs/amusing_corn_gxs3xymwyw?wsid=/subscriptions/15ae9cb6-95c1-483d-a0e3-b1a1a3b06324/resourcegroups/kemp-testing-rg/workspaces/mldev&tid=72f988bf-86f1-41af-91ab-2d7cd011db47'

### Understanding the code

When the job has executed, you will see in the log files a print out of the first 10 records of the titanic sample data. The cell above you can see the inputs to the job were defined using a `dict`:

```python
my_job_inputs = {
    "input_data": JobInput(
        type=AssetTypes.URI_FILE, 
        path='./sample_data/titanic.csv'
    )
}
```

The `JobInput` class allow you to define data inputs where:

- `type` can be a `uri_file` (a specific file) or `uri_folder` (a folder location)
- `path` can be a local path or a cloud path. Azure Machine Learning supports `https://`, `abfss://`, `wasbs://` and `azureml://` URIs. As you saw above, if the path is local but your compute is defined to be in the cloud, Azure Machine Learning will automatically upload the data to cloud storage for you.

The `JobInput` defaults the `mode` - how the input will be exposed during job runtime - to `InputOutputModes.RO_MOUNT` (read-only mount). Put another way, Azure Machine Learning will mount the file or folder to the compute and set the file/folder to read-only. By design, you cannot *write* to `JobInputs` only `JobOutputs` - we will cover this later in the notebook.

#### Accessing data already in the cloud

As mentioned above, the `path` in JobInput supports `https://`, `abfss://`, `wasbs://` and `azureml://` protocols. Therefore, you can simply change the `path` in the above cell to a cloud-based URI.

## Reading *and* writing data in a job

By design, you cannot *write* to `JobInputs` only `JobOutputs`. Say, you want to read in some data, do some processing and then write the processed data back to the cloud. In the example below you get the URI of the default Azure ML datastore:

In [None]:
default_dstor = ml_client.datastores.get_default()
output_path = default_dstor.protocol + '://' + default_dstor.account_name + '.blob.' + default_dstor.endpoint + '/' + default_dstor.container_name

print(output_path)

The code below creates a `JobOutput` that will mount your Azure Machine Learning default storage (Azure Blob) in Read-*Write* mode. The python code simply takes the CSV as import and exports it as a parquet file, i.e.

```python
df = pd.read_csv(args.input_data)
output_path = os.path.join(args.output_folder, "my_output.parquet")
df.to_parquet(output_path)
```

In [None]:
from azure.ml.entities import Data, UriReference, JobInput, CommandJob, JobOutput
from azure.ml._constants import AssetTypes

my_job_inputs = {
    "input_data": JobInput(
        type=AssetTypes.URI_FILE, 
        path='./sample_data/titanic.csv'
    )
}

my_job_outputs = {
    "output_folder": JobOutput(
        type=AssetTypes.URI_FOLDER, 
        path=output_path
    )
}

job = CommandJob(
    code="./src", #local path where the code is stored
    command='python read_write_data.py --input_data ${{inputs.input_data}} --output_folder ${{outputs.output_folder}}',
    inputs=my_job_inputs,
    outputs=my_job_outputs,
    environment="AzureML-sklearn-0.24-ubuntu18.04-py37-cpu:9",
    compute="cpu-cluster"
)

#submit the command job
returned_job = ml_client.create_or_update(job)
#get a URL for the status of the job
returned_job.services["Studio"].endpoint


## Registering data as an asset in Azure Machine Learning

You can register data as an asset in Azure Machine Learning. The benefits of registering data are:

- Easy to share with other members of the team (no need to remember file locations)
- Versioning of the metadata (location, description, etc)
- Lineage tracking

Below we show an example of versioning the sample data in this repo. The data is uploaded to cloud storage and registered as an asset.

In [None]:
from azure.ml.entities import Data
from azure.ml._constants import AssetTypes

my_data = Data(
    path="./sample_data/titanic.csv",
    type=AssetTypes.URI_FILE,
    description="Titanic Data",
    name="titanic",
    version='1'
)

ml_client.data.create_or_update(my_data)

> Note: Whilst the above example shows a local file. Remember that `path` supports cloud storage (`https`, `abfss`, `wasbs` protocols). Therefore, if you want to register data in a cloud location just specify the path with any of the supported protocols.

### Consume data assets in an Azure Machine Learning Job

Below we use the previously registered data asset in the job by refering to the long-form ID in the `path`:

```txt
/subscriptions/XXXXX/resourceGroups/XXXXX/providers/Microsoft.MachineLearningServices/workspaces/XXXXX/datasets/titanic/versions/1
```

This long-form URI is accessed using:

```python
registered_data_asset = ml_client.data.get(name='titanic', version='1')
registered_data_asset.id
```


In [None]:
from azure.ml.entities import Data, UriReference, JobInput, CommandJob
from azure.ml._constants import AssetTypes

registered_data_asset = ml_client.data.get(name='titanic', version='1')

my_job_inputs = {
    "input_data": JobInput(
        type=AssetTypes.URI_FILE,
        path=registered_data_asset.id
    )
}

job = CommandJob(
    code="./src", 
    command='python read_data_asset.py --input_data ${{inputs.input_data}}',
    inputs=my_job_inputs,
    environment="AzureML-sklearn-0.24-ubuntu18.04-py37-cpu:9",
    compute="cpu-cluster"
)

#submit the command job
returned_job = ml_client.create_or_update(job)
#get a URL for the status of the job
returned_job.services["Studio"].endpoint