# MLTable Quickstart 🚀

In this notebook, you create a Table (`mltable`) of the [NYC Green Taxi Data](https://learn.microsoft.com/azure/open-datasets/dataset-taxi-green?tabs=azureml-opendatasets) from Azure Open Datasets. The data is in parquet format and covers year 2008-2021. The data files are in the following folder structure on a publicly accessible blob storage account:

```text
/
└── green
    ├── puYear=2008
    │   ├── puMonth=1
    │   │   ├── _committed_2983805876188002631
    │   │   └── part-XXX.snappy.parquet
    │   ├── ... 
    │   └── puMonth=12
    │       ├── _committed_2983805876188002631
    │       └── part-XXX.snappy.parquet
    ├── ...
    └── puYear=2021
        ├── puMonth=1
        │   ├── _committed_2983805876188002631
        │   └── part-XXX.snappy.parquet
        ├── ...
        └── puMonth=12
            ├── _committed_2983805876188002631
            └── part-XXX.snappy.parquet
```

With this data, you want to load into a Pandas data frame:

- Only the parquet files for years 2015-19.
- A random sample of the data.
- Correct data (for example, where trip distance is greater than 0).
- Relevant columns.
- New columns - year and month - using the path information (`puYear=X/puMonth=Y`).

You could achieve these data loading steps with Pandas code. However, achieving *reproducibility* is difficult because you'd either need to:

1. share code, which means if the schema changes (for example, a column name change) then all users need to update their code, or
1. write an ETL pipeline, which is heavy weight.

Azure ML Tables provide a light-weight mechanism to serialize (save) the data loading steps in an `MLTable` file so that you and team members can *reproduce* the Pandas data frame. If the schema changes, you only update the `MLTable` file rather than multiple places containing Python data loading code.


In [None]:
# ensure you have the dependencies for this notebook installed.
%pip install -r ../mltable-requirements.txt

## Create an MLTable using the Python SDK 🐍

Here you build your data loading steps using the `mltable` Python SDK. The `show()` method allows you to see the effect of the data loading transformation.

In [None]:
import mltable

# glob the parquet file paths for years 2015-19, all months.
paths = [
    {
        "pattern": "wasbs://nyctlc@azureopendatastorage.blob.core.windows.net/green/puYear=2015/puMonth=*/*.parquet"
    },
    {
        "pattern": "wasbs://nyctlc@azureopendatastorage.blob.core.windows.net/green/puYear=2016/puMonth=*/*.parquet"
    },
    {
        "pattern": "wasbs://nyctlc@azureopendatastorage.blob.core.windows.net/green/puYear=2017/puMonth=*/*.parquet"
    },
    {
        "pattern": "wasbs://nyctlc@azureopendatastorage.blob.core.windows.net/green/puYear=2018/puMonth=*/*.parquet"
    },
    {
        "pattern": "wasbs://nyctlc@azureopendatastorage.blob.core.windows.net/green/puYear=2019/puMonth=*/*.parquet"
    },
]

# create a table from the parquet paths
tbl = mltable.from_parquet_files(paths)

# table a random sample
tbl = tbl.take_random_sample(probability=0.001, seed=735)

# filter trips with a distance > 0
tbl = tbl.filter("col('tripDistance') > 0")

# Drop columns
tbl = tbl.drop_columns(["puLocationId", "doLocationId", "storeAndFwdFlag"])

# Create two new columns - year and month - where the values are taken from the path
tbl = tbl.extract_columns_from_partition_format("/puYear={year}/puMonth={month}")

# print the first 5 records of the table as a check
tbl.show(5)

### 💾 Save data loading steps 

Next, you'll save all your data loading steps into an `MLTable` file. This allows you to *reproduce* your Pandas data frame at a later point in time without having to redefine the data loading steps in your code.

In [None]:
# serialize the above data loading steps into an MLTable file
tbl.save("./nyc_taxi")

#### 👓 View the saved file

To understand what is saved, you can view the saved `MLTable` file using the Linux `cat` command. Notice that all your data loading steps have been serialized into a YAML-based file.

In [None]:
with open("./nyc_taxi/MLTable", "r") as f:
    print(f.read())

## ♻️ Reproduce data loading steps

Now that the data loading steps have been serialized into a file, you can reproduce them at any point in time using the `load()` method. This means you do not need to redefine your data loading steps in code and makes it easier to share with others.

In [None]:
import mltable

# load the previously saved MLTable file
tbl = mltable.load("./nyc_taxi/")

# Load the table into a pandas dataframe
# NOTE: The data is in East US region and the data is large, so this will take several minutes (~7mins) to load if you are in a different region.
df = tbl.to_pandas_dataframe()

In [None]:
# print the head of the data frame
df.head()

In [None]:
# print the shape and column types of the data frame
print(f"Shape: {df.shape}")
print(f"Columns:\n{df.dtypes}")

## Create a data asset to aid sharing and reproducibility 🤝

Your `MLTable` file is currently saved on disk, making it hard to share with Team members. By creating a *data asset* in AzureML, your MLTable will be uploaded to cloud storage and "bookmarked", meaning your Team members can access the MLTable using a friendly name. Also, the data asset is *versioned*.

In [None]:
subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace = "<AML_WORKSPACE_NAME>"

In [None]:
from azure.ai.ml import MLClient
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
from azure.identity import DefaultAzureCredential

# connect to the AzureML workspace
ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace
)

my_path = "./nyc_taxi"

my_data = Data(
    path=my_path,
    type=AssetTypes.MLTABLE,
    description="A random sample of NYC Green Taxi Data between 2015-19.",
    name="green-quickstart",
)

ml_client.data.create_or_update(my_data)

## Access data asset in an interactive session

Now you have your MLTable stored in the cloud, you and Team members can access it using a friendly name in an interactive session (for example, a notebook).

In [None]:
import mltable
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

# connect to the AzureML workspace
ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace
)

# get the latest version of the data asset
data_asset = ml_client.data.list(name="green-quickstart").next()

# create a table
tbl = mltable.load(f"azureml:/{data_asset.id}")

tbl.show(5)

# load into pandas
# NOTE: The data is in East US region and the data is large, so this will take several minutes (~7mins) to load if you are in a different region.
# df = tbl.to_pandas_dataframe()

## Access data asset into a job

You can also access your Table in a job, using:

In [None]:
from azure.ai.ml import MLClient, command, Input
from azure.ai.ml.entities import Environment
from azure.identity import DefaultAzureCredential

# connect to the AzureML workspace
ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace
)

# get the latest version of the data asset
data_asset = ml_client.data.list(name="green-quickstart").next()

job = command(
    command="python train.py --input ${{inputs.green}}",
    inputs={"green": Input(type="mltable", path=data_asset.id)},
    compute="cpu-cluster",
    environment=Environment(
        image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
        conda_file="./job-env/conda_dependencies.yml",
    ),
    code="./src",
)

ml_client.jobs.create_or_update(job)