# 8. Registering the training dataset as an Azure ML Data asset

Before submitting a cloud run, we must make the local `../data/raw/` folder discoverable by Azure ML.  
We do this by creating a **Data asset** (`AssetTypes.URI_FOLDER`). Once registered, the asset:

* Is stored in the workspace’s default datastore (Azure Blob / ADLS).  
* Gets a **name + version** so every pipeline can reference an exact snapshot.  
* Can be mounted or downloaded transparently inside training jobs.

The helper below is **idempotent**: if the `name`+`version` already exists, it re-uses it; otherwise it uploads the folder.


### Expected keys in `config.yaml`

```yaml
train:
  data_asset_name: "finally_az_train_4"
  version: "1"
  description: "Raw training images and COCO annotations"


In [1]:
import yaml

from azure.identity import DefaultAzureCredential
from azure.ai.ml import MLClient

from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

## 1) Load settings from config.yaml

In [2]:
# Load configuration from the YAML file
with open("../config.yaml", "r") as file:
    config = yaml.safe_load(file)

In [3]:
subscription_id = config["azure"]["subscription_id"]
resource_group_name = config["azure"]["resource_group_name"]
workspace_name = config["azure"]["workspace_name"]

data_asset_name = config["train"]["data_asset_name"]
version = config["train"]["version"]
description = config["train"]["description"]

## 2) Authenticate & connect to the workspace

In [4]:
# Initialize DefaultAzureCredential
credential = DefaultAzureCredential()

In [5]:
ml_client = MLClient(credential, subscription_id, resource_group_name, workspace_name)

## 3) Register or Retrieve Data Asset

In [6]:
def create_data_asset(ml_client, asset_name, version, description, asset_path, asset_type=AssetTypes.URI_FOLDER):
    """
    Creates or retrieves a data asset in Azure ML using a local folder path.
    
    This function attempts to get the data asset with the specified name and version.
    If found, it prints a message and returns the asset. Otherwise, it creates a new 
    data asset by registering the local folder (asset_path) with the provided description, 
    and returns the newly created asset.

    Parameters:
        ml_client (MLClient): An instance of the Azure ML client.
        asset_name (str): The name of the data asset.
        version (str): The version identifier for the data asset.
        description (str): A short description of the asset.
        asset_path (str): The local path to the data folder to register.
        asset_type (AssetTypes, optional): The type of asset. 
                                             Use AssetTypes.URI_FOLDER for a folder (default) 
                                             or AssetTypes.URI_FILE for a single file.

    Returns:
        Data: The registered data asset object.
    """
    # Create the data asset object
    my_data = Data(
        name=asset_name,
        version=version,
        description=description,
        path=asset_path,
        type=asset_type
    )

    try:
        # Try to retrieve the existing data asset
        data_asset = ml_client.data.get(name=asset_name, version=version)
        print(f"Data asset already exists. Name: {asset_name}, version: {version}")
        return data_asset
    except Exception as e:
        # If retrieval fails, create (or update) the data asset in the workspace
        ml_client.data.create_or_update(my_data)
        print(f"Data asset created. Name: {asset_name}, version: {version}")
        # Retrieve and return the newly created asset
        return ml_client.data.get(name=asset_name, version=version)

In [None]:
data_asset = create_data_asset(ml_client, 'finally_az_train_4', '1',
                               description, asset_path="../data/raw")

## 4) Persist the remote path back into config.yaml for later steps

In [8]:
config["train"]["data_asset_path"] = data_asset.path

In [9]:
with open("../config.yaml", "w") as f:
    yaml.safe_dump(config, f, default_flow_style=False)

### After the upload

* The local folder is copied to the workspace’s **default datastore** (Azure Blob).  
* Training jobs can now reference the dataset via  
  `inputs: { data: azureml:${{data_asset_name}}:${{version}} }`  
  or simply mount it with the SDK.  
* For large datasets consider using `azcopy` first, then registering the *already-uploaded* folder with `path="azureml://datastores/<name>/paths/..."`.
