# Work with Data

Data is the foundation on which machine learning models are built. Managing data centrally in the cloud, and making it accessible to teams of data scientists who are running experiments and training models on multiple workstations and compute targets is an important part of any professional data science solution.

In this notebook, you'll explore two Azure Machine Learning objects for working with data: *datastores*, and *data assets*.

## Before you start

You'll need the latest version of the **azure-ai-ml** package to run the code in this notebook. Run the cell below to verify that it is installed.

> **Note**:
> If the **azure-ai-ml** package is not installed, run `pip install azure-ai-ml` to install it.

In [1]:
!pip install azure-ai-ml
!pip install mltable
!pip install --upgrade azure-ai-ml mltable pandas




In [3]:
pip show azure-ai-ml

Name: azure-ai-ml
Version: 1.24.0
Summary: Microsoft Azure Machine Learning Client Library for Python
Home-page: https://github.com/Azure/azure-sdk-for-python
Author: Microsoft Corporation
Author-email: azuresdkengsysadmins@microsoft.com
License: MIT License
Location: /anaconda/envs/azureml_py38/lib/python3.10/site-packages
Requires: azure-common, azure-core, azure-mgmt-core, azure-monitor-opentelemetry, azure-storage-blob, azure-storage-file-datalake, azure-storage-file-share, colorama, isodate, jsonschema, marshmallow, msrest, pydash, pyjwt, pyyaml, strictyaml, tqdm, typing-extensions
Required-by: 
Note: you may need to restart the kernel to use updated packages.


## Connect to your workspace

With the required SDK packages installed, now you're ready to connect to your workspace.

To connect to a workspace, we need identifier parameters - a subscription ID, resource group name, and workspace name. Since you're working with a compute instance, managed by Azure Machine Learning, you can use the default values to connect to the workspace.

In [4]:
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
from azure.ai.ml import MLClient

try:
    credential = DefaultAzureCredential()
    # Check if given credential can get token successfully.
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work
    credential = InteractiveBrowserCredential()


In [5]:
# Get a handle to workspace
ml_client = MLClient.from_config(credential=credential)

Found the config file in: /config.json


## List the datastores

When you create the Azure Machine Learning workspace, an Azure Storage Account is created too. The Storage Account includes Blob and file storage and are automatically connected with your workspace as **datastores**. You can list all datastores connected to your workspace:

In [7]:
stores = ml_client.datastores.list()
for ds_name in stores:
    print(ds_name.name)

workspaceblobstore
workspacefilestore
workspaceartifactstore
workspaceworkingdirectory


Note the `workspaceblobstore` which connects to the **azureml-blobstore-...** container you explored earlier. The `workspacefilestore` connects to the **code-...** file share.

## Create a datastore

Whenever you want to connect another Azure storage service with the Azure Machine Learning workspace, you can create a datastore. Note that creating a datastore, creates the connection between your workspace and the storage, it doesn't create the storage service itself. 

To create a datastore and connect to a (already existing) storage, you'll need to specify:

- The class to indicate with what type of storage service you want to connect. The example below connects to a Blob storage (`AzureBlobDatastore`).
- `name`: The display name of the datastore in the Azure Machine Learning workspace.
- `description`: Optional description to provide more information about the datastore.
- `account_name`: The name of the Azure Storage Account.
- `container_name`: The name of the container to store blobs in the Azure Storage Account.
- `credentials`: Provide the method of authentication and the credentials to authenticate. The example below uses an account key.

**Important**: 
- Replace the **YOUR-STORAGE-ACCOUNT-NAME** with the name of the Storage Account that was automatically created for you. 
- Replace the **XXXX-XXXX** for `account_key` with the account key of your Azure Storage Account. 

Remember you can retrieve the account key by navigating to the [Azure portal](https://portal.azure.com), go to your Storage Account, from the **Access keys** tab, copy the **Key** value for key1 or key2. 


In [8]:
from azure.ai.ml.entities import AzureBlobDatastore
from azure.ai.ml.entities import AccountKeyConfiguration

store = AzureBlobDatastore(
    name="blob_training_data",
    description="Blob Storage for training data",
    account_name="mlwdp100labs7786297541",
    container_name="training-data", 
    credentials=AccountKeyConfiguration(
        account_key="K5pG9Owo89ibA9H91sF6hcXBBfljIpbP0RPAZFMdZBvD0MvLT+JxMwF5mwvRxKhdVoYMnV/OASzD+ASt8Si9LA=="
    ),
)

ml_client.create_or_update(store)

AzureBlobDatastore({'type': <DatastoreType.AZURE_BLOB: 'AzureBlob'>, 'name': 'blob_training_data', 'description': 'Blob Storage for training data', 'tags': {}, 'properties': {}, 'print_as_yaml': False, 'id': '/subscriptions/cda9116f-5326-4a9b-9407-bc3a4391c27c/resourceGroups/rg-dp100/providers/Microsoft.MachineLearningServices/workspaces/mlw-dp100-labs/datastores/blob_training_data', 'Resource__source_path': '', 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/captgt0072/code/Users/captgt007/azure-ml-labs/Labs/03', 'creation_context': None, 'serialize': <msrest.serialization.Serializer object at 0x7fc5af1a27a0>, 'credentials': {'type': 'account_key'}, 'container_name': 'training-data', 'account_name': 'mlwdp100labs7786297541', 'endpoint': 'core.windows.net', 'protocol': 'https'})

List the datastores again to verify that a new datastore named `blob_training_data` has been created:

In [9]:
stores = ml_client.datastores.list()
for ds_name in stores:
    print(ds_name.name)

blob_training_data
workspaceblobstore
workspacefilestore
workspaceartifactstore
workspaceworkingdirectory


## Create data assets

To point to a specific folder or file in a datastore, you can create data assets. There are three types of data assets:

- `URI_FILE` points to a specific file.
- `URI_FOLDER` points to a specific folder.
- `MLTABLE` points to a MLTable file which specifies how to read one or more files within a folder.

You'll create all three types of data assets to experience the differences between them.

To create a `URI_FILE` data asset, you have to specify a path that points to a specific file. The path can be a local path or cloud path.

In the example below, you'll create a data asset by referencing a *local* path. To ensure the data is always available when working with the Azure Machine Learning workspace, local files will automatically be uploaded to the default datastore. In this case, the `diabetes.csv` file will be uploaded to **LocalUpload** folder in the **workspaceblobstore** datastore. 

To create a data asset from a local file, run the following cell:

In [10]:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

my_path = './data/accident.csv'

my_data = Data(
    path=my_path,
    type=AssetTypes.URI_FILE,
    description="Data asset pointing to a local file, automatically uploaded to the default datastore",
    name="accident-local"
)

ml_client.data.create_or_update(my_data)

Data({'path': 'azureml://subscriptions/cda9116f-5326-4a9b-9407-bc3a4391c27c/resourcegroups/rg-dp100/workspaces/mlw-dp100-labs/datastores/workspaceblobstore/paths/LocalUpload/8f0b4ad0fab8fe4de591ab7a24e502b5/accident.csv', 'skip_validation': False, 'mltable_schema_url': None, 'referenced_uris': None, 'type': 'uri_file', 'is_anonymous': False, 'auto_increment_version': False, 'auto_delete_setting': None, 'name': 'accident-local', 'description': 'Data asset pointing to a local file, automatically uploaded to the default datastore', 'tags': {}, 'properties': {}, 'print_as_yaml': False, 'id': '/subscriptions/cda9116f-5326-4a9b-9407-bc3a4391c27c/resourceGroups/rg-dp100/providers/Microsoft.MachineLearningServices/workspaces/mlw-dp100-labs/data/accident-local/versions/3', 'Resource__source_path': '', 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/captgt0072/code/Users/captgt007/azure-ml-labs/Labs/03', 'creation_context': <azure.ai.ml.entities._system_data.SystemData object at 0x

To create a `URI_FOLDER` data asset, you have to specify a path that points to a specific folder. The path can be a local path or cloud path.

In the example below, you'll create a data asset by referencing a *cloud* path. The path doesn't have to exist yet. The folder will be created when data is uploaded to the path.

In [11]:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

datastore_path = 'azureml://datastores/blob_training_data/paths/data-asset-path/'

my_data = Data(
    path=datastore_path,
    type=AssetTypes.URI_FOLDER,
    description="Data asset pointing to data-asset-path folder in datastore",
    name="accident-datastore-path"
)

ml_client.data.create_or_update(my_data)

Data({'path': 'azureml://subscriptions/cda9116f-5326-4a9b-9407-bc3a4391c27c/resourcegroups/rg-dp100/workspaces/mlw-dp100-labs/datastores/blob_training_data/paths/data-asset-path/', 'skip_validation': False, 'mltable_schema_url': None, 'referenced_uris': None, 'type': 'uri_folder', 'is_anonymous': False, 'auto_increment_version': False, 'auto_delete_setting': None, 'name': 'accident-datastore-path', 'description': 'Data asset pointing to data-asset-path folder in datastore', 'tags': {}, 'properties': {}, 'print_as_yaml': False, 'id': '/subscriptions/cda9116f-5326-4a9b-9407-bc3a4391c27c/resourceGroups/rg-dp100/providers/Microsoft.MachineLearningServices/workspaces/mlw-dp100-labs/data/accident-datastore-path/versions/3', 'Resource__source_path': '', 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/captgt0072/code/Users/captgt007/azure-ml-labs/Labs/03', 'creation_context': <azure.ai.ml.entities._system_data.SystemData object at 0x7fc5af1a3dc0>, 'serialize': <msrest.serializati

To create a `MLTable` data asset, you have to specify a path that points to a folder which contains a MLTable file. The path can be a local path or cloud path. 

> **Note**:
> Do **not** rename the `MLTable` file to `MLTable.yaml` or `MLTable.yml`. Azure machine learning expects an `MLTable` file.

In the example below, you'll create a data asset by referencing a *local* path which contains an MLTable and CSV file. 

In [12]:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

local_path = 'data/'

my_data = Data(
    path=local_path,
    type=AssetTypes.MLTABLE,
    description="MLTable pointing to accident.csv in data folder",
    name="accident-table"   
)

ml_client.data.create_or_update(my_data)

[32mUploading data (0.01 MBs):   0%|          | 0/5001 [00:00<?, ?it/s][32mUploading data (0.01 MBs): 100%|██████████| 5001/5001 [00:00<00:00, 220866.74it/s]
[39m



Data({'path': 'azureml://subscriptions/cda9116f-5326-4a9b-9407-bc3a4391c27c/resourcegroups/rg-dp100/workspaces/mlw-dp100-labs/datastores/workspaceblobstore/paths/LocalUpload/ea756ad94b1e09efc21fbfc778a26af1/data/', 'skip_validation': False, 'mltable_schema_url': None, 'referenced_uris': ['./accident.csv'], 'type': 'mltable', 'is_anonymous': False, 'auto_increment_version': False, 'auto_delete_setting': None, 'name': 'accident-table', 'description': 'MLTable pointing to accident.csv in data folder', 'tags': {}, 'properties': {}, 'print_as_yaml': False, 'id': '/subscriptions/cda9116f-5326-4a9b-9407-bc3a4391c27c/resourceGroups/rg-dp100/providers/Microsoft.MachineLearningServices/workspaces/mlw-dp100-labs/data/accident-table/versions/3', 'Resource__source_path': '', 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/captgt0072/code/Users/captgt007/azure-ml-labs/Labs/03', 'creation_context': <azure.ai.ml.entities._system_data.SystemData object at 0x7fc5af154400>, 'serialize': <ms

To verify that the new data assets have been created, you can list all data assets in the workspace again:

In [18]:
datasets = ml_client.data.list()
for ds in datasets:
    print(f"Name: {ds.name}, Type: {ds.type}")

Name: accident-local, Type: uri_file
Name: accident-datastore-path, Type: uri_folder
Name: accident-table, Type: mltable


## Read data in notebook

Initially, you may want to work with data assets in notebooks, to explore the data and experiment with machine learning models. Any `URI_FILE` or `URI_FOLDER` type data assets are read as you would normally read data. For example, to read a CSV file a data asset points to, you can use the pandas function `read_csv()`. 

A `MLTable` type data asset is already *read* by the **MLTable** file, which specifies the schema and how to interpret the data. Since the data is already *read*, you can easily convert a MLTable data asset to a pandas dataframe. 

You'll need to install the `mltable` library (which you did in the terminal). Then, you can convert the data asset to a dataframe and visualize the data.  

In [19]:
import mltable

registered_data_asset = ml_client.data.get(name='accident-table', version= '3')
tbl = mltable.load(f"azureml:/{registered_data_asset.id}")


# Convert MLTable to Pandas DataFrame
df = tbl.to_pandas_dataframe()

# Display first 5 rows
df.head()

Overriding of current TracerProvider is not allowed
Overriding of current LoggerProvider is not allowed
Overriding of current MeterProvider is not allowed
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented


Unnamed: 0,Age,Gender,Speed_of_Impact,Helmet_Used,Seatbelt_Used,Survived
0,56,Female,27.0,False,False,True
1,69,Female,46.0,False,True,True
2,46,Male,46.0,True,True,False
3,32,Male,117.0,False,True,False
4,60,Female,40.0,True,True,False


## Use data in a job

After using a notebook for experimentation. You can use scripts to train machine learning models. A script can be run as a job, and for each job you can specify inputs and outputs. 

You can use either **data assets** or **datastore paths** as inputs or outputs of a job. 

The cells below creates the **move-data.py** script in the **src** folder. The script reads the input data with the `read_csv()` function. The script then stores the data as a CSV file in the output path.

In [20]:
import os

# create a folder for the script files
script_folder = 'src'
os.makedirs(script_folder, exist_ok=True)
print(script_folder, 'folder created')

src folder created


In [22]:
%%writefile src/move-data.py
# Import libraries
import argparse
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from pathlib import Path
import os


# Function to read and split data
def get_data(path):
    df = pd.read_csv(path)

    # Count rows
    row_count = len(df)
    print(f'Analyzing {row_count} rows of data')

    # Ensure columns exist
    feature_columns = ['Age', 'Gender', 'Speed_of_Impact', 'Helmet_Used', 'Seatbelt_Used']
    df = df[feature_columns + ['Survived']].dropna()  # Drop rows with missing values

    # Separate features and labels
    X = df[feature_columns]
    y = df['Survived']

    # Split into training & test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
    return X_train, X_test, y_train, y_test

# Main function
def main(args):
    # Read and split data
    X_train, X_test, y_train, y_test = get_data(args.input_data)

    # Ensure output folders exist
    Path(args.output_datastore_train).mkdir(parents=True, exist_ok=True)
    Path(args.output_datastore_test).mkdir(parents=True, exist_ok=True)

    # Save train and test datasets
    X_train.to_csv(Path(args.output_datastore_train) / "train.csv", index=False)
    y_train.to_csv(Path(args.output_datastore_train) / "train_label.csv", index=False, header=True)
    X_test.to_csv(Path(args.output_datastore_test) / "test.csv", index=False)
    y_test.to_csv(Path(args.output_datastore_test) / "test_label.csv", index=False, header=True)

    print(f"Data saved in {args.output_datastore_train} and {args.output_datastore_test}")

# Function to parse arguments
def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--input_data", type=str, required=True)
    parser.add_argument("--output_datastore_train", type=str, required=True)
    parser.add_argument("--output_datastore_test", type=str, required=True)
    return parser.parse_args()

# Run script
if __name__ == "__main__":
    print("\n" + "*" * 60)
    args = parse_args()
    main(args)
    print("*" * 60 + "\n")



Writing src/move-data.py


To submit a job that runs the **move-data.py** script, run the cell below. 

The job is configured to use the data asset `accident-local`, pointing to the local **accident.csv** file as input. The output is a path pointing to a folder in the new datastore `blob_training_data`.

In [23]:
from azure.ai.ml import Input, Output
from azure.ai.ml.constants import AssetTypes
from azure.ai.ml import command

# configure input and output
my_job_inputs = {
    "local_data": Input(type=AssetTypes.URI_FILE, path="azureml:accident-local:1")
}

my_job_outputs = {
    "train_data": Output(type=AssetTypes.URI_FOLDER, path="azureml://datastores/blob_training_data/paths/train-path"),
    "test_data": Output(type=AssetTypes.URI_FOLDER, path="azureml://datastores/blob_training_data/paths/test-path")
}

# configure job
job = command(
    code="./src",
    command="python move-data.py --input_data ${{inputs.local_data}} --output_datastore_train ${{outputs.train_data}} --output_datastore_test ${{outputs.test_data}}",
    inputs=my_job_inputs,
    outputs=my_job_outputs,
    environment="AzureML-sklearn-0.24-ubuntu18.04-py37-cpu@latest",
    compute="captgt0072",
    display_name="move-accident-data",
    experiment_name="move-accident-data"
)

# submit job
returned_job = ml_client.create_or_update(job)
aml_url = returned_job.studio_url
print("Monitor your job at", aml_url)

Class AutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class AutoDeleteConditionSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseAutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class IntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class ProtectionLevelSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseIntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Uploading src (0.0 MBs): 100%|████████

Monitor your job at https://ml.azure.com/runs/lucid_street_0qln3g46nq?wsid=/subscriptions/cda9116f-5326-4a9b-9407-bc3a4391c27c/resourcegroups/rg-dp100/workspaces/mlw-dp100-labs&tid=aef6e45c-850f-4f38-a10b-1df3ad33cdb0
