### Work with Data in Azure Machine Learning


#### Datastore

_Datastores_ are abstractions for cloud data sources. Especially, they encapsulate the information required to connect to data sources.  They encapsulate the information required to connect to data sources.

AzureML supports the creation of datastores for multiple kinds of AzureML source, including:
- Azure Storage
- Azure Data Lake
- Azure SQL Database
- Azure Databricks file system

By default, each workspace contains two datasources, namely a Azure Storage blob and file container.

In [35]:
from azureml.core import Workspace, Datastore

ws = Workspace.from_config()

In [11]:
# List datastores.
default_ds = ws.get_default_datastore()

for ds_name in ws.datastores:
    if ds_name == default_ds.name:
        print(f'Default Data Storage: {default_ds}')

Default Data Storage: {
  "name": "workspaceblobstore",
  "container_name": "azureml-blobstore-b0c040f4-cad9-4180-843c-f18afbed9fe7",
  "account_name": "amlworksstorage59918767d",
  "protocol": "https",
  "endpoint": "core.windows.net"
}


In [6]:
# Retrieve a datastore.
blob_store = Datastore.get(ws, datastore_name='workspaceblobstore')

In [None]:
# Retrieve the default datastore (workspaceblobstore).
blob_store = Datastore.get_default_datastore()

In [None]:
# Set the datastore.
# ws.set_default_datastore('blob_data')

In [13]:
#blob_ds = Datastore.register_azure_blob_container(workspace = ws,
#                                                  datastore_name = 'blob_data',
#                                                  container_name = 'data_container')

In [15]:
# Upload a local dataset in the datastorage.
from azureml.data.datapath import DataPath
from azureml.core import Dataset

Dataset.File.upload_directory(src_dir = 'Script/data',
                              target = DataPath(default_ds, 'data'))

Validating arguments.
Arguments validated.
Uploading file to data
Uploading an estimated of 1 files
Uploading Script/data\diabetes.csv
Uploaded Script/data\diabetes.csv, 1 files out of an estimated total of 1
Uploaded 1 files
Creating new dataset


{
  "source": [
    "('workspaceblobstore', '/data')"
  ],
  "definition": [
    "GetDatastoreFiles"
  ]
}

#### Dataset

Dataset are _versioned_ packaged data objects that can be easily consumed in the experiment and pipelines.

Datasets are typically based on files in a datastore, though they can also be based on URLs and other sources. The following type of datasets can be created:
- _Tabular_: data are read as a table.
- _File_: data are stored as a list of files.

First, a dataset must be created and then, it should be registered in the workspace to be used in the experiments.

#### Create a dataset object

In [17]:
from azureml.core import Dataset

# Register a tabular dataset from datastore.
blob_ds = ws.get_default_datastore()
csv_paths = [(blob_ds, 'data/diabetes.csv')]
tab_ds = Dataset.Tabular.from_delimited_files(path = csv_paths)

tab_ds.take(20).to_pandas_dataframe()

Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic
0,1354778,0,171,80,34,23,43.509726,1.213191,21,0
1,1147438,8,92,93,47,36,21.240576,0.158365,23,0
2,1640031,7,115,47,52,35,41.511523,0.079019,23,0
3,1883350,9,103,78,25,304,29.582192,1.28287,43,1
4,1424119,1,85,59,27,35,42.604536,0.549542,22,0
5,1619297,0,82,92,9,253,19.72416,0.103424,26,0
6,1660149,0,133,47,19,227,21.941357,0.17416,21,0
7,1458769,0,67,87,43,36,18.277723,0.236165,26,0
8,1201647,8,80,95,33,24,26.624929,0.443947,53,1
9,1403912,1,72,31,40,42,36.889576,0.103944,26,0


In [19]:
# Register a file dataset from datastore.

file_ds = Dataset.File.from_files(path=(blob_ds, 'data/*.csv'))
for file_path in file_ds.to_path():
    print(file_path)

/diabetes.csv


#### Register datasets

In [30]:
# Register the Tabular dataset.

try:
    tab_ds = tab_ds.register(workspace = ws,
                             name = 'diabetes table',
                             description = 'Diabetes data',
                             tags = {'format': 'CSV'},
                             create_new_version = True)
except Exception as ex:
    print(ex)

In [31]:
# Register the file dataset.

try:
    file_ds = file_ds.register(workspace = ws,
                             name = 'diabetes file data',
                             description = 'Diabetes data',
                             tags = {'format': 'CSV'},
                             create_new_version = True)
except Exception as ex:
    print(ex)

In [32]:
print("Datasets:")
for dataset_name in list(ws.datasets.keys()):
    dataset = Dataset.get_by_name(ws, dataset_name)
    print("\t", dataset.name, 'version', dataset.version)

Datasets:
	 diabetes file data version 1
	 diabetes table version 1


Versioning is so important for machine learning models because let you define a different version of dataset without removing it: in this case, the experiment which uses that version of data should return some errors.

In [34]:
# If you want to use a specific version of data.
dataset_v1 = Dataset.get_by_name(ws, 'diabetes table', version = 1)

### Train a model with table dataset

In [41]:
from azureml.core import Experiment, ScriptRunConfig, Environment
from azureml.core.runconfig import DockerConfiguration
from azureml.widgets import RunDetails

env = Environment.from_conda_specification('experiment_env', 'environment.yml')

diabetes_ds = ws.datasets.get("diabetes table")

# Note that if you use this approach, you still need to include a script
# argument for the dataset, even though you don’t actually use it to retrieve the dataset.
script_config = ScriptRunConfig(source_directory = 'Script',
                                script = '5_Train_Dataset.py',
                                environment = env,
                                compute_target = 'my-compute',
                                arguments = ['--regularization', 0.1,
                                             '--input-data', diabetes_ds.as_named_input('training_data')],
                                docker_runtime_config=DockerConfiguration(use_docker=True)
                                )

experiment_name = 'mslearn-train-diabetes'
exp = Experiment(workspace = ws, name = experiment_name)
run = exp.submit(config = script_config)
run.complete()
run.wait_for_completion()

{'runId': 'mslearn-train-diabetes_1671482266_9449f548',
 'target': 'my-compute',
 'status': 'Completed',
 'startTimeUtc': '2022-12-19T20:37:46.603191Z',
 'endTimeUtc': '2022-12-19T20:37:48.352495Z',
 'services': {},
 'properties': {'_azureml.ComputeTargetType': 'amlctrain',
  'ContentSnapshotId': '13660cc4-770e-40fd-958a-4c50471328d9',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json',
  'azureml.git.repository_uri': 'https://github.com/LuciaRavazzi/AzureML.git',
  'mlflow.source.git.repoURL': 'https://github.com/LuciaRavazzi/AzureML.git',
  'azureml.git.branch': 'main',
  'mlflow.source.git.branch': 'main',
  'azureml.git.commit': '8d87150fc14085920177589fd302d4f345ee8897',
  'mlflow.source.git.commit': '8d87150fc14085920177589fd302d4f345ee8897',
  'azureml.git.dirty': 'True'},
 'inputDatasets': [{'dataset': {'id': '62c31925-6d0a-466c-8d6a-8573666c3b2d'}, 'consumptionDetails': {'type': 'RunInput', 'inputName': 'training_da

In [None]:
# If you want to retrieve the datset trough the ID
#script_config = ScriptRunConfig(source_directory = 'Script',
#                                script = '5_Train_Dataset.py',
#                                environment = env,
#                                compute_target = 'my-compute',
#                                arguments = ['--regularization', 0.1,
#                                             '--input-data', diabetes_ds.as_named_input#('training_data')],
#                                docker_runtime_config=DockerConfiguration(use_docker=True)
#                                )

# In the Script:
# from azureml.core import Run, Dataset
# parser.add_argument('--ds', type=str, dest='dataset_id')
# args = parser.parse_args()
# run = Run.get_context()
# ws = run.experiment.workspace
# dataset = Dataset.get_by_id(ws, id=args.dataset_id)
# data = dataset.to_pandas_dataframe()

#### Train a model from file data

In this case, you should use .as_download() or .as_mount() in order to use a temporary location for files. The former download the data, the latter streams the data from the source which is quite convenient when the volume of data is huge.

In [44]:
from azureml.core import Experiment, ScriptRunConfig, Environment
from azureml.core.runconfig import DockerConfiguration
from azureml.widgets import RunDetails

env = Environment.from_conda_specification('experiment_env', 'environment.yml')

diabetes_ds = ws.datasets.get("diabetes file data")

script_config = ScriptRunConfig(source_directory = 'Script',
                                script = '6_Train_File_Dataset.py',
                                environment = env,
                                compute_target = 'my-compute',
                                arguments = ['--regularization', 0.1,
                                             '--input-data', diabetes_ds.as_named_input('training_files').as_download()],
                                docker_runtime_config=DockerConfiguration(use_docker=True)
                                )

experiment_name = 'mslearn-train-diabetes'
exp = Experiment(workspace = ws, name = experiment_name)
run = exp.submit(config = script_config)
run.complete()
run.wait_for_completion()

{'runId': 'mslearn-train-diabetes_1671483479_77de376a',
 'target': 'my-compute',
 'status': 'Completed',
 'startTimeUtc': '2022-12-19T20:58:02.117808Z',
 'endTimeUtc': '2022-12-19T20:58:04.030098Z',
 'services': {},
 'properties': {'_azureml.ComputeTargetType': 'amlctrain',
  'ContentSnapshotId': '258542a0-9636-402b-9202-08b2637ba3a3',
  'azureml.git.repository_uri': 'https://github.com/LuciaRavazzi/AzureML.git',
  'mlflow.source.git.repoURL': 'https://github.com/LuciaRavazzi/AzureML.git',
  'azureml.git.branch': 'main',
  'mlflow.source.git.branch': 'main',
  'azureml.git.commit': '8d87150fc14085920177589fd302d4f345ee8897',
  'mlflow.source.git.commit': '8d87150fc14085920177589fd302d4f345ee8897',
  'azureml.git.dirty': 'True',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json'},
 'inputDatasets': [{'dataset': {'id': '007a308b-03d7-45dc-b762-672f073c27db'}, 'consumptionDetails': {'type': 'RunInput', 'inputName': 'training_fi