Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/work-with-data/datasets-tutorial/train-with-datasets/train-with-datasets.png)

# Train with Azure Machine Learning datasets
Datasets are categorized into TabularDataset and FileDataset based on how users consume them in training. 
* A TabularDataset represents data in a tabular format by parsing the provided file or list of files. TabularDataset can be created from csv, tsv, parquet files, SQL query results etc. For the complete list, please visit our [documentation](https://aka.ms/tabulardataset-api-reference). It provides you with the ability to materialize the data into a pandas DataFrame.
* A FileDataset references single or multiple files in your datastores or public urls. This provides you with the ability to download or mount the files to your compute. The files can be of any format, which enables a wider range of machine learning scenarios including deep learning.

In this tutorial, you will learn how to train with Azure Machine Learning datasets:

&#x2611; Use datasets directly in your training script

&#x2611; Use datasets to mount files to a remote compute

## Prerequisites

@rama please update the private build information once we have everything checked in

In [1]:
# NOTE: PLEASE RUN THIS PIP INSTALL BEFORE THE REST OF THE NOTEBOOK
# !pip install numpy==1.19.3
# !pip install azureml-dataprep[pandas]
# !pip install azureml-widgets
# !pip install sklearn
# !pip install azureml-sdk==0.1.0.* --index-url https://azuremlsdktestpypi.azureedge.net/DataDrift-SDK-Unit/25355784  --extra-index-url https://pypi.python.org/simple

In [8]:
# Check core SDK version number
import azureml.core

print('SDK version:', azureml.core.VERSION)

SDK version: 0.1.0.25355784


## Initialize Workspace

Initialize a workspace object from persisted configuration.

In [1]:
from azureml.core import Workspace

ws = Workspace('b1fff005-d722-4d97-99ac-7c6e9ef020aa', 'rafarmahtestrg', 'rafarmahcredpassthrough')
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

If you run your code in unattended mode, i.e., where you can't give a user input, then we recommend to use ServicePrincipalAuthentication or MsiAuthentication.
Please refer to aka.ms/aml-notebook-auth for different authentication mechanisms in azureml-sdk.


rafarmahcredpassthrough
rafarmahtestrg
centraluseuap
b1fff005-d722-4d97-99ac-7c6e9ef020aa


## Create Experiment

**Experiment** is a logical container in an Azure ML Workspace. It hosts run records which can include run metrics and output artifacts from your experiments.

In [10]:
from azureml.core import Experiment
experiment_name = 'may-credential-passthrough'
experiment = Experiment(workspace = ws, name = experiment_name)

## Create or Attach existing compute resource
By using Azure Machine Learning Compute, a managed service, data scientists can train machine learning models on clusters of Azure virtual machines. Examples include VMs with GPU support. In this tutorial, you create Azure Machine Learning Compute as your training environment. The code below creates the compute clusters for you if they don't already exist in your workspace.

**Creation of compute takes approximately 5 minutes.** If the AmlCompute with that name is already in your workspace the code will skip the creation process.

In [11]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your CPU cluster
cpu_cluster_name = 'hosttoolstest' # "hosttoolstest2", "hosttoolsttest3"

# Verify that cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                           max_nodes=4)
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

cpu_cluster.wait_for_completion(show_output=True)

Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


You now have the necessary packages and compute resources to train a model in the cloud.
## Use datasets directly in training

### Create a TabularDataset
By creating a dataset, you create a reference to the data source location. If you applied any subsetting transformations to the dataset, they will be stored in the dataset as well. The data remains in its existing location, so no extra storage cost is incurred. 

First you need to upload [iris dataset](./train-dataset/iris.csv) to your ADLS Gen2 storage account. Make sure that you grant yourself 'Storage Blob Data Contributor' access to the storage account for read & write access. ADLS Gen 2 also supports POSIX-like access control lists (ACLs), learn how to set ACLs [here](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-access-control)

![roleaccess](roleaccess.PNG)

Then we will create an unregistered TabularDataset pointing from ADLS Gen2 storage url. You can also create a dataset from multiple paths. [learn more](https://aka.ms/azureml/howto/createdatasets) <br>
You can find the storage url from storage explorer on Azure portal
![image.png](storageurl.jpg)

[TabularDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py) represents data in a tabular format by parsing the provided file or list of files. This provides you with the ability to materialize the data into a Pandas or Spark DataFrame. You can create a TabularDataset object from .csv, .tsv, and parquet files, and from SQL query results. For a complete list, see [TabularDatasetFactory](https://docs.microsoft.com/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactory?view=azure-ml-py) class.

**NOTE** You will get permission denied if you try to load data from the sample url below because you do not have permission to the adlsgen2 storage account. You need to upload [iris dataset](./train-dataset/iris.csv) to your adlsgen2 storage account and replace the url with your own storage url.

In [2]:
from azureml.core import Dataset
dataset = Dataset.Tabular.from_delimited_files('https://mayadls2.blob.core.windows.net/tabular/iris.csv')

# preview the first 3 rows of the dataset
dataset.take(3).to_pandas_dataframe()

Credentials are not provided to access data from the source. Please sign in using identity with required permission granted.


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa


Alternatively, you can create ADLS Gen2 without providing credentials and create dataset from datastore path

In [3]:
#from azureml.core import Datastore
#datastore = Datastore.register_azure_data_lake_gen2(workspace=ws, datastore_name='mayadlsgen2',
                                                   filesystem='tabular', account_name='mayadls2')
#dataset = Dataset.Tabular.from_delimited_files((datastore, 'iris.csv'))
#dataset.take(3).to_pandas_dataframe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa


### Create a training script

To submit the job to the cluster, first create a training script. Run the following code to create the training script called `train_titanic.py` in the script_folder. 

In [2]:
import os
script_folder = os.path.join(os.getcwd(), 'train-dataset')

In [3]:
%%writefile $script_folder/train_iris.py

import os

from azureml.dataprep import __version__ as dprepver
from azureml.core import Dataset, Run
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
# sklearn.externals.joblib is removed in 0.23
from sklearn import __version__ as sklearnver
from packaging.version import Version
if Version(sklearnver) < Version("0.23.0"):
    from sklearn.externals import joblib
else:
    import joblib

print('dprep version: {}'.format(dprepver))

run = Run.get_context()
# get input dataset by name
dataset = run.input_datasets['iris']

df = dataset.to_pandas_dataframe()

x_col = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
y_col = ['species']
x_df = df.loc[:, x_col]
y_df = df.loc[:, y_col]

#dividing X,y into train and test data
x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, test_size=0.2, random_state=223)

data = {'train': {'X': x_train, 'y': y_train},

        'test': {'X': x_test, 'y': y_test}}

clf = DecisionTreeClassifier().fit(data['train']['X'], data['train']['y'])
model_file_name = 'decision_tree.pkl'

print('Accuracy of Decision Tree classifier on training set: {:.2f}'.format(clf.score(x_train, y_train)))
print('Accuracy of Decision Tree classifier on test set: {:.2f}'.format(clf.score(x_test, y_test)))

os.makedirs('./outputs', exist_ok=True)
with open(model_file_name, 'wb') as file:
    joblib.dump(value=clf, filename='outputs/' + model_file_name)

Overwriting C:\Users\SIHHU\project\identitypassthrough\train-with-dataset\train-dataset/train_iris.py


### Create an environment

Define a conda environment YAML file with your training script dependencies and create an Azure ML environment.

In [4]:
%%writefile conda_dependencies.yml

dependencies:
- python=3.6.2
- scikit-learn
- pip:
  - packaging
  - nltk
  - azureml-core==0.1.0.26419284
  - --extra-index-url https://azuremlsdktestpypi.azureedge.net/DataDrift-SDK-Unit/26419284
  - azureml-telemetry==1.6.0
  - azureml-dataprep[pandas]==2.7.0.dev0+e4ddba8
  - --extra-index-url https://dataprepdownloads.azureedge.net/pypi/test-M3ME5B1GMEM3SW0W/26411269/
  - --pre

Overwriting conda_dependencies.yml


In [5]:
from azureml.core import Environment

sklearn_env = Environment.from_conda_specification(name = 'sklearn-env', file_path = './conda_dependencies.yml')

### Configure training run

A ScriptRunConfig object specifies the configuration details of your training job, including your training script, environment to use, and the compute target to run on. Specify the following in your script run configuration:
* The directory that contains your scripts. All the files in this directory are uploaded into the cluster nodes for execution. 
* The training script name, train_iris.py
* The input dataset for training, passed as an argument to your training script. `as_named_input()` is required so that the input dataset can be referenced by the assigned name in your training script. 
* The compute target. In this case you will use the AmlCompute you created
* The environment definition for the experiment

In [12]:
from azureml.core import ScriptRunConfig

src = ScriptRunConfig(source_directory=script_folder,
                      script='train_iris.py',
                      arguments=[dataset.as_named_input('iris')],
                      compute_target=cpu_cluster,
                      environment=sklearn_env)

### Submit job to run
Submit the ScriptRunConfig to the Azure ML experiment to kick off the execution. You will need to set `credential_passthrough=True` to opt-in to use your own identity for data access authentication in remote training. Otherwise, our service will try to use the identity of the compute for data access authentication.

In [13]:
run = experiment.submit(src, credential_passthrough=True)
run

Experiment,Id,Type,Status,Details Page,Docs Page
may-credential-passthrough,may-credential-passthrough_1606346412_ce38320f,azureml.scriptrun,Starting,Link to Azure Machine Learning studio,Link to Documentation


In [14]:
from azureml.widgets import RunDetails

# monitor the run
RunDetails(run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

## Use datasets to mount files to a remote compute

You can use the `Dataset` object to mount or download files referred by it. When you mount a file system, you attach that file system to a directory (mount point) and make it available to the system. Because mounting load files at the time of processing, it is usually faster than download.<br> 
Note: mounting is only available for Linux-based compute (DSVM/VM, AMLCompute, HDInsights).

### Upload data files into your ADLS Gen2 storage account
We will first load diabetes data from `scikit-learn` to the train-dataset folder.

In [24]:
!pip install sklearn

Processing c:\users\sihhu\appdata\local\pip\cache\wheels\76\03\bb\589d421d27431bcd2c6da284d5f2286c8e3b2ea3cf1594c074\sklearn-0.0-py2.py3-none-any.whl
Collecting scikit-learn
  Using cached scikit_learn-0.23.2-cp37-cp37m-win_amd64.whl (6.8 MB)
Collecting threadpoolctl>=2.0.0
  Using cached threadpoolctl-2.1.0-py3-none-any.whl (12 kB)
Collecting scipy>=0.19.1
  Downloading scipy-1.5.4-cp37-cp37m-win_amd64.whl (31.2 MB)
Collecting joblib>=0.11
  Using cached joblib-0.17.0-py3-none-any.whl (301 kB)
Installing collected packages: threadpoolctl, scipy, joblib, scikit-learn, sklearn
Successfully installed joblib-0.17.0 scikit-learn-0.23.2 scipy-1.5.4 sklearn-0.0 threadpoolctl-2.1.0


In [28]:
from sklearn.datasets import load_diabetes
import numpy as np

os.makedirs('./data', exist_ok=True)
training_data = load_diabetes()
np.save(file='./data/features.npy', arr=training_data['data'])
np.save(file='./data/labels.npy', arr=training_data['target'])

Now upload the 2 files into the ADLS Gen2 storage account into a folder named `diabetes`

### Create a FileDataset

[FileDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.file_dataset.filedataset?view=azure-ml-py) references single or multiple files in your storage. Using this method, you can download or mount the files to your compute as a FileDataset object. The files can be in any format, which enables a wider range of machine learning scenarios, including deep learning.

In this example, we will create the filedataset directly from storage url of the diabetes folder.

**NOTE** You will get permission denied if you try to load data from the sample url below because you do not have permission to the adlsgen2 storage account. You need to upload diabetes dataset to your adlsgen2 storage account and replace the url with your own storage url.

In [15]:
from azureml.core import Dataset

dataset = Dataset.File.from_files('https://mayadls2.blob.core.windows.net/tabular/diabetes')

# see a list of files referenced by dataset
dataset.to_path()

['/diabetes/features.npy', '/diabetes/labels.npy']

### Create a training script

To submit the job to the cluster, first create a training script. Run the following code to create the training script called `train_diabetes.py` in the script_folder. 

In [16]:
%%writefile $script_folder/train_diabetes.py

import os
import glob
import argparse

from azureml.core.run import Run
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
# sklearn.externals.joblib is removed in 0.23
from sklearn import __version__ as sklearnver
from packaging.version import Version
if Version(sklearnver) < Version("0.23.0"):
    from sklearn.externals import joblib
else:
    import joblib

import numpy as np

parser = argparse.ArgumentParser()
parser.add_argument('--data-folder', type=str, help='training dataset')
args = parser.parse_args()

os.makedirs('./outputs', exist_ok=True)

base_path = args.data_folder

run = Run.get_context()

X = np.load(glob.glob(os.path.join(base_path, '**/features.npy'), recursive=True)[0])
y = np.load(glob.glob(os.path.join(base_path, '**/labels.npy'), recursive=True)[0])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0)
data = {'train': {'X': X_train, 'y': y_train},
        'test': {'X': X_test, 'y': y_test}}

# list of numbers from 0.0 to 1.0 with a 0.05 interval
alphas = np.arange(0.0, 1.0, 0.05)

for alpha in alphas:
    # use Ridge algorithm to create a regression model
    reg = Ridge(alpha=alpha)
    reg.fit(data['train']['X'], data['train']['y'])

    preds = reg.predict(data['test']['X'])
    mse = mean_squared_error(preds, data['test']['y'])
    run.log('alpha', alpha)
    run.log('mse', mse)

    model_file_name = 'ridge_{0:.2f}.pkl'.format(alpha)
    with open(model_file_name, 'wb') as file:
        joblib.dump(value=reg, filename='outputs/' + model_file_name)

    print('alpha is {0:.2f}, and mse is {1:0.2f}'.format(alpha, mse))

Overwriting C:\Users\SIHHU\project\identitypassthrough\train-with-dataset\train-dataset/train_diabetes.py


### Configure & Run

Now configure your run. We will reuse the same `sklearn_env` environment from the previous run. Once the environment is built, and if you don't change your dependencies, it will be reused in subsequent runs. 

We will pass in the DatasetConsumptionConfig of our FileDataset to the `'--data-folder'` argument of the script. Azure ML will resolve this to mount point of the data on the compute target, which we parse in the training script.

In [17]:
from azureml.core import ScriptRunConfig

src = ScriptRunConfig(source_directory=script_folder, 
                      script='train_diabetes.py', 
                      # to mount the dataset on the remote compute and pass the mounted path as an argument to the training script
                      arguments =['--data-folder', dataset.as_mount()],
                      compute_target=cpu_cluster,
                      environment=sklearn_env)

You will need to set `credential_passthrough=True` to opt-in to use your own identity for data access authentication in remote training. Otherwise, our service will try to use the identity of the compute for data access authentication.

In [18]:
run = experiment.submit(config=src, credential_passthrough=True)

# monitor the run
RunDetails(run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

### Display run results
You now have a model trained on a remote cluster. Retrieve all the metrics logged during the run, including the accuracy of the model:

In [20]:
run.wait_for_completion()
metrics = run.get_metrics()
print(metrics)

{'alpha': [0.0, 0.05, 0.1, 0.15000000000000002, 0.2, 0.25, 0.30000000000000004, 0.35000000000000003, 0.4, 0.45, 0.5, 0.55, 0.6000000000000001, 0.65, 0.7000000000000001, 0.75, 0.8, 0.8500000000000001, 0.9, 0.9500000000000001], 'mse': [3424.3166882137343, 3408.9153122589296, 3372.649627810032, 3345.14964347419, 3325.294679467878, 3311.5562509289744, 3302.6736334017264, 3297.658733944204, 3295.74106435581, 3296.316884705676, 3298.9096058070622, 3303.140055527517, 3308.7042707723226, 3315.3568399622573, 3322.898314903962, 3331.1656169285875, 3340.024662032161, 3349.364644348603, 3359.093569748443, 3369.1347399130477]}
