# Tutorial: Bring your own data (Part 3 of 3)

---
## Introduction

In the previous [Tutorial: Train a model in the cloud](2.train-model.ipynb) article, the CIFAR10 data was downloaded using the builtin `torchvision.datasets.CIFAR10` method in the PyTorch API. However, in many cases you are going to want to use your own data in a remote training run. This article focuses on the workflow you can leverage such that you can work with your own data in Azure Machine Learning. 

By the end of this tutorial you would have a better understanding of:

- How to upload your data to Azure
- Best practices for working with cloud data in Azure Machine Learning
- Working with command-line arguments

## Prerequisites

- 

---

## Your machine learning code

By now you have your training script running in Azure Machine Learning, and can monitor the model performance. Let's _parametrize_ the training script by introducing
arguments. Using arguments will allow you to easily compare different hyperparmeters.

Presently our training script is set to download the CIFAR10 dataset on each run. The python code in [train-with-cloud-data-and-logging.py](../../code/models/pytorch/cifar10-cnn/train-with-cloud-data-and-logging.py) now uses **`argparse` to parametize the script.**

### Understanding your machine learning code changes

The script `train-with-cloud-data-and-logging.py` has leveraged the `argparse` library to set up the `--data-path`, `--learning-rate`, `--momentum`, and `--epochs` arguments:

```python
parser = argparse.ArgumentParser()
parser.add_argument("--data-path", type=str, help="Path to the training data")
parser.add_argument("--learning-rate", type=float, default=0.001, help="Learning rate for SGD")
parser.add_argument("--momentum", type=float, default=0.9, help="Momentum for SGD")
parser.add_argument("--epochs", type=int, default=2, help="Number of epochs to train")
args = parser.parse_args()
```

The script was adapted to update the optimizer to use the user-defined parameters:

```python
optimizer = optim.SGD(
    net.parameters(),
    lr=args.learning_rate,     # get learning rate from command-line argument
    momentum=args.momentum,    # get momentum from command-line argument
)
```

Similarly the training loop was adapted to update the number of epochs to train to use the user-defined parameters:
```python
for epoch in range(args.epochs):
```


## Upload your data to Azure

In order to run this script in Azure Machine Learning, you need to make your training data available in Azure. Your Azure Machine Learning workspace comes equipped with a _default_ **Datastore** - an Azure Blob storage account - that you can use to store your training data.

> <span style="color:purple; font-weight:bold">! NOTE <br>
> Azure Machine Learning allows you to connect other cloud-based datastores that store your data. For more details, see [datastores documentation](./concept-data.md).</span>


In [None]:
from azureml.core import Workspace, Dataset
from torchvision import datasets

ws = Workspace.from_config("~/code/default.json")

datasets.CIFAR10(".", download=True)

ds = ws.get_default_datastore()
ds.upload(
    src_dir="cifar-10-batches-py",
    target_path="datasets/cifar10",
    overwrite=False,
)

In [2]:
import os
import shutil

os.remove("cifar-10-python.tar.gz")
shutil.rmtree("cifar-10-batches-py")

The `target_path` specifies the path on the datastore where the CIFAR10 data will be uploaded.

## Submit your machine learning code to Azure Machine Learning

As you have done previously, create a new Python control script:

In [5]:
from azureml.core import (
    Workspace,
    Experiment,
    Environment,
    ScriptRunConfig,
    Dataset,
)
from azureml.widgets import RunDetails

ws = Workspace.from_config("~/code/default.json")

ds = Dataset.File.from_files(
    path=(ws.get_default_datastore(), "datasets/cifar10")
)
env = Environment.from_conda_specification(
    name="pytorch-env-tutorial",
    file_path="../../environments/pytorch-example.yml",
)
exp = Experiment(
    workspace=ws, name="getting-started-train-model-cloud-data-tutorial"
)
src = ScriptRunConfig(
    source_directory="../../code/models/pytorch/cifar10-cnn",
    script="train-with-cloud-data-and-logging.py",
    compute_target="cpu-cluster",
    environment=env,
    arguments=[
        "--data-path",
        ds.as_mount(),
        "--learning-rate",
        0.003,
        "--momentum",
        0.92,
        "--epochs",
        2,
    ],
)

run = exp.submit(src)
RunDetails(run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

### Understand the control code

The above control code has the following additional code compared to the control code written in [previous tutorial](03-train-model.ipynb)

**`ds = Dataset.File.from_files(path=(datastore, 'datasets/cifar10'))`**: A Dataset is used to reference the data you uploaded to the Azure Blob Store. Datasets are an abstraction layer on top of your data that are designed to improve reliability and trustworthiness.


**`src = ScriptRunConfig(...)`**: We modified the `ScriptRunConfig` to include a list of arguments that will be passed into training script. We also specified `ds.as_mount()`, which means the directory specified will be _mounted_ to the compute target.

## Inspect the 70_driver_log log file

In the navigate to the 70_driver_log.txt file - you should see the following output:

```
Processing 'input'.
Processing dataset FileDataset
{
  "source": [
    "('workspaceblobstore', 'datasets/cifar10')"
  ],
  "definition": [
    "GetDatastoreFiles"
  ],
  "registration": {
    "id": "XXXXX",
    "name": null,
    "version": null,
    "workspace": "Workspace.create(name='XXXX', subscription_id='XXXX', resource_group='X')"
  }
}
Mounting input to /tmp/tmp9kituvp3.
Mounted input to /tmp/tmp9kituvp3 as folder.
Exit __enter__ of DatasetContextManager
Entering Run History Context Manager.
Current directory:  /mnt/batch/tasks/shared/LS_root/jobs/dsvm-aml/azureml/tutorial-session-3_1600171983_763c5381/mounts/workspaceblobstore/azureml/tutorial-session-3_1600171983_763c5381
Preparing to call script [ train.py ] with arguments: ['--data_path', '$input', '--learning_rate', '0.003', '--momentum', '0.92']
After variable expansion, calling script [ train.py ] with arguments: ['--data_path', '/tmp/tmp9kituvp3', '--learning_rate', '0.003', '--momentum', '0.92']

Script type = None
===== DATA =====
DATA PATH: /tmp/tmp9kituvp3
LIST FILES IN DATA PATH...
['cifar-10-batches-py', 'cifar-10-python.tar.gz']
```

Notice:

1. Azure Machine Learning has mounted the blob store to the compute cluster automatically for you
2. The ``ds.as_mount()`` used in the control script resolves to the mount point
3. In the machine learning code we include a line to list the directorys under the data directory - you can see the list above

If you're not going to use what you've created here, delete the resources you just created with this quickstart so you don't incur any charges for storage. In the Azure portal, select and delete your resource group.