In this notebook, we register the previously prepared dataset within an Azure ML Worspace, so that we can use it for remote training on Azure ML Compute.

Before registering the data, we need to make it available in a shared location. For that, we upload it  to an Azure Blob Storage using the azure-storage-blob package.

For learning more about registering datasets within Azure ML, please see [here]( https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-register-datasets).

In [1]:
!pip install azure-storage-blob



Here we upload our dataset from a local folder to the default Azure Blob Storage associated with our Azure ML Workspace. For more details, please see [here]( https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-python).

You need to replace the values for *account_name*, *account_key*, and *container_name* with the values for your own corresponding resources.

You can find those values by logging into your [Azure ML studio environment](https://ml.azure.com) and then click on *Datastores* on the left menu. You will find your Storage Account Name and Blob Container Name there. To get the corresponding Storage Account Key, you need to access your Azure ML Worspace through the [Azure Portal](https://ms.portal.azure.com), click on the Storage Account associated to your workspace, and then click on *Access keys* on the left menu. You can use either *key1* or *key2*.

In [2]:
from azure.storage.blob import BlockBlobService

account_name = '<your azure storage account name>'
account_key = '<your azure storage account access key>'

block_blob_service = BlockBlobService(account_name=account_name, account_key=account_key)

container_name = '<your azure blob storage container name>'
blob_name = 'data/complaints_dataset/consumer_complaint_data_sample_prepared.csv'
file_path = './data/consumer_complaint_data_sample_prepared.csv'

block_blob_service.create_blob_from_path(container_name=container_name, blob_name=blob_name, file_path=file_path)

<azure.storage.blob.models.ResourceProperties at 0x7f0120c3a6a0>

To be able to register the dataset within Azure ML, we first need to get a reference to the [workspace]( https://docs.microsoft.com/en-us/azure/machine-learning/concept-workspace) we are registering it to.

We use the [Azure ML SDK]( https://docs.microsoft.com/en-us/python/api/overview/azureml-sdk/?view=azure-ml-py) for that. If you don’t have it installed into your development environment, please follow the instructions [here]( https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-environment#local). If you want to run the code on a managed VM instance, which already has the SDK, please see [here]( https://docs.microsoft.com/en-us/azure/machine-learning/tutorial-1st-experiment-sdk-setup).

You need to replace the values for *subscription_id*, *resource_group*, and *workspace_name* with the values for your own corresponding resources.

In [3]:
from azureml.core.authentication import InteractiveLoginAuthentication
from azureml.core import Workspace

interactive_auth = InteractiveLoginAuthentication()

subscription_id = '<your azure subscription id>'
resource_group = '<your azure ml workspace resource group>'
workspace_name = '<your azure ml workspace name>'

workspace = Workspace(subscription_id=subscription_id, resource_group=resource_group, workspace_name=workspace_name,
                      auth=interactive_auth)

Finally, we register our dataset as a [Dataset]( https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.dataset.dataset?view=azure-ml-py) object within our Azure ML Workspace.

Notice that we need to have our Azure Storage Account already registered as a [Datastore]( https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.datastore.datastore?view=azure-ml-py). The default Azure Storage Account associated woth our Azure ML Workspace is already refistered as a Datastore by default. Then, we only need to specify its name, which is * workspaceblobstore*.

In [4]:
from azureml.core import Datastore, Dataset

datastore = Datastore.get(workspace, 'workspaceblobstore')

datastore_path = [(datastore, 'data/complaints_dataset/consumer_complaint_data_sample_prepared.csv')]
dataset = Dataset.File.from_files(path=datastore_path)

dataset_name = 'Consumer Complaints Dataset'
dataset_description = 'Consumer Complaint Database. Source: https://catalog.data.gov/dataset/consumer-complaint-database'
dataset = dataset.register(workspace=workspace, name=dataset_name, description=dataset_description)