# Create Azure Machine Learning Datastore

**Requirements** - In order to benefit from this tutorial, you will need:
- A basic understanding of Machine Learning
- An Azure account with an active subscription. [Create an account for free](https://azure.microsoft.com/free/?WT.mc_id=A261C142F)
- An Azure ML workspace. [Check this notebook for creating a workspace](/sdk/resources/workspace/workspace.ipynb) 
- A python environment
- Installed Azure Machine Learning Python SDK v2 - [install instructions](/sdk/README.md#getting-started)

**Learning Objectives** - By the end of this tutorial, you should be able to:
- Create an Azure Machine Learning datastore from Python SDK for
  - Azure Blob Storage container
  - Azure File share
  - Azure Data Lake Storage Gen1
  - Azure Data Lake Storage Gen2
- Use a datastore in a CommandJob

**Motivations** - Azure Machine Learning datastores securely keep the connection information to your data storage, so you don't have to code it in your scripts. This tutorial will introduce you to create datastores for machine learning from different sources.

**Note** - The credentials property in these samples are redacted. Please replace the redacted account_key, sas_token, tenant_id, client_id and client_secret appropriately.

# 1. Connect to Azure Machine Learning Workspace

The [workspace](https://docs.microsoft.com/en-us/azure/machine-learning/concept-workspace) is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. In this section we will connect to the workspace in which the datastore will be created.

## 1.1. Import the required libraries

In [None]:
#import required libraries
from azure.ml import MLClient
from azure.identity import DefaultAzureCredential
from azure.ml.entities import AzureBlobDatastore, AzureFileDatastore, AzureDataLakeGen1Datastore, AzureDataLakeGen2Datastore
from azure.ml.entities import CommandJob, JobInput, Environment

## 1.2. Configure workspace details and get a handle to the workspace

To connect to a workspace, we need identifier parameters - a subscription, resource group and workspace name. We will use these details in the `MLClient` from `azure.ml` to get a handle to the required Azure Machine Learning workspace. We use the default [default azure authentication](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity.defaultazurecredential?view=azure-python) for this tutorial. More advanced connection methods can be found [here](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity?view=azure-python).

In [None]:
#Enter details of your AML workspace
subscription_id = '<SUBSCRIPTION_ID>'
resource_group = '<RESOURCE_GROUP>'
workspace = '<AML_WORKSPACE_NAME>'

In [None]:
#get a handle to the workspace
ml_client = MLClient(DefaultAzureCredential(), subscription_id, resource_group, workspace)

# 2. Create Datastore
Datastores are attached to workspaces and are used to store connection information to  storage services so you can refer to them by name and don't need to remember the connection information and secret used to connect to the storage services.

## 2.1 Create a datastore for Azure Blob Storage container
The `AzureBlobDatastore` can be used to create datastores for Azure blob containers. The key parameters needed to create this type of datastore are:
- `name` - Name of the datastore
- `account_name` - Name of the Azure storage account.
- `container_name` - Name of the container in the storage account
- `protocol` - Protocol to use to connect to the container. `https` and `wasbs` are supported. The default is `https`.
- `credentials` - Credential-based authentication credentials for connecting to the Azure storage account. You can provide either an `account key` or a shared access signature (`SAS`) token. Credential secrets are stored in the workspace key vault.
- `description` - Description of the datastore.

### 2.1.1 Create a datastore with account key
In this sample we will use an account key to connect to the storage

In [None]:
blob_datastore1 = AzureBlobDatastore(
    name='blob-example',
    description = 'Datastore pointing to a blob container.',
    account_name = 'mytestblobstore',
    container_name = 'data-container',
    credentials = {'account_key': 'XXXxxxXXXxXXXXxxXXXXXxXXXXXxXxxXxXXXxXXXxXXxxxXXxxXXXxXxXXXxxXxxXXXXxxxxxXXxxxxxxXXXxXXX'})
ml_client.create_or_update(blob_datastore1)

### 2.1.2 Create a datastore with SAS token
In this sample we will use a shared access signature (`SAS`) token to connect to the storage.

In [None]:
#create a SAS based blob datastore
blob_sas_datastore = AzureBlobDatastore(
    name='blob-sas-example',
    description = 'Datastore pointing to a blob container using SAS token.',
    account_name = 'mytestblobstore',
    container_name = 'data-container',
    credentials = {'sas_token': '?xx=XXXX-XX-XX&xx=xxxx&xxx=xxx&xx=xxxxxxxxxxx&xx=XXXX-XX-XXXXX:XX:XXX&xx=XXXX-XX-XXXXX:XX:XXX&xxx=xxxxx&xxx=XXxXXXxxxxxXXXXXXXxXxxxXXXXXxxXXXXXxXXXXxXXXxXXxXX'})
ml_client.create_or_update(blob_sas_datastore)

### 2.1.3 Create a datastore with account key and wasbs protocol
In this sample we will use an account key to connect to the storage using wasbs protocol

In [None]:
#create a datastore pointing to a blob container using wasbs protocol
blob_wasb_datastore = AzureBlobDatastore(
    name='blob-protocol-example',
    description = 'Datastore pointing to a blob container using wasbs protocol.',
    account_name = 'mytestblobstore',
    container_name = 'data-container',
    protocol='wasbs',
    credentials = {'account_key': 'XXXxxxXXXxXXXXxxXXXXXxXXXXXxXxxXxXXXxXXXxXXxxxXXxxXXXxXxXXXxxXxxXXXXxxxxxXXxxxxxxXXXxXXX'})
ml_client.create_or_update(blob_wasb_datastore)

### 2.1.4 Create a datastore without adding any credentials
In this sample we will create a datastore without storing any credentials. When this datastore is used in a job, the identity used to run the job will also be used to access the datastore.

In [None]:
#create a credential-less datastore pointing to a blob container
blob_credless_datastore = AzureBlobDatastore(
    name='blob-credless-example',
    description = 'Credential-less datastore pointing to a blob container.',
    account_name = 'mytestblobstore',
    container_name = 'data-container')
ml_client.create_or_update(blob_credless_datastore)

## 2.2 Create a datastore for Azure File Share
The `AzureFileDatastore` can be used to create datastores for Azure File Share. The key parameters needed to create this type of datastore are:
- `name` - Name of the datastore
- `account_name` - Name of the Azure storage account.
- `file_share_name` - Name of the file share in the storage account
- `protocol` - Protocol to use to connect to the file share. Only `https` is supported.
- `credentials` - Credential-based authentication credentials for connecting to the Azure storage account. You can provide either an `account key` or a shared access signature (`SAS`) token. Credential secrets are stored in the workspace key vault.
- `description` - Description of the datastore.

### 2.2.1 Create a datastore with account key
In this sample we will use an account key to connect to the storage

In [None]:
#Datastore pointing to an Azure File Share
file_datastore = AzureFileDatastore(
    name = 'file-example',
    description = 'Datastore pointing to an Azure File Share.',
    account_name = 'mytestfilestore',
    file_share_name = 'my-share',
    credentials = {'account_key': 'XXXxxxXXXxXXXXxxXXXXXxXXXXXxXxxXxXXXxXXXxXXxxxXXxxXXXxXxXXXxxXxxXXXXxxxxxXXxxxxxxXXXxXXX'})
ml_client.create_or_update(file_datastore)

### 2.2.2 Create a datastore with SAS token
In this sample we will use a shared access signature (`SAS`) token to connect to the storage.

In [None]:
#Datastore pointing to an Azure File Share using SAS token
file_sas_datastore = AzureFileDatastore(
    name = 'file-sas-example',
    description = 'Datastore pointing to an Azure File Share using SAS token.',
    account_name = 'mytestfilestore',
    file_share_name = 'my-share',
    credentials = {'sas_token': '?xx=XXXX-XX-XX&xx=xxxx&xxx=xxx&xx=xxxxxxxxxxx&xx=XXXX-XX-XXXXX:XX:XXX&xx=XXXX-XX-XXXXX:XX:XXX&xxx=xxxxx&xxx=XXxXXXxxxxxXXXXXXXxXxxxXXXXXxxXXXXXxXXXXxXXXxXXxXX'})
ml_client.create_or_update(file_sas_datastore)

## 2.3 Create a datastore for Azure Data Lake Storage Gen1
The `AzureDataLakeGen1Datastore` class can be used to create datastores for Azure Data Lake Storage Gen1. The key parameters needed to create this type of datastore are:
- `name` - Name of the datastore
- `store_name` - Name of the Azure Data Lake Storage Gen1 account.
- `credentials` - Service principal credentials for connecting to the Azure storage account. Credential secrets are stored in the workspace key vault.
  - `tenant_id` - 	The tenant ID of the service principal
  - `client_id` - The client ID of the service principal
  -  `client_secret` - The client secret of the service principal.
- `description` - Description of the datastore.

In [None]:
adlsg1_datastore = AzureDataLakeGen1Datastore(
    name = 'adls-gen1-example',
    description = 'Datastore pointing to an Azure Data Lake Storage Gen1.',
    store_name = 'mytestdatalakegen1', 
    credentials = {
        'tenant_id': 'XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX',
        'client_id': 'XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX',
        'client_secret': 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'})
ml_client.create_or_update(adlsg1_datastore)

## 2.4 Create a datastore for Azure Data Lake Storage Gen1
The `AzureDataLakeGen2Datastore` class can be used to create datastores for Azure Data Lake Storage Gen2. The key parameters needed to create this type of datastore are:
- `name` - Name of the datastore
- `account_name` - Name of the Azure Data Lake Gen2 Storage account.
- `filesystem` - Name of the file system. The parent directory that contains the files and folders. This is equivalent to a container in Azure Blob storage.
- `protocol` - Protocol to use to connect to the file system. `https` and `abfs` are supported. The default is `https`.
- `credentials` - Service principal credentials for connecting to the Azure storage account. Credential secrets are stored in the workspace key vault.
  - `tenant_id` - 	The tenant ID of the service principal
  - `client_id` - The client ID of the service principal
  -  `client_secret` - The client secret of the service principal.
- `description` - Description of the datastore.

In [None]:
adlsg2_datastore = AzureDataLakeGen2Datastore(
    name = 'adls-gen2-example',
    description = 'Datastore pointing to an Azure Data Lake Storage Gen2.',
    account_name = 'mytestdatalakegen2',
    filesystem = 'my-gen2-container',
    credentials = {
        'tenant_id': 'XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX',
        'client_id': 'XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX',
        'client_secret': 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'})
ml_client.create_or_update(adlsg2_datastore)

# 3. Using the datastore in a Job
A Datastore can be used in a job like a `CommandJob` or a `PipelineJob`. In the below snippet, we will list the contents of a `datastore` in a `CommandJob`. We will use the default datastore `workspaceblobstore` which is created with any Azure Machine Learning Workspace.

The datastore can be used as a folder in the format `azureml://datastores/<datastore-name>/paths/<optional-path>`

In [None]:
explore_datastore = CommandJob(
    command = 'ls ${{inputs.datastore}}',
    inputs = {'datastore': JobInput(folder='azureml://datastores/workspaceblobstore/paths/')},
    environment=Environment(image='python:latest'),
    compute = "cpu-cluster", 
    display_name="using-datastore"
)

#submit the command job
returned_job = ml_client.create_or_update(explore_datastore)
#get a URL for the status of the job
returned_job.services["Studio"].endpoint