## Working with datasets and datastores
This notebook explains how to work  with datasets  stored in azure data lake. 

## Files on datalake 

We have created a storage account named `shareddatalakedev`. This will simulate a datalake for us. Usually, files for ml are provisioned by a data engineering team and data science teams use the datasets maintained by them.

These datasets are maintained on a datalake. So the starting point for any ml/mlops project is the dataset provisioned by the data engineering team.

In our case `shareddatalakedev` represents a datalake. In this storage account we have containers named:

- silver
- bronze

The silver container contains the modelling data in our case. There are datasets present for the month of january for both model training and testing.

We will need to pull data from this container and register it in azure-ml workspace.

## Register blob storage as a datastore

Register data in `shareddatalakedev/sliver` as a datastore

For this you will need to access the storage account key. Use azure cli `az storage account keys list --resource-group <resource group containing storage account> --account-name <storageaccountname>` to view the keys


In [2]:
## create a client
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
from azure.ai.ml.entities import AzureBlobDatastore
from azure.ai.ml.entities import AccountKeyConfiguration

credential = DefaultAzureCredential()

subs_id = '0aa1c63a-7a46-403c-91e4-8ec91264bc42'
rg_name = 'rg-mobile-pricing-dev' 
ws_name = 'mobile-pricing-azml-dev'
ml_client = MLClient(
    credential=credential,
    subscription_id=subs_id,
    resource_group_name=rg_name,
    workspace_name=ws_name,
)
## create a data-store
name = "mobile_pricing"
description = "training dataset stored in blob store"
account_name = "shareddatalakedev"
container_name = "silver"
account_key = ""## you need to fill your account key
store = AzureBlobDatastore(
    name=name,
    description=description,
    account_name=account_name,
    container_name=container_name,
    protocol="https",
    credentials=AccountKeyConfiguration(account_key=account_key)
)

ml_client.create_or_update(store)

AzureBlobDatastore({'type': <DatastoreType.AZURE_BLOB: 'AzureBlob'>, 'name': 'mobile_pricing', 'description': 'training dataset stored in blob store', 'tags': {}, 'properties': {}, 'print_as_yaml': True, 'id': '/subscriptions/0aa1c63a-7a46-403c-91e4-8ec91264bc42/resourceGroups/rg-mobile-pricing-dev/providers/Microsoft.MachineLearningServices/workspaces/mobile-pricing-azml-dev/datastores/mobile_pricing', 'Resource__source_path': None, 'base_path': 'C:\\Users\\gunnv\\OneDrive\\consulting\\setu\\mlops_azure\\content\\01_Mlops_Using_Cloud_Tools\\datasets_azure_ml\\notebooks', 'creation_context': None, 'serialize': <msrest.serialization.Serializer object at 0x000002551ABF1B70>, 'credentials': {'type': 'account_key'}, 'container_name': 'silver', 'account_name': 'shareddatalakedev', 'endpoint': 'core.windows.net', 'protocol': 'https'})

**The datastore will now appear in the azure-ml workspace**

![](../../../../images/datastore1.png)

**One can click on the datastore and we will be able to see the paths on the silver container in our storage account**

![](../../../../images/datastore2.png)

After this action you will be able to see the datastore in named `mobile_pricing`.  <to be added: images and text explaning how the UI will look>

### Register dataset from datastore

Once a datastore has been created, one can register a dataset and then version control it. 

In [3]:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
subscription = "0aa1c63a-7a46-403c-91e4-8ec91264bc42"
resource_group = "rg-mobile-pricing-dev"
workspace = "mobile-pricing-azml-dev"
datastore_name = "mobile_pricing" 
path_on_datastore = "./mobile-pricing-train/01_01_24/train.csv"## be careful with the path
uri = f'azureml://subscriptions/{subscription}/resourcegroups/{resource_group}/workspaces/{workspace}/datastores/{datastore_name}/paths/{path_on_datastore}'
print(uri)

azureml://subscriptions/0aa1c63a-7a46-403c-91e4-8ec91264bc42/resourcegroups/rg-mobile-pricing-dev/workspaces/mobile-pricing-azml-dev/datastores/mobile_pricing/paths/./mobile-pricing-train/01_01_24/train.csv


In [4]:
VERSION="1"
mobile_pricing_train_data = Data(path=uri,
                         type=AssetTypes.URI_FILE,
    description="Train data for mobile pricing",
    name="mobile_pricing_train_data",
    version=VERSION,
)

In [5]:
ml_client.data.create_or_update(mobile_pricing_train_data)

Data({'skip_validation': False, 'mltable_schema_url': None, 'referenced_uris': None, 'type': 'uri_file', 'is_anonymous': False, 'auto_increment_version': False, 'auto_delete_setting': None, 'name': 'mobile_pricing_train_data', 'description': 'Train data for mobile pricing', 'tags': {}, 'properties': {}, 'print_as_yaml': True, 'id': '/subscriptions/0aa1c63a-7a46-403c-91e4-8ec91264bc42/resourceGroups/rg-mobile-pricing-dev/providers/Microsoft.MachineLearningServices/workspaces/mobile-pricing-azml-dev/data/mobile_pricing_train_data/versions/1', 'Resource__source_path': None, 'base_path': 'C:\\Users\\gunnv\\OneDrive\\consulting\\setu\\mlops_azure\\content\\01_Mlops_Using_Cloud_Tools\\datasets_azure_ml\\notebooks', 'creation_context': <azure.ai.ml.entities._system_data.SystemData object at 0x0000025512701810>, 'serialize': <msrest.serialization.Serializer object at 0x000002551ABBAE90>, 'version': '1', 'latest_version': None, 'path': 'azureml://subscriptions/0aa1c63a-7a46-403c-91e4-8ec91264bc

**Once the dataset is registered we can see it in the datsets tab**
![](../../../../images/dataset1.png)

**On clicking on the dataste we can see the detailed view**

![](../../../../images/dataset2.png)

In [6]:
import pandas as pd

In [7]:
data_asset = ml_client.data.get("mobile_pricing_train_data", version="1")
df = pd.read_csv(data_asset.path)

In [8]:
df.head(2)

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,...,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_range
0,842,0,2.2,0,1,0,7,0.6,188,2,...,20,756,2549,9,7,19,0,0,1,1
1,1021,1,0.5,1,0,1,53,0.7,136,3,...,905,1988,2631,17,3,7,1,1,0,2


## Storing secrets 

To access the storage account we had to generate a secret access key. Its not a good practice to store the secrets in our notebook/source code. We should ideally store it in some secure place and then retrieve it from there. We can use `azure-key-vault` service to store our secrets.

We are going to store the following secrets in key-vault and then access these secrets from there:

1. subscription-id
2. ml-resource-group
3. ml-workspace-name
4. storage-account-name
5. storage-account-key

Whenever azure-ml workspace is created, we get a key-vault as part of azure ml workspace. We can use that keyvault to access the storage account keys.

In [9]:
from azure.keyvault.secrets import SecretClient
from azure.identity import DefaultAzureCredential
keyVaultName = "mobilepricinga6945442583"
KVUri = f"https://{keyVaultName}.vault.azure.net"
credential = DefaultAzureCredential()
client = SecretClient(vault_url=KVUri, credential=credential)
account_key = client.get_secret("storage-account-key")

In [None]:
# account_key.value ->this will have the storage account key

## Other ways of accessing data from azure storage account

Apart from generating access key and storing it in key-vault, one can do the followin to allow access to data:

1. Give ml-workspace `Storage Blob Data Contributor` accesss
2. Create a service principal and give it access to the storage account. Save the service principal specific secrets in a key-vault