# In this notebook, you'll explore how to get data from datastores (big data) and using data assets.


You'll need the latest version of the **azureml-ai-ml** package to run the code in this notebook. Run the cell below to verify that it is installed.

## Setup

This is important if you want to use your `.py` modules to create jobs, rather than writing your code directly in the notebook.




In [None]:
from dotenv import load_dotenv

load_dotenv()

In [3]:
import os
import sys

project_root_directory = os.getcwd().split("/notebooks")[0]
sys.path.insert(0, project_root_directory)

## Connect to your workspace

With the required SDK packages installed, now you're ready to connect to your workspace.

To connect to a workspace, we need identifier parameters - a subscription ID, resource group name, and workspace name. Since you're working with a compute instance, managed by Azure Machine Learning, you can use the default values to connect to the workspace.

In [7]:
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
from azure.ai.ml import MLClient

try:
    credential = DefaultAzureCredential()
    # Check if given credential can get token successfully.
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work
    credential = InteractiveBrowserCredential()

In [8]:
# Get a handle to workspace
ml_client = MLClient.from_config(credential=credential)

Found the config file in: /config.json





## Configure de env

Let's explore the environments within the workspace.


> **Note**:
> If the **azure-ai-ml** package is not installed, run `pip install azure-ai-ml` to install it.

In [4]:
envs = ml_client.environments.list()
for env in envs:
    print(env.name)

DefaultNcdEnv-openmpi4-1-0-cuda11-8-cudnn8-ubuntu22-04
realesr_inference
fine-tunning-embedding
batch-inference-ncd-env
DefaultNcdEnv-ai-ml-automl
inferencing-env
custom-env-embedding
DefaultNcdEnv-foundation-model-inference
env-fine-tuning-inference
env-fine-tuning
fine-tuning-v3
fine-tuning-conda
finetuning-env
fine-tuning-env-2
fine-tuning-env
DefaultNcdEnv-mlflow-ubuntu20-04-py38-cpu-inference
docker-context-repo-based-v1
AzureML-ACPT-pytorch-1.13-py38-cuda11.7-gpu


Submitting the job with the new custom environment triggers the build of the environment. The first time you use a newly created environment, it can take 10-15 minutes to build the environment, which also means your job will take longer to complete.
You can also choose to manually trigger the build of the environment before you submit a job. The environment only needs to be built the first time you use it.

## Creating the container and data store if it does not exist

In [12]:
from azure.storage.blob import BlobServiceClient
from azure.core.exceptions import ResourceExistsError

account_url = f"https://{os.getenv("storage_account_name")}.blob.core.windows.net"

# Create the BlobServiceClient object
blob_service_client = BlobServiceClient(account_url, credential=credential)

In [6]:
def create_blob_container(blob_service_client: BlobServiceClient, container_name):
    try:
        container_client = blob_service_client.create_container(name=container_name)
    except ResourceExistsError:
        print("A container with this name already exists")

In [None]:
create_blob_container(blob_service_client, "mobile-data")

AzureCliCredential.get_token_info failed: Please run 'az login' to set up an account


In [None]:
def upload_blob_file(self, blob_service_client: BlobServiceClient, container_name: str):
    container_client = blob_service_client.get_container_client(
        container=container_name
    )
    with open(file=os.path.join("filepath", "filename"), mode="rb") as data:
        blob_client = container_client.upload_blob(
            name="sample-blob.txt", data=data, overwrite=True
        )

In [None]:
create_blob_container(blob_service_client, "mobile-data")

In [13]:
from azure.ai.ml.entities import AzureDataLakeGen2Datastore

# Replace with your Data Lake account details
account_name = os.getenv("storage_account_name")
filesystem_name = "mobile-price"

datalake_datastore = AzureDataLakeGen2Datastore(
    name="my_datalake",
    account_name=account_name,
    filesystem=filesystem_name,
)

created_datastore = ml_client.datastores.create_or_update(datalake_datastore)
print(f"Datastore '{created_datastore.name}' created successfully!")

Datastore 'my_datalake' created successfully!


## List the datastores

When you create the Azure Machine Learning workspace, an Azure Storage Account is created too. The Storage Account includes Blob and file storage and are automatically connected with your workspace as **datastores**. You can list all datastores connected to your workspace:

### List the datastores.

We can also use https://learn.microsoft.com/en-us/azure/machine-learning/how-to-access-data-interactive?view=azureml-api-2&tabs=adls

to access data during interactive development

In [14]:
stores = ml_client.datastores.list()
for ds_name in stores:
    print(ds_name.name)

my_datalake
azureml
azureml_globaldatasets
blob_mobileprice_cleaned
workspaceworkingdirectory
workspacefilestore
workspaceblobstore
workspaceartifactstore


#### We are going to read data directly from Data Lake

In [15]:
# Azure Machine Learning workspace details:
subscription = os.getenv("subscription")
resource_group = os.getenv("subscription")
workspace = os.getenv("workspace")
datastore_name = os.getenv("datastore_name")
path_on_datastore = os.getenv("path_on_datastore")

## Create data assets using local folder

To point to a specific folder or file in a datastore, you can create data assets. There are three types of data assets:

- `URI_FILE` points to a specific file.
- `URI_FOLDER` points to a specific folder.
- `MLTABLE` points to a MLTable file which specifies how to read one or more files within a folder.

You'll create all three types of data assets to experience the differences between them.

To create a `URI_FILE` data asset, you have to specify a path that points to a specific file. The path can be a local path or cloud path.

In the example below, you'll create a data asset by referencing a *local* path. To ensure the data is always available when working with the Azure Machine Learning workspace, local files will automatically be uploaded to the default datastore. In this case, the `diabetes.csv` file will be uploaded to **LocalUpload** folder in the **workspaceblobstore** datastore. 

To create a data asset from a local file, run the following cell:

In [None]:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

my_path = "../data/Mobile-Price-Prediction-cleaned_data.csv"

my_data = Data(
    path=my_path,
    type=AssetTypes.URI_FILE,
    description="Data asset pointing to a local file, automatically uploaded to the default datastore",
    name="mobile-price-local",
)

ml_client.data.create_or_update(my_data)

To create a `MLTable` data asset, you have to specify a path that points to a folder which contains a MLTable file. The path can be a local path or cloud path. 

In the example below, you'll create a data asset by referencing a *local* path which contains an MLTable and CSV file. 

In [None]:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

my_path = "../data"

my_data = Data(
    path=my_path,
    type=AssetTypes.MLTABLE,
    description="MLTable pointing to diabetes.csv in data folder",
    name="mobile-price-table",
)

ml_client.data.create_or_update(my_data)

To verify that the new data assets have been created, you can list all data assets in the workspace again:

In [None]:
datasets = ml_client.data.list()
for ds_name in datasets:
    print(ds_name.name)

## Creating data asset using DataLake

In [28]:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

data_uri = f"azureml://datastores/{datastore_name}/paths/{path_on_datastore}/Mobile-Price-Prediction-cleaned_data.csv"

datalake_data_asset = Data(
    name="datalake-price-data",
    path=data_uri,
    type=AssetTypes.URI_FILE,
    description="Data asset pointing to a file in Azure Data Lake via the datastore",
)

ml_client.data.create_or_update(datalake_data_asset)
print("Data asset registered successfully!")

Data asset registered successfully!


## Read data in notebook

Initially, you may want to work with data assets in notebooks, to explore the data and experiment with machine learning models. Any `URI_FILE` or `URI_FOLDER` type data assets are read as you would normally read data. For example, to read a CSV file a data asset points to, you can use the pandas function `read_csv()`. 

A `MLTable` type data asset is already *read* by the **MLTable** file, which specifies the schema and how to interpret the data. Since the data is already *read*, you can easily convert a MLTable data asset to a pandas dataframe. 

You'll need to install the `mltable` library (which you did in the terminal). Then, you can convert the data asset to a dataframe and visualize the data.  

In [31]:
import pandas as pd
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

ml_client = MLClient.from_config(credential=DefaultAzureCredential())
data_asset = ml_client.data.get("datalake-price-data", version="3")

df = pd.read_csv(data_asset.path)
df

Found the config file in: /config.json
Overriding of current TracerProvider is not allowed
Overriding of current LoggerProvider is not allowed
Overriding of current MeterProvider is not allowed
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented


Unnamed: 0,Ratings,RAM,ROM,Mobile_Size,Primary_Cam,Selfi_Cam,Battery_Power,Price
0,4.3,4.0,128.0,6.00,48,13.0,4000,24999
1,3.4,6.0,64.0,4.50,48,12.0,4000,15999
2,4.3,4.0,4.0,4.50,64,16.0,4000,15000
3,4.4,6.0,64.0,6.40,48,15.0,3800,18999
4,4.5,6.0,128.0,6.18,35,15.0,3800,18999
...,...,...,...,...,...,...,...,...
802,3.8,6.0,32.0,4.54,48,12.0,2800,1299
803,4.1,8.0,64.0,4.54,64,8.0,2500,1390
804,4.4,3.0,32.0,6.20,48,1.0,3800,9790
805,3.7,10.0,32.0,4.50,64,8.0,3500,799
