# Share data across workspaces

This is the companion notebook for the article on sharing components, environments and models across workspaces: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-share-data-across-workspaces-with-registries

### Prerequisites
Review the prerequisites section in the below article to get all the details and relevant links:

https://learn.microsoft.com/en-us/azure/machine-learning/how-to-share-data-across-workspaces-with-registries

- Familiarity with Azure Machine Learning registries and Data concepts in Azure Machine Learning.
- An Azure Machine Learning registry (preview) to share data.
- An Azure Machine Learning workspace. (The Azure region (location) where you create your workspace must be in the list of supported regions for Azure Machine Learning registry.)
- The Azure Machine Learning Python SDK v2


### Overview and scenarios

Azure Machine Learning registry (preview) enables you to collaborate across workspaces within your organization. Using registries, you can share models, components, environments and data.

#### Key scenario addressed by data sharing using Azure Machine Learning registry

You may want to have data shared across multiple teams, projects, or workspaces in a central location. Such data doesn't have sensitive access controls and can be broadly used in the organization.  

Examples include:
* A team wants to share a public dataset that is preprocessed and ready to use in experiments.
* Your organization has acquired a particular dataset for a project from an external vendor and wants to make it available to all teams working on a project.
* A team wants to share data assets across workspaces in different regions.

#### Scenarios NOT addressed by data sharing using Azure Machine Learning registry

* Sharing sensitive data that requires fine grained access control. You can't create a data asset in a registry to share with a small subset of users/workspaces while the registry is accessible by many other users in the org.
* SSharing data that is available in existing storage that must not be copied or is too large or too expensive to be copied. Whenever data assets are created in a registry, a copy of data is ingested into the registry storage so that it can be replicated.


### Goals

In this article, you'll learn how to:
* Create a data asset in the registry.
* Use the data asset from registry as input to a model training job in a workspace.
* Share an existing data asset from workspace to registry

In [None]:
# Import required libraries
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential

from azure.ai.ml import MLClient, Input, Output
from azure.ai.ml.dsl import pipeline
from azure.ai.ml import load_component
from azure.ai.ml.entities import (
    Environment,
    BuildContext,
    Model,
    Data,
    CodeConfiguration,
)
from azure.ai.ml.constants import AssetTypes
import time, datetime, os

# print the sdk version - you many want to share this in the issue you will report if parts of this notebook don't work
!pip show azure-ai-ml

### Setup authentication

We are using `DefaultAzureCredential` to get access to workspace. When an access token is needed, it requests one using multiple identities(`EnvironmentCredential, ManagedIdentityCredential, SharedTokenCacheCredential, VisualStudioCodeCredential, AzureCliCredential, AzurePowerShellCredential`) in turn, stopping when one provides a token.
Reference [here](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity.defaultazurecredential?view=azure-python) for more information.

`DefaultAzureCredential` should be capable of handling most Azure SDK authentication scenarios. 
Reference [here](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity?view=azure-python) for all available credentials if it does not work for you.  

In [None]:
try:
    credential = DefaultAzureCredential()
    # Check if given credential can get token successfully.
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work
    credential = InteractiveBrowserCredential()

## Connect to a workspace and registry

Most samples create one client to connect to the workspace. However, in this sample, you need two clients. First client, called `ml_client_workspace`, will be used to connect to a workspace and run jobs or deploy endpoints. Second client, called `ml_client_registry` will be used to connect to the registry to create components, environments and models.

Replace the following:
* `<SUBSCRIPTION_ID>`
* `<RESOURCE_GROUP>`
* `<AML_WORKSPACE_NAME>`
* `<REGISTRY_NAME>`
 

In [None]:
ml_client_workspace = MLClient(
    credential=credential,
    subscription_id="<SUBSCRIPTION_ID>",
    resource_group_name="<RESOURCE_GROUP>",
    workspace_name="<AML_WORKSPACE_NAME>",
)
print(ml_client_workspace)

ml_client_registry = MLClient(credential=credential, registry_name="<REGISTRY_NAME>")
print(ml_client_registry)

### Create a version number and setup root directory 
Make sure that you set the version number to something unique if this notebook has been run before. You can use the timestamp to generate a unique version number, the sample code for which is commented out. This will prevent any name and version conflicts when creating assets.

Set the root directory in which the YAML definitions of the components, environments, etc. are present.

In [None]:
import time
import sys

# version = str(123456)
version = str(int(time.time()))
print("version: ", version)

parent_dir = os.path.abspath(
    os.path.join(
        sys.path[0],
        "../../../../cli/jobs/pipelines-with-components/nyc_taxi_data_regression",
    )
)

### Create environment in registry

**Note:** In this step, we are creating a environment in registry and will use that in a job later. You can also create an environment in workspace or use an existing environment in registry or workspace to use in your job.  

You will use a docker file to create the environment. The docker file has base python image and few python dependencies required to run Scikit Learn training jobs. This notebook: [../environment/environment.ipynb](../environment/environment.ipynb)has more samples for environment create.

Note that we use the `ml_client_registry` client because we plan to create the environment in registry. The syntax for creating environment in a workspace or registry are identical. You just use a client that is specific to the target - workspace or registry.

In [None]:
env_docker_context = Environment(
    build=BuildContext(path=os.path.join(parent_dir, "env_train")),
    name="SKLearnEnv",
    version=version,
    description="Scikit Learn environment",
)
ml_client_registry.environments.create_or_update(env_docker_context)

### Get environment from registry

Get the environment using the `ml_client_registry` client. The syntax for getting environment in a workspace or registry are identical. You just use a client that is specific to the target - workspace or registry.

You will use this environment in the next step to create a component in the registry.

In [None]:
env_from_registry = ml_client_registry.environments.get(
    name="SKLearnEnv", version=version
)
print(env_from_registry)

### Create component in registry

**Note:** In this step, we are creating a component in registry and will use that in a job later. You can also create an component in workspace or use an existing component in registry or workspace to use in your job.  


You will use the [`train.yml`](../../../cli/jobs/pipelines-with-components/nyc_taxi_data_regression/train.yml) component YAML defined in `cli/jobs/pipelines-with-components/nyc_taxi_data_regression` for this. This component runs a Scikit Learn training python script. The `train.yml` refers to the AzureML curated environment for the Scikit Learn framework: `AzureML-sklearn-0.24-ubuntu18`, but you will over ride this to use the Scikit Learn environment you created in the previous step.

A similar sample notebook shows how to create these components in workspaces instead of registry, in which case you can use those components only in the specific workspace: https://github.com/Azure/azureml-examples/blob/main/sdk/jobs/pipelines/1e_pipeline_with_registered_components/pipeline_with_registered_components.ipynb


Use the `ml_client_registry` client to create the component in the registry. The syntax for creating component in a workspace or registry are identical. You just use a client that is specific to the target - workspace or registry.

In [None]:
# load component definition from YAML
print(parent_dir)
train_model = load_component(source=os.path.join(parent_dir, "train.yml"))
# print the component as yaml
print(train_model)

# change environment reference to the environment created in registry
train_model.environment = env_from_registry

# changing the version number is optional, but useful if a component with same name and version already exist in registry
train_model.version = version

print(train_model)
ml_client_registry.components.create_or_update(train_model)

### Get component from Registry

Get the component using the `ml_client_registry` client. The syntax for getting component from a workspace or registry are identical. You just use a client that is specific to the target - workspace or registry.

You will use this component in the next step to run a pipeline job to train a model.

In [None]:
train_component_from_registry = ml_client_registry.components.get(
    name="train_linear_regression_model", version=version
)
print(train_component_from_registry)

#### Create data asset in registry

The data asset created in this step is used later in this article when submitting a training job. We will use transformed NYC taxi data which is required as in input to the pipelines job we will be running later. The data is avalble in `data_transformed` folder.

Use the `ml_client_registry` client to create the data in the registry. The syntax for creating data in a workspace or registry are identical. You just use a client that is specific to the target - workspace or registry.

In [None]:
my_path = parent_dir+"/data_transformed/"
my_data = Data(path=my_path,
               type=AssetTypes.URI_FOLDER,
               description="Transformed NYC Taxi data created from local folder.",
               name="transformed-nyc-taxt-data",
               version=version)
ml_client_registry.data.create_or_update(my_data)

#### Get data from registry

Get the data using the ml_client_registry client. The syntax for getting data from a workspace or registry are identical. You just use a client that is specific to the target - workspace or registry.

You will use this data in the next step to run a pipeline job to train a model.

In [None]:
# get the data asset
data_asset_from_registry = ml_client_registry.data.get(name="transformed-nyc-taxt-data", version=version)
print(data_asset_from_registry)

### Create a pipeline job using data from registry

Review this page to learn how to use pipelines and components: https://github.com/Azure/azureml-examples/tree/main/sdk/python/jobs/pipelines. 

You will create a pipeline job that uses the training component created in the previous step using the Python DSL for pipelines. 

Make sure your workspace has a compute with the name `cpu-cluster` or update the compute name here: `pipeline_job.settings.default_compute = `

In [None]:
@pipeline()
def pipeline_with_registered_components(training_data):
    train_job = train_component_from_registry(
        training_data=training_data,
    )


pipeline_job = pipeline_with_registered_components(
    training_data=Input(type="uri_folder", path=data_asset_from_registry.id),
)
pipeline_job.settings.default_compute = "cpu-cluster"
print(pipeline_job)

### Run pipeline job using a data from registry

Submit pipeline job and wait for it to complete. Notice that you are using the workspace client: `ml_client_workspace` to run the pipeline job. This job is running a component and data that is not available in your workspace but is coming from a registry. This way, you can run this job in any workspace you have access to. This is useful when you want to develop a pipeline in th `dev` workspace with some sample data and run the pipeline in the `prod` workspace with actual data. This is also helpful if you want to share the components and data you develop with other teams in your organization who may be using a different workspace. 
To summarize, you can submit this job to different workspaces such as `dev`, `test` or `prod` by creating different ML clients for each of those workspaces.

In [None]:
pipeline_job = ml_client_workspace.jobs.create_or_update(
    pipeline_job,
    experiment_name="sdk_job_data_from_registry",
    skip_validation=True,
)
ml_client_workspace.jobs.stream(pipeline_job.name)
pipeline_job = ml_client_workspace.jobs.get(pipeline_job.name)
pipeline_job

### Share data from workspce to registry

You can also share an existing data asset from workspace to registry. We will start with creating a data asset in workspace. For this we will use the same data from a previous step in this notebook but instead of creating it in registry, we will create it in workspace.

Note that the only diffeence is that we are using `ml_clent_workspace` instead of `ml_client_registry`.

In [None]:
my_path = parent_dir+"/data_transformed/"
my_data = Data(path=my_path,
               type=AssetTypes.URI_FOLDER,
               description="Transformed NYC Taxi data created from local folder.",
               name="transformed-nyc-taxt-data-ws",
               version=version)
ml_client_workspace.data.create_or_update(my_data)

In this step, we will share the data, created in workspace, in the previous step to the registry.

Note that we are passing `name` and `version` parameter in `_prepare_to_copy` function. These parameters are optional and if you do not pass these data will be shared with same name and version as in workspace.

In [None]:
# fetch the data from workspace
data_in_workspace = ml_client_workspace.data.get(
    name="transformed-nyc-taxt-data-ws", version=version
)
print("data from workspace:\n\n", data_in_workspace)

# change the format such that the registry understands the data (when you print the data_ready_to_copy object, notice the asset id
data_ready_to_copy = ml_client_workspace.data._prepare_to_copy(data_in_workspace, name="transformed-nyc-taxt-data-shared-from-ws", version=version)
print("\n\ndata ready to copy:\n\n", data_ready_to_copy)
# copy the data from registry to workspace
ml_client_registry.data.create_or_update(data_ready_to_copy).wait()