# Workspace

The first class you will work with is the AzureML Workspace, a class that gives
you access to all the resources within your workspace. To create a reference to your
workspace, you will need the following information:

• **Subscription ID**: The subscription where the workspace is located. This is
a Globally Unique Identifier (GUID, also known as a UUID) that consists
of 32 hexadecimal (0-F) digits; for example, ab05ab05-ab05-ab05-ab05-
ab05ab05ab05. You can find this ID in the Azure portal in the Properties tab
of the subscription you are using.

• **Resource group name**: The resource group that contains the AzureML
workspace components.

• **Workspace name**: The name of the AzureML workspace.


In [53]:
subscription_id=""
resource_group=""
workspace_name=""

In [2]:
from azureml.core import Workspace
ws=Workspace(subscription_id,resource_group,workspace_name)

In [3]:
from azureml.core import Workspace
ws = Workspace.get(name=workspace_name,
 subscription_id=subscription_id,
 resource_group=resource_group)


The previous two ways of getting the AzureML workspace reference are identical. The
main issue with them, however, is that they hardcode the workspace where the script is
connecting to. Imagine that you want to share a notebook with a friend, and you have
hardcoded the subscription ID, resource name, and workspace name in that notebook.
Your friend would have to manually go and edit that cell. This problem becomes even
more obvious when you want to write a script that runs in multiple environments,
such as the development environment, the quality assurance environment, and the
production environment.

The Workspace class offers the from_config() method to address this issue. This
method searches the folder tree structure for the config.json file.

In the case of the compute instance, this file is located in the root folder (/config.
json) and was automatically created there when you provisioned the compute instance
within the AzureML workspace

In [4]:
from azureml.core import Workspace
ws = Workspace.from_config()
print(f"Connected to workspace {ws.name}")

Connected to workspace prod_dev_ml_ws


If you want to spin up a new AzureML workspace, you can provision one using the
Workspace.create() method.

In [5]:
from azureml.core import Workspace

In [8]:
new_ws=Workspace.create(name="new_ml_ws_prod_dev",
subscription_id=subscription_id,
resource_group=resource_group,
create_resource_group=False,
location="eastus"
)

Deploying StorageAccount with name newmlwspstorageefc3f126b.
Deploying KeyVault with name newmlwspkeyvault39367f99.
Deploying AppInsights with name newmlwspinsightsdb6fbf35.
Deployed AppInsights with name newmlwspinsightsdb6fbf35. Took 1.36 seconds.
Deployed KeyVault with name newmlwspkeyvault39367f99. Took 16.91 seconds.
Deploying Workspace with name new_ml_ws_prod_dev.
Deployed StorageAccount with name newmlwspstorageefc3f126b. Took 26.38 seconds.
Deployed Workspace with name new_ml_ws_prod_dev. Took 16.54 seconds.


To delete the workspace

In [9]:
new_ws.delete(delete_dependent_resources=True)

This code deletes the workspace being referenced by the new_ws variable and removes
the dependent resources, which are the storage account, the key vault, and the Application
Insights resources that were deployed with the AzureML workspace.

# Working with compute targets

The AzureML SDK allows you to list the existing compute targets you may have
in your workspace or provision new ones if needed

In [6]:
ws.compute_targets

{'training-compute123': {
   "id": "/subscriptions/a8b508b6-da16-4c45-84f5-cac5c9f57513/resourceGroups/azure-mlops/providers/Microsoft.MachineLearningServices/workspaces/prod_dev_ml_ws/computes/training-compute123",
   "name": "training-compute123",
   "location": "eastus",
   "tags": {},
   "properties": {
     "description": null,
     "computeType": "ComputeInstance",
     "computeLocation": "eastus",
     "resourceId": null,
     "provisioningErrors": null,
     "provisioningState": "Succeeded",
     "properties": {
       "vmSize": "STANDARD_DS11_V2",
       "applications": [
         {
           "displayName": "Jupyter",
           "endpointUri": "https://training-compute123.eastus.instances.azureml.ms/tree/"
         },
         {
           "displayName": "Jupyter Lab",
           "endpointUri": "https://training-compute123.eastus.instances.azureml.ms/lab"
         },
         {
           "displayName": "RStudio",
           "endpointUri": "https://training-compute123-8787.ea

In [7]:
for compute_name in ws.compute_targets:
    compute=ws.compute_targets[compute_name]
    print(f"Compute {compute.name} is a {type(compute)}")

Compute training-compute123 is a <class 'azureml.core.compute.computeinstance.ComputeInstance'>


The output should list at least the ComputeInstance area where you are executing the
script. You will notice that all the compute types are defined within the modules of
the azureml.core.compute package.

Another way to get a reference to a compute target is to use the ComputeTarget
constructor. You need to pass in the Workspace reference and the name of the compute
target you are looking for. If the target does not exist, a ComputeTargetException
exception will be raised that you have to handle in your code base

In [8]:
from azureml.core import ComputeTarget
from azureml.exceptions import ComputeTargetException

In [9]:
compute_name="training-compute123"
compute=None

In [10]:
try:
    compute=ComputeTarget(ws,compute_name)
    print(f"Found {compute_name} which is {type(compute)}")
except ComputeTargetException as e:
    print(f"Failed to get compute {compute_name}. Error: {e}")

Found training-compute123 which is <class 'azureml.core.compute.computeinstance.ComputeInstance'>


The ComputeTarget class offers the create() method, which allows you to provision
various compute targets, including compute instances (the ComputeInstance
class), compute clusters (the AmlCompute class), and Azure Kubernetes Service (the
AKSCompute class) targets.

To provision a compute target, you will need to create a configuration object that
inherits from the ComputeTargetProvisioningConfiguration abstract
class.

In [11]:
from azureml.core.compute import ComputeTarget, AmlCompute
compute_name="demo-clusterakt"

In [12]:
cluster=None
if compute_name in ws.compute_targets:
    print("Getting referrence to compute cluster")
    custer=ws.compute_targets[compute_name]
else:
    print("creating compute cluster")
    config=AmlCompute.provisioning_configuration(vm_size='Standard_D1',max_nodes=2)
    cluster=ComputeTarget.create(ws,compute_name,config)
    cluster.wait_for_completion(show_output=True)
    print(f"Got reference to cluster {cluster.name}")

creating compute cluster
InProgress.
SucceededProvisioning operation finished, operation "Succeeded"
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned
Got reference to cluster demo-clusterakt


The script is only specifying the maximum nodes (the max_nodes argument) that the
compute cluster will have. If you do not specify the minimum nodes (the min_nodes
argument), the argument will be the default value of 0. This means that by default, the
cluster will scale down to 0 nodes, inflicting no compute costs when no job is running. 

One of the drawbacks of having 0 minimum nodes in a compute cluster is that you will
have to wait for the compute nodes to be allocated before the job you submitted gets
executed. To save this slack time, it is common to scale up the minimum and even the
maximum nodes of the cluster during workdays, and then change those values after
business hours to save on costs.

# Defining datastores


In [16]:
from azureml.core import Datastore

In [17]:
dstore=Datastore.register_azure_blob_container(ws,datastore_name="myblob_datastore",
container_name="mynewcontainer123",account_name="proddevmlws8204932307",account_key="smIo47xCxlQqxTGaZW5M+UnPnH7RaUKbvATe20MxUG0zNohFY46pEQlGG/hiiZgubixb32lZ24L7+AStHofvqg==",
create_if_not_exists=True
)

To get a reference to the connected datastore, you can use the Datastore class
constructor

In [18]:
from azureml.core import Datastore
dstore=Datastore.get(ws,datastore_name="myblob_datastore")

The Workspace class offers a shortcut that gives a reference to that store using the
get_default_datastore() method

In [19]:
dstore=ws.get_default_datastore()

### Uploading Files with the help of datastore

In [20]:
from sklearn.datasets import load_diabetes
import pandas as pd

In [21]:
features, target=load_diabetes(return_X_y=True)

In [22]:
features

array([[ 0.03807591,  0.05068012,  0.06169621, ..., -0.00259226,
         0.01990842, -0.01764613],
       [-0.00188202, -0.04464164, -0.05147406, ..., -0.03949338,
        -0.06832974, -0.09220405],
       [ 0.08529891,  0.05068012,  0.04445121, ..., -0.00259226,
         0.00286377, -0.02593034],
       ...,
       [ 0.04170844,  0.05068012, -0.01590626, ..., -0.01107952,
        -0.04687948,  0.01549073],
       [-0.04547248, -0.04464164,  0.03906215, ...,  0.02655962,
         0.04452837, -0.02593034],
       [-0.04547248, -0.04464164, -0.0730303 , ..., -0.03949338,
        -0.00421986,  0.00306441]])

In [23]:
target

array([151.,  75., 141., 206., 135.,  97., 138.,  63., 110., 310., 101.,
        69., 179., 185., 118., 171., 166., 144.,  97., 168.,  68.,  49.,
        68., 245., 184., 202., 137.,  85., 131., 283., 129.,  59., 341.,
        87.,  65., 102., 265., 276., 252.,  90., 100.,  55.,  61.,  92.,
       259.,  53., 190., 142.,  75., 142., 155., 225.,  59., 104., 182.,
       128.,  52.,  37., 170., 170.,  61., 144.,  52., 128.,  71., 163.,
       150.,  97., 160., 178.,  48., 270., 202., 111.,  85.,  42., 170.,
       200., 252., 113., 143.,  51.,  52., 210.,  65., 141.,  55., 134.,
        42., 111.,  98., 164.,  48.,  96.,  90., 162., 150., 279.,  92.,
        83., 128., 102., 302., 198.,  95.,  53., 134., 144., 232.,  81.,
       104.,  59., 246., 297., 258., 229., 275., 281., 179., 200., 200.,
       173., 180.,  84., 121., 161.,  99., 109., 115., 268., 274., 158.,
       107.,  83., 103., 272.,  85., 280., 336., 281., 118., 317., 235.,
        60., 174., 259., 178., 128.,  96., 126., 28

In [24]:
diabetes_df=pd.DataFrame(features)

In [26]:
diabetes_df['target']=target

In [27]:
diabetes_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,target
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641,135.0


In [28]:
diabetes_df.to_csv("rawdata.csv",index=False)

In [29]:
dstore=Datastore.get(ws,datastore_name="myblob_datastore")

In [30]:
dstore.upload_files(files=["rawdata.csv"],target_path="/data",overwrite=True,show_progress=True)

"datastore.upload_files" is deprecated after version 1.0.69. Please use "FileDatasetFactory.upload_directory" instead. See Dataset API change notice at https://aka.ms/dataset-deprecation.


$AZUREML_DATAREFERENCE_96b3c398adb84646baf8888df22da9ad

you learned how to attach an existing Azure blob container to a new
datastore within your AzureML workspace. You also learned how to easily get a reference
to the workspace's default datastore, and then you uploaded a CSV file to that datastore

# Working with datasets

datasets are an abstraction layer on top of the data that you use for training and
inference. They contain references to the physical data's location and provide a series of
metadata that helps you understand their shape and statistical properties. They do not
copy the data that resides within the datastores. AzureML offers two types of datasets:

- FileDataset 
- TabularDataset

In [38]:
from azureml.core import Dataset

dstore=Datastore(ws,name="myblob_datastore")

file_paths=[
    (dstore,"/data"),
]

tabular_ds=Dataset.Tabular.from_delimited_files(path=file_paths)

In [39]:
tabular_ds

{
  "source": [
    "('myblob_datastore', '/data')"
  ],
  "definition": [
    "GetDatastoreFiles",
    "ParseDelimited",
    "DropColumns",
    "SetColumnTypes"
  ]
}

In [40]:
df=tabular_ds.to_pandas_dataframe()

you are creating an unregistered
TabularDataset using the from_delimited_files() method. Also,
note that you explicitly skip the validation so that the data can be loaded from the
current compute (validate=False), speeding up the declaration process.
Datasets do not load the data by default unless you are explicitly invoking a method
that requires the actual data. In this case, your code will reach out to the datastore,
load the data in memory as a pandas DataFrame, and assign it to the df variable
when you invoke the to_pandas_dataframe() method. Upon calling the
len() method, you get the number of rows that DataFrame has.

In [41]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,target
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641,135.0


 If you want to reuse a dataset in multiple
experiments, you can register it in the workspace using the register() method: 


In [42]:
tabular_ds.register(ws,name="diabetes_dataset",description="diabetes data taken from sklearn")

{
  "source": [
    "('myblob_datastore', '/data')"
  ],
  "definition": [
    "GetDatastoreFiles",
    "ParseDelimited",
    "DropColumns",
    "SetColumnTypes"
  ],
  "registration": {
    "id": "8df7cb97-e5e6-4577-a016-8b656ef1dfdd",
    "name": "diabetes_dataset",
    "version": 1,
    "description": "diabetes data taken from sklearn",
    "workspace": "Workspace.create(name='prod_dev_ml_ws', subscription_id='a8b508b6-da16-4c45-84f5-cac5c9f57513', resource_group='azure-mlops')"
  }
}

If, instead of TabularDataset, you have a pandas DataFrame that you want to
register, you can use the register_pandas_dataframe() method,

In [43]:
Dataset.Tabular.register_pandas_dataframe(df,target=(dstore,"/data"),name="diabetes_df",description="registered from a dataframe")

Validating arguments.
Arguments validated.
Successfully obtained datastore reference and path.
Uploading file to /data/cc506c61-ddab-4322-b236-5b0f4273ec62/
Successfully uploaded file to datastore.
Creating and registering a new dataset.
Successfully created and registered a new dataset.


{
  "source": [
    "('myblob_datastore', '/data/cc506c61-ddab-4322-b236-5b0f4273ec62/')"
  ],
  "definition": [
    "GetDatastoreFiles",
    "ReadParquetFile",
    "DropColumns"
  ],
  "registration": {
    "id": "21ad2713-aaf7-4748-a870-6495e83e221b",
    "name": "diabetes_df",
    "version": 1,
    "description": "registered from a dataframe",
    "workspace": "Workspace.create(name='prod_dev_ml_ws', subscription_id='a8b508b6-da16-4c45-84f5-cac5c9f57513', resource_group='azure-mlops')"
  }
}

Once you have registered a dataset, either FileDataset or TabularDataset,
you can retrieve it using the get_by_name() method of the Dataset class

In [44]:
from azureml.core import Dataset
diabetes_ds=Dataset.get_by_name(ws,name="diabetes_dataset")

In [49]:
diabetes_ds.name

'diabetes_dataset'

The preceding code snippet returns an instance of a TabularDataset class, but
the data hasn't been loaded yet. You can load the dataset partially using various
methods of the TabularDataset class

In [50]:
diabetes_ds.version

1

In [52]:
#diabetes_ds.to_pandas_dataframe()