title | description | ms.date | ms.custom | ms.topic | ms.author | author | ms.service | ms.subservice |
---|---|---|---|---|---|---|---|---|
Azure Machine Learning monitoring data reference |
This article contains important reference material you need when you monitor Azure Machine Learning. |
03/21/2024 |
horz-monitor |
reference |
aashishb |
aashishb |
machine-learning |
mlops |
[!INCLUDE horz-monitor-ref-intro]
See Monitor Machine Learning for details on the data you can collect for Azure Machine Learning and how to use it.
[!INCLUDE horz-monitor-ref-metrics-intro] The resource provider for these metrics is Microsoft.MachineLearningServices/workspaces.
The metrics categories are Model, Quota, Resource, Run, and Traffic. Quota information is for Machine Learning compute only. Run provides information on training runs for the workspace.
The following table lists the metrics available for the Microsoft.MachineLearningServices/workspaces resource type.
[!INCLUDE horz-monitor-ref-metrics-tableheader] [!INCLUDE Microsoft.MachineLearningServices/workspaces]
The following table lists the metrics available for the Microsoft.MachineLearningServices/workspaces/onlineEndpoints resource type.
[!INCLUDE horz-monitor-ref-metrics-tableheader] [!INCLUDE Microsoft.MachineLearningServices/workspaces/onlineEndpoints]
The following table lists the metrics available for the Microsoft.MachineLearningServices/workspaces/onlineEndpoints/deployments resource type.
[!INCLUDE horz-monitor-ref-metrics-tableheader] [!INCLUDE Microsoft.MachineLearningServices/workspaces/onlineEndpoints/deployments]
[!INCLUDE horz-monitor-ref-metrics-dimensions-intro] [!INCLUDE horz-monitor-ref-metrics-dimensions]
Dimension | Description |
---|---|
Cluster Name | The name of the compute cluster resource. Available for all quota metrics. |
Vm Family Name | The name of the VM family used by the cluster. Available for quota utilization percentage. |
Vm Priority | The priority of the VM. Available for quota utilization percentage. |
CreatedTime | Only available for CpuUtilization and GpuUtilization. |
DeviceId | ID of the device (GPU). Only available for GpuUtilization. |
NodeId | ID of the node created where job is running. Only available for CpuUtilization and GpuUtilization. |
RunId | ID of the run/job. Only available for CpuUtilization and GpuUtilization. |
ComputeType | The compute type that the run used. Only available for Completed runs, Failed runs, and Started runs. |
PipelineStepType | The type of PipelineStep used in the run. Only available for Completed runs, Failed runs, and Started runs. |
PublishedPipelineId | The ID of the published pipeline used in the run. Only available for Completed runs, Failed runs, and Started runs. |
RunType | The type of run. Only available for Completed runs, Failed runs, and Started runs. |
The valid values for the RunType dimension are:
Value | Description |
---|---|
Experiment | Non-pipeline runs. |
PipelineRun | A pipeline run, which is the parent of a StepRun. |
StepRun | A run for a pipeline step. |
ReusedStepRun | A run for a pipeline step that reuses a previous run. |
[!INCLUDE horz-monitor-ref-resource-logs]
[!INCLUDE Microsoft.MachineLearningServices/registries]
[!INCLUDE Microsoft.MachineLearningServices/workspaces]
[!INCLUDE Microsoft.MachineLearningServices/workspaces/onlineEndpoints]
[!INCLUDE horz-monitor-ref-logs-tables]
Microsoft.MachineLearningServices/workspaces
- AzureActivity
- AMLOnlineEndpointConsoleLog
- AMLOnlineEndpointTrafficLog
- AMLOnlineEndpointEventLog
- AzureMetrics
- AMLComputeClusterEvent
- AMLComputeClusterNodeEvent
- AMLComputeJobEvent
- AMLRunStatusChangedEvent
- AMLComputeCpuGpuUtilization
- AMLComputeInstanceEvent
- AMLDataLabelEvent
- AMLDataSetEvent
- AMLDataStoreEvent
- AMLDeploymentEvent
- AMLEnvironmentEvent
- AMLInferencingEvent
- AMLModelsEvent
- AMLPipelineEvent
- AMLRunEvent
Microsoft.MachineLearningServices/registries
[!INCLUDE horz-monitor-ref-activity-log]
The following table lists some operations related to Machine Learning that may be created in the activity log. For a complete listing of Microsoft.MachineLearningServices operations, see Microsoft.MachineLearningServices resource provider operations.
Operation | Description |
---|---|
Creates or updates a Machine Learning workspace | A workspace was created or updated |
CheckComputeNameAvailability | Check if a compute name is already in use |
Creates or updates the compute resources | A compute resource was created or updated |
Deletes the compute resources | A compute resource was deleted |
List secrets | On operation listed secrets for a Machine Learning workspace |
Azure Machine Learning uses the following schemas.
Property | Description |
---|---|
TimeGenerated | Time when the log entry was generated |
OperationName | Name of the operation associated with the log event |
Category | Name of the log event |
JobId | ID of the Job submitted |
ExperimentId | ID of the Experiment |
ExperimentName | Name of the Experiment |
CustomerSubscriptionId | SubscriptionId where Experiment and Job as submitted |
WorkspaceName | Name of the machine learning workspace |
ClusterName | Name of the Cluster |
ProvisioningState | State of the Job submission |
ResourceGroupName | Name of the resource group |
JobName | Name of the Job |
ClusterId | ID of the cluster |
EventType | Type of the Job event. For example, JobSubmitted, JobRunning, JobFailed, JobSucceeded. |
ExecutionState | State of the job (the Run). For example, Queued, Running, Succeeded, Failed |
ErrorDetails | Details of job error |
CreationApiVersion | Api version used to create the job |
ClusterResourceGroupName | Resource group name of the cluster |
TFWorkerCount | Count of TF workers |
TFParameterServerCount | Count of TF parameter server |
ToolType | Type of tool used |
RunInContainer | Flag describing if job should be run inside a container |
JobErrorMessage | detailed message of Job error |
NodeId | ID of the node created where job is running |
Property | Description |
---|---|
TimeGenerated | Time when the log entry was generated |
OperationName | Name of the operation associated with the log event |
Category | Name of the log event |
ProvisioningState | Provisioning state of the cluster |
ClusterName | Name of the cluster |
ClusterType | Type of the cluster |
CreatedBy | User who created the cluster |
CoreCount | Count of the cores in the cluster |
VmSize | Vm size of the cluster |
VmPriority | Priority of the nodes created inside a cluster Dedicated/LowPriority |
ScalingType | Type of cluster scaling manual/auto |
InitialNodeCount | Initial node count of the cluster |
MinimumNodeCount | Minimum node count of the cluster |
MaximumNodeCount | Maximum node count of the cluster |
NodeDeallocationOption | How the node should be deallocated |
Publisher | Publisher of the cluster type |
Offer | Offer with which the cluster is created |
Sku | Sku of the Node/VM created inside cluster |
Version | Version of the image used while Node/VM is created |
SubnetId | SubnetId of the cluster |
AllocationState | Cluster allocation state |
CurrentNodeCount | Current node count of the cluster |
TargetNodeCount | Target node count of the cluster while scaling up/down |
EventType | Type of event during cluster creation. |
NodeIdleTimeSecondsBeforeScaleDown | Idle time in seconds before cluster is scaled down |
PreemptedNodeCount | Preempted node count of the cluster |
IsResizeGrow | Flag indicating that cluster is scaling up |
VmFamilyName | Name of the VM family of the nodes that can be created inside cluster |
LeavingNodeCount | Leaving node count of the cluster |
UnusableNodeCount | Unusable node count of the cluster |
IdleNodeCount | Idle node count of the cluster |
RunningNodeCount | Running node count of the cluster |
PreparingNodeCount | Preparing node count of the cluster |
QuotaAllocated | Allocated quota to the cluster |
QuotaUtilized | Utilized quota of the cluster |
AllocationStateTransitionTime | Transition time from one state to another |
ClusterErrorCodes | Error code received during cluster creation or scaling |
CreationApiVersion | Api version used while creating the cluster |
Property | Description |
---|---|
Type | Name of the log event, AmlComputeInstanceEvent |
TimeGenerated | Time (UTC) when the log entry was generated |
Level | The severity level of the event. Must be one of Informational, Warning, Error, or Critical. |
ResultType | The status of the event. Typical values include Started, In Progress, Succeeded, Failed, Active, and Resolved. |
CorrelationId | A GUID used to group together a set of related events, when applicable. |
OperationName | The name of the operation associated with the log entry |
Identity | The identity of the user or application that performed the operation. |
AadTenantId | The Microsoft Entra tenant ID the operation was submitted for. |
AmlComputeInstanceName | "The name of the compute instance associated with the log entry. |
Property | Description |
---|---|
Type | Name of the log event, AmlDataLabelEvent |
TimeGenerated | Time (UTC) when the log entry was generated |
Level | The severity level of the event. Must be one of Informational, Warning, Error, or Critical. |
ResultType | The status of the event. Typical values include Started, In Progress, Succeeded, Failed, Active, and Resolved. |
CorrelationId | A GUID used to group together a set of related events, when applicable. |
OperationName | The name of the operation associated with the log entry |
Identity | The identity of the user or application that performed the operation. |
AadTenantId | The Microsoft Entra tenant ID the operation was submitted for. |
AmlProjectId | The unique identifier of the Azure Machine Learning project. |
AmlProjectName | The name of the Azure Machine Learning project. |
AmlLabelNames | The label class names which are created for the project. |
AmlDataStoreName | The name of the data store where the project's data is stored. |
Property | Description |
---|---|
Type | Name of the log event, AmlDataSetEvent |
TimeGenerated | Time (UTC) when the log entry was generated |
Level | The severity level of the event. Must be one of Informational, Warning, Error, or Critical. |
ResultType | The status of the event. Typical values include Started, In Progress, Succeeded, Failed, Active, and Resolved. |
AmlWorkspaceId | A GUID and unique ID of the Azure Machine Learning workspace. |
OperationName | The name of the operation associated with the log entry |
Identity | The identity of the user or application that performed the operation. |
AadTenantId | The Microsoft Entra tenant ID the operation was submitted for. |
AmlDatasetId | The ID of the Azure Machine Learning Data Set. |
AmlDatasetName | The name of the Azure Machine Learning Data Set. |
Property | Description |
---|---|
Type | Name of the log event, AmlDataStoreEvent |
TimeGenerated | Time (UTC) when the log entry was generated |
Level | The severity level of the event. Must be one of Informational, Warning, Error, or Critical. |
ResultType | The status of the event. Typical values include Started, In Progress, Succeeded, Failed, Active, and Resolved. |
AmlWorkspaceId | A GUID and unique ID of the Azure Machine Learning workspace. |
OperationName | The name of the operation associated with the log entry |
Identity | The identity of the user or application that performed the operation. |
AadTenantId | The Microsoft Entra tenant ID the operation was submitted for. |
AmlDatastoreName | The name of the Azure Machine Learning Data Store. |
Property | Description |
---|---|
Type | Name of the log event, AmlDeploymentEvent |
TimeGenerated | Time (UTC) when the log entry was generated |
Level | The severity level of the event. Must be one of Informational, Warning, Error, or Critical. |
ResultType | The status of the event. Typical values include Started, In Progress, Succeeded, Failed, Active, and Resolved. |
OperationName | The name of the operation associated with the log entry |
Identity | The identity of the user or application that performed the operation. |
AadTenantId | The Microsoft Entra tenant ID the operation was submitted for. |
AmlServiceName | The name of the Azure Machine Learning Service. |
Property | Description |
---|---|
Type | Name of the log event, AmlInferencingEvent |
TimeGenerated | Time (UTC) when the log entry was generated |
Level | The severity level of the event. Must be one of Informational, Warning, Error, or Critical. |
ResultType | The status of the event. Typical values include Started, In Progress, Succeeded, Failed, Active, and Resolved. |
OperationName | The name of the operation associated with the log entry |
Identity | The identity of the user or application that performed the operation. |
AadTenantId | The Microsoft Entra tenant ID the operation was submitted for. |
AmlServiceName | The name of the Azure Machine Learning Service. |
Property | Description |
---|---|
Type | Name of the log event, AmlModelsEvent |
TimeGenerated | Time (UTC) when the log entry was generated |
Level | The severity level of the event. Must be one of Informational, Warning, Error, or Critical. |
ResultType | The status of the event. Typical values include Started, In Progress, Succeeded, Failed, Active, and Resolved. |
OperationName | The name of the operation associated with the log entry |
Identity | The identity of the user or application that performed the operation. |
AadTenantId | The Microsoft Entra tenant ID the operation was submitted for. |
ResultSignature | The HTTP status code of the event. Typical values include 200, 201, 202 etc. |
AmlModelName | The name of the Azure Machine Learning Model. |
Property | Description |
---|---|
Type | Name of the log event, AmlPipelineEvent |
TimeGenerated | Time (UTC) when the log entry was generated |
Level | The severity level of the event. Must be one of Informational, Warning, Error, or Critical. |
ResultType | The status of the event. Typical values include Started, In Progress, Succeeded, Failed, Active, and Resolved. |
AmlWorkspaceId | A GUID and unique ID of the Azure Machine Learning workspace. |
AmlWorkspaceId | The name of the Azure Machine Learning workspace. |
OperationName | The name of the operation associated with the log entry |
Identity | The identity of the user or application that performed the operation. |
AadTenantId | The Microsoft Entra tenant ID the operation was submitted for. |
AmlModuleId | A GUID and unique ID of the module. |
AmlModelName | The name of the Azure Machine Learning Model. |
AmlPipelineId | The ID of the Azure Machine Learning pipeline. |
AmlParentPipelineId | The ID of the parent Azure Machine Learning pipeline (in the case of cloning). |
AmlPipelineDraftId | The ID of the Azure Machine Learning pipeline draft. |
AmlPipelineDraftName | The name of the Azure Machine Learning pipeline draft. |
AmlPipelineEndpointId | The ID of the Azure Machine Learning pipeline endpoint. |
AmlPipelineEndpointName | The name of the Azure Machine Learning pipeline endpoint. |
Property | Description |
---|---|
Type | Name of the log event, AmlRunEvent |
TimeGenerated | Time (UTC) when the log entry was generated |
Level | The severity level of the event. Must be one of Informational, Warning, Error, or Critical. |
ResultType | The status of the event. Typical values include Started, In Progress, Succeeded, Failed, Active, and Resolved. |
OperationName | The name of the operation associated with the log entry |
AmlWorkspaceId | A GUID and unique ID of the Azure Machine Learning workspace. |
Identity | The identity of the user or application that performed the operation. |
AadTenantId | The Microsoft Entra tenant ID the operation was submitted for. |
RunId | The unique ID of the run. |
Property | Description |
---|---|
Type | Name of the log event, AmlEnvironmentEvent |
TimeGenerated | Time (UTC) when the log entry was generated |
Level | The severity level of the event. Must be one of Informational, Warning, Error, or Critical. |
OperationName | The name of the operation associated with the log entry |
Identity | The identity of the user or application that performed the operation. |
AadTenantId | The Microsoft Entra tenant ID the operation was submitted for. |
AmlEnvironmentName | The name of the Azure Machine Learning environment configuration. |
AmlEnvironmentVersion | The name of the Azure Machine Learning environment configuration version. |
[!INCLUDE endpoint-monitor-traffic-reference]
For more information on this log, see Monitor online endpoints.
[!INCLUDE endpoint-monitor-console-reference]
For more information on this log, see Monitor online endpoints.
[!INCLUDE endpoint-monitor-event-reference]
For more information on this log, see Monitor online endpoints.
- See Monitor Machine Learning for a description of monitoring Machine Learning.
- See Monitor Azure resources with Azure Monitor for details on monitoring Azure resources.