Skip to content

Commit

Permalink
Add distributed commands related metrics to job service
Browse files Browse the repository at this point in the history
### What changes are proposed in this pull request?
Add new distributed command metrics, operation counts by status
(success, fail, cancel), file count, file size metrics.

1. Tests run in local mode.
2. Add unit test for cancel operations.

Ran distributedLoad command and could see the counters change their
values accordingly.
` ./bin/alluxio fs distributedLoad /data/1023
Allow up to 3000 active jobs
/data/1023 loading
Successfully loaded path /data/1023 after 1 attempts
Completed count is 1,Failed count is 0.`
"JobMaster.Master.JobDistributedLoadCancel" : {
"count" : 0
},
"JobMaster.Master.JobDistributedLoadFail" : {
"count" : 0
},
"JobMaster.Master.JobDistributedLoadFileCount" : {
"count" : 1
},
"JobMaster.Master.JobDistributedLoadFileSizes" : {
"count" : 12
},
"JobMaster.Master.JobDistributedLoadSuccess" : {
"count" : 1
},
"meters" : {
"JobMaster.Master.JobDistributedLoadRate" : {
"count" : 12,
"m15_rate" : 0.01329636478824785,
"m1_rate" : 0.1918934048896241,
"m5_rate" : 0.039668510828118016,
"mean_rate" : 0.0051907953589612026,
"units" : "events/second"
}
},

Ran distributedCp command, and see the counters change below:
`./bin/alluxio fs distributedCp /data/1023 /data/1023copy
Allow up to 3000 active jobs
Copying /data/1023 to /data/1023copy
Successfully copied /data/1023 to /data/1023copy after 1 attempts`
"JobMaster.Master.MigrateJobCancel" : {
"count" : 0
},
"JobMaster.Master.MigrateJobFail" : {
"count" : 0
},
"JobMaster.Master.MigrateJobFileCount" : {
"count" : 1
},
"JobMaster.Master.MigrateJobFileSize" : {
"count" : 12
},
"JobMaster.Master.MigrateJobSuccess" : {
"count" : 1
},

For Persist metrics after running runTests and manually loading or
copying files (total 14 files):
"JobMaster.Master.AsyncPersistCancel" : {
"count" : 0
},
"JobMaster.Master.AsyncPersistFail" : {
"count" : 0
},
"JobMaster.Master.AsyncPersistFileCount" : {
"count" : 14
},
"JobMaster.Master.AsyncPersistFileSize" : {
"count" : 1059
},
"JobMaster.Master.AsyncPersistSuccess" : {
"count" : 14
},

Please outline the changes and how this PR fixes the issue.

### Why are the changes needed?

Please clarify why the changes are needed. For instance,
1. If you propose a new API, clarify the use case for a new API.
2. If you fix a bug, describe the bug.

### Does this PR introduce any user facing changes?

Please list the user-facing changes introduced by your change, including
1. change in user-facing APIs
2. addition or removal of property keys
3. webui

pr-link: #14678
change-id: cid-6b60c5dc9c4c508b9b3fc3610b1f3046aa7600c2
  • Loading branch information
luzhang6 committed Jan 15, 2022
1 parent 9f51b06 commit 696cb89
Show file tree
Hide file tree
Showing 7 changed files with 631 additions and 4 deletions.
84 changes: 84 additions & 0 deletions core/common/src/main/java/alluxio/metrics/MetricKey.java
Original file line number Diff line number Diff line change
Expand Up @@ -738,6 +738,90 @@ public MetricKey build() {
.setMetricType(MetricType.COUNTER)
.build();

// Distributed command related metrics
public static final MetricKey MASTER_JOB_DISTRIBUTED_LOAD_SUCCESS =
new Builder("Master.JobDistributedLoadSuccess")
.setDescription("The number of successful DistributedLoad operations")
.setMetricType(MetricType.COUNTER)
.build();
public static final MetricKey MASTER_JOB_DISTRIBUTED_LOAD_FAIL =
new Builder("Master.JobDistributedLoadFail")
.setDescription("The number of failed DistributedLoad operations")
.setMetricType(MetricType.COUNTER)
.build();
public static final MetricKey MASTER_JOB_DISTRIBUTED_LOAD_CANCEL =
new Builder("Master.JobDistributedLoadCancel")
.setDescription("The number of cancelled DistributedLoad operations")
.setMetricType(MetricType.COUNTER)
.build();
public static final MetricKey MASTER_JOB_DISTRIBUTED_LOAD_FILE_COUNT =
new Builder("Master.JobDistributedLoadFileCount")
.setDescription("The number of files by DistributedLoad operations")
.setMetricType(MetricType.COUNTER)
.build();
public static final MetricKey MASTER_JOB_DISTRIBUTED_LOAD_FILE_SIZE =
new Builder("Master.JobDistributedLoadFileSizes")
.setDescription("The total file size by DistributedLoad operations")
.setMetricType(MetricType.COUNTER)
.build();
public static final MetricKey MASTER_JOB_DISTRIBUTED_LOAD_RATE =
new Builder("Master.JobDistributedLoadRate")
.setDescription("The average DistributedLoad loading rate")
.setMetricType(MetricType.METER)
.setIsClusterAggregated(true)
.build();
public static final MetricKey MASTER_MIGRATE_JOB_SUCCESS =
new Builder("Master.MigrateJobSuccess")
.setDescription("The number of successful MigrateJob operations")
.setMetricType(MetricType.COUNTER)
.build();
public static final MetricKey MASTER_MIGRATE_JOB_FAIL =
new Builder("Master.MigrateJobFail")
.setDescription("The number of failed MigrateJob operations")
.setMetricType(MetricType.COUNTER)
.build();
public static final MetricKey MASTER_MIGRATE_JOB_CANCEL =
new Builder("Master.MigrateJobCancel")
.setDescription("The number of cancelled MigrateJob operations")
.setMetricType(MetricType.COUNTER)
.build();
public static final MetricKey MASTER_MIGRATE_JOB_FILE_COUNT =
new Builder("Master.MigrateJobFileCount")
.setDescription("The number of MigrateJob files")
.setMetricType(MetricType.COUNTER)
.build();
public static final MetricKey MASTER_MIGRATE_JOB_FILE_SIZE =
new Builder("Master.MigrateJobFileSize")
.setDescription("The total size of MigrateJob files")
.setMetricType(MetricType.COUNTER)
.build();

public static final MetricKey MASTER_ASYNC_PERSIST_SUCCESS =
new Builder("Master.AsyncPersistSuccess")
.setDescription("The number of successful AsyncPersist operations")
.setMetricType(MetricType.COUNTER)
.build();
public static final MetricKey MASTER_ASYNC_PERSIST_FAIL =
new Builder("Master.AsyncPersistFail")
.setDescription("The number of failed AsyncPersist operations")
.setMetricType(MetricType.COUNTER)
.build();
public static final MetricKey MASTER_ASYNC_PERSIST_CANCEL =
new Builder("Master.AsyncPersistCancel")
.setDescription("The number of cancelled AsyncPersist operations")
.setMetricType(MetricType.COUNTER)
.build();
public static final MetricKey MASTER_ASYNC_PERSIST_FILE_COUNT =
new Builder("Master.AsyncPersistFileCount")
.setDescription("The number of files created by AsyncPersist operations")
.setMetricType(MetricType.COUNTER)
.build();
public static final MetricKey MASTER_ASYNC_PERSIST_FILE_SIZE =
new Builder("Master.AsyncPersistFileSize")
.setDescription("The total size of files created by AsyncPersist operations")
.setMetricType(MetricType.COUNTER)
.build();

// Cluster metrics
public static final MetricKey CLUSTER_ACTIVE_RPC_READ_COUNT =
new Builder("Cluster.ActiveRpcReadCount")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@
@ThreadSafe
public class MigrateConfig implements PlanConfig {
private static final long serialVersionUID = 8014674802258120190L;
private static final String NAME = "Migrate";
public static final String NAME = "Migrate";

private final String mSource;
private final String mDestination;
Expand Down
Loading

0 comments on commit 696cb89

Please sign in to comment.