Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Short Circuit read/write #17243

Merged
merged 3 commits into from
Jul 16, 2023

Conversation

apc999
Copy link
Contributor

@apc999 apc999 commented Apr 13, 2023

What changes are proposed in this pull request?

Remove source code related to Short Circuit read/write

Why are the changes needed?

  • In K8s env (increasingly more popular), support short-ciruit I/O is challenging
  • We shall use SDK instead

Does this PR introduce any user facing changes?

No more Short Circuit read/write

@alluxio-bot alluxio-bot added the API Change Changes covering public API label Apr 13, 2023
@apc999 apc999 requested a review from beinan April 18, 2023 05:25
@apc999 apc999 force-pushed the dora-remove-shortcircuit branch 3 times, most recently from 8605ed6 to dc05cb1 Compare April 19, 2023 17:44
@ssz1997
Copy link
Contributor

ssz1997 commented Apr 19, 2023

Shall the short-circuit-related metrics be removed in this PR as well? Or metrics cleanup will be an another process?

@Xenorith Xenorith force-pushed the dora branch 3 times, most recently from 253ed31 to 6ecdda5 Compare April 28, 2023 21:46
@apc999 apc999 requested review from jiacheliu3 and removed request for beinan July 14, 2023 06:49
Copy link
Contributor

@jiacheliu3 jiacheliu3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some hard coded tests are complaining:

2023-07-14T07:29:09.9167401Z 07:29:09.915 [ERROR] Failures: 
2023-07-14T07:29:09.9168088Z 07:29:09.915 [ERROR] alluxio.cli.fsadmin.report.MetricsCommandTest.metrics
2023-07-14T07:29:09.9169006Z 07:29:09.915 [ERROR]   Run 1: MetricsCommandTest.metrics:62->checkIfOutputValid:145 
2023-07-14T07:29:09.9175392Z Expected: iterable containing ["Cluster.BytesReadDomain  (Type: GAUGE, Value: 4145.73KB)", "Cluster.BytesReadDomainThroughput  (Type: GAUGE, Value: 29.97MB/MIN)", "Cluster.BytesReadLocal  (Type: GAUGE, Value: 11.47GB)", "Cluster.BytesReadLocalThroughput  (Type: GAUGE, Value: 117.42MB/MIN)", "Cluster.BytesReadRemote  (Type: GAUGE, Value: 401.79MB)", "Cluster.BytesReadRemoteThroughput  (Type: GAUGE, Value: 518.36MB/MIN)", "Cluster.BytesReadUfsAll  (Type: GAUGE, Value: 509.47MB)", "Cluster.BytesReadUfsThroughput  (Type: GAUGE, Value: 728.16KB/MIN)", "Cluster.BytesWrittenDomain  (Type: GAUGE, Value: 62.43MB)", "Cluster.BytesWrittenDomainThroughput  (Type: GAUGE, Value: 1202.37KB/MIN)", "Cluster.BytesWrittenRemote  (Type: GAUGE, Value: 22.98KB)", "Cluster.BytesWrittenRemoteThroughput  (Type: GAUGE, Value: 8.03MB/MIN)", "Cluster.BytesWrittenUfsAll  (Type: GAUGE, Value: 317.70KB)", "Cluster.BytesWrittenUfsThroughput  (Type: GAUGE, Value: 33.46KB/MIN)", "Cluster.CapacityTotal  (Type: GAUGE, Value: 1,154,531,246,129,122)", "Master.CompleteFileOps  (Type: COUNTER, Value: 813)", "Master.UfsSessionCount-Ufs:_alluxio_underFSStorage  (Type: COUNTER, Value: 8,535)", "Master.UfsSessionCount-Ufs:file:___Users_alluxio_alluxioMountedFolder  (Type: COUNTER, Value: 1,231)", "Master.getMetrics.User:alluxio  (Type: TIMER, Value: 4)"]
2023-07-14T07:29:09.9178398Z      but: item 2: was "Cluster.BytesReadRemote  (Type: GAUGE, Value: 401.79MB)"
2023-07-14T07:29:09.9179129Z 07:29:09.915 [ERROR]   Run 2: MetricsCommandTest.metrics:62->checkIfOutputValid:145 
2023-07-14T07:29:09.9184592Z Expected: iterable containing ["Cluster.BytesReadDomain  (Type: GAUGE, Value: 4145.73KB)", "Cluster.BytesReadDomainThroughput  (Type: GAUGE, Value: 29.97MB/MIN)", "Cluster.BytesReadLocal  (Type: GAUGE, Value: 11.47GB)", "Cluster.BytesReadLocalThroughput  (Type: GAUGE, Value: 117.42MB/MIN)", "Cluster.BytesReadRemote  (Type: GAUGE, Value: 401.79MB)", "Cluster.BytesReadRemoteThroughput  (Type: GAUGE, Value: 518.36MB/MIN)", "Cluster.BytesReadUfsAll  (Type: GAUGE, Value: 509.47MB)", "Cluster.BytesReadUfsThroughput  (Type: GAUGE, Value: 728.16KB/MIN)", "Cluster.BytesWrittenDomain  (Type: GAUGE, Value: 62.43MB)", "Cluster.BytesWrittenDomainThroughput  (Type: GAUGE, Value: 1202.37KB/MIN)", "Cluster.BytesWrittenRemote  (Type: GAUGE, Value: 22.98KB)", "Cluster.BytesWrittenRemoteThroughput  (Type: GAUGE, Value: 8.03MB/MIN)", "Cluster.BytesWrittenUfsAll  (Type: GAUGE, Value: 317.70KB)", "Cluster.BytesWrittenUfsThroughput  (Type: GAUGE, Value: 33.46KB/MIN)", "Cluster.CapacityTotal  (Type: GAUGE, Value: 1,154,531,246,129,122)", "Master.CompleteFileOps  (Type: COUNTER, Value: 813)", "Master.UfsSessionCount-Ufs:_alluxio_underFSStorage  (Type: COUNTER, Value: 8,535)", "Master.UfsSessionCount-Ufs:file:___Users_alluxio_alluxioMountedFolder  (Type: COUNTER, Value: 1,231)", "Master.getMetrics.User:alluxio  (Type: TIMER, Value: 4)"]
2023-07-14T07:29:09.9187204Z      but: item 2: was "Cluster.BytesReadRemote  (Type: GAUGE, Value: 401.79MB)"

Other than that, I browsed all changes and I think they mostly LGTM. I need to take a closer look tonight. From the scale of this PR (~2000 lines), I figure I can do a full scan of all changed lines.

If a change grows much larger than that (>3000 lines or even much larger than that), I believe it's close to impossible to do a thorough review. And for those PRs I'd recommend:

  1. Break it into smaller reviewable parts and allow leftovers in between, as long as leftovers do not interfere other code;
  2. Or try to allow ONLY removing code files as a whole - try to do changes as sparingly as possible because the changes cannot be reviewed or noticed at all. If a change in-line is compulsory, segregate that to one commit and log it in description. That was the approach we collectively took in Remove block interface - worker side #17638 with ~40k lines of change.

Copy link
Contributor

@jiacheliu3 jiacheliu3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work!

Found some extra occurrences of the term short circuit, which are not included here. It depends on your intention of this PR whether we want to fix those individual occurrences scattered every where, or you may create a separate one to address those comment-only occurrences.

Some other relevant changes you won't be able to cover in this PR, but might need extra awareness:

  1. In web UI there are at least some gauges for short circuit related metrics for display
  2. Some relevant sections in the doc pages

*
* @return the cache hit local
*/
public String getCacheHitLocal() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[no action required in this PR I guess]

Seems it used to be:

local -> short circuit
domain -> domain socket
remote -> remote grpc not including local domain socket traffic

Now we take away "short circuit", then "local" is gone. Suddenly we only have "domain" and "remote" and no "local", although "domain" should theoretically be "local".

My point is, the opposite of "remote" should be "local" instead of "domain". Now the naming becomes a bit misleading because "local" is gone. A user may just ask "Is domain local or not?"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with the code change here, in that we are just removing code instead of doing a refactor. It's just we may need to rename the metrics in the next refactor attempt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree. let's revisit the metrics name later. I will leave a TODO here

@@ -115,49 +114,27 @@ public static BlockInStream create(FileSystemContext context, BlockInfo info,
}

AlluxioConfiguration alluxioConf = context.getClusterConf();
boolean shortCircuit = alluxioConf.getBoolean(PropertyKey.USER_SHORT_CIRCUIT_ENABLED);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the comment of this method, there's "short-circuit" mentioned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@@ -57,9 +57,6 @@ private Factory() {} // prevent instantiation
public static DataWriter create(FileSystemContext context, long blockId, long blockSize,
WorkerNetAddress address, OutStreamOptions options) throws IOException {
AlluxioConfiguration alluxioConf = context.getClusterConf();
boolean shortCircuit = alluxioConf.getBoolean(PropertyKey.USER_SHORT_CIRCUIT_ENABLED);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In GrpcDataWriter there are comments mentioning "short-circuit". I found those occurrences by searching "shortcircuit" and "short-circuit" and "short circuit".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some other occurrences include Format.java, package-info.java, BlockInStream.java, DecommissionWorkerCommand etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all removed

Copy link
Contributor Author

@apc999 apc999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java comment part addressed. PTAL

For webui/metrics, we shall have a separate round of revamp

@@ -115,49 +114,27 @@ public static BlockInStream create(FileSystemContext context, BlockInfo info,
}

AlluxioConfiguration alluxioConf = context.getClusterConf();
boolean shortCircuit = alluxioConf.getBoolean(PropertyKey.USER_SHORT_CIRCUIT_ENABLED);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@@ -57,9 +57,6 @@ private Factory() {} // prevent instantiation
public static DataWriter create(FileSystemContext context, long blockId, long blockSize,
WorkerNetAddress address, OutStreamOptions options) throws IOException {
AlluxioConfiguration alluxioConf = context.getClusterConf();
boolean shortCircuit = alluxioConf.getBoolean(PropertyKey.USER_SHORT_CIRCUIT_ENABLED);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all removed

Copy link
Contributor

@jiacheliu3 jiacheliu3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the work!

@apc999
Copy link
Contributor Author

apc999 commented Jul 16, 2023

alluxio-bot, merge this please

@alluxio-bot
Copy link
Contributor

merge failed:
Merge refused because pull request does not have label start with type-

@apc999 apc999 added the type-debt This issue is about tech debt label Jul 16, 2023
@apc999
Copy link
Contributor Author

apc999 commented Jul 16, 2023

alluxio-bot, merge this please

@alluxio-bot alluxio-bot merged commit 7f98cab into Alluxio:main Jul 16, 2023
15 checks passed
@apc999 apc999 deleted the dora-remove-shortcircuit branch July 18, 2023 04:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Change Changes covering public API type-debt This issue is about tech debt
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants