Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add collectInfo command #10596

Merged
merged 35 commits into from Feb 13, 2020
Merged

Add collectInfo command #10596

merged 35 commits into from Feb 13, 2020

Conversation

jiacheliu3
Copy link
Contributor

@jiacheliu3 jiacheliu3 commented Dec 9, 2019

Add Alluxio CollectInfo command set to Alluxio command line. This command will collect information on every host in the Alluxio cluster, tarball the information into a local tarball file, then gather all the tarballs back to where the command is issued.

This command is designed to help troubleshooting and collect information from Alluxio clusters. The collected information include logs, config files and it also executes a set of commands on each of the hosts and include outputs in the tarball.

Added commands:
This will run one or all checks on the local machine and write the output into files in targetDir. At the end of execution a tarball will be produced containing all the information collected, put in the targetDir as well.
bin/alluxio collectInfo --local [all/collectConfig/collectEnv/collectLog/collectMetrics/collectAlluxioInfo] targetDir

This will run the corresponding collectInfo --local command on all hosts in the Alluxio cluster defined in conf/masters and conf/workers via SSH. After that the tarballs generated by each individual infoBundle call will be SCPed into a local temp directory, and then put into one final tarball in the targetDir.
bin/alluxio collectInfo [all/collectConfig/collectEnv/collectLog/collectMetrics/collectAlluxioInfo] targetDir

Phase 2:
This PR is the Phase 1 of CollectInfo feature. This PR provides basic implementations and defines the structure.

Some features are not implemented in this PR and scheduled to a separate Phase 2 PR. These features include the following:

  1. Skipping a command when pre-existing work are found. This is because if the cluster is large, running infoBundle can take a long time and fail in the middle.
  2. Add a -f option to force overwrite existing work.
  3. Add a -c or --components option to copy only certain logs.
  4. Add a -h or --hosts option to only invoke collectInfo on certain hosts.
  5. Add security check to avoid copying conf/ files that contain credentials.

@alluxio-bot
Copy link
Contributor

Automated checks report:

  • AmplabJenkins build check: PENDING
    • We were not able to detect AmplabJenkins test results on this PR. Status will update when testing completes.
  • Commits associated with Github account: PASS
  • PR title follows the conventions: PASS

Some checks failed. Please fix the reported issues and reply 'alluxio-bot, check this please' to re-run checks.

@jiacheliu3 jiacheliu3 changed the title Log bundle command [WIP] Log bundle command Dec 9, 2019
@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Alluxio-Pull-Request-Builder/7217/
Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Alluxio-Pull-Request-Builder/7231/
Test FAILed.

@jiacheliu3 jiacheliu3 changed the title [WIP] Log bundle command [WIP] Info bundle command Dec 11, 2019
@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Alluxio-Pull-Request-Builder/7264/
Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Alluxio-Pull-Request-Builder/7300/
Test FAILed.

alluxio-bot pushed a commit that referenced this pull request Jan 7, 2020
This change adds a wrapper to SSH executing a command to ShellUtils,
together with SCP commands.

`ShellUtils` class refactored with inner classed extracted to standalone
classes.

PR #10596 relies on this utility
as it extensively invokes commands over SSH.

pr-link: #10658
change-id: cid-d4d74b1d4b1aff513fe298effee509f620e5821c
@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Alluxio-Pull-Request-Builder/7830/
Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Alluxio-Pull-Request-Builder/7831/
Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Alluxio-Pull-Request-Builder/7886/
Test FAILed.

@jiacheliu3
Copy link
Contributor Author

@jiacheliu3 have you considered having a single command , and just using an option like --local for just collecting the local info? It is confusing to have 2 commands collectInfo and collectInfoAll

Good idea. Now CollectInfoAll has been merged into CollectInfo and added a --local option as the switch. @gpang PTAL at the new logic now. I may add more unit test coverage tmr.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Alluxio-Pull-Request-Builder/8150/
Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Alluxio-Pull-Request-Builder/8151/
Test PASSed.

@gpang
Copy link
Contributor

gpang commented Feb 12, 2020

@jiacheliu3 Thanks! Can you update the PR title and description with the updated syntax?

new AlluxioCommand(mAlluxioPath, "fs mount"), null);
registerCommand("version",
new AlluxioCommand(mAlluxioPath, "version -r"), null);
registerCommand("job",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bradyoo is there a better api for getting JobService information instead of dumping the list of JobID's? If not, I'd rather we just collect the number of jobs instead of the jobs themselves. A list of JobId's will have almost not use to an external reviewer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

job ID's might not be immediately useful, but a string dump of the Job type and config for each would be the most useful IMO

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @ns1123 job ls in 2.2 prints Job Type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool.

registerCommand("job",
new AlluxioCommand(mAlluxioPath, "job ls"), null);
registerCommand("journal",
new AlluxioCommand(mAlluxioPath, String.format("fs ls -R %s",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming the journal is local not UFS. The journal is pretty important for our review so I'd definitely try to handle at least HDFS journal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added logic to check if the journal dir is local or hdfs.
Local -> ls -R journalDir
Hdfs -> hdfs dfs -ls -R journalDir
Are there any more logic that can be helpful?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pretty much covers all scenarios that we want to.


@Override
protected void registerCommands() {
registerCommand("ps", new ShellCommand(new String[]{"ps", "-ef", "|grep alluxio*"}), null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to also grab information about a presto/spark/yarn process if it's colocated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added attempts to grab Spark/Yarn/Hdfs/Presto.

Map<String, String> procs = new HashMap<>();

// Get Jps output
String[] jpsCommand = new String[]{"jps"};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, jstack will only work if run within the same cgroup. Even sudo will not be able to jstack another user process correctly. WRT too many java processes, that is one of the pieces of information we want to be aware of. I think it's perfectly fine to grab all java processes on the machine.

@jiacheliu3 jiacheliu3 changed the title Add collectInfo and collectInfoAll command Add collectInfo command Feb 13, 2020
@jiacheliu3
Copy link
Contributor Author

@jiacheliu3 Thanks! Can you update the PR title and description with the updated syntax?

This is done. Thanks for the catch!

Copy link
Contributor

@bf8086 bf8086 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me. Only a couple minor comments.

shell/src/main/java/alluxio/cli/bundler/CollectInfo.java Outdated Show resolved Hide resolved
shell/src/main/java/alluxio/cli/bundler/CollectInfo.java Outdated Show resolved Hide resolved
shell/src/main/java/alluxio/cli/bundler/CollectInfo.java Outdated Show resolved Hide resolved
@jiacheliu3
Copy link
Contributor Author

@ns1123 PTAL now the CollectJvmInfoCommand attempts sudo.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Alluxio-Pull-Request-Builder/8187/
Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Alluxio-Pull-Request-Builder/8188/
Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Alluxio-Pull-Request-Builder/8189/
Test PASSed.

@jiacheliu3
Copy link
Contributor Author

@bf8086 @ns1123 if you have great ideas on improvements that are not low-hanging fruits (like a smarter extra command option), I would propose we leave a mark there and I can address it in the phase 2 improvement on this command. This PR is getting a little too long to keep track and keep focused on all aspects.

Copy link
Contributor

@gpang gpang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ns1123
Copy link
Contributor

ns1123 commented Feb 13, 2020

LGTM. Thanks @jiacheliu3 for doing this! looking forward to getting to use it.

Copy link
Contributor

@bf8086 bf8086 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@calvinjia
Copy link
Contributor

alluxio-bot, merge this please.

@alluxio-bot alluxio-bot merged commit 1807acf into Alluxio:master Feb 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants