Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify help message for "lenkf.j -debug" #591

Merged
merged 6 commits into from
Oct 18, 2022

Conversation

weiyuan-jiang
Copy link
Contributor

@weiyuan-jiang weiyuan-jiang commented Oct 13, 2022

Clarify help message that users see when running "lenkf.j -debug".

@weiyuan-jiang weiyuan-jiang added the 0-diff trivial very, very obvious 0-diff change label Oct 13, 2022
@weiyuan-jiang weiyuan-jiang requested a review from a team as a code owner October 13, 2022 13:51
Copy link
Contributor

@gmao-rreichle gmao-rreichle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weiyuan-jiang, thanks for the proposed change. I don't think it's a good idea, though. Totalview may not be available on some compute systems. Also, the one-line command isn't onerous given that the user still has to go through the totalview option windows and select the MPI stack etc. It's also good to have the message pointing to the Wiki page with instructions. The message could, of course, be added back, but if lenkf.j launches totalview, the user is almost certainly going to miss the message because they are then presented with a totalview window.

@weiyuan-jiang
Copy link
Contributor Author

weiyuan-jiang commented Oct 13, 2022

I think the wiki is obsolete. The debugger can not be launched that way. I will update wiki

@gmao-rreichle
Copy link
Contributor

I think the wiki is obsolete. The debugger can not be launched that way. I will update wiki

@weiyuan-jiang, @gmao-qliu:
Please do not yet edit the Wiki. I'm rather confused now. In #587, there is comment that things worked for @gmao-qliu, which plainly contradicts @weiyuan-jiang's message above. It's not clear to me what works and what does not.

For the record, I'm adding an excerpt from an off-github email by @gmao-qliu on 13-Oct-2022:
"My understanding is that running lenkf.j will start a new shell which is different from the current one. initiating Totalview within lenkf.j will ensure totalview to start from the same shell as "source g5_modules". Weiyuan can correct me if inaccurate."

So, can we launch totalview manually from the command line after "source g5_modules"? That would be a minimal addition to the Wiki and would make the code less specific on what is available on Discover.

@weiyuan-jiang
Copy link
Contributor Author

can we launch totalview manually from the command line after "source g5_modules"? That would be a minimal addition to the Wiki and would make the code less specific on what is available on Discover.

Yes, we can. So you want "lenkf.j -debug" just echo help information?

@gmao-qliu
Copy link
Contributor

can we launch totalview manually from the command line after "source g5_modules"? That would be a minimal addition to the Wiki and would make the code less specific on what is available on Discover.

Yes, we can. So you want "lenkf.j -debug" just echo help information?

yes. that'll work. but can we remove the "source g5_module" part if with "-debug"? It outputs "g5_modules: Setting BASEDIR and modules for .." which is confusing.

an attempt to get back to the original, system-agnostic approach of stopping lenkf.j before manually launching the debugger;  probably needs further tweaking
Comment on lines 41 to 43
if ( $debug_flag == 0 ) then
source $GEOSBIN/g5_modules
endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is meant to take care of @gmao-qliu's suggestion to avoid "source g5_modules" when in debug mode.
If this works, then I'm not sure why we would need unset argv; setenv argv above that was introduced to avoid g5_modules tripping up when lenkf.j is called with the -debug flag. Maybe the suggestion to omit "source g5_modules" here won't work? I'm getting really confused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since '-debug" option only prints out information, I would like to change it to "--help". It prints out and exits before source g5_modules

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weiyuan-jiang : I don't think --help is good here. Doesn't -debug run all of the preprocessing and stops only when it's time to run GEOSldas.x? In other words, -debug does not just print out info.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having said that, does the preprocessing possibly need "source g5_modules"? @gmao-qliu: Did you verify that we can skip "source g5_modules" when running with -debug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the preprocessing is not necessary for debug. Except users change the directory to scratch and launch totalview

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having said that, does the preprocessing possibly need "source g5_modules"? @gmao-qliu: Did you verify that we can skip "source g5_modules" when running with -debug?

actually "source g5_modules" is needed with -debug because otherwise $BASEDIR is undefined. I take my suggestion back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After users finish ldas_setup and get the interactive nodes, they can go to the "run" directory, source ../build/bin/g5_modules.sh , then load and launch totalview in run directory

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weiyuan-jiang : Are you seriously saying that the first 200-300 lines in lenkf.j are not needed? If that's the case, why can't we just delete them altogether?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weiyuan-jiang, maybe what you are suggesting is to run in debug mode from a crashed experiment directory? Otherwise, how can we do without preprocess_ldas.x ?
I think we should keep -debug such that it exactly replicates what sbatch lenkf.j would do.

@weiyuan-jiang
Copy link
Contributor Author

We can discard my changes. But we need to add more information to the wiki. 1) launch totalview in scratch directory 2) config the same number of processors in totalvies as that in lenkf.j

@gmao-rreichle
Copy link
Contributor

@weiyuan-jiang, I'm still not sure that for debugging we can simply skip the first 200-300 lines of lenkf.j in all cases. Let's keep everything as is for now and discuss at our next tag-up.

Copy link
Contributor Author

@weiyuan-jiang weiyuan-jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After lenkf.j exit, SCRDIR and GEOSBIN are not defined anymore. May be it is ease to say go to scratch directory,
source ../build/bin/g5_modules.sh
module load tview
totalview

@gmao-rreichle
Copy link
Contributor

After lenkf.j exit, SCRDIR and GEOSBIN are not defined anymore.

But the "echo" commands that use $SCRDIR and $GEOSBIN are within lenkf.j, so shouldn't the user get the correct dir names printed in the help text on the screen?

@weiyuan-jiang
Copy link
Contributor Author

After lenkf.j exit, SCRDIR and GEOSBIN are not defined anymore.

But the "echo" commands that use $SCRDIR and $GEOSBIN are within lenkf.j, so shouldn't the user get the correct dir names printed in the help text on the screen?

Oh, you are right. The echo should print out right information. Another suggestions: 1) The users should choose the same number of processors as that in lenkf.j 2) Just launch totalview without executable because usually we debug in parallel mode. The users will have to open a new parallel session.

@gmao-qliu
Copy link
Contributor

After lenkf.j exit, SCRDIR and GEOSBIN are not defined anymore.

But the "echo" commands that use $SCRDIR and $GEOSBIN are within lenkf.j, so shouldn't the user get the correct dir names printed in the help text on the screen?

Oh, you are right. The echo should print out right information. Another suggestions: 1) The users should choose the same number of processors as that in lenkf.j 2) Just launch totalview without executable because usually we debug in parallel mode. The users will have to open a new parallel session.

@weiyuan-jiang @gmao-rreichle's suggestion to start totalview with executable and edit the parallel setting afterwards using Ctrl+A works. The current instruction on the wiki page is overall good to me. Someone who tries to run debugging the first time might have more insight of the clarity. One minor edit suggestion: add the highlighted word to the line "To use debugging tools at NCCS, you must be on an interactive compute node, which can be obtained with the following sample command:"

@gmao-rreichle gmao-rreichle changed the title load and launch totalview in lenkf.j Clarify help message for "lenkf.j -debug" Oct 18, 2022
@gmao-rreichle gmao-rreichle added documentation Improvements or additions to documentation and removed infrastructure cleanup labels Oct 18, 2022
Copy link
Contributor

@gmao-rreichle gmao-rreichle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I further edited the Wiki page to reflect the latest changes by @weiyuan-jiang and @gmao-qliu.

@gmao-rreichle gmao-rreichle marked this pull request as ready for review October 18, 2022 19:15
@gmao-rreichle gmao-rreichle merged commit 8ad174f into develop Oct 18, 2022
@gmao-rreichle gmao-rreichle deleted the feature/wjiang/load_launch_tview branch October 18, 2022 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0-diff trivial very, very obvious 0-diff change documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants