Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jScope freezes during loading with a "broken pipe" error #2704

Open
mwinkel-dev opened this issue Feb 8, 2024 · 22 comments
Open

jScope freezes during loading with a "broken pipe" error #2704

mwinkel-dev opened this issue Feb 8, 2024 · 22 comments
Assignees
Labels
bug An unexpected problem or unintended behavior tool/jScope Relates to the jScope tool US Priority

Comments

@mwinkel-dev
Copy link
Contributor

mwinkel-dev commented Feb 8, 2024

Affiliation
LLNL / DIII-D
(submitted by @mwinkel-dev of MIT PSFC on behalf of Brian V. of LLNL)

Version(s) Affected
Client MDSplus: TBD
Server MDSplus: TBD

Platform
Client: GA's Iris cluster, CentOS 6.10 (Final)
Server: GA's Atlas cluster, TBD

Describe the bug
Intermittent socket failures when using jScope to display DIII-D data. Causes jScope to freeze when loading / displaying data.

To Reproduce
This description is from the email of 8-Feb-2024 that reported the issue.

Here are four screenshots that explain what happens.

  1. Typical jScope display showing many signals. This is working correctly.

  2. Many of the signals fail to load. jScope loads signals top to bottom, left to right, so in this case it failed to load ‘i_boot’ from the automatic onetwo tree. After that failure, none of the other signals load.

notes: The failure doesn’t always occur on the same signal. Often it fails on Ip. I don’t know why it failed on this shot. Anecdotally, some shots seem to fail more often than others. This morning took longer than usual to reproduce this error.

  1. If I try any other shot in the same jScope session, I get this error. The only thing I can do is restart jScope.

  2. The error indicates a ’socket Exception: broken pipe'.

Expected behavior
All signals should load and display in jScope.

Screenshots
These screenshots are from the 8-Feb-2024 email (and presented in the same order).

image image image image

Additional context
n/a

@mwinkel-dev mwinkel-dev added bug An unexpected problem or unintended behavior US Priority tool/jScope Relates to the jScope tool labels Feb 8, 2024
@mwinkel-dev mwinkel-dev self-assigned this Feb 8, 2024
@mwinkel-dev
Copy link
Contributor Author

This is likely a network issue regarding the mdsip protocol. The error message probably indicates that the mdsip socket for the jScope connection is being killed for some reason.

We will attempt to reproduce the issue using GA's computers. And then using MIT PSFC's computers.

It is likely that eventually the troubleshooting will require the assistance of GA's networking specialists.

@mwinkel-dev
Copy link
Contributor Author

mwinkel-dev commented Feb 8, 2024

Brian reported (via email) that this problem started ~4 months ago. Prior to that he used jScope for years without any issues.

@victorbs
Copy link

victorbs commented Feb 8, 2024

Client computer details:
iris cluster at DIII-D
I don't know the version of linux or MDSplus installed on the cluster
MDSplus Archive computer details:
Host name = atlas.gat.com
I don't know the version of linux or MDSplus installed on the server
Network details:
I'm physically at DIII-D running on the iris cluster through noMachine.
I have the most problems with jScope, but I also use it the most. Programs like efitviewer and reviewplus have occasional data loading issues, too.

@mwinkel-dev
Copy link
Contributor Author

Hi @victorbs -- Thanks for the information. Much appreciated.

@mwinkel-dev
Copy link
Contributor Author

Hi @ModestMC -- Am hoping you can provide some additional context regarding this issue.

  • Would appreciate it if you can provide the operating system and MDSplus version for GA's Iris and Atlas clusters.
  • Also, are you aware of any networking issues at GA in the past ~4 months that would account for the intermittent mdsip socket errors that @victorbs has encountered?
  • And have any other users at GA reported problems with mdsip sockets in recent months?

Thanks,
-MarkW

@GabrieleManduchi
Copy link
Contributor

Just for information, jScope has not been changed at least in the last year.

@zack-vii
Copy link
Contributor

zack-vii commented Feb 9, 2024

Hi there, just a hunch but since its a broken pipe, what protocol are you using for connecting to the server (plain mdsip, via tunnel, ssh; i am not familiar with noMachine).
If i am not mistaken you get a 'broken pipe' if you try to send something down a socket that does not have a receiver anymore. Can you check the data servers (however, this may be an mdsip spawn on your user machine) logs for any potential crashes of mdsip sessions? If the error is new but the software old, there may be a memory heavy process sitting causing sporadic OutOfMemory isssues.

@mwinkel-dev
Copy link
Contributor Author

mwinkel-dev commented Feb 9, 2024

Hi @ModestMC,

Would also appreciate it if you can check the mdsip logs on GA's Atlas server. Typically, the log files are in /var/log/mdsplus/mdsipd, and there can be clues in both the access and the errors files.

And as per @zack-vii's post above, would also be good to check the various system logs on the Atlas server for networking issues.

Thanks,

-MarkW

@ModestMC
Copy link

Iris uses 6.1.84 as its default, Atlas was updated in November from 7.96.?? to 7.139.59 (also the OS went from RHEL6 to RHEL8). Without an exact timeframe of the change from Brian, it's hard to say whether this was or was not what gave rise to the issue. We will not be updating the version on Iris, so this might not be worth trying to reproduce.

Our recommendation is that @victorbs try using JScope on Omega (which also runs 7.139.59) to see if the bug persists. Many users here also use Reviewplus or OMFIT for visualization. As for the log files, we tried looking but there are too many entries to have any idea who is associated with what (see #2683).

@victorbs
Copy link

victorbs commented Feb 14, 2024 via email

@mwinkel-dev
Copy link
Contributor Author

Hi @ModestMC and @victorbs,

Thanks to both of you for the additional information.

If the problem is reproducible on Omega, let me know. I will then see if I can reproduce the issue at MIT using MDSplus 7.139.59 and RHEL8.

@ModestMC
Copy link

@victorbs the simplest way for me to attempt (though I'm not optimistic) to reproduce your errors would be for you to give me a basic example that breaks and then I try to run it at a time when Atlas usage is minimal (eg. wee hours on a weekend or something) until I can see something interesting. Realistically, I think this is a good sign that we should find you a more stable long term workflow.

The reviewplus issues I'm recalling were the result of network changes which have since been patched, but Sterling would know better than I would. Definitely let me know what happens when you try using JScope on Omega, as it's a datapoint worth having. Feel welcome to email me from the original email thread if you'd like.

As for @mwinkel-dev, my hunch is that this is some kind of incompatibility between 6.x.x and 7.x.x in a manner like what @zack-vii described, specifically when the server is updated. If Brian has no issue with the same versions communicating (Omega <--> Atlas), I think this bug can be closed a known version incompatibility.

@victorbs
Copy link

I haven't had time to test jScope on Omega extensively, but in a couple days of use, I haven't had any problems with the data loading. I will continue to use jScope on Omega and will keep you posted if I begin to have any issues.

@mwinkel-dev
Copy link
Contributor Author

Hi @victorbs -- Thanks for the update. If jScope on Omega works well for you during the next two weeks or so, then let me know if this issue should be closed.

@victorbs
Copy link

Hi @mwinkel-dev I'm starting to have similar problems using jScope on omega that I was having on iris. I get an error that 'the connection to atom.gat.com' was lost. After I get that error, signals will no longer load.

@mwinkel-dev
Copy link
Contributor Author

Hi @victorbs -- That is unfortunate news. But thanks for the update.

Hi @ModestMC -- What is the atom.gat.com server? And what version of MDSplus is it running? Could this be another cross-version incompatibility similar to your conjecture regarding Iris and Atlas?
#2704 (comment)

@margomw
Copy link
Contributor

margomw commented Mar 22, 2024

@mwinkel-dev : ATOM is a Linux server similar to Omega and is restricted to team that operate D3D. It does not have an MDSplus server at all, only clients. Perhaps the use case is misunderstood?

The available version are

  1. mdsplus/core/alpha-7.130.1
  2. mdsplus/core/alpha-7.139.39
  3. mdsplus/core/alpha-7.139.40
  4. mdsplus/core/alpha-7.139.59

@mwinkel-dev
Copy link
Contributor Author

Hi @margomw -- Thanks for explaining the purpose of the atom.gat.com server. That is useful to know.

Hi @sflanagan and @ModestMC -- Any idea why jScope on Omega would be connecting to Atom? For details, see the post from @victorbs .
#2704 (comment)

Note though that jScope (from Omega to Atlas) has apparently worked well for about a month.

@victorbs
Copy link

I was mistaken above. I lost connection to 'atlas.gat.com' not 'atom.gat.com'. Sorry for the confusion. Here are screenshots of my connection and the error message.
Screen Shot 2024-03-25 at 8 51 30 AM
Screen Shot 2024-03-25 at 8 57 26 AM

@mwinkel-dev
Copy link
Contributor Author

Hi @victorbs -- Thanks for the clarification. According to a previous post, both Omega and Atlas are running MDSplus alpha-7.139.59. Therefore, I will see if I can reproduce the problem using that version of MDSplus for both client and server.

@victorbs
Copy link

victorbs commented May 6, 2024

Hi. Is there any update on this?

There was a period about a month ago when I wasn't having data loading issues. In the last week or two, I have had more connection issues than usual.

@mwinkel-dev
Copy link
Contributor Author

mwinkel-dev commented May 8, 2024

Hi @victorbs,

Thanks for reminding us to look at this. (We've been swamped with tasks associated with the startup of DIII-D.)

Hi @sflanagan and @ModestMC,

Have there been any changes regarding Omega and/or GA's networking that would explain why jScope is freezing for @victorbs? When he switched to Omega (instead of Iris) the problem vanished for a month or so. Strikes me as odd that the problem has arisen again. (My guess is that we'll probably have to fix Issue #2683 to troubleshoot this jScope issue at GA.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug An unexpected problem or unintended behavior tool/jScope Relates to the jScope tool US Priority
Projects
None yet
Development

No branches or pull requests

6 participants