-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
translate Cluster Training and Prediction #9356
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,4 +2,15 @@ | |
Cluster Training and Prediction | ||
############################### | ||
|
||
TBD | ||
.. contents:: | ||
|
||
1. Network connection errors in the log during muliti-node cluster training | ||
------------------------------------------------ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. mulilti-node -> multi-node |
||
The errors in the log belong to network connection during mulilti-node cluster training, for example, :code:`Connection reset by peer`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This sentence only have subject but no predicate and no object. |
||
This kind of error is usually caused by the abnormal exit of the training process in some node, and the others cannot connect with this node any longer. Steps to troubleshoot the problem as follows: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. of the training process -> of a training process There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. others -> other nodes There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the problem as follows -> the problem are as follows |
||
|
||
* Find the first error in the :code:`train.log`, :code:`server.log`, check whether other fault casued the problem, such as FPE, lacking of memory or disk. | ||
|
||
* If network connection gave rise to the first error in the log, this may be caused by the port conflict of the non-exclusive execution. Connect with the operator to check if the current MPI cluster supports jobs submitted with parameter :code:`resource=full`. If so, change the port of job. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "If network connection gave rise to the first error in the log" => If the first error in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Connect with the operator => Contact the sys-admin There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If so, change the port of job. => If the current MPI cluster does not support this parameter, change the server port and try agian. |
||
|
||
* If the currnet MPI cluster does not support exclusive pattern, ask the operator to replace or update the current cluster. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. people may want to know what the "exclusive pattern" is. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. operator => cluster administrator There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Which doc should we refer this term to ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. currnet -> current |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
muliti-node -> multi-node