-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
errBadParameters:DESC IS TOO LONG #155
Comments
Hi @yzzer123, please provide all the Swarm containers SWOP, SWCI, SL and ML logs, that would help to debug the issue. Also as you have run modified example, if you can mention what are the changes made, that would help. |
Hi @yzzer123, could you provide how did you solve this issue? I have the same error "errBadParameters: DESC IS TOO LONG" when executing MNIST-PYT or FRAUD-DETECTION examples, and the "WAIT FOR TASKRUNNER defaulttaskbb.taskdb.sml.hpe command" fail after waiting for more than 120 minutes. Thank you in advance. |
I haven't solved this problem, and plan to get back to SL 1.1.0. |
Maybe try to pull the 1.2.0 images would help, might be a problem related to sl node: Pull images
|
Hi @yzzer123, closing this issue as there are no further questions from you. |
Issue description
when running MNIST example on SL1.2, it can't work correctly.But a few weeks before, I have run MNIST example on SL1.1, it works well. The cluster I used contains two cloud servers with 1 cores/2GB memory
SWOP logs:
2023-01-09 13:07:13,844 : swarm.swop : INFO : SL Nodes validation is started
2023-01-09 13:07:13,844 : swarm.swop : INFO : Attempting to contact API-Server at : xxxxxx:30304
2023-01-09 13:07:13,868 : swarm.swop : INFO : API-Server is UP!
2023-01-09 13:07:13,874 : swarm.swop : INFO : SWOPCtx :
/usr/lib/python3.8/site-packages/urllib3/connection.py:460: SubjectAltNameWarning: Certificate for xxxxx has no
subjectAltName
, falling back to check for acommonName
for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See urllib3/urllib3#497 for details.)warnings.warn(
2023-01-09 13:15:18,637 : swarm.swop : INFO : SWOPExecutor: ENROLL TO Task : user_env_tf_build_task , opId : 9837225082559815058 - Begins
2023-01-09 13:15:45,688 : swarm.swop : INFO : SWOPExecutor: ENROLL TO Task : user_env_tf_build_task , opId : 9837225082559815058 - Ends
2023-01-09 13:15:45,721 : swarm.swop : INFO : SWOPBuildTask: Validating profile
2023-01-09 13:15:51,814 : swarm.swop : INFO : Extracted container id and image info from /tmp/container_info_file file
2023-01-09 13:16:04,351 : swarm.swop : INFO : SWOPBuildTask: prerequisites OK
2023-01-09 13:16:07,372 : swarm.swop : INFO : SWOPBuildTask: start build thread
2023-01-09 13:16:07,449 : swarm.swop : INFO : Step 1/5 : FROM tensorflow/tensorflow:2.7.0
2023-01-09 13:16:07,451 : swarm.swop : INFO : ---> b51f642475ab
2023-01-09 13:16:07,451 : swarm.swop : INFO : Step 2/5 : RUN pip3 install --upgrade pip && pip3 install keras matplotlib opencv-python pandas protobuf==3.15.6 sklearn
2023-01-09 13:16:07,452 : swarm.swop : INFO : ---> Using cache
2023-01-09 13:16:07,453 : swarm.swop : INFO : ---> b7909ca127e0
2023-01-09 13:16:07,453 : swarm.swop : INFO : Step 3/5 : RUN mkdir -p /tmp/hpe-swarmcli-pkg
2023-01-09 13:16:07,454 : swarm.swop : INFO : ---> Using cache
2023-01-09 13:16:07,454 : swarm.swop : INFO : ---> 4d5668e8fb75
2023-01-09 13:16:07,455 : swarm.swop : INFO : Step 4/5 : COPY swarmlearning-client-py3-none-manylinux_2_24_x86_64.whl /tmp/hpe-swarmcli-pkg/swarmlearning-client-py3-none-manylinux_2_24_x86_64.whl
2023-01-09 13:16:07,456 : swarm.swop : INFO : ---> Using cache
2023-01-09 13:16:07,456 : swarm.swop : INFO : ---> ac4b70a54382
2023-01-09 13:16:07,457 : swarm.swop : INFO : Step 5/5 : RUN pip3 install /tmp/hpe-swarmcli-pkg/swarmlearning-client-py3-none-manylinux_2_24_x86_64.whl
2023-01-09 13:16:07,458 : swarm.swop : INFO : ---> Using cache
2023-01-09 13:16:07,458 : swarm.swop : INFO : ---> bd7a54bb4f9a
2023-01-09 13:16:07,458 : swarm.swop : INFO : ID: sha256:bd7a54bb4f9a54063a747772fa3efed5ff6435c0b9854627c3414044370eb186
2023-01-09 13:16:07,459 : swarm.swop : INFO : Successfully built bd7a54bb4f9a
2023-01-09 13:16:07,463 : swarm.swop : INFO : Successfully tagged user-env-tf2.7.0-swop:latest
2023-01-09 13:16:21,407 : swarm.swop : INFO : SWOPBuildTask: build task completed
2023-01-09 13:16:21,408 : swarm.swop : INFO : SWOPBuildTask: Stopping Task
2023-01-09 13:16:24,429 : swarm.swop : INFO : SWOPExecutor: Total Tasks: 1 , Current Task : user_env_tf_build_task , opId : 9837225082559815058 Done
2023-01-09 13:16:55,538 : swarm.swop : INFO : SWOPExecutor: ENROLL TO Task : swarm_mnist_task , opId : 13286232894752780294 - Begins
2023-01-09 13:16:58,564 : swarm.swop : INFO : SWOPExecutor: ENROLL TO Task : swarm_mnist_task , opId : 13286232894752780294 - Ends
2023-01-09 13:16:58,607 : swarm.swop : INFO : Extracted container id and image info from /tmp/container_info_file file
2023-01-09 13:16:58,618 : swarm.swop : INFO : SWOPRunTask: Stopping Task
2023-01-09 13:16:58,625 : swarm.swop : INFO : errBadParameters:DESC IS TOO LONG
2023-01-09 13:16:58,625 : swarm.swop : INFO : SWOPRunTask: Stopping Task
2023-01-09 13:16:58,626 : swarm.swop : INFO : SWOPExecutor: Total Tasks: 2 , Current Task : swarm_mnist_task , opId : 13286232894752780294 Done
SWCI logs:
SWCI:18 > EXIT ON FAILURE OFF
SWCI:18 > EXIT ON FAILURE IS TURNED OFF
SWCI:19 >
SWCI:19 > # Build task was already run. Now build and run swarm run tasks
SWCI:19 >
SWCI:19 > # Create and finalize swarm run task
SWCI:19 > EXIT ON FAILURE
SWCI:19 > EXIT ON FAILURE IS TURNED ON
SWCI:20 > create task from taskdefs/swarm_mnist_task.yaml
Task definition is valid
Task Registered : swarm_mnist_task
Appending Task Body
batch start : 1 , len : 4 Successful
batch start : 5 , len : 4 Successful
batch start : 9 , len : 4 Successful
batch start : 13 , len : 1 Successful
Task creation Successful
WARNING: Task should be finalized by user explicitly
SWCI:21 > finalize task swarm_mnist_task
Task Finalized
SWCI:22 > get task info swarm_mnist_task
NAME : swarm_mnist_task
TASKTYPE : RUN_SWARM
CREATETIME : 2023-01-09 13:16:28
AUTHOR : HPESwarm
CONTENTLINES : 14
PREREQ : user_env_tf_build_task
OUTCOME : swarm_mnist_task
FINALIZED : True
SWCI:23 > get task body swarm_mnist_task
0000: ---
0001: Command : model/mnist_tf.py
0002: Entrypoint : python3
0003: WorkingDir : /tmp/test
0004: PrivateContent : /tmp/test/app-data
0005: SharedContent :
0006: - Src : /opt/hpe/swarm-learning/workspace/mnist/model
0007: Tgt : /tmp/test/model
0008: MType : BIND
0009: Envvars :
0010: - DATA_DIR : app-data
0011: - MODEL_DIR : model
0012: - MAX_EPOCHS : 2
0013: - MIN_PEERS : 2
SWCI:24 > list tasks
ROOTTASK
user_env_tf_build_task
swarm_mnist_task
SWCI:25 > EXIT ON FAILURE OFF
SWCI:25 > EXIT ON FAILURE IS TURNED OFF
SWCI:26 >
SWCI:26 > # Assign run task
SWCI:26 > EXIT ON FAILURE
SWCI:26 > EXIT ON FAILURE IS TURNED ON
SWCI:27 > RESET TASKRUNNER defaulttaskbb.taskdb.sml.hpe
TaskRunner Reset
SWCI:28 > ASSIGN TASK swarm_mnist_task TO defaulttaskbb.taskdb.sml.hpe WITH 2 PEERS
Task assigned to TaskRunner
SWCI:29 > WAIT FOR TASKRUNNER defaulttaskbb.taskdb.sml.hpe
WAITING FOR TASKRUNNER TO COMPLETE - Maximum wait time is : 120 mins
#################################################
Swarm Learning Version:
OS and ML Platform
Quick Checklist: Respond [Yes/No]
Additional notes
The text was updated successfully, but these errors were encountered: