Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

errBadParameters:DESC IS TOO LONG #155

Closed
yzzer123 opened this issue Jan 9, 2023 · 5 comments
Closed

errBadParameters:DESC IS TOO LONG #155

yzzer123 opened this issue Jan 9, 2023 · 5 comments

Comments

@yzzer123
Copy link

yzzer123 commented Jan 9, 2023

Issue description

when running MNIST example on SL1.2, it can't work correctly.But a few weeks before, I have run MNIST example on SL1.1, it works well. The cluster I used contains two cloud servers with 1 cores/2GB memory

SWOP logs:
2023-01-09 13:07:13,844 : swarm.swop : INFO : SL Nodes validation is started
2023-01-09 13:07:13,844 : swarm.swop : INFO : Attempting to contact API-Server at : xxxxxx:30304
2023-01-09 13:07:13,868 : swarm.swop : INFO : API-Server is UP!
2023-01-09 13:07:13,874 : swarm.swop : INFO : SWOPCtx :
/usr/lib/python3.8/site-packages/urllib3/connection.py:460: SubjectAltNameWarning: Certificate for xxxxx has no subjectAltName, falling back to check for a commonName for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See urllib3/urllib3#497 for details.)
warnings.warn(
2023-01-09 13:15:18,637 : swarm.swop : INFO : SWOPExecutor: ENROLL TO Task : user_env_tf_build_task , opId : 9837225082559815058 - Begins
2023-01-09 13:15:45,688 : swarm.swop : INFO : SWOPExecutor: ENROLL TO Task : user_env_tf_build_task , opId : 9837225082559815058 - Ends
2023-01-09 13:15:45,721 : swarm.swop : INFO : SWOPBuildTask: Validating profile
2023-01-09 13:15:51,814 : swarm.swop : INFO : Extracted container id and image info from /tmp/container_info_file file
2023-01-09 13:16:04,351 : swarm.swop : INFO : SWOPBuildTask: prerequisites OK
2023-01-09 13:16:07,372 : swarm.swop : INFO : SWOPBuildTask: start build thread

2023-01-09 13:16:07,449 : swarm.swop : INFO : Step 1/5 : FROM tensorflow/tensorflow:2.7.0
2023-01-09 13:16:07,451 : swarm.swop : INFO : ---> b51f642475ab
2023-01-09 13:16:07,451 : swarm.swop : INFO : Step 2/5 : RUN pip3 install --upgrade pip && pip3 install keras matplotlib opencv-python pandas protobuf==3.15.6 sklearn
2023-01-09 13:16:07,452 : swarm.swop : INFO : ---> Using cache
2023-01-09 13:16:07,453 : swarm.swop : INFO : ---> b7909ca127e0
2023-01-09 13:16:07,453 : swarm.swop : INFO : Step 3/5 : RUN mkdir -p /tmp/hpe-swarmcli-pkg
2023-01-09 13:16:07,454 : swarm.swop : INFO : ---> Using cache
2023-01-09 13:16:07,454 : swarm.swop : INFO : ---> 4d5668e8fb75
2023-01-09 13:16:07,455 : swarm.swop : INFO : Step 4/5 : COPY swarmlearning-client-py3-none-manylinux_2_24_x86_64.whl /tmp/hpe-swarmcli-pkg/swarmlearning-client-py3-none-manylinux_2_24_x86_64.whl
2023-01-09 13:16:07,456 : swarm.swop : INFO : ---> Using cache
2023-01-09 13:16:07,456 : swarm.swop : INFO : ---> ac4b70a54382
2023-01-09 13:16:07,457 : swarm.swop : INFO : Step 5/5 : RUN pip3 install /tmp/hpe-swarmcli-pkg/swarmlearning-client-py3-none-manylinux_2_24_x86_64.whl
2023-01-09 13:16:07,458 : swarm.swop : INFO : ---> Using cache
2023-01-09 13:16:07,458 : swarm.swop : INFO : ---> bd7a54bb4f9a
2023-01-09 13:16:07,458 : swarm.swop : INFO : ID: sha256:bd7a54bb4f9a54063a747772fa3efed5ff6435c0b9854627c3414044370eb186
2023-01-09 13:16:07,459 : swarm.swop : INFO : Successfully built bd7a54bb4f9a
2023-01-09 13:16:07,463 : swarm.swop : INFO : Successfully tagged user-env-tf2.7.0-swop:latest
2023-01-09 13:16:21,407 : swarm.swop : INFO : SWOPBuildTask: build task completed
2023-01-09 13:16:21,408 : swarm.swop : INFO : SWOPBuildTask: Stopping Task
2023-01-09 13:16:24,429 : swarm.swop : INFO : SWOPExecutor: Total Tasks: 1 , Current Task : user_env_tf_build_task , opId : 9837225082559815058 Done
2023-01-09 13:16:55,538 : swarm.swop : INFO : SWOPExecutor: ENROLL TO Task : swarm_mnist_task , opId : 13286232894752780294 - Begins
2023-01-09 13:16:58,564 : swarm.swop : INFO : SWOPExecutor: ENROLL TO Task : swarm_mnist_task , opId : 13286232894752780294 - Ends
2023-01-09 13:16:58,607 : swarm.swop : INFO : Extracted container id and image info from /tmp/container_info_file file
2023-01-09 13:16:58,618 : swarm.swop : INFO : SWOPRunTask: Stopping Task
2023-01-09 13:16:58,625 : swarm.swop : INFO : errBadParameters:DESC IS TOO LONG
2023-01-09 13:16:58,625 : swarm.swop : INFO : SWOPRunTask: Stopping Task
2023-01-09 13:16:58,626 : swarm.swop : INFO : SWOPExecutor: Total Tasks: 2 , Current Task : swarm_mnist_task , opId : 13286232894752780294 Done

SWCI logs:
SWCI:18 > EXIT ON FAILURE OFF
SWCI:18 > EXIT ON FAILURE IS TURNED OFF
SWCI:19 >
SWCI:19 > # Build task was already run. Now build and run swarm run tasks
SWCI:19 >
SWCI:19 > # Create and finalize swarm run task
SWCI:19 > EXIT ON FAILURE
SWCI:19 > EXIT ON FAILURE IS TURNED ON
SWCI:20 > create task from taskdefs/swarm_mnist_task.yaml
Task definition is valid
Task Registered : swarm_mnist_task
Appending Task Body
batch start : 1 , len : 4 Successful
batch start : 5 , len : 4 Successful
batch start : 9 , len : 4 Successful
batch start : 13 , len : 1 Successful
Task creation Successful
WARNING: Task should be finalized by user explicitly
SWCI:21 > finalize task swarm_mnist_task
Task Finalized
SWCI:22 > get task info swarm_mnist_task
NAME : swarm_mnist_task
TASKTYPE : RUN_SWARM
CREATETIME : 2023-01-09 13:16:28
AUTHOR : HPESwarm
CONTENTLINES : 14
PREREQ : user_env_tf_build_task
OUTCOME : swarm_mnist_task
FINALIZED : True
SWCI:23 > get task body swarm_mnist_task
0000: ---
0001: Command : model/mnist_tf.py
0002: Entrypoint : python3
0003: WorkingDir : /tmp/test
0004: PrivateContent : /tmp/test/app-data
0005: SharedContent :
0006: - Src : /opt/hpe/swarm-learning/workspace/mnist/model
0007: Tgt : /tmp/test/model
0008: MType : BIND
0009: Envvars :
0010: - DATA_DIR : app-data
0011: - MODEL_DIR : model
0012: - MAX_EPOCHS : 2
0013: - MIN_PEERS : 2
SWCI:24 > list tasks
ROOTTASK
user_env_tf_build_task
swarm_mnist_task
SWCI:25 > EXIT ON FAILURE OFF
SWCI:25 > EXIT ON FAILURE IS TURNED OFF
SWCI:26 >
SWCI:26 > # Assign run task
SWCI:26 > EXIT ON FAILURE
SWCI:26 > EXIT ON FAILURE IS TURNED ON
SWCI:27 > RESET TASKRUNNER defaulttaskbb.taskdb.sml.hpe
TaskRunner Reset
SWCI:28 > ASSIGN TASK swarm_mnist_task TO defaulttaskbb.taskdb.sml.hpe WITH 2 PEERS
Task assigned to TaskRunner
SWCI:29 > WAIT FOR TASKRUNNER defaulttaskbb.taskdb.sml.hpe
WAITING FOR TASKRUNNER TO COMPLETE - Maximum wait time is : 120 mins
#################################################

Swarm Learning Version:

  • Find the docker tag of the Swarm images ( $ docker images | grep hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning )

OS and ML Platform

  • details of host OS: Ubuntu20
  • details of ML platform used: tensorflow
  • details of Swarm learning Cluster (Number of machines, SL nodes, SN nodes): 2 nodes

Quick Checklist: Respond [Yes/No]

  • APLS server web GUI shows available Licenses? yes
  • If Multiple systems are used, can each system access every other system? yes
  • Is Password-less SSH configuration setup for all the systems? no
  • If GPU or other protected resources are used, does the account have sufficient privileges to access and use them? yes
  • Is the user id a member of the docker group? yes

Additional notes

  • Are you running documented example without any modification? no
  • Add any additional information about use case or any notes which supports for issue investigation:
@iArpanPatel
Copy link
Collaborator

Hi @yzzer123, please provide all the Swarm containers SWOP, SWCI, SL and ML logs, that would help to debug the issue. Also as you have run modified example, if you can mention what are the changes made, that would help.

@joaquingarciaatos
Copy link

joaquingarciaatos commented Mar 14, 2023

Hi @yzzer123, could you provide how did you solve this issue? I have the same error "errBadParameters: DESC IS TOO LONG" when executing MNIST-PYT or FRAUD-DETECTION examples, and the "WAIT FOR TASKRUNNER defaulttaskbb.taskdb.sml.hpe command" fail after waiting for more than 120 minutes. Thank you in advance.

@yzzer123
Copy link
Author

I haven't solved this problem, and plan to get back to SL 1.1.0.

@Ultimate-Storm
Copy link

Maybe try to pull the 1.2.0 images would help, might be a problem related to sl node:

Pull images

echo "Download Swarm Network (SN) Node"
sudo docker pull hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning/sn:1.2.0
echo "Download Swarm Learning (SL) Node"
sudo docker pull hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning/sl:1.2.0
echo "Download Swarm Learning Command Interface (SWCI) Node"
sudo docker pull hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning/swci:1.2.0
echo "Download Swarm Operator (SWOP) Node"
sudo docker pull hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning/swop:1.2.0

@iArpanPatel
Copy link
Collaborator

Hi @yzzer123, closing this issue as there are no further questions from you.
You can upgrade to our latest 2.0.0 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants