Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Would you please check this problem( task_executor.py", line 118 )? the errlog is uploaded ,thanks #53

Closed
HelloLadsAndGents opened this issue Nov 28, 2019 · 5 comments

Comments

@HelloLadsAndGents
Copy link

HelloLadsAndGents commented Nov 28, 2019

i have followed the steps of docker compose deployment
(https://github.com/FederatedAI/KubeFATE/blob/master/docker-deploy/README.md)

and at the last step:Verify the Deployment
when i execute scripts:
python run_toy_example.py 10000 9999 1
i got this problem:

(part of the log)
Traceback (most recent call last):
File "/data/projects/fate/python/fate_flow/driver/task_executor.py", line 118, in run_task
run_object.run(parameters, task_run_args)
File "/data/projects/fate/python/federatedml/toy_example/secure_add_guest.py", line 113, in run
self._init_data()
File "/data/projects/fate/python/federatedml/toy_example/secure_add_guest.py", line 54, in _init_data
self.x = session.parallelize(kvs, include_key=True, partition=self.partition)
File "/data/projects/fate/python/arch/api/utils/profile_util.py", line 31, in _fn
rtn = func(*args, **kwargs)
File "/data/projects/fate/python/arch/api/session.py", line 76, in parallelize
error_if_exist=error_if_exist)
File "/data/projects/fate/python/arch/api/table/eggroll/session_impl.py", line 69, in parallelize
error_if_exist=error_if_exist)
File "/data/projects/fate/python/eggroll/api/cluster/eggroll.py", line 410, in parallelize
_table = self._create_table(create_table_info)
File "/data/projects/fate/python/eggroll/api/cluster/eggroll.py", line 444, in _create_table
count = self.eggroll_session._gc_table.get(info.storageLocator.name)
File "/data/projects/fate/python/eggroll/api/cluster/eggroll.py", line 214, in get
return _EggRoll.get_instance().get(self, k, use_serialize=use_serialize)
File "/data/projects/fate/python/eggroll/api/cluster/eggroll.py", line 571, in get
operand = self.kv_stub.get(kv_pb2.Operand(key=k), metadata=_get_meta(_table))
File "/data/projects/python/venv/lib/python3.6/site-packages/grpc/_channel.py", line 533, in call
return _end_unary_response_blocking(state, call, False, None)
File "/data/projects/python/venv/lib/python3.6/site-packages/grpc/_channel.py", line 467, in _end_unary_response_blocking
raise _Rendezvous(state, None, None, deadline)
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
status = StatusCode.INTERNAL
details = "172.18.0.8:8011: java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
at

err_20191128.log

@jiahaoc1993
Copy link
Contributor

jiahaoc1993 commented Nov 28, 2019

@HelloLadsAndGents It seems your storage-service is not working properly, please go to egg container and check the logs located at "/data/projects/fate/eggroll/storage-service-cxx/logs/"

@HelloLadsAndGents
Copy link
Author

hello
i had searched this dir but found all two logs are empty
Are there any other ways to find out where the problem is
pic

@jiahaoc1993
Copy link
Contributor

jiahaoc1993 commented Nov 29, 2019

@HelloLadsAndGents It seems your "storage service" is failed to start. You can start it manually to identify the problem. To start the "storage service", please use the following command:

./storage-service -p 7778 -d /data/projects/fate/data-dir

According to the output, you may have different solutions. For more details please check out our wiki page

@HelloLadsAndGents
Copy link
Author

HelloLadsAndGents commented Nov 29, 2019

Thanks ,it does help, i follow your steps:

cd /data/projects/fate/eggroll/storage-service-cxx
./storage-service -p 7778 -d /data/projects/fate/data-dir

and i got the info:
Illegal instruction (core dumped)

then i check your wiki page(https://github.com/FederatedAI/KubeFATE/wiki/KubeFATE) and i found:
The storage-service of egg service requires CPU instruction set like avx2 etc. Please make sure your CPU supports these instructions otherwise the storage-service will fail to start with the following error:

i think it is because of our nodes' CPU does not support instruction set like avx2
and i'm wondering are there any solutions to solve this kind of problme ?

and i suggest you guys can put requirements for CPU in pages like this(now it only menthioned requirements for Space and Memory and core):
https://github.com/FederatedAI/FATE/tree/master/cluster-deploy

thanks again

@jiahaoc1993
Copy link
Contributor

@HelloLadsAndGents will do, I appreciate your feedback. Let's close the issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants