RayDistributor for using Ray to distribute the calculations in tsfresh #1030

TheaperDeng · 2023-06-22T09:39:29Z

Ray is getting popular for building distributed applications and easy to fit into tsfresh by a RayDistributor.

Distributed tsfresh on Ray

This repo involves a new RayDistributor for tsfresh to use ray to distribute the calculations.

RayDistributor is a subclass of IterableDistributorBaseClass in tsfresh which follows the developing instruction in https://tsfresh.readthedocs.io/en/latest/text/tsfresh_on_a_cluster.html.

Quick Start

Use RayDistributor the same way as MultiprocessingDistributor, ClusterDaskDistributor or LocalDaskDistributor.

from tsfresh.utilities.distribution import RayDistributor

distributor = RayDistributor(n_workers=4)
# ...
extracted_features = extract_features(..., distributor=distributor)
# ...

Code change summary

add RayDistributor definition in tsfresh.utilities.distribution
add RayDistributor document in docs/text/tsfresh_on_a_cluster.rst
Update pre-commit-config to enable future development
Update test-requirements.txt for UT
munually test the UT and document generation locally

TheaperDeng · 2023-06-28T01:26:15Z

@nils-braun It would be great to have some suggestions to avoid changing pre-commit-config version and to the PR itself

nils-braun

Thanks @TheaperDeng !
Really nice addition. I had a few comments but I am fine in general with the changes.

Two additional questions:

did you had the chance to run some speed tests? Is it faster/slower than other options?
I know that ray also has a feature for datasets, which would allow for data locality. Now, you need to move all data from the main node to the worker nodes. Did you had a look into this as well? Is this worth exploring?

nils-braun · 2023-06-28T07:31:28Z

.pre-commit-config.yaml

@TheaperDeng - I have update the pre-commit file on the main branch to use the newest versions. Can you please merge in the newest changes and resolve the merge conflicts?

nils-braun · 2023-06-28T07:32:53Z

test-requirements.txt

The structure of the test requirements changed on main to have a more "modern" or typical repository structure. Those changes will go into the setup.cfg file once you merge in newest main.

nils-braun · 2023-06-28T07:33:34Z

test-requirements.txt

@@ -8,4 +8,6 @@ seaborn>=0.7.1
 ipython>=5.3.0
 notebook>=4.4.1
 pandas-datareader>=0.5.0
+ray>=2.5.0
+protobuf<=3.20.3


Why is that needed? Could you maybe add a comment because I do not see protobuf being used directly

nils-braun · 2023-06-28T07:37:32Z

tsfresh/utilities/distribution.py

+
+        ray.init(address=address, **rayinit_config)
+        self.n_workers = n_workers
+        self.cpu_per_worker = max(


I am not an expert in ray, but this assumes that you have a heterogeneous cluster where each machine has the same number of CPUs, or? Why is this setting needed at all? Does ray not use all CPUs of a machine by default (again, not an expert in ray!)

nils-braun · 2023-06-28T07:40:52Z

tsfresh/utilities/distribution.py

+        import ray
+
+        ray.init(address=address, **rayinit_config)
+        self.n_workers = n_workers


The number of workers is defined by the ray cluster size when starting the cluster and can not be controlled by the user st this point, or? So the user needs to make sure to always pass in the correct number of workers according to the cluster. Can this also be retrieved from ray? We do something similar for dask.

If possible, I would recommend the following: as the number of worker property currently is not changing the cluster deployment, I would prefer if it is filled automatically. If this is not possible, we should remove the default value of 1 and maybe rename the parameter to make sure users know they need to set it to the number of cluster workers.
Now, it might look to users as if they can control the number of workers in ther cluster using this variable (which I think they can not)

nils-braun · 2023-06-28T08:21:05Z

docs/text/tsfresh_on_a_cluster.rst

+    download_robot_execution_failures()
+    df, y = load_robot_execution_failures()
+
+    Distributor = RayDistributor(address="ray://123.45.67.89:10001")


If the number of workers is not retrieved automatically , don't you need to pass it here because the default is 1?

Nit: can you use a lowercase distributor? The object is not a class but an instance.

nils-braun · 2023-06-28T08:21:39Z

docs/text/tsfresh_on_a_cluster.rst

+    download_robot_execution_failures()
+    df, y = load_robot_execution_failures()
+
+    Distributor = RayDistributor(n_workers=3)


As far as I understood, this will not automatically start a ray cluster with 3 workers, or?

nils-braun · 2023-06-28T08:24:18Z

docs/text/tsfresh_on_a_cluster.rst

+    Ray is an optional dependency and users who needs to use Ray to distribute the calculations should install
+    it first by `pip install ray`.
+
+Ray is a easy-to-use developing framework for distributed computing workload. Users could use it on single node or scale


Could you add a few words on why a user would choose the ray distributor and not use any other distributors? I am totally fine with having it in the code-base, I just want to make sure users are not confused on what to choose. What are benefits compared to e.g. dask?
What understood, using ray allows to parallelize the computation but does not help for out-of-memory data. And it is of course useful if you already run a fay cluster

nils-braun · 2023-06-28T08:25:47Z

tests/units/utilities/test_distribution.py

+class LocalRayDistributorTestCase(DataTestCase):
+    def test_ray_cluster_extraction_one_worker(self):
+
+        Distributor = RayDistributor(n_workers=1)


Same nit as before, can you use lowercase?

nils-braun · 2023-06-28T08:26:56Z

tests/units/utilities/test_distribution.py

+
+    def test_ray_cluster_extraction_two_worker(self):
+
+        Distributor = RayDistributor(n_workers=2)


Again the question: this is not creating a ray cluster with two workers, or? It is technically the same cluster as without this option - you just change the chunking.
Not sure if this is expected.

Is it possible to actually start a 2-worker local ray cluster?

TheaperDeng added 2 commits June 22, 2023 17:27

add RayDistributor and its document

c403853

add test-requirements

839a76f

nils-braun self-requested a review June 28, 2023 07:29

nils-braun reviewed Jun 28, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RayDistributor for using Ray to distribute the calculations in tsfresh #1030

RayDistributor for using Ray to distribute the calculations in tsfresh #1030

TheaperDeng commented Jun 22, 2023

TheaperDeng commented Jun 28, 2023

nils-braun left a comment

nils-braun Jun 28, 2023

nils-braun Jun 28, 2023

nils-braun Jun 28, 2023

nils-braun Jun 28, 2023

nils-braun Jun 28, 2023

nils-braun Jun 28, 2023

nils-braun Jun 28, 2023

nils-braun Jun 28, 2023

nils-braun Jun 28, 2023

nils-braun Jun 28, 2023

nils-braun Jun 28, 2023

nils-braun Jun 28, 2023

nils-braun Jun 28, 2023


		def test_ray_cluster_extraction_two_worker(self):

		Distributor = RayDistributor(n_workers=2)

RayDistributor for using Ray to distribute the calculations in tsfresh #1030

Are you sure you want to change the base?

RayDistributor for using Ray to distribute the calculations in tsfresh #1030

Conversation

TheaperDeng commented Jun 22, 2023

Distributed tsfresh on Ray

Quick Start

Code change summary

TheaperDeng commented Jun 28, 2023

nils-braun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment