forked from modin-project/modin
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
DOCS-modin-project#6871: Update Modin on Ray cluster tutorial
Signed-off-by: Kirill Suvorov <kirill.suvorov@intel.com>
- Loading branch information
1 parent
097ea52
commit e917bd3
Showing
6 changed files
with
234 additions
and
415 deletions.
There are no files selected for viewing
81 changes: 81 additions & 0 deletions
81
examples/tutorial/jupyter/execution/pandas_on_ray/cluster/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,81 @@ | ||
![LOGO](../../../img/MODIN_ver2_hrz.png) | ||
|
||
<center> | ||
<h1>Scale your pandas workflows by changing one line of code</h2> | ||
</center> | ||
|
||
# Exercise 5: Setting up cluster environment | ||
|
||
**GOAL**: Learn how to set up a cluster for Modin. | ||
|
||
**NOTE**: This exercise has extra requirements. Read instructions carefully before attempting. | ||
|
||
**This exercise instructs the user on how to start a 700+ core cluster, and it is not shut down until the end of Exercise 6. Read instructions carefully.** | ||
|
||
Often in practice we have a need to exceed the capabilities of a single machine. Modin works and performs well in both local mode and in a cluster environment. The key advantage of Modin is that your notebook does not change between local development and cluster execution. Users are not required to think about how many workers exist or how to distribute and partition their data; Modin handles all of this seamlessly and transparently. | ||
|
||
![Cluster](../../../img/modin_cluster.png) | ||
|
||
**Extra Requirements for this exercise** | ||
|
||
Detailed instructions can be found here: https://docs.ray.io/en/latest/cluster/getting-started.html | ||
|
||
First of all, please install the necessary dependencies in your environment: | ||
```bash | ||
pip install boto3 | ||
``` | ||
|
||
The next step is setup yout AWS credentionals. We can set | ||
`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY` and `AWS_SESSION_TOKEN` environment variables or just run the following command: | ||
|
||
```bash | ||
aws configure | ||
``` | ||
|
||
## Starting and connecting to the cluster | ||
|
||
This example starts 1 head node (m5.24xlarge) and 7 workers (m5.24xlarge), 768 total CPUs. | ||
|
||
Cost of this cluster can be found here: https://aws.amazon.com/ec2/pricing/on-demand/. | ||
|
||
You can just use the `Ray autoscaler` to create and initialize Ray cluster using the configuration file. This file is included in this directory and is called [`modin-cluster.yaml`](https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/modin-cluster.yaml). You can read more about how to modify `Cluster YAML Configuration file` here: https://docs.ray.io/en/latest/cluster/vms/references/ray-cluster-configuration.html#cluster-yaml-configuration-options | ||
```bash | ||
ray up modin-cluster.yaml | ||
``` | ||
|
||
Once the cluster initialization is complete, run the following command to connect to it: | ||
```bash | ||
ray attach modin-cluster.yaml | ||
``` | ||
|
||
Now our console is connected to the cluster head node via ssh and we can run our code there. | ||
|
||
# Exercise 6: Executing on a cluster environment | ||
|
||
**GOAL**: Learn how to connect Modin to a Ray cluster and run pandas queries on a cluster. | ||
|
||
**NOTE**: Exercise 5 must be completed first, this exercise relies on the cluster created in Exercise 5. | ||
|
||
Modin lets you instantly speed up your workflows with a large data by scaling pandas on a cluster. In this exercise, we will use the 12.5GB [NYC Taxi dataset](https://modin-datasets.s3.amazonaws.com/testing/yellow_tripdata_2015-01.csv) that was provided with the [executable script](https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/exercise_6.py) as part of our [modin-cluster.yaml](https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/modin-cluster.yaml) configuration file. | ||
|
||
![ClusterPerf](../../../img/modin_cluster_perf.png) | ||
|
||
We can just run the following command to reproduce these result: | ||
```bash | ||
python exercise_6.py | ||
``` | ||
|
||
## Shutting down the cluster | ||
|
||
Now that we have finished computation, but before we shut down the cluster, we need to exit from ssh console: | ||
```bash | ||
exit | ||
``` | ||
|
||
Then we can shut down the cluster with `ray down`. | ||
|
||
```bash | ||
!ray down modin-cluster.yaml | ||
``` | ||
|
||
### This ends the cluster exercise |
146 changes: 0 additions & 146 deletions
146
examples/tutorial/jupyter/execution/pandas_on_ray/cluster/exercise_5.ipynb
This file was deleted.
Oops, something went wrong.
Oops, something went wrong.