In the Azure Distributed Data Engineering Toolkit, a cluster is primarily designed to run Spark jobs. This document describes how to create a cluster to use for Spark jobs. Alternitavely for getting started and debugging you can also use the cluster in interactive mode which will allow you to log into the master node, use Jupyter and view the Spark UI.
Creating a Spark cluster only takes a few simple steps after which you will be able to SSH into the master node of the cluster and interact with Spark. You will be able to view the Spark Web UI, Spark Jobs UI, submit Spark jobs (with spark-submit), and even interact with Spark in a Jupyter notebook.
For the advanced user, please note that the default cluster settings are preconfigured in the .aztk/cluster.yaml file that is generated when you run aztk spark init
. More information on cluster config here.
Create a Spark cluster:
aztk spark cluster create --id <your_cluster_id> --vm-size <vm_size_name> --size <number_of_nodes>
For example, to create a cluster of 4 Standard_A2 nodes called 'spark' you can run:
aztk spark cluster create --id spark --vm-size standard_a2 --size 4
You can find more information on VM sizes here. Please note that you must use the official SKU name when setting your VM size - they usually come in the form: "standard_d2_v2".
NOTE: The cluster id (--id
) can only contain alphanumeric characters including hyphens and underscores, and cannot contain more than 64 characters. Each cluster must have a unique cluster id.
You can create your cluster with low-priority VMs at an 80% discount by using --size-low-pri
instead of --size
. Note that these are great for experimental use, but can be taken away at any time. We recommend against this option when doing long running jobs or for critical workloads.
By default, the Azure Distributed Data Engineering Toolkit will use Spark v2.2.0 and Python v3.5.4. However, you can set your Spark and/or Python versions by configuring the base Docker image used by this package.
You can list all clusters currently running in your account by running
aztk spark cluster list
To view details about a particular cluster run:
aztk spark cluster get --id <your_cluster_id>
Note that the cluster is not fully usable until a master node has been selected and it's state is idle
.
For example here cluster 'spark' has 2 nodes and node tvm-257509324_2-20170820t200959z
is the mastesr and ready to run a job.
Cluster spark
------------------------------------------
State: active
Node Size: standard_a2
Nodes: 2
| Dedicated: 2
| Low priority: 0
Nodes | State | IP:Port | Master
------------------------------------|-----------------|----------------------|--------
tvm-257509324_1-20170820t200959z | idle | 40.83.254.90:50001 |
tvm-257509324_2-20170820t200959z | idle | 40.83.254.90:50000 | *
To delete a cluster run:
aztk spark cluster delete --id <your_cluster_id>
You are charged for the cluster as long as the nodes are provisioned in your account. Make sure to delete any clusters you are not using to avoid unwanted costs.
All interaction to the cluster is done via SSH and SSH tunneling. If you didn't create a user during cluster create (aztk spark cluster create
), the first step is to enable to add a user to the master node.
Make sure that the .aztk/secrets.yaml file has your SSH key (or path to the SSH key), and it will automatically use it to make the SSH connection.
aztk spark cluster add-user --id spark --username admin
Alternatively, you can add the SSH key as a parameter when running the add-user
command.
aztk spark cluster add-user --id spark --username admin --ssh-key <your_key_OR_path_to_key>
You can also use a password to create your user:
aztk spark cluster add-user --id spark --username admin --password <my_password>
Using a SSH key is the recommended method.
After a user has been created, SSH into the Spark container on the master node with:
aztk spark cluster ssh --id spark --username admin
If you would like to ssh into the host instead of the Spark container on it, run:
aztk spark cluster ssh --id spark --username admin --host
If you ssh into the host and wish to access the running Docker Spark environment, you can run the following:
sudo docker exec -it spark /bin/bash
Now that you're in, you can change directory to your familiar $SPARK_HOME
cd $SPARK_HOME
By default, the aztk spark cluster ssh
command port forwards the Spark Web UI to localhost:8080, Spark Jobs UI to localhost:4040, and Spark History Server to your locahost:18080. This can be configured in .aztb/ssh.yaml.
Once the appropriate ports have been forwarded, simply navigate to the local ports for viewing. In this case, if you used port 8888 (the default) for Jupyter then navigate to http://localhost:8888.
The notebooks will only be persisted to the local cluster. Once the cluster is deleted, all notebooks will be deleted with them. We recommend saving the notebooks elsewhere if you do not want them deleted.