Cloudera Data Engineering (CDE) provides a command line tool that makes migration of spark workloads to CDE a breeze. The key advantage to using this tool is the ability to migrate spark workloads to CDE without a rewrite of the spark-submit command lines. While this tool works on both Public and private form factors, this example is specific for running a spark-submit workload on Public Cloud CDE Cluster only
- This set up will only work on a Mac notebook
- You have access to Cloudera Data Platform( CDP), Cloudera Data Engineering (CDE) Virtual Cluster
- You have successfully setup cde command line tool. If you have not done so already, please refer to cloudera documentation here
- Access keys for CDP. Click here for more help documentation
- A basic knowledge of executing Spark jobs remotely using command line tools.
- clone the repo in a folder on your mac. You have multiple ways to do this:
- If you are familiar with git command line following command line
git clone https://github.com/SuperEllipse/cde-spark-submit-migration
- You can also directly go to the git location here and download it as a zip and unzip it, although the method 1 is preferable.
- If you are familiar with git command line following command line
- go to the directory cde-env-tool
cd /your/path/install-for-mac/cde-env-tool
- Install the tool and run the following command to ensure that you can run the tool without root permissions on your laptop
sed -i '' "s#CLOUDERA_BIN=/opt/cloudera/bin#CLOUDERA_BIN=$HOME/bin#g" cde-env.sh && ./cde-env.sh enable-spark-submit-proxy -f private
- Update PATH to give access to those binary and script files.
export PATH=$HOME/bin:$PATH
- Check if the tool is installed with the following command
which cde-env.sh
The result should show you
$HOME/bin/cde-env.sh
where $HOME is your home directory on your Laptop, in my case, the result is as below since my HOME directory is /Users/vrajagopalan, your ouput could be different based on your home directory location
/Users/vrajagopalan/bin/cde-env.sh
- Ensure that your which command above succeeds, else the tool set up has not been succesfull and the rest of the steps won't work.
Important Note: To configure the spark-submit migration tool we need to modify the config.yaml file of the CDE CLI tool. Internally the CDE spark-submit migration tool uses CDE CLI interface so it is essential that the CDE CLI tool be set up.
- Copy the following URL for the CDP Environment your are planning to use in a Textedit application on your mac. For example below image describes how to get the CDP End point
- Copy the following URL for the Virtual Cluster endpoint in a Textedit application on your mac. For example below image describes how to get the Virtual Cluster endpoint
- Copy the full path location access Credentials file location for CDP ( refer to the step in pre-requisites)
- Launch Terminal on macbook and change to .cde directory
cd $HOME/.cde
- next open config.yaml file and change the contents to following
# ~/.cde/config.yaml
allow-all-spark-submit-flags: true
credentials-file: <credentials-location>
cdp-endpoint: <CDP-endpoint>
tls-insecure: true
profiles:
- name: vc-1 #enter a profile name you wish for but then you need to activate this later vcluster-endpoint: <VC-endpoint>
Enter the cdp credentials-location, cdp-endpoint, and virtual cluster endpoint.
An example of this file looks like below:
In this step we will activate the user profile that we have created and then submit a simple-spark-job to CDE with the spark-submit command that we familiar with for submitting spark jobs. In this case, behind the scenes, the spark-submit commands with redirected to the CDE cluster. To get started , let us activate the profile we created earlier ( if you have not used vc-1 then change it to the profile name you have used instead of vc-1)
- First navigate to the folder where you have installed this demo by
cd /path/to/this-demo-on-your-mach/
- Ensure that you are able to see the folder structure below
- We now need to activate the profile we created earlier in our config.yaml file. To do so, execute the command below
cde-env.sh activate -p vc-1
- Now we are ready to execute the spark-submit command on our virtual cluster using our simple-spark-job.py file. This simply calculates the value of pi function as a spark job.
spark-submit \
--deploy-mode client \
--num-executors 1 \
--driver-memory 512m \
--executor-memory 512m \
--executor-cores 1 \
./simple-spark-job.py
You should now see a job submitted to your spark cluster, with the image below. ![job submitted](./install-for-mac/cde-env-tool/img/spark-job-submitted.png) When you wait for a little while, you should be able to see that the spark job has completed as below: ![job-submitted](./install-for-mac/cde-env-tool/img/job-succeeded.png)
In order to validate if our spark-submit did execute , we go to CDE and check the submitted jobs. Go to cloudera data engineering in CDP and click on Job Runs menu option on the right. You may also need to change the Virtual cluster to the one you have used for submitting your spark wokload. your spark-submit should have executed with the job cli-submit--XXXXXXXXXX. See example screen shot below.
Click on the Job Run id, and open the Logs. Clicking on the stdout shows the output of our submitted spark job
With this demo, we demonstrate how the spark-submit migration tool cane be used to migrate existing spark workloads to CDE without changes to spark-submit commands. Although, we have used a public cloud example to demonstrate this here, the tool can be used equally effectively on a private cloud SPARK workloads. More details can be found in the documentation here (Note: Documentation is currently under review , subject to changes and not hence published externally). The spark-submit migration tool can also be configured for multiple profiles, so that the spark workloads are executed against different virtual clusters.