-
Notifications
You must be signed in to change notification settings - Fork 5
How Spark Submit works
Kevin Wallimann edited this page Apr 13, 2022
·
2 revisions
The following explanations are mainly to understand the submission of yarn jobs in cluster mode.
-
spark-submit script, launches java process using script spark-class with main class org.apache.spark.deploy.SparkSubmit
-
spark-class- finds
java - executes
$SPARK_HOME/conf/spark-env.shthrough$SPARK_HOME/bin/load-spark-env.sh - starts java process
-
org.apache.spark.launcher.Main builds the command to be executed by
spark-class, e.g. something like
- finds
-
/usr/lib/jvm/java-1.8.0-openjdk/jre/bin/java -cp /opt/spark/conf/:/opt/spark/jars/*:/opt/hadoop/ \
-Dscala.usejavacp=true org.apache.spark.deploy.SparkSubmit --master yarn --deploy-mode cluster \
--conf spark.executor.memory=1g --conf spark.driver.memory=1g \
--class za.co.absa.hyperdrive.driver.drivers.CommandLineIngestionDriver --name Hyperdrive \
--jars spark-jobs-current.jar hyperdrive-release-latest.jar arg1 arg2
-
SparkSubmit-
main->doSubmit->submit->doRunMain->runMain->prepareSubmitEnvironment -
prepareSubmitEnvironment returns among others the main class to execute.
- Also, login with keytab and principal is done here. https://github.com/apache/spark/blob/v3.2.0/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L359
- Return the Yarn Client main class, which is org.apache.spark.deploy.yarn.YarnClusterApplication
-
-
org.apache.spark.deploy.yarn.Client-
YarnClusterApplication.start->Client.run->Client.submitApplication -
__spark_conf__.ziparchive is created here: https://github.com/apache/spark/blob/v3.2.1/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L750 - The upload of jars happens here: https://github.com/apache/spark/blob/v3.2.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L391
-
submitApplication calls
yarnClient.createApplication()and thenyarnClient.submitApplication(ApplicationSubmissionContext)
-
-
There are two implementations of AbstractLauncher, which can be used to programmatically launch a spark job with the method startApplication
-
SparkLauncher creates a
java.lang.Processthat executes thespark-submitscript. -
InProcessLauncher calls org.apache.spark.deploy.InProcessSparkSubmit.main() directly within a thread (using
new Thread()) - Both
SparkLauncherandInProcessLauncherstart a static instance ofLauncherServerwhich is used to keep track of the launched spark jobs. TheLauncherServeris not used when submitting an application throughspark-submit
-
SparkLauncher creates a
- Is it really necessary for
spark-submitto have the full$SPARK_HOMEor are only specific files required?