<img src="../ucsb_logo_seal.png"> 

## Running on a Cluster
### PSTAT 135 / 235: Big Data Analytics
### University of California, Santa Barbara
### Last Updated: Sep 4, 2019

---  

### Sources 

1. Learning Spark, Chapter 7

### OBJECTIVES
- Learn how to run distributed Spark
- Learn about some of the common deployment environments


### CONCEPTS AND FUNCTIONS
- Cluster manager (Hadoop YARN, Apache Mesos, Standalone)
- Driver and worker/executor
- Spark application
- Directed acyclic graph (DAG)
- Build tool
- Assembly JAR

---  

### Spark Architecture

One benefit of Spark is the ability to scale computation by adding more machines and running in cluster mode

*Driver* is in charge of coordinating the workers

The *workers* / *executors* receive code and data and do the processing, sending results back to driver.

Driver + Workers = Spark application

### Driver

`main()` method of program runs on driver

Converts program into tasks

Converts into logical *directed acyclic graph* (DAG) of operations

Coordinates scheduling of tasks on executors (like a manager)

### Executors

Run the individual tasks

Launch at start of application and run for lifetime of app

Provide in-memory (RAM) storage for RDDs

### Cluster Manager

External service where the Spark application runs.  

Spark is packaged with the Standalone cluster manager.

Manages the resources between Spark applications.  
Can manage queues if there is more demand than resources for executors.
 
### Launching a Program
`spark-submit` is called to launch a Spark app

**Run in local mode using single core**

In [15]:
$ bin\spark-submit --master local python_scripts\textAnalysis1.py

**Run in local mode using 4 cores**

In [None]:
$ bin\spark-submit --master local[4] python_scripts\textAnalysis1.py

**Run in local mode using all cores**

In [3]:
$ bin\spark-submit --master local[*] python_scripts\textAnalysis1.py

**Run on Spark Standalone cluster at default port**

In [11]:
$ bin\spark-submit --master spark://host:7077 python_scripts\textAnalysis1.py

**Run on Spark Standalone cluster at default port, specifying memory to allocate**

In [None]:
$ bin\spark-submit --master spark://host:7077 –-executor_memory 10g 	python_scripts\textAnalysis1.py

**Generic Form to run Spark App**

In [None]:
$ bin\spark-submit [options] <app jar | python file> [app options]

Can include various flags in the short or long format `-shortflag` and 
`--longflag` respectively  

See page 122 for a list of flags  

The flags control scheduling information and dependencies such as libraries and files

For a list of all flags issue:  
bin\spark-submit --help


### Spark Web UI

Local mode:  
http://localhost:4040/jobs/


<img src="spark_app_mgr.png">  

### Packaging Code and Dependencies  

**Python**  
PySpark uses Python on worker machines, so can use `pip`  
Can also submit libaries using `--py-files` argument to `spark-submit`  

**Java and Scala**   
can submit individual JAR files using `--jars`  
For a large set of dependencies, better to use a build tool (`sbt` or `maven`) to package all dependencies into one JAR called the 
*assembly *JAR  

Maven produces a pom.xml file containing a build definition

A *Project Object Model* or *POM* is the fundamental unit of work in Maven. It is an XML file that contains information about the project and configuration details used by Maven to build the project. It contains default values for most projects.

https://maven.apache.org/guides/introduction/introduction-to-the-pom.html

Packaging a spark application built w Maven is straightforward:    

**Run on Spark Standalone cluster at default port**


In [None]:
$ mvn package          # create the assembly JAR

# The assembly JAR will be placed in the target directory
$ bin\spark-submit --master local … target\name_of_assembly.jar

### Hadoop YARN

**Y**et **A**nother **R**esource **N**egotiator 

YARN is a cluster manager introduced in Hadoop 2.0  
Allocates system resources to various applications running in Hadoop cluster  
Schedules tasks to be executed on different cluster nodes  
Installed on same nodes as *HDFS*, making it quicker to access data  

To use YARN in Spark, set an environment variable that points to Hadoop config directory, then submit jobs to a special master URL with `spark-submit`

In [None]:
export HADOOP_CONF_DIR="..."
spark-submit --master yarn appname

By default YARN uses 2 executors, so will likely need to change with flag:  
`--num-executors`

<img src="yarn.png">  

*Resource Manager* accepts jobs from users, schedules them and allocates resources  

*Node Manager* monitors the node and provides reporting

*Application Master* created for each application to negotiate for resources and work with NodeManager to execute and monitor tasks

*Containers* are controlled by the NodeManager and assigned system resources



### Amazon EC2 (elastic cloud compute)

Spark has built-in script to launch clusters on EC2: `spark-ec2`

Will need Amazon Web Services (AWS) account  
Export the *access key ID* and *secret access key*    
By default, launching the cluster produces one master and one slave  
Storage: Spark EC2 clusters include two installations of HDFS  
See p 136 for details