#Overview

__Driver__: The _SparkContext_ object that coordinates the other spark processes. This driver is either run on the cluster (__cluster mode__), or on a user's laptop (__client mode__).

__Cluster Manager__: Allocates resources accross applications. SparkContext can connect to SparkStandAlone, Mesos, or YARN.

__Executors__: Run computations and store data for applications. Spark sends code and tasks for each executor to run.

![alt text](https://spark.apache.org/docs/1.4.0/img/cluster-overview.png)

- Each application gets it own set of executors.

- Agnostic to the underlying cluster manager: as long as it can create executors and they communicate with each other.

- Because driver schedules tasks to the cluster, it should be running close to the worker nodes: same local area network or from somewhere with a very fast connection to the cluster.


##Cluster Manager Types

__StandAlone__: Simple cluster manager included in spark.

__Apache Mesos__: A general cluster manager that can run Hadoop and service apps.

__Hadoop YARN__: Resource manager for Hadoop 2.

##Submitting Applications
An application can be submitted by using `spark-submit`:

```bash
$ spark-submit \
  --class <main-class> \
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other options
  <application-jar> \
  [application-arguments]
```

If the code depends on other project, you need to package them together (except Spark and Hadoop dependencies). 

For Java and Scala code, you create an assembly jar or "uber" jar containing your code and dependencies. sbt and Maven has assembly plugins.

For Python, you can use the `--py-files` argument to add a `.py`, `.zip`, or `.egg` file to be distributed with your application.

For submitting code without any foreign dependencies and using the default settings:
```
$ spark-submit code.py
```

###Commonly Used Options
| Option | Description | Example |
|--------|-------------|---------|
| `--class` | Entry point for application | ` org.apache.spark.examples.SparkPi` |
| `--master` | Master URL for cluster | `spark://23.195.26.187:7077` |
| `--deploy-mode` | Deployment mode for driver (`client`, `cluster`) | Default: `client` |
| `--conf` | Arbitary Spark config property in key=value format. For values with spaces, wrap in quotes.  | `"key=value"` |

##Monitoring Jobs
- Each driver has a web UI, usually on port 4040: tasks, executors, storage usage. http://<driver-node>:4040

Every SparkContext has a web UI, displaying useful information:
- List of scheduler stages and tasks
- Summary of RDD sizes and memory usage
- Environment information
- Information about running executors

In [1]:
import pyspark

In [6]:
pyspark.