In [None]:
%matplotlib inline
import matplotlib
import seaborn as sns
matplotlib.rcParams['savefig.dpi'] = 2 * matplotlib.rcParams['savefig.dpi']

# Creating Spark applications with PySpark

The PySpark API is similar to Scala, but not exactly, and there may be missing features. Be careful when looking through the [documentation](https://spark.apache.org/docs/latest/programming-guide.html).

PySpark comes installed with all versions of Spark, and you should be able to `import pyspark` without any trouble.
For machine learning applications, you'll probably need to work within a Spark SQL context as well as the usual Spark Context - this is to enable DataFrame functionality.

In [None]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext

conf = SparkConf()
conf.setMaster("local[*]")
conf.setAppName("main")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

lines = sc.textFile("small_data/gutenberg/")
totalLines = lines.count()
print "total lines: %d" % totalLines

sc.stop()

To run locally:

`$SPARK_HOME/bin/spark-submit --py-files src/aux_classes.py src/main.py arg1 arg2`

## Dependencies and class definitions

Use the `--py-files` flag with `spark-submit` to specify additional Python modules which should be made available to each worker. These may include class definitions or third-party dependencies. Usually, if you're using classes, you will not be able to define them in the main file.

## PySpark on EMR

The setup is similar to Scala, although you will need to manage your dependencies through the --py-files flag. Since you can pass python code to `spark-submit`, you can simply use the script-runner JAR to start the cluster. Documentation is available [here](http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-spark-submit-step.html).

# Creating Spark applications with Scala

While the interactive console is fun, it is (likely) not how you will be submitting a job.  Instead, you will want to follow these steps.  A sample simple application is provided in [projects/simple-spark-project](projects/simple-spark-project).

1. **Build your Spark application**: Scala is a compiled language so you will need to build a jar that can be run on the Java Virtual Machine (JVM).  JAR (Java Archive) is a package file format typically used to aggregate many Java class files and associated metadata and resources (text, images, etc.) into one file to distribute application software or libraries on the Java platform.  Go to the project directory `projects/simple-spark-project` and run
```bash
$ sbt package
```

2. Submit the job locally running on 4 cores:
```bash
$ $SPARK_HOME/bin/spark-submit \
  --class "com.thedataincubator.simplespark.SimpleApp" \
  --master local[4] \
  target/scala-2.10/simple-project_2.10-1.0.jar
```

You can use local[*] to run with as many worker threads as logical cores on your machine.

## Packaging with `sbt`

### What is `sbt`?
`sbt` is a modern build tool written in and for Scala, though it is also a general purpose build tool.  `sbt` is actually a Scala [Domain Specific Language (DSL)](https://en.wikipedia.org/wiki/Domain-specific_language), meaning it's actually Scala (with enough new constructs to look like it's not Scala).  To invoke SBT, run
``` bash
$ sbt
```
which brings you into an "sbt session."  The commands given below are to be typed within an SBT session.

### Why `sbt`?
- Sane(ish) dependency management
- Incremental recompilation and keeping the compiler alive in between compilations (see [this article](http://www.scala-sbt.org/0.13.2/docs/Detailed-Topics/Understanding-incremental-recompilation.html))
- Automatic recompilation triggered by file-change.  Within an sbt session, enter:
    ```
    ~compile
    ```
- Run the program within sbt:
    ```
    run
    ```
- Test the program within sbt:
    ```
    test
    ```
- Full Scala language support for creating tasks (it's a DSL)
- Launch REPL in project context
    ```
    console # gives you a Scala repl within your jar
    ```
    and you can type commands into the REPL to play around
    ```scala
    import com.thedataincubator.simplespark.SimpleApp

    val x = 1
    ```

## Project Layout (Directory Structure)

A sample simple application is provided in [projects/simple-spark-project](projects/simple-spark-project).

### Source files:
1. `src/main` – your app code goes here, in a subdirectory indicating the code’s language, e.g.
    1. `src/main/scala`
    1. `src/main/java`
1. `src/main/resources` – static files you want added to your jar (e.g. logging config)
1. `src/test` – like src/main, but for tests  
1. `src/main/scala/com/thedataincubator/simplespark/SimpleApp.scala` - an actual code file.  This is in two components:
    1. `src/main/scala` - overhead (explained above)
    1. `com/thedataincubator/simplespark/` - related to the package hierarchy (b/c it's written by people at the domain `thedataincubator.com`).  There are two files in our sample app:
        1. `src/main/scala/com/thedataincubator/simplespark/SimpleApp.scala` - main app
        1. `src/main/scala/com/thedataincubator/simplespark/Foo.scala` - helper class and methods
    
    This affects your code in two places:
        1. All your `*.scala` files in this directory need to declare their packages consistently with their directory
        ```scala
        package com.thedataincubator.simplespark
        ```
        You can then easily access other files in this folder (package).  For example, there is also a `Foo.scala` in this directory (with the same package definition) and we can access it directly in `SimpleApp.scala` 
        1. When you invoke the jar, you need to specify the class (package) name (see the `spark-submit` command above)

### Build Files:
1. `build.sbt` - This is like a make file.  It tells `sbt` how to build your project.  You can specify the version of your application, the Scala version you want, and the version of the dependencies you require (e.g. Spark) in the `build.sbt`:
    ```scala
    name := "Simple Project"
    version := "1.0"
    scalaVersion := "2.10.4"
    libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.0"
    ```
1. `project/` – Because the `sbt` compiler is actually Scala code, the compiler has to be built.  It is built with `sbt`.  The instructions for how to build this meta-build are placed in this directory.  This allows you to tweak the build's build.
1. `project/build.sbt` - The instructions for the meta-build (like `build.sbt` but for the compiler, not for your main project).  You can also tweak the build's build's build by having `project/project` and continue iterating forever (see [Organizing Build's](http://www.scala-sbt.org/0.13/tutorial/Organizing-Build.html)).

### Output files:
1. `target/` – The destination for generated files (e.g. class files, jars).

### Further information:
For more, check out the [documentation](http://www.scala-sbt.org/0.13/tutorial/Directories.html).

## Spark on EMR
This section will describe how to run a Spark jar as part of the SparkOverflow project. It assumes you are using stevenskelton's class structure and the spark-overflow_2.10-1.0.jar.

You'll first need to create a folder on s3 called target, which contains your packaged jar and the classes folder (the contents of the target/scala-2.10/ folder after running sbt package). Run the create_spark_cluster script (~/datacourse/scripts/) with target's parent directory as the argument, e.g. 

``` bash
$ python create_spark_cluster.py s3://thedataincubator-fellow/tempFellow/spark/
```

The full path to the jar would be `s3://thedataincubator-fellow/tempFellow/spark/target/spark-overflow_2.10-1.0.jar`

The script will create folders in your base directory called logs/ and output/ - make sure these are empty or nonexistent before running.

*Please don't alter the script!*

## Final gotcha

Scala (like Java) requires everything to be inside a class or object.  The repl will accept a global assignment like
```scala
val x = 1
```
but this will not work in application code that you compile.  You would need to put this into an object or class.

*Copyright &copy; 2015 The Data Incubator.  All rights reserved.*