# Apache Spark @ DESC -- Part I: Installation and first steps

Author: **Julien Peloton** [@JulienPeloton](https://github.com/LSSTDESC/desc-spark/issues/new?body=@JulienPeloton)  
Last Verifed to Run: **2018-10-23**  

Welcome to the series of notebooks on Apache Spark! The main goal of this series is to get familiar with Apache Spark, and in particular its Python API called pyspark. 

__Learning objectives__

- Apache Spark: what it is?
- Installation @ HOME
- Installation @ NERSC
- Using the pyspark shell
- Your first Spark program.


## Apache Spark 

Apache Spark is a cluster computing framework, that is a set of tools to perform computation on a network of many machines. Spark started in 2009 as a research project, and it had a huge success so far in the industry. It is based on the so-called MapReduce cluster computing paradigm, popularized by the Hadoop framework using implicit data parallelism and fault tolerance. 

The core of Spark is written in Scala which is a general-purpose programming language that has been started in 2004 by Martin Odersky (EPFL). The language is inter-operable with Java and Java-like languages, and Scala executables run on the Java Virtual Machine (JVM). Note that Scala is not a pure functional programming language. It is multi-paradigm, including functional programming, imperative programming, object-oriented programming and concurrent computing.

Spark provides many functionalities exposed through Scala/Python/Java/R API (Scala being the most complete one). As far as DESC is concerned, I would advocate to use the Python API (called pyspark) for obvious reasons. But feel free to put your hands on Scala, it's worth it. For those interested, you can have a look at this [tutorial](https://gitlab.in2p3.fr/MaitresNageurs/QuatreNages/Scala) on Scala.

## Installation @ HOME

You might want to install Apache Spark on your laptop, to prototype programs and perform local checks. The easiest way to do so is to [download](https://spark.apache.org/downloads.html) a pre-built version of Spark (take the latest one). Untar it, move it to the location you want, and update your path such that it can be found when you launch a job:

```bash
# Put those lines in your HOME/.bash_profile
SPARK_HOME=/path/to/spark
export PATH=$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH
```

Latest version of Spark should run on Java 8+, but I recommend using it on Java 8. On macOS, to see the different java jdk installed on your machine: 

```
/usr/libexec/java_home -V
```

If Java 8 is not present, download the JDK and set it using:

```bash
# Put this line in your HOME/.bash_profile, with the 
# version number you just downloaded. Example:
export JAVA_HOME=`/usr/libexec/java_home -v 1.8.0_151`
```

## Installation @ NERSC

### batch mode 

NERSC provides support to run Spark at scale. Note that for Spark version 2.3.0+, Spark runs inside of [Shifter](http://www.nersc.gov/research-and-development/user-defined-images/). Complete information is available at [spark-distributed-analytic-framework](www.nersc.gov/users/data-analytics/data-analytics-2/spark-distributed-analytic-framework/).

### JupyterLab

We provide kernels to work with Apache Spark and DESC. To get a DESC python + Apache Spark kernel, follow these steps:

```bash
# Clone the repo
git clone https://github.com/astrolabsoftware/spark-kernel-nersc.git
cd spark-kernel-nersc

# Where the Spark logs will be stored
# Logs can be then be browsed from the Spark UI
LOGDIR=${SCRATCH}/spark/event_logs
mkdir -p ${LOGDIR}

# Resource to use. Here we will use 4 threads.
RESOURCE=local[4]

# Extra libraries (comma separated if many) to use.
SPARKFITS=com.github.astrolabsoftware:spark-fits_2.11:0.7.1

# Create the kernel - it will be stored under
# $HOME/.ipython/kernels/<kernelname>
python makekernel.py \
  -kernelname desc-pyspark --desc \
  -pyspark_args "--master ${RESOURCE} \
  --conf spark.eventLog.enabled=true \
  --conf spark.eventLog.dir=file://${LOGDIR} \
  --conf spark.history.fs.logDirectory=file://${LOGDIR} \
  --packages ${SPARKFITS}"


```

And then select the kernel `desc-pyspark` in the JupyerLab [interface](https://jupyter-dev.nersc.gov/).
More information can be found at [spark-kernel-nersc](https://github.com/astrolabsoftware/spark-kernel-nersc).

## Using the pyspark shell (@ HOME or NERSC interactive)

### Python/IPython shells

To access the pyspark shell, just type `pyspark` in a terminal. You will be redirected to the standard python shell, augmented with spark environment and pre-loaded objects such as the `sparkContext` (`sc`) and the `sparkSession` (`spark`). Between you and me, the standard python shell is rather ugly and lacks of nice functionalities. If you really want to increase your productivity, you probably want to switch to IPython. Just type in your shell:

```
PYSPARK_DRIVER_PYTHON=ipython pyspark
```
And you should see (with your corresponding Spark, Python and IPython versions):

```
Python 3.7.0 (default, Jun 28 2018, 07:39:16)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.0.1 -- An enhanced Interactive Python. Type '?' for help.
2018-10-24 21:13:45 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.3.2
      /_/

Using Python version 3.7.0 (default, Jun 28 2018 07:39:16)
SparkSession available as 'spark'.

In [1]:
```

If it complains about IPython not found but you know it is installed somewhere, just specify the whole path to it (to see it: `which ipython`). As said previously, you'll have your Spark environment loaded and few objects ready:

```python
In [1]: # Spark Session 
In [2]: spark
Out[2]: <pyspark.sql.session.SparkSession at 0x10d8d6e80>
    
In [3]: # Spark Context 
In [4]: sc
Out[4]: <SparkContext master=local[*] appName=PySparkShell>
```

### Specifying resources

By default ... see above ... 

## Your first pro

```
spark-submit ....
```