# Big Data with Spark HATS _Pre_-_Exercises_

These pre-exercises will introduce you to the basics of
[Jupyter](https://jupyter.org) and verify that your environment
is properly configured.

*Note* - To perform any exercise, these notebooks must be open
within [Jupyter](https://jupyter.org). GitHub has a very nice
notebook renderer, but it is read-only and won't actually
execute any code. Information on how to access Jupyter can
be found in the [README](../README.md).

## Introduction to Jupyter

Developing via Jupyter is the inverse of how most people interact with Python. Normally, people write a whole script, pass it in bulk to Python, execute it, then examine the outputs.

With Jupyter, however, you first begin by starting Python *then* you pass snippets of code for Python to execute. Since the code is added incrementally to a constantly-running python process, intermediate values stay in-memory, and the coding cycle of "write, execute, print outputs, loop" becomes much more interactive. This mode of development isn't totally new, REPL (Read, Execute, Print, Loop) shells exist for a number of languages.

Below this text exists your first code entry box, known as a "cell" in Jupyter parlance. Enter the following code, then press Shift+Enter to tell Jupyter to execute it:
```python
x = 10
y = 5
print x + y
```

Notice that after you press Shift+Enter, the text to the left of your cell changed to `In [1]`. Jupyter counts the number of times that cells are executed, and updates the legend for each cell. This is useful if you want to find the last cell you executed.

This was a quick execution, so you may have missed it. While the Jupyter `kernel` is running, the cell's label will change to `In [*]`. While there's an asterisk, Python is running in the background. If it seems to be running too long, you can either interrupt the current execution by hitting `Kernel -> Interrupt` menu or abort the whole process via `Kernel -> Restart`.

Once you run, you should get the expected value of `15`. Remember, though that the python process is still running in the background. You can verify this by printing out the value of `y` and note that it returns the expected value of `5`
```python
    print y
```

Don't forget to press shift-enter to execute cells

## Verify Spark is Properly Installed
Each notebook runs as a separate Python process, but Spark itself is implemented in Scala (which is related to Java). The PySpark interface connects the two languages and handles passing data back and forth between the two.

To gain access to Spark, you first need to create a SparkSession who loads the Scala binaries via PySpark. Once the connection is complete, PySpark returns a SparkSession, which is the central entrypoint to Spark.

In [None]:
# Start up spark and get our SparkSession...
from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .appName(# Name of your application in the dashboard/UI
             "00-test-spark"
            ) \
    .config(# Tell Spark to load some extra libraries from Maven (the Java repository)
            'spark.jars.packages',
            'org.diana-hep:spark-root_2.11:0.1.16,org.diana-hep:histogrammar-sparksql_2.11:1.0.4'
            ) \
    .getOrCreate()

## Calculate π using Spark
To verify that Spark is working, estimate π by randomly choosing one million points in the unit 2D plane and counting the number that fall within a unit circle.

In [None]:
import random

def inside(p):     
  x, y = random.random(), random.random()
  return x*x + y*y < 1

# SparkContexts provide most of the "execution engine" of Spark
sc = spark.sparkContext
# One million
num_samples = 1000000
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4.0 * count / num_samples
print(pi)

## Verify Spark-ROOT functionality
When Spark was started, we loaded two additional Scala libraries into the environment:
* **[Spark-ROOT](https://github.com/diana-hep/Spark-ROOT)** - Scala-based ROOT/IO interface to Spark
* **[Histogrammar](https://histogrammar.org)** - Functional histogramming framework, optimized for Spark

First, verify that Spark-ROOT is functional by loading a ROOT file and counting the number of events it contains. If successful, you will see this example file has four events.

In [None]:
# Read the ROOT file into a Spark DataFrame...
import os
testPath = "file://%s/../root/test-tuple.root" % os.getcwd()
df = spark.read.format("org.dianahep.sparkroot.experimental").load(testPath)
# ... and print the number of events
print df.count()

## Verify Histogrammar works
The second Scala library PySpark loaded was Histogrammar, which is used to quickly produce histograms of very large datasets in a distributed, functional manner. Verify it works by generating random numbers distributed according to the unit Gaussian distribution, adding them to a histogram, then plotting the result

In [None]:
# Produce 200,000 random numbers
import random
data = [random.gauss(0, 1) for i in range(200000)]

# Define the histogram
import histogrammar as hg
gauss = hg.Bin(num=16,
                   low=-4,
                   high=4,
                   quantity=lambda x: x,
                   value=hg.Count())

# Fill the histogram with the values we generated earlier
for d in data:
    gauss.fill(d)

# Plot the resulting histogram
print gauss.ascii()

If successful, you will see a text representation of a unit Gaussian. This is underwhelming, so let's test the final piece of the puzzle, a plotting library.
## Verify Matplotlib Functionality
One of the very nice features of Jupyter is its ability to display visualisations in-line within the browser. To tell Jupyter to display images inline, we can use what is known as a
[magic command](http://ipython.readthedocs.io/en/5.x/interactive/magics.html). Magic commands begin with a `%` character and are interpreted directly by Jupyter instead of being passed to Python. Tell jupyter by executing the following in the next cell
```python
    %matplotlib inline  
```

Now that Jupyter knows we want to see matplotlib outputs in the browser, plot the histogram we previously generated:

In [None]:
plot = gauss.plot.matplotlib(name="Unit Gaussian")

## Complete
If each step ran without error, congratulations, your environment is properly set up!

One note: Starting a notebook automatically starts a Python interpreter in the background, so you can close your web browser and come back to it later and resume where you left off. However, each Python process consumes resources, which can be significant if you're processing large datasets. To free up the resources, shut down the interpreter by clicking `File->Close and Halt` if you're done with a notebook. The server itself doesn't take up any resources, so you don't need to click the `Shutdown Server` button.

If you're interested, feel free to browse the rest of the tutorial [index](../Start-Here.ipynb).