# Big Data with Spark HATS

This Hands on Advanced Tutorial Session 
([HATS](http://lpc.fnal.gov/programs/schools-workshops/hats.shtml)) is
presented by the LPC to demonstrate a CMS analysis using
[Apache Spark](http://spark.apache.org/),
[Spark-ROOT](https://github.com/diana-hep/spark-root),
[Histogrammar](http://histogrammar.org/), and
[MatplotLib](https://matplotlib.org/). After introducing Spark, students
will learn the steps needed to perform a basic measurement
of the Z-boson mass using CMS data recorded in 2016.

*Note* - To perform any exercise, these notebooks must be open
within [Jupyter](https://jupyter.org). GitHub has a very nice
notebook renderer, but it is read-only and won't actually
execute any code. Information on how to access Jupyter can
be found in the [README](./README.md).



Setup Instructions
==========

These instructions need to be run once to load the requisite libraries for the tutorial.

Jupyter has the concept of _kernels_, which are independent execution environments. They don't
even have to be Python, kernels for other languages exist as well.

By loading a separate kernel for each project, we avoid the complication of different
components/projects having weird interactions, ultimately helping reproducibility.

We first produce a new virtualenv with the libraries we require, then we teach Jupyter
about this new environment with the ipython executable

In [None]:
%%bash
set -e
python2 -m virtualenv hats-spark
source hats-spark/bin/activate
HISTOGRAMMAR_PATH='git+https://github.com/histogrammar/histogrammar-python.git@1.0.x#egg=histogrammar'
pip install ipykernel matplotlib numpy py4j $HISTOGRAMMAR_PATH
ipython kernel install --user --name=hats-spark

Results
=======

If successful, you should see something similar to the following:

```
New python executable in /home/meloam/hats-template/hats-template/bin/python2
Also creating executable in /home/meloam/hats-template/hats-template/bin/python
Installing setuptools, pip, wheel...done.
/home/meloam/hats-template/hats-template/bin/pip
Collecting numpy==1.14.3
  Using cached https://files.pythonhosted.org/packages/c0/e7/08f059a00367fd613e4f2875a16c70b6237268a1d6d166c6d36acada8301/numpy-1.14.3-cp27-cp27mu-manylinux1_x86_64.whl
<snip>
Installed kernelspec hats-template in /home/meloam/.local/share/jupyter/kernels/hats-template
```

The new kernel you just made will then show up in the various Jupyter dropdowns, allowing you to use it for different notebooks. You can run the [pre-exercises](notebooks/00-preexercise.ipynb) to validate that your environment is properly configured.

## Tutorial
* [Building blocks](notebooks/10-building-blocks.ipynb) - Introduction to the concepts of a Spark-based analysis
* [Z-Peak with CMS data](notebooks/20-z-peak.ipynb) - Use Spark to plot the dimuon invariant mass peak

## Built With

* [Jupyter](http://jupyter.org/) - Interactive python notebook interface
* [Apache Spark](http://spark.apache.org/) - Fast and general engine for large-scale data processing
* [Spark-ROOT](https://github.com/diana-hep/spark-root) - Scala-based ROOT/IO interface to Spark
* [Histogrammar](http://histogrammar.org/) - Functional histogramming framework, optimized for Spark
* [MatplotLib](https://matplotlib.org/) - Python plotting library

## Authors

* **Andrew Melo** - http://lpc.fnal.gov/fellows/2017/Andrew_Melo.shtml

## Acknowledgments

* The LPC Distinguished Researcher Program ([link](http://lpc.fnal.gov/fellows/2017.shtml)) - *Support for the author*
* Advanced Computing Center for Research and Education (ACCRE) ([link](http://www.accre.vanderbilt.edu/)) - *Host facility and sysadmin support*
* The Diana-HEP project ([link](http://diana-hep.org/)) - *Interoperability and compatibility libaries*
* Vanderbilt Trans Institutional Program (TIPs) Award ([link](https://vanderbilt.edu/provost/occi/tips.php)) - *Big Data hardware seed funding*