## Introduction
This is a template file to elucidate the structure of a typical analysis notebook on coffe-casa. We will load in a sample file, create a minimal processor class, and run the Dask executor.

In [1]:
import coffea
import coffea.processor as processor

## Loading in a file.

A dataset is parsed as a dictionary where each key is a dataset name, and each value is a list of files in that dataset. You can have multiple datasets (multiple keys), and you can have multiple files in a dataset (multiple pointers in the list). Typically, CMS files will require authentication, but coffea-casa does away with this by implementation of tokens. In order to bypass authentication, replace the redirector portion of your file with xcache; i.e., the file:

`root://`**xrootd.unl.edu**`//eos/cms/store/mc/RunIIAutumn18NanoAODv7/DYJetsToLL_M-50_TuneCP5_13TeV-amcatnloFXFX-pythia8/NANOAODSIM/Nano02Apr2020_102X_upgrade2018_realistic_v21_ext2-v1/260000/47DA174D-9F5A-F745-B2AA-B9F66CDADB1A.root`

becomes

`root://`**xcache**`//eos/cms/store/mc/RunIIAutumn18NanoAODv7/DYJetsToLL_M-50_TuneCP5_13TeV-amcatnloFXFX-pythia8/NANOAODSIM/Nano02Apr2020_102X_upgrade2018_realistic_v21_ext2-v1/260000/47DA174D-9F5A-F745-B2AA-B9F66CDADB1A.root`

In [2]:
fileset = {'SingleMu' : ["root://xcache//eos/cms/store/mc/RunIIAutumn18NanoAODv7/DYJetsToLL_M-50_TuneCP5_13TeV-amcatnloFXFX-pythia8/NANOAODSIM/Nano02Apr2020_102X_upgrade2018_realistic_v21_ext2-v1/260000/47DA174D-9F5A-F745-B2AA-B9F66CDADB1A.root"]}

## Creating a minimal processor class.

This part is pure [Coffea](https://coffeateam.github.io/coffea/reference.html). The processor class encapsulates all of our analysis. It is what we send to our executor, which forwards it to our workers. For detailed instructions on how to create the processor class, see the Coffea examples and documentation, or refer to the benchmarks and analysis in this repository. In short:

`__init__`: This is where we define our histograms. Categorical or sparse axes (*Cats*) split data vertically, into different categories. Bin or dense axes (*Bins*) split data horizontally, into the 'bars' of the histogram. We also define an accumulator here. Data that is fed to the Processor is split into chunks, and we need to add all of these chunks together to get a histogram of **all** chunks. The accumulator is a tool that allows us to do this, by enabling easy object addition; i.e., \[AwkwardArray1\] + \[AwkwardArray2\] = \[AwkwardArray1 + AwkwardArray2\].

`accumulator`: This is a helper method for our accumulator. Just return the accumulator in it.

`process`: This is where all of the magic actually happens. All of your analysis code should go here. The current Coffea standard is to use NanoEvents for reading data, but outdated analyses may still make use of the old standard of JaggedCandidateArrays. It's recommended that you update to NanoEvents if this is the case; see the end section for more discussion. For a primer on columnar analysis, see the benchmarks and analysis in this repository, or the Coffea documentation's examples.

`postprocess`: This is where we can make post-analysis adjustments, such as rebinning or scaling our histograms.

In [3]:
class Processor(processor.ProcessorABC):
    def __init__(self):
        dataset_axis = hist.Cat("dataset", "")
        # Split data into 50 bins, ranging from 0 to 100.
        MET_axis = hist.Bin("MET", "MET [GeV]", 50, 0, 100)
        
        self._accumulator = processor.dict_accumulator({
            'MET': hist.Hist("Counts", dataset_axis, MET_axis),
        })
    
    @property
    def accumulator(self):
        return self._accumulator
    
    def process(self, events):
        output = self.accumulator.identity()
        
        dataset = events.metadata["dataset"]
        MET = events.MET.pt
        
        # Flatten so that we don't pass jagged data into a histogram (that would make no sense!)
        output['MET'].fill(dataset=dataset, MET=MET.flatten())
        return output

    def postprocess(self, accumulator):
        return accumulator

## Running the Dask executor.

This is where [Dask](https://dask.org/) comes in. Now that we have a minimal processor put together, we can execute it on our sample file. This requires an executor. Coffea comes with basic executors such as `futures_executor` and `iterative_executor` which use strictly Pythonic tools. The Dask executor (`dask_executor`), however, is more sophisticated for cluster computing, and coffea-casa enables its usage.

In the JupyterLab sidebar, you should see a sidecar dedicated to Dask.


<img src="dask.png" alt="Drawing" width="35%"/>


You can click on the UNL HTCondor Cluster button and drag it out into a block of the Jupyter Notebook, and it will paste everything necessary to connect to the Dask scheduler. The Dask workers will then connect to this scheduler when the executor is run.

In [4]:
from dask.distributed import Client

client = Client("tls://matousadamec-40gmail-2ecom.dask.coffea.casa:8786")
client

Exception in callback None()
handle: <Handle cancelled>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 1391, in _do_ssl_handshake
    self.socket.do_handshake()
  File "/opt/conda/lib/python3.8/ssl.py", line 1309, in do_handshake
    self._sslobj.do_handshake()
OSError: [Errno 0] Error

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/asyncio/events.py", line 81, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.8/site-packages/tornado/platform/asyncio.py", line 189, in _handle_events
    handler_func(fileobj, events)
  File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 696, in _handle_events
    self._handle_read()
  File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 1478, in _handle_read
    self._do_ssl_handshake()
  File "/opt/conda/lib/python3.8/site-pa

OSError: Timed out trying to connect to tls://matousadamec-40gmail-2ecom.dask.coffea.casa:8786 after 10 s

Then, all we have to do is run the executor. This is done through the `processor.run_uproot_job` method. It requires the following to be provided as arguments:

`fileset`: The files we want to run our analysis on. In our case, the sample file defined earlier.

`treename`: This is the name of the tree inside of the root file. For NanoAODs, I believe this should always just be 'Events.'

`executor`: The executor that we wish to use; coffea-casa is intended to be used with the Dask executor. You can also try `futures_executor` and `iterative_executor`, and both can be useful for debugging or troubleshooting when workers are acting up with errors.

`executor_args`: There's a lot of optional arguments you can put in the dictionary here. See the run_uproot_job [documentation](https://coffeateam.github.io/coffea/api/coffea.processor.run_uproot_job.html). At minimum, we need to point to a Dask scheduler (`'client': client`) if we're using the Dask executor; we do not need to do this for the futures or iterative executor. If you're using NanoEvents, then you need to say so (`'nano': True`). For JaggedCandidateArrays, you shouldn't need to specify anything, until NanoEvents becomes the default.

`chunksize`: Coffea will split your data into chunks with this many events. If your data has a million events and your chunksize is 250000, you'll have four chunks. There is also a `maxchunks` argument you can put in, which will stop the analysis after a certain number of chunks are reached. In other words, `maxchunks=2` will only process 500000 events of your million. This can be useful for debugging.

In [None]:
output = processor.run_uproot_job(fileset=fileset, 
                       treename="Events", 
                       processor_instance=Processor(),
                       executor=processor.dask_executor,
                       executor_args={'client': client, 'nano': True},
                       chunksize=250000)

## Miscellaneous


### JaggedCandidateArrays vs. NanoEvents
JaggedCandidateArrays employ explicit instantiation of data. For example, to get muons with a JaggedCandidateArray:

` def process(self, df):
        dataset = df['dataset']
        muons = JaggedCandidateArray.candidatesfromcounts(
            df['nMuon'],
            pt=df['Muon_pt'].content,
            eta=df['Muon_eta'].content,
            phi=df['Muon_phi'].content,
            mass=df['Muon_mass'].content,
            charge=df['Muon_charge'].content) `
            
Conversely, NanoEvents employs lazy-reading and doesn't require explicit instantiation of data. This makes it both more efficient and more elegant. To get muons with a NanoEvents array:

` def process(self, events):
        dataset = events.metadata['dataset']
        muons = events.Muons `
        
If we also wanted to get electrons, then we'd have to construct another JaggedCandidateArray with a similar block of code. With NanoEvents, we'd just call `electrons = events.Electrons.` Thus, it's recommended that you make the swap to NanoEvents if you're still using JaggedCandidateArrays!

### ServiceX
[ServiceX](https://servicex.readthedocs.io/en/latest/introduction/) is a data delivery package which uses [func_adl](https://pypi.org/project/func-adl/) to fetch data. 

**The coffea-casa facility is built to support ServiceX, though it is currently in experimental stages. This section will be updated as ServiceX implementation becomes more stable.**