In [None]:
# import all the libraries we need
import os

import parsl
from parsl import python_app
from parsl.config import Config
from parsl.channels import LocalChannel
from parsl.executors import HighThroughputExecutor
from parsl.providers import LocalProvider

# helpers
from grouputils import initialize_stager
from grouputils import plot_tiles


## Background

The first step in our workflow is to "stage" our data. Staging the data encompasses the following pre-processing tasks:

- simplify the polygons 
- set an input CRS if one is missing
- reproject the data when required
- add additional properties to each polygon, including: the centroid x and y
  coordinates, the area, a unique ID, and the name of the file that the
  polygon originated from
- break each input file into [standardized tiles](https://docs.opengeospatial.org/is/17-083r2/17-083r2.html), and save them to disk.

Here is a diagram showing what the most important step, the last one, looks like.

![](https://raw.githubusercontent.com/PermafrostDiscoveryGateway/viz-staging/develop/docs/images/staging_tldr.png)

We will use some methods from the `pdgstaging` library to stage our tiles. The first step, is to initalize the `TileStager`. The `TileStager` is a class with methods `stage`, which works on a single vector file, and `stage_all` which stages all vector files in a directory.

### Initalize the stager

Fist we need to use the `initialize_stager` function to instantiate the `TileStager` object. The arguments to this function are `dir_input`, the directory of input vector files, and `dir_staged`, the directory of output vector files.

Input vector files are located **in `/home/jclark/example-data`**

In [None]:
# execute the initialize_stager function with the appropriate arguments
# save the result to a variable called iwp_stager
iwp_stager = initialize_stager("/home/jclark/example-data")

Feel free to explore the `iwp_stager` object in any way you like.

Next let's use it to get a list of files to stage.

In [None]:
files_to_stage = iwp_stager.tiles.get_filenames_from_dir('input')

## Stage one file

Here is an example of how to run the stager on one file. We use the `stage` method on the `iwp_stager` object, with a path to a file as the argument to the method.

In [None]:
example_file = files_to_stage[0]
iwp_stager.stage(example_file)

Based on how long staging one file took, estimate how long that would take to stage all the input files that we have in this example, serially. How long would it take if we had 100 files? 1000?

In [None]:
# estimate computation time

For these example data, the amount of time it takes is not super high. But as the number of files gets bigger, things get out of hand quickly. Luckily for us, this problem is pleasingly parallel. The staging of each file is completely independent of the others. So, let's set this up as a `parsl` workflow using the skills we learned in Section 4. 

Just to get a sense of what happened, let's plot the result of our test staging effort using a `plot_tiles` helper we wrote for this activity.

In [None]:
plot_tiles(iwp_stager)

Finally, let's remove the files we just created (including the staging summary csv file) to prepare to run this over all of the files. If we don't do this polygons will get appended to the staged files which will result in duplication.

In [None]:
os.system(f'rm -rf {iwp_stager.config.get("dir_staged")}')
os.system(f'rm {iwp_stager.config.get("filename_staging_summary")}')

### Staging in parallel

First set up the configuration for `parsl` using `config`, and a `HighThroughputExecutor`. For the executor, set the `max_workers` to 32, and the `max_blocks` for the `provider` to 1. This will spread our work over 32 processes on the server. Make sure you pass the bash command you use to invoke your virtual environment to the `worker_init` argument as a string.

In [None]:
# htex_config = config(
#   executors=[
#       HighThroughputExecutor(
#           ..., 
#           provider = LocalProvider(...)
#       )
#   ]
# )

Next, set up your Parsl app to run the `stage` method in parallel. You'll need to pass as arguments to the app function and the `TileStager` instance we created earlier.

In [None]:
# Make a Parsl app that uses the stage method


Now, execute the app in parallel over all of the `files_to_stage`.

In [None]:
# execute the app using app.futures

In [None]:
# don't forget to shutdown and clear your executor

Now we can check out the `plot_tiles` result again (which will only plot the first 92 of our tiled files)

In [None]:
plot_tiles(iwp_stager)

## Bonus

This process took the original 35 files, ranging in size from 20MB to 500MB (6 GB total), and tiled them into arond 2200 files, and if you set up your executor like we described, it should have taken around 15 minutes. 

Discuss in your groups whether you suspect this process is CPU bound, I/O bound, memory bound, or network bound. How would you figure it out for sure? Why would you want to know?

### Answer

