# Example: Virtual Detector with Visualization in HyperSpy

An example how to process a pixelated STEM dataset from an HDFS filesystem. 

Prerequisites: 

 * a python3.6 virtualenv with all requirements installed
  * ``pip install -e .`` in your LiberTEM source dir
  * ``pip install hyperspy hyperspy_gui_ipywidgets hdfs3`` for additional dependencies
  * hdfs3 might need some care as it depends on libhdfs3 which needs to be built first...
 * a running HDFS installation (single node is fine)
 * a dataset imported onto the HDFS filesystem with ``libertem-ingest``
  * example call: ``$ libertem-ingest ~/Data/EMPAD/scan_11_x256_y256.emd experimental/science_data/data test/``
  * ``experimental/science_data/data`` is the path to the HDF5 DataSet that contains your data, ``test/`` is the destination on the Hadoop filesystem
  * see also: ``$ libertem-ingest --help``

In [1]:
%matplotlib nbagg

We disable threading in OpenBLAS because it would interfere with the dask multiprocessing and because OpenBLAS likes to call ``sched_yield`` for no discernable reason

In [2]:
import os
os.environ["OMP_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"

Import our dependencies. There may be some warnings about ``hyperspy_gui_traitui`` and so on which can be safely ignored (you do need ``hyperspy_gui_ipywidgets`` though):

In [5]:
from libertem.io.dataset.hdf5 import H5DataSet
from libertem.executor.dask import DaskJobExecutor
from libertem.executor.inline import InlineJobExecutor
from libertem.job.masks import ApplyMasksJob
from libertem.utils import _make_circular_mask
import numpy as np
import hyperspy.api as hs
import matplotlib.pyplot as plt

  return f(*args, **kwds)


Now onto our masks. Many problems in pixelated STEM can be solved by element-wise matrix multiplication and summation of a mask onto each frame, including virtual detector. Here we define some simple radial masks: 

In [6]:
masks=[
    # random noise for comparison
    lambda: np.random.randn(128, 128),
    
    # summation of all pixels:
    lambda: np.ones(shape=(128, 128)),
    
    # small radial mask at the center of the frame:
    lambda: np.ones(shape=(128, 128)) * _make_circular_mask(
        centerX=64, centerY=64,
        imageSizeX=128, imageSizeY=128, radius=15
    ),
    
    # large radial mask:
    lambda: np.ones(shape=(128, 128)) * _make_circular_mask(
        centerX=64, centerY=64,
        imageSizeX=128, imageSizeY=128, radius=50
    ),
    
    # ring, built from a large radial mask minus a smaller radial mask:
    lambda: np.ones(shape=(128, 128)) * (
        _make_circular_mask(
            centerX=64, centerY=64,
            imageSizeX=128, imageSizeY=128, radius=64
        ) &
        (~_make_circular_mask(
            centerX=64, centerY=64,
            imageSizeX=128, imageSizeY=128, radius=50
        ))
    ),
]

Here we define the ``stackheight``. It says how many frames we process at once. Our dataset has ``128×128`` frames at 8 bytes per pixel, so 8 frames are exactly 1 MB. As my laptop CPU has a L3 cache with a capacity of 8MB, using 4 threads means the working set fits into the L3 cache, even with room for about 32 masks.

In [7]:
stackheight = 8

Using ``H5DataSet`` we define the input dataset, and define the tileshape using the ``stackheight`` we set above. With ``ApplyMasksJob`` we define the operation we want to do on the dataset: apply masks. Finally, with ``DaskJobExecutor`` we define the execution environment - we want to use dask in this case, with a scheduler running on the local computer.

In [11]:
ds = H5DataSet(
    path="/home/clausen/Data/EMPAD/scan_11_x256_y256.emd",
    ds_path="experimental/science_data/data",
    target_size=512*1024*1024,
    tileshape=(1, stackheight, 128, 128)
)
job = ApplyMasksJob(dataset=ds, mask_factories=masks)  # if you want to test with just a single mask, pass mask_factories[-1:] to select just the last one

executor = DaskJobExecutor.make_local(cluster_kwargs={"threads_per_worker": 1, "n_workers": 4})

# we use a local cluster here, if you want to use a real cluster, use the following line instead:
# executor = DaskJobExecutor(scheduler_uri="tcp://localhost:8786")

# executor = InlineJobExecutor()
# print(executor.client.scheduler_info())

Now, we define a HyperSpy signal that will hold our result. The navigation axis will select the result for a specific mask, and the signal axis is the 256×256 result image.

In [12]:
full_result = hs.signals.Signal2D(np.zeros(shape=(len(job.masks), 256, 256)))
full_result.axes_manager[0].name = "masks"
full_result.axes_manager[1].name = "x"
full_result.axes_manager[2].name = "y"
full_result.plot(navigator="slider")

VBox(children=(HBox(children=(Label(value='masks', layout=Layout(width='15%')), IntSlider(value=0, description…

<IPython.core.display.Javascript object>

In [13]:
from ipywidgets import FloatProgress
from IPython.display import display
f = FloatProgress(min=0, max=len(list(job.get_tasks())))
display(f)
full_result.data = np.zeros(shape=(len(job.masks), 256, 256))
full_result.events.data_changed.trigger(full_result)
full_result._plot.signal_plot.figure.canvas.draw()

FloatProgress(value=0.0, max=16.0)

Now, for kicking off the computation, we tell the ``executor`` to run ``job``. ``run_job`` is a generator and we get the results back as they arrive from the dask scheduler. Each ``result`` is a list of ``ResultTile`` instances, and each ``tile`` represents the result of all masks applied to ``stackheight`` frames. Each ``ResultTile`` knows where in the ``full_result`` it should end up, so we ask it via ``copy_to_result`` to copy its data to the HyperSpy signal data array. We then trigger the ``data_changed`` HyperSpy event, which should enable live updates for results as they come in.

In [15]:
def run():
    f.value = 0
    for result in executor.run_job(job):
        for tile in result:
            tile.copy_to_result(full_result.data)
        # updating the whole canvas is expensive, but gives us live feedback:
        # full_result._plot.signal_plot.figure.canvas.draw()
        f.value += 1
        
    return tile
%time tile = run()
%timeit run()
full_result.events.data_changed.trigger(full_result)

CPU times: user 496 ms, sys: 39.1 ms, total: 535 ms
Wall time: 1.2 s
1.12 s ± 31.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Nice! With a warm fs cache, LiberTEM can apply 5 masks to the 8GB dataset in about 1.1s (for a single mask, we only take ~800ms). You can have a look at the different results for each mask by using the index slider in the plot above.

Let's have a look at one of the result tiles:

In [16]:
tile

<ResultTile for slice=<Slice origin=(47, 248, 0, 0) shape=(1, 8, 128, 128)>>

The ``origin`` tells us where in the full dataset this tile originated, and ``shape`` tells us the shape of the *source ``DataTile``* that was processed into this ``ResultTile`` (more on ``DataTile`` later).

In [17]:
tile.data.shape

(8, 5)

Here we can see that the source ``DataTile`` that contained 8 frames with a shape of ``(1, 8, 128, 128)`` had 5 masks applied to it and thus was reduced to a shape of ``(8, 5)``. 

## Data Partitioning

For distributing the computation, we also have to distribute the data. For that, the ``libertem-ingest`` script slices the data into partitions. Let's have a look at them:

In [19]:
[p.slice for p in ds.get_partitions()]

[<Slice origin=(0, 0, 0, 0) shape=(16, 256, 128, 128)>,
 <Slice origin=(16, 0, 0, 0) shape=(16, 256, 128, 128)>,
 <Slice origin=(32, 0, 0, 0) shape=(16, 256, 128, 128)>,
 <Slice origin=(48, 0, 0, 0) shape=(16, 256, 128, 128)>,
 <Slice origin=(64, 0, 0, 0) shape=(16, 256, 128, 128)>,
 <Slice origin=(80, 0, 0, 0) shape=(16, 256, 128, 128)>,
 <Slice origin=(96, 0, 0, 0) shape=(16, 256, 128, 128)>,
 <Slice origin=(112, 0, 0, 0) shape=(16, 256, 128, 128)>,
 <Slice origin=(128, 0, 0, 0) shape=(16, 256, 128, 128)>,
 <Slice origin=(144, 0, 0, 0) shape=(16, 256, 128, 128)>,
 <Slice origin=(160, 0, 0, 0) shape=(16, 256, 128, 128)>,
 <Slice origin=(176, 0, 0, 0) shape=(16, 256, 128, 128)>,
 <Slice origin=(192, 0, 0, 0) shape=(16, 256, 128, 128)>,
 <Slice origin=(208, 0, 0, 0) shape=(16, 256, 128, 128)>,
 <Slice origin=(224, 0, 0, 0) shape=(16, 256, 128, 128)>,
 <Slice origin=(240, 0, 0, 0) shape=(16, 256, 128, 128)>]

So our dataset is neatly sliced in the first dimension, each partition containing ``16*256 = 4096`` frames. We can also take a look at the index, as it is stored on HDFS:

Now, for efficient computation, each partition itself is split into even smaller ``DataTile``s that fit into the L3 cache:

In [21]:
p = next(ds.get_partitions())
data_tile = next(p.get_tiles())

In [22]:
data_tile

<DataTile <Slice origin=(0, 0, 0, 0) shape=(1, 8, 128, 128)>>

So each ``DataTile`` contains ``stackheight`` frames, each of which is 128x128

In [23]:
len(list(ds.get_partitions()))

16

In [24]:
len(list(p.get_tiles()))

512