# Software Architecture

## Software Design

Unlike most InSAR processing software (e.g., StamPS, MintPy) that have designated processing workflow, Decorrelation only provides a collection of Python functions or commands. The reason is, in real application, there is no perfect workflow that always generate satisfactory deformation result. Especially when the coherence is not good and atmospheric artifact is strong. One needs to try a lot of different methods but they are generally implented in different packages. Even worse, the workflow-based software are encapsulated too much and generally no detailed documentation is provided. It is really frustrating when users need to save intermediate data from one software and prepared them in a designated format and structure required by another software. Sometimes it is necessarry to read a lot of source code to understand what are the output, what are their data structure and what kind of inputs are needed as their typical workflows is not followed. So, instead of providing a standard workflow, Decorrelation is designed as a collection of functions that implement specific InSAR processing techniques (e.g. calculate the dispersion index, do phase linking) and users are encouraged to make their own workflow that are suitable for their case. We provide the necessary infrastructure and your role is to be innovative! To make it easier, Decorrelation provides detailed documentation for each function that explain the usage. We also provide the tutorials section that provide some typical workflow for your reference. In case users want to try methods that are not implemented in Decorrelation, the input and output are well explained in the documentation of every Decorrelation functions.

Although we provide detailed documentation and reference workflow, we still admit this software is not that easy that users only need to run from the first step to the last step. It doesn't mean we don't value user-friendliness, but it shouldn’t come at the expense of flexibility and creativity.

## Software Structure

Most of the functions in this package provide 2 kind of API, the array-based API and the file-based API. The inputs and output of array-based functions generally are numpy or cupy arrays (Simply, cupy is a package that provides same functions as numpy but runs on GPU), while inputs and outputs of file-based functions are string of path to the array stored in disk. InSAR techniques that can be greatly accelerated with parallel processing are implented in cupy for better performance while all other functions are implented with numpy arrays. The file-based functions are not simple wrapper of the array-based functions. Due to the limitation of numpy and cupy, most array-based functions can only be runned on a single CPU core or on a single GPU. However, the file-based functions support parallel processing on multi-CPU-cores and multi-GPUs with the help of [dask](https://www.dask.org/). But their is performance cost for using `dask`, sometimes the array-based functions is faster. Another benefit of `dask` is the memory usage is smaller as the processing on each chunks can be in sequence. 

All functions in the file-based API starts with a prefix `de` to make them more distinguishable. Terminal commands of same name as the file-based functions is also provided. To make it simpler, we call the file-based functions and commands CLI (command line interface), the array-based API API. The API and CLI functions are arranged in different namespace. In this documentation website, if document of one API function `xxx` is in section `API` and subsection `pl`, then it is in the namespace `decorrelation.pl`, the correct way to import it is `from decorrelation.pl import xxx`. For CLI funtion `de_xxx`, it the document page is in section `CLI` and subsection `pl`, then it is in the namespace `decorrelation.cli.pl` and the correct way to import it is `from decorrelation.cli.pl import de_xxx`.

## Data format

Most of the stored data in this package is in the [zarr](https://zarr.readthedocs.io/en/stable/) format, which is a file storage format for chunked, compressed, N-dimensional arrays. The figure below shows how the structure of zarr data. The reading and writing speed is fast since the data volume is compressed. Before compressing, the data are divided into chunks to be more flexiable for `dask` parallel operation. Generally, the file name is `xxxxxx.zarr`. You will find it is indeed a directory in the file system. But just treat it as a single file in use.

![imga](./software_architecture/array.svg)

Note that the sturcture of dask array is similar. Each chunk of a big dask array is just a numpy or cupy array. Independent operations on every chunks are automatically parallelized.

In this software, there are mainly two kind of dataset. One is stack of raster data, another is stack of point cloud data. The raster dataset are divided into chunks both azimuth dimension and range dimension. The point cloud dataset are divided into chunks along the spatial dimension. These two chunksize needs to be determined by the user. The chunksize in high dims are automatically determined. Users don't need to care about it.

Chunksize affect the performance of the program. Unproper chunksize slows down the processing speed or even crash the program.
Using too small chunksize makes too much inter-process communication and slows down the program.
Too big chunksize may crash the program due to mamory limit.
For raster data, it is good to make sure range chunksize of the last chunk is same as others.
And it is prefered to divide raster data along azimuth direction rather than range direction.

## An example

Here we provide an simple example. The API function `decorrelation.pl.emi` implemented the `EMI` phase linking method and `decorrelation.cli.de_emi` is the file-based API of it.

Import them first:

In [None]:
from decorrelation.pl import emi
from decorrelation.cli.pl import de_emi
from nbdev.showdoc import show_doc # this is just a function to show the document

In [None]:
show_doc(emi)

---

[source](https://github.com/kanglcn/decorrelation/blob/main/decorrelation/pl.py#LNone){target="_blank" style="float:right; font-size:smaller"}

### emi

>      emi (coh:cupy.ndarray, ref:int=0)

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| coh | ndarray |  | complex coherence metrix,dtype cupy.complex |
| ref | int | 0 | index of reference image in the phase history output, optional. Default: 0 |
| **Returns** | **tuple** |  | **estimated phase history `ph`, dtype complex; quality (minimum eigvalue, dtype float)** |

In [None]:
show_doc(de_emi)

---

[source](https://github.com/kanglcn/decorrelation/blob/main/decorrelation/cli/pl.py#LNone){target="_blank" style="float:right; font-size:smaller"}

### de_emi

>      de_emi (coh:str, ph:str, emi_quality:str, ref:int=0,
>              point_chunk_size:int=None, log:str=None,
>              plot_emi_quality:bool=False, vmin:int=1.0, vmax:int=1.3,
>              ds_idx:str=None, shape:tuple=None, emi_quality_fig:str=None)

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| coh | str |  | coherence matrix |
| ph | str |  | output, wrapped phase |
| emi_quality | str |  | output, pixel quality |
| ref | int | 0 | reference image for phase |
| point_chunk_size | int | None | parallel processing point chunk size |
| log | str | None | log |
| plot_emi_quality | bool | False | if plot the emi quality |
| vmin | int | 1.0 | min value of emi quality to plot |
| vmax | int | 1.3 | max value of emi quality to plot |
| ds_idx | str | None | index of ds |
| shape | tuple | None | shape of one image |
| emi_quality_fig | str | None | path to save the emi quality plot, optional. Default, no saving |

To apply the `emi` API funtion:

In [None]:
import zarr
import numpy as np
import cupy as cp

In [None]:
coh_zarr = zarr.open('./software_architecture/ds_can_coh.zarr/','r') 

In [None]:
coh_zarr,coh_zarr.shape,coh_zarr.chunks,coh_zarr.dtype

(<zarr.core.Array (740397, 17, 17) complex64 read-only>,
 (740397, 17, 17),
 (200000, 17, 17),
 dtype('complex64'))

It is coherence matrix for 740397 selected DS candidate and there are 17 SAR images.
So the coherence matrix for one pixel is 17 $\times$ 17.
The coherence matrix is stored in 4 chunks and each chunks stores data for 200000 DS candidate.
(The last chunk only have 140397 DS candidate).

In [None]:
!ls -al ./software_architecture/ds_can_coh.zarr/ #It is a directory indeed!

total 1570400
drwxrwxr-x 2 kangl kangl      4096 Sep 28 12:30 .
drwxrwxr-x 5 kangl kangl      4096 Oct  4 12:15 ..
-rw-rw-r-- 1 kangl kangl 434775676 Sep 28 12:30 0.0.0
-rw-rw-r-- 1 kangl kangl 432578417 Sep 28 12:30 1.0.0
-rw-rw-r-- 1 kangl kangl 434846911 Sep 28 12:30 2.0.0
-rw-rw-r-- 1 kangl kangl 305857416 Sep 28 12:30 3.0.0
-rw-rw-r-- 1 kangl kangl       398 Sep 28 12:30 .zarray


In [None]:
coh = coh_zarr[:] # read as numpy array

In [None]:
coh = cp.asarray(coh) # convert to cupy array

In [None]:
coh.shape

(740397, 17, 17)

In [None]:
%%time
# The processing is really fast!
ph,emi_quality = emi(coh)

CPU times: user 1.18 s, sys: 447 ms, total: 1.62 s
Wall time: 1.69 s


Now we apply the CLI function:

In [None]:
%%time
de_emi('./software_architecture/ds_can_coh.zarr/',
       './software_architecture/ds_can_ph.zarr',
       './software_architecture/ds_can_emi_quality.zarr',
       point_chunk_size = 200000)

2023-10-08 23:30:17 - de_emi - INFO - fetching args:
2023-10-08 23:30:17 - de_emi - INFO - coh = './software_architecture/ds_can_coh.zarr/'
2023-10-08 23:30:17 - de_emi - INFO - ph = './software_architecture/ds_can_ph.zarr'
2023-10-08 23:30:17 - de_emi - INFO - emi_quality = './software_architecture/ds_can_emi_quality.zarr'
2023-10-08 23:30:17 - de_emi - INFO - ref = 0
2023-10-08 23:30:17 - de_emi - INFO - point_chunk_size = 200000
2023-10-08 23:30:17 - de_emi - INFO - log = None
2023-10-08 23:30:17 - de_emi - INFO - plot_emi_quality = False
2023-10-08 23:30:17 - de_emi - INFO - vmin = 1.0
2023-10-08 23:30:17 - de_emi - INFO - vmax = 1.3
2023-10-08 23:30:17 - de_emi - INFO - ds_idx = None
2023-10-08 23:30:17 - de_emi - INFO - shape = None
2023-10-08 23:30:17 - de_emi - INFO - emi_quality_fig = None
2023-10-08 23:30:17 - de_emi - INFO - fetching args done.
2023-10-08 23:30:17 - de_emi - INFO - coh dataset shape: (740397, 17, 17)
2023-10-08 23:30:17 - de_emi - INFO - coh dataset chunks: 

2023-10-08 23:30:19,866 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2023-10-08 23:30:19,866 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2023-10-08 23:30:19,869 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2023-10-08 23:30:19,869 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2023-10-08 23:30:19,872 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2023-10-08 23:30:19,872 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2023-10-08 23:30:19,873 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2023-10-08 23:30:19,873 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2023-10-08 23:30:19,880 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2023-10-08 23:30:19,880 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2023-

2023-10-08 23:30:23 - de_emi - INFO - dask local CUDA cluster started.
2023-10-08 23:30:23 - de_emi - INFO - coh dask array shape: (740397, 17, 17)
2023-10-08 23:30:23 - de_emi - INFO - coh dask array chunks: ((200000, 200000, 200000, 140397), (17,), (17,))
2023-10-08 23:30:23 - de_emi - INFO - phase linking with EMI.
2023-10-08 23:30:23 - de_emi - INFO - got ph and emi_quality.
2023-10-08 23:30:23 - de_emi - INFO - ph dask array shape: (740397, 17)
2023-10-08 23:30:23 - de_emi - INFO - ph dask array chunks: ((200000, 200000, 200000, 140397), (17,))
2023-10-08 23:30:23 - de_emi - INFO - emi_quality dask array shape: (740397,)
2023-10-08 23:30:23 - de_emi - INFO - emi_quality dask array chunks: ((200000, 200000, 200000, 140397),)
2023-10-08 23:30:23 - de_emi - INFO - saving ph and emi_quality.
2023-10-08 23:30:23 - de_emi - INFO - computing graph setted. doing all the computing.
2023-10-08 23:30:26 - de_emi - INFO - computing finished.
2023-10-08 23:30:28 - de_emi - INFO - dask cluster 

The CLI function is slower than the API function since it needs to read and write the data and set up the dask CUDA cluster.

Notice that there is a `point_chunk_size` option in the CLI function, it means the data is processed in chunks seperately and each chunk have `point_chunk_size` pixels. By default, this number is set as the chunk size in the input zarr data.

There are more options in the CLI function, e.g., to save the printed information to the log file, to plot some result. 

As mentioned, the CLI funtion also provide command line interface, but it won't generate any plot on screen since it is not supported in terminal.

In [None]:
!de_emi -h

usage: de_emi [-h] [--ref REF] [--point_chunk_size POINT_CHUNK_SIZE] [--log LOG]
              [--plot_emi_quality] [--vmin VMIN] [--vmax VMAX] [--ds_idx DS_IDX]
              [--shape SHAPE] [--emi_quality_fig EMI_QUALITY_FIG]
              coh ph emi_quality

positional arguments:
  coh                                  coherence matrix
  ph                                   output, wrapped phase
  emi_quality                          output, pixel quality

options:
  -h, --help                           show this help message and exit
  --ref REF                            reference image for phase (default: 0)
  --point_chunk_size POINT_CHUNK_SIZE  parallel processing point chunk size
  --log LOG                            log
  --plot_emi_quality                   if plot the emi quality (default: False)
  --vmin VMIN                          min value of emi quality to plot
                                       (default: 1.0)
  --vmax VMAX                          max value of em

The CLI also include functions for simple data manipulation (e.g. array slicing and point clouds merging). As it is very easy to do them for numpy/cupy arrays, these CLI do not have corresponding API.