# Data Manipulation Tools

In this tutorial, we'll make use of a number of Python libraries to work with geospatial data. There are numerous ways to work with data and so choosing tooling can be difficult. The principal library we'll be using is [*Xarray*](https://docs.xarray.dev/en/stable/index.html) for its `DataArray` and `Dataset` data structures and associated utilities as well as [NumPy]() and [Pandas]() for manipulating homogeneous numerical arrays and tabular data respectively. We'll also look a bit make use of [Rasterio]() as a tool for reading data or meta-data from GeoTIFF files; judicious use of Rasterio can make a big difference when working with remote files in the cloud.

In [None]:
from warnings import filterwarnings
filterwarnings('ignore')

from pathlib import Path
import xarray as xr
import numpy as np
import pandas as pd
import rioxarray as rio
import rasterio

## [rioxarray](https://corteva.github.io/rioxarray/html/index.html)

+ `rioxarray` is a package to *extend* Xarray
+ Primary use within this tutorial:
  + `rioxarray.open_rasterio` enables loading GeoTIFF files directly into Xarray `DataArray` structures
  + `xarray.DataArray.rio` extension provides useful utilities (e.g., for specifying CRS information)

Observe first that `open_rasterio` works on local file paths and remote URLs.
+ Predictably, local access is faster than remote access.

In [None]:
%%time
LOCAL_PATH = Path('..') / 'assets' / 'OPERA_L3_DIST-ALERT-HLS_T10TEM_20220815T185931Z_20220817T153514Z_S2A_30_v0.1_VEG-ANOM-MAX.tif'
da = rio.open_rasterio(LOCAL_PATH)

In [None]:
%%time
REMOTE_URL ='https://opera-provisional-products.s3.us-west-2.amazonaws.com/DIST/DIST_HLS/WG/DIST-ALERT/McKinney_Wildfire/OPERA_L3_DIST-ALERT-HLS_T10TEM_20220815T185931Z_20220817T153514Z_S2A_30_v0.1/OPERA_L3_DIST-ALERT-HLS_T10TEM_20220815T185931Z_20220817T153514Z_S2A_30_v0.1_VEG-ANOM-MAX.tif'
da_remote = rio.open_rasterio(REMOTE_URL)

In [None]:
(da_remote == da).all() # Verify that the data is identical from both sources

## [Xarray](https://docs.xarray.dev/en/stable/index.html)

Let's examine the data structure loaded from the file `LOCAL_PATH`.

Observe, in this notebook, the `repr` for an Xarray `DataArray` can be interactively examined.

In [None]:
print(f'{type(da)=}\n')
da

Of course, it is also possible to access various `DataArray` attributes programmatically.

In [None]:
print(da.coords)

We can extract the coordinates as one-dimensional (homogeneous) NumPy arrays.

In [None]:
print(da.coords['x'].values)

The dimensions `da.dims` are the strings/labels associated with the `DataArray` axes.

In [None]:
da.dims

`data.attrs` is a dictionary containing other meta-data parsed from the GeoTIFF tags.

In [None]:
da.attrs

As mentioned, `rioxarray` extends the class `xarray.DataArray` with the `rio` accessor. For our purposes, an important attribute retrievable from this namespace is `da.rio.crs`, the coordinate refernce system associated with this raster dataset.

In [None]:
da.rio.crs

Given that this data is stored using a particular UTM CRS, let's relabel the coordinates to reflect this; that is, the coordinate labelled `x` would conventionally be called `easting` and the coordinate called `y` would be called `northing`.

In [None]:
da = da.rename({'x':'easting', 'y':'northing', 'band':'band'})

Recall Xarray permits slicing using coordinate values or their corresponding integer positions using the `sel` and `isel` accessors respectively.

In [None]:
da.isel(easting=slice(0,2))

In [None]:
da.sel(easting=[499995, 500025])

If we take a 2D slice from this 3D `DataArray`, we can plot it using the `.plot` accessor (more on this later).

In [None]:
da.isel(band=0).plot();

Finally, recall that a `DataArray` is a wrapper around a NumPy array. That NumPy array can be retrieved using the `.values` attribute.

In [None]:
array = da.values
print(f'{type(array)=}')
print(f'{array.shape=}')
print(f'{array.dtype=}')
print(f'{array.nbytes=}')

This raster data is stored as 8-bit unsigned integer data, so one byte for each pixel. A single unsigned 8-bit integer can represent integer values between 0 and 255. In an array with a bit more than thirteen million elements, that means there are many repeated values.

In [None]:
s_flat = pd.Series(array.flatten()).value_counts()
s_flat

Most of the entries in this raster array are zero. The numerical values vary between 0 and 100 with the exception of some 1709 pixels with the value 255. This will make more sense when we discuss the DIST data product specification.

## [rasterio](https://rasterio.readthedocs.io/en/stable)

Having reviewed some features of Xarray (and of its extension `rioxarray`), let's briefly look at `rasterio` as a means of exploring GeoTIFF files.

From the [Rasterio documentation](https://rasterio.readthedocs.io/en/stable):

> Before Rasterio there was one Python option for accessing the many different kind of raster data files used in the GIS field: the Python bindings distributed with the [Geospatial Data Abstraction Library, GDAL](http://gdal.org/). These bindings extend Python, but provide little abstraction for GDAL’s C API. This means that Python programs using them tend to read and run like C programs. For example, GDAL’s Python bindings require users to watch out for dangling C pointers, potential crashers of programs. This is bad: among other considerations we’ve chosen Python instead of C to avoid problems with pointers.
>
>What would it be like to have a geospatial data abstraction in the Python standard library? One that used modern Python language features and idioms? One that freed users from concern about dangling pointers and other C programming pitfalls? Rasterio’s goal is to be this kind of raster data library – expressing GDAL’s data model using fewer non-idiomatic extension classes and more idiomatic Python types and protocols, while performing as fast as GDAL’s Python bindings.
>
>High performance, lower cognitive load, cleaner and more transparent code. This is what Rasterio is about.

In [None]:
# Show rasterio.open works using context manager
import rasterio
from pathlib import Path
LOCAL_PATH = Path('..') / 'assets' / \
             'OPERA_L3_DIST-ALERT-HLS_T10TEM_20220815T185931Z_20220817T153514Z_S2A_30_v0.1_VEG-ANOM-MAX.tif'
print(LOCAL_PATH)

Given a data source (e.g., a GeoTIFF file in the current context), we can open a `DatasetReader` object associated with using `rasterio.open`. Technically, we have to remember to close the object afterward. That is, our code would look like this:

```python
ds = rasterio.open(LOCAL_PATH)
..
# do some computation
...
ds.close()
```

As with file-handling in Python, we can use a *context manager* (i.e., a `with` clause) instead.
```python
with rasterio.open(LOCAL_PATH) as ds:
  ...
  # do some computation
  ...

# more code outside the scope of the with block.
```
The dataset will be closed automatically outseide the `with` block.

In [None]:
with rasterio.open(LOCAL_PATH) as ds:
    print(f'{type(ds)=}')
    assert not ds.closed

# outside the scope of the with block
assert ds.closed

The `DatasetReader` class has a number of attributes and methods of interest to us:

 |  | | |
 |--|--|--|
 |`profile`|`height`|`width` |
 |`shape` |`count`|`nodata`|
 |`crs`|`transform`|`bounds`|
 |`xy`|`index`|`read` |

First, given a `DatasetReader` `ds` associated with a data source, examining `ds.profile` returns some diagnostic information.

In [None]:
with rasterio.open(LOCAL_PATH) as ds:
    print(f'{ds.profile=}')

The attributes `ds.height`, `ds.width`, `ds.shape`, `ds.count`, `ds.nodata`, and `ds.transform` are all included in the output from `ds.profile` but are also accessible individually.

In [None]:
with rasterio.open(LOCAL_PATH) as ds:
    print(f'{ds.height=}')
    print(f'{ds.width=}')
    print(f'{ds.shape=}')
    print(f'{ds.count=}')
    print(f'{ds.nodata=}')
    print(f'{ds.crs=}')
    print(f'{ds.transform=}')

Part of the motivation for using `rioxarray` is to simplify the loading of GeoTIFF files into Xarray structures while taking care of these coordinate mappings for us. If needed, `rasterio` provides access to the details.

![](http://ioam.github.io/topographica/_images/matrix_coords.png)
![](http://ioam.github.io/topographica/_images/sheet_coords_-0.2_0.4.png)
(from `http://ioam.github.io/topographica`)

Notice that the absolute values of the diagonal entries of the matrix `ds.transform` give the spatial dimensions of pixels.

In [None]:
with rasterio.open(LOCAL_PATH) as ds:
    print(f'{ds.transform=}')
    print(f'{np.abs(ds.transform[0])=}')
    print(f'{np.abs(ds.transform[4])=}')

In [None]:
with rasterio.open(LOCAL_PATH) as ds:
    print(f'{ds.transform * (0,0)=}')       # top-left pixel
    print(f'{ds.transform * (0,3660)=}')    # bottom-left pixel
    print(f'{ds.transform * (3660,0)=}')    # top-right pixel
    print(f'{ds.transform * (3660,3660)=}') # bottom-right pixel

In [None]:
with rasterio.open(LOCAL_PATH) as ds:
    print(f'coordinate bounds: {ds.bounds=}')

The method `ds.xy` converts integer index coordinates to continuous coordinates. Notice that `ds.xy` maps integers to the centre of pixels.

In [None]:
with rasterio.open(LOCAL_PATH) as ds:
    for k in range(3):
        for l in range(4):
            print(f'({k:2d},{l:2d})','\t', end='')
        print()
    print()
    for k in range(3):
        for l in range(4):
            e,n = ds.xy(k,l)
            print(f'({e},{n})','\t', end='')
        print()
    print()
    print(ds.bounds)
    print(ds.transform)
    print(ds.transform * (0.5,0.5))
    print(ds.transform * (0,0))
    print(ds.transform[0])

`ds.index` does the reverse: given spatial coordinates `(x,y)`, it returns the integer indices of the pixel that contains that point.

In [None]:
with rasterio.open(LOCAL_PATH) as ds:
    print(ds.index(500000, 4700015))

Finally, the method `ds.read` loads an array from the data file into memory. Notice this can be done on local or remote files.

In [None]:
%%time
with rasterio.open(LOCAL_PATH) as ds:
    array = ds.read()
    print(f'{array.shape=}')

In [None]:
%%time
with rasterio.open(REMOTE_URL) as ds:
    array = ds.read()
    print(f'{array.shape=}')

These tools are used in some of the casse studies that follow.