# Raster Data Analysis in Python

*2 hours*

---

## The Geospatial Data Abstraction Library (GDAL)

We're finally ready to start working with spatial file formats in Python!

**Today we'll introduce the Geospatial Data Abstraction Library (GDAL).** GDAL is a collection of tools for working with geospatial raster data. It is often referenced as part of "GDAL/OGR" where "OGR" is a related set of tools for working with vector data; OGR stands for OpenGIS Reference, but is always called just "OGR."

GDAL is a library written in the C language but it can also be imported into Python. When we import it in Python we write:

In [None]:
from osgeo import gdal

`osgeo` refers to the Open Source Geospatial Consortium.

When we use the `gdal` library, we're most often going to work with a `gdal.Dataset` object. This is the type of object that `gdal` uses to represent raster datasets. For this example, we'll use data collected as part of the National Ecological Observatory Network (NEON), at a Long-Term Ecological Research (LTER) site. [You can read more about these data here.](https://data.neonscience.org/data-products/DP3.30011.001)

**We create a new `gdal.Dataset` by calling `gdal.Open()` on a raster file path.**

In [None]:
ds = gdal.Open('http://files.ntsg.umt.edu/data/GIS_Programming/data/NEON_albedo.tif')
ds

The attributes of a `gdal.Dataset` are named a little differently...

```py
dir(ds)
```

They generally appear in what is called *camel-case,* because the alternating capital letters form "humps" in the attribute names.

In [None]:
ds.RasterCount

When `RasterCount` is 1, we know we're working with a single-band image.

In [None]:
(ds.RasterXSize, ds.RasterYSize)

**What distinguishes a `gdal.Dataset`, of course, is that it represents a spatial dataset. What is the spatial reference system (SRS) of this dataset?**

In [None]:
ds.GetProjection()

Wow, that's a lot! This can be pretty hard to read the way it is currently displayed.

Below, we'll extract the `SpatialReference` object, which is part of a different Python module in the `osgeo` package called `osr`.

In [None]:
srs = ds.GetSpatialRef()
srs

In [None]:
print(srs.ExportToPrettyWkt())

**This SRS or projection definition is in a format called Well-Known Text (WKT).** It defines many of the projection's parameters in terms of EPSG codes, which you may recall using before in a Desktop GIS program.

Here, we see that this raster data file is in the Universal Transverse Mercator (UTM) projection, using the WGS84 datum.

### Defining the Spatial Reference System in Python

**To completely describe a raster dataset in Python, we need two pieces of information:**

1. The spatial reference system (SRS) or *projection* of the data; this describes how the flat image corresponds to the non-flat Earth.
2. The *affine transformation* or **geotransform,** which describes how the raster's rows and columns line up with the geospatial coordinates, e.g., latitude and longitude.

So, if we're using GDAL, we'll keep the SRS information around in its WKT format:

In [None]:
wkt = ds.GetProjection()

**The second piece of information, the geotransform, is obtained:**

In [None]:
gt = ds.GetGeoTransform()
gt

**What does this mean?** The GeoTransform always consists of 6 numbers:

```py
(x_min, pixel_width, row_rotation, y_max, col_rotation, -pixel_height)
```

1. `x_min` is the minimum X coordinate, for example, the minimum or west-most Longitude.
2. `pixel_width` is the width of a raster pixel in the units used by the SRS, usually meters or degrees.
3. `row_rotation` describes how rows are oriented on the map; for North-up maps this is always zero.
4. `y_max` is the maximum Y coordinate, for example, the maximum or north-most Latitude.
5. `col_rotation` is similar to `row_rotation` and is always zero for North-up maps.
6. `-pixel_height` is the *negative* height of a raster pixel, in the units used by the SRS.


[You can read more about the GeoTransform here.](https://gdal.org/tutorials/geotransforms_tut.html)

An **affine transformation** is a kind of transformation that preserves lines and paralellism; in GIS, it's a way of mapping one 2D grid onto a different gridded coordinate system.

![](./assets/affine-transformation.jpg)

*Image from [GeeksForGeeks.org](https://www.geeksforgeeks.org/python-opencv-affine-traansformation/)*

For example, an affine transformation describes the change in perspective between a billboard seen at an angle and the original image.

![](./assets/affine-transformation2.png)

*Image from [Felipe Meganha](https://felipemeganha.medium.com/perspective-transformation-with-kornia-8bf86718adfd)*

Let's build some intuition about raster data arrays in Python. When we work with raster data in Python, we have to keep in mind that there are two coordinate systems: 

1. The spatial coordinate system (SRS), which describes where something is on our representation of the Earth;
2. The image coordinate system, which describes the location of a value within the array's data structure.

The image coordinate system, for a single-band raster image, consists of rows and columns. **One important thing to remember is that row-column values *increase* from top to bottom in the image coordinate system, whereas spatial coordinate (e.g., latitude) *decrease* in that same direction.**

![](./assets/coordinate-system-diagram.png)

We'll use the `affine` module to translate between row-column coordinates and spatial coordinates.

In [None]:
from affine import Affine

transform = Affine.from_gdal(*gt)

The notation `*gt` tells Python to take each element of the `gt` list and provide it as an argument, in order, to the `from_gdal()` function. The resulting `Affine` object has a slightly different internal representation of these arguments...

In [None]:
transform

What are the spatial coordinates of the top-left corner of the image?

In [None]:
transform * (0, 0)

Obviously, these numbers come straight from our GeoTransform: the left-most (west-most) coordinate and the top-most (north-most) coordinate.

What are the spatial coordinates of the bottom-right pixel?

In [None]:
transform * (1000, 1000)

This is a 1-meter image and the UTM projection has very little distortion for such small areas, so we can see that we would get the same result if we added 1000 meters (equivalent to 1000 pixels when we have 1-meter resolution) to the upper-left corner coordinate.

Note that we technical *subtract* 1000 meters from the north-west coordinate (northing) because we are interested in the *bottom-right* pixel relative to the *top-left* pixel.

In [None]:
(gt[0] + 1000, gt[3] - 1000)

Again, spatial coordinate systems increase to the top and to the right; but the image coordinate system increases to the *bottom* and the right.

If we ever wanted to find the image coordinates corresponding to a given pair of spatial coordinates, we could write:

In [None]:
~transform * (471000, 7227000)

And, sure enough, these coordinates correspond to the last row and the last column of the image (bottom-right corner). [You can read more about the `affine` library here.](https://pypi.org/project/affine/)

**If we need to look up the WKT for a given coordinate reference system (CRS) or SRS, [we can use the `pyproj` library.](https://pyproj4.github.io/pyproj/stable/api/crs/crs.html)**

In [None]:
from pyproj import CRS

srs = CRS.from_user_input(4326)
srs.to_wkt()

Where `4326` is the European Petroleum Survey Group (EPSG) code for the WGS84 Geographic Coordinate System. [EPSG.io](https://epsg.io/) is a great website for getting detailed information on any CRS/SRS.

So, with `wkt` and `gt`, we have all the information needed to display the raster's rows and columns correctly on a map. But what about the raster data itself? How can we actually start working with the raster data values?

---

## Working with Raster Data in Python

Raster data in Python can be represented by NumPy arrays.

In [None]:
arr = ds.ReadAsArray()
arr.shape

As expected, the raster has 1000 rows and 1000 columns, just as the `gdal.Dataset` reported.

In [None]:
from matplotlib import pyplot
pyplot.imshow(arr)

This image shows part of a fire scare in central Alaska. The data values correspond to the surface *albedo,* which describes the fraction of incoming sunlight that is reflected by the Earth's surface. You can see that the fire scar has a much lower albedo, and appears darker, than the unburned areas at the bottom-left and top-right of the image. Two roads cutting through the area also appear brighter.

**One important thing to note when we're working with raster data as `numpy` arrays.**

- The rows of an array increase from top to bottom. This is different from most spatial coordinates, like latitude.
- The columns of an array increase from left to right, which is similar to spatial coordinates like longitude.

Therefore, the top-left value of an array is the north-west corner of our image.

In [None]:
arr[0,0]

Because these are albedo data, the values fall between zero and one. We can ask `numpy` for the percentiles of the data, e.g., the 0th percentile (minimum), 50th percentile (median), and 100th percentile (maximum).

In [None]:
import numpy as np

np.percentile(arr, (0, 50, 100))

Because the raster data are a `numpy` array, we can operate on them as if they contain any other kind of data. In addition, we usually don't need to think at all about the spatial coordinate system when we're working with the data.

In [None]:
# Applying a stretch to the data
lower, upper = np.percentile(arr, (2, 98))

stretch = arr.copy()
stretch[stretch < lower] = lower
stretch[stretch > upper] = upper
pyplot.imshow(stretch)

---

## Multispectral Raster Data and Spectral Indices

Multispectral or multi-band imagery can be read in the same way as single-band imagery. [Here, we'll use an ortho-image of the same area as our albedo data, also from NEON.](https://data.neonscience.org/data-products/DP3.30010.001/RELEASE-2022)

In [None]:
ds = gdal.Open('http://files.ntsg.umt.edu/data/GIS_Programming/data/NEON_ortho.tif')

If we want to plot the data, however, it's a little different.

In [None]:
rgb = ds.ReadAsArray()
rgb.shape

`pyplot.imgshow()` can only plot 2D arrays, using the first and second axes as the row and column coordinates and the value of the array as the grayscale color value.

In [None]:
fig = pyplot.figure(figsize = (10, 8))
fig.subplots_adjust(wspace = 0.2) # More horizontal space between plots
for band in range(0, ds.RasterCount):
    # add_subplot(nrows, ncols, index, ...); index must be non-zero so we add 1
    ax = fig.add_subplot(1, 3, 1 + band, title = f'Band {band+1}')
    ax.imshow(rgb[band])

---

### Challenge: Python Raster Calculator

The Normalized Difference Greenness Index (Escadafal & Huete 1991) or NDGI is a variation on the Normalized Difference Vegetation Index (NDVI). It can be calculated when only visible bands (e.g., red, green, blue) are available:

$$
\text{NDGI} = \frac{G - R}{G + R}
$$

Where $R$ is the Red band value and $G$ is the Green band value.

**Calculate the NDGI using this multi-band raster and then plot the resulting image.** The bands of this NEON dataset are, in order: Red, Green, Blue.

**NOTE:** Because the above calculation involves a fraction and will return floating-point data, we should first convert our array to a floating-point data type:

In [None]:
rgb = rgb.astype(np.float32)

---

## Break

*10 minute break for learners.*

---

## Applying Functions to Axes

Let's switch to a different raster dataset, one that has multiple time periods, where each band in the image is a different time period.

We'll continue working with the NOAA NCEP data we saw earlier. In this version, I subset the data to part of North America and calculated an annual average.

In [None]:
import requests
import numpy as np

content = requests.get('http://files.ntsg.umt.edu/data/GIS_Programming/data/NOAA_NCEP_CPC_gridded_deg0p5_1948-2022_Africa_74x149x143.float32')
data = np.frombuffer(content.content, dtype = np.float32)\
    .reshape((74, 149, 143))

As a reminder, we can plot the first year (1948) of data by typing:

In [None]:
from matplotlib import pyplot

pyplot.imshow(data[0])
pyplot.colorbar()

In our lab this week, we calculated the maximum surface temperature at each pixel, over time, by typing:

In [None]:
max_temp = data.max(axis = 0)
pyplot.imshow(max_temp)

Recall that `numpy` arrays have methods like `data.min()` and `data.mean()`, as well, which take axis arguments.

**But what if we wanted to calculate something other than the minimum, maximum, mean, or median in surface temperatures over this 74-year period?** There's no obvious way to do this with the tools we already. We need a way of applying a custom function to an array.

For situations like this, we can use `numpy.apply_along_axis()`.

In [None]:
max_temp2 = np.apply_along_axis(max, 0, data)
pyplot.imshow(max_temp2)

The arguments to `numpy.apply_along_axis()` are, in order:

- The function you want to apply
- The axis you want to apply the function over
- The array

For the second argument, it's helpful to remember how the `axis` argument works, from this diagram:

![](./assets/numpy-axis.jpg)

**When we write `np.apply_along_axis(max, 0, data)`, we are saying we want the function `max()` to be applied to slices along the 0th axis.** 

i.e., every time `max()` is called it receives a slice of the 0th axis which, in this case, is the years axis. You can prove this to yourself by trying out:

In [None]:
np.apply_along_axis(lambda x: x.size, 0, data)

We can see from this example that `max()` receives a 74-year time series when it is called. This allows us to summarize interannual data quickly and easily!

For instance, where do we see the biggest range in inter-annual temperatures?

In [None]:
rng = np.apply_along_axis(lambda x: np.max(x) - np.min(x), 0, data)
pyplot.imshow(rng, vmax = 12)
pyplot.colorbar()
pyplot.show()

But clearly, the most interesting thing we could do is to calculate trends.

In [None]:
from scipy import stats

def linear_trend(array):
    # linregress(x, y) takes two arguments: y is regressed on x
    result = stats.linregress(np.arange(0, 74), array)
    return result[0] # Just the slope

trends = np.apply_along_axis(linear_trend, 0, data)

In [None]:
pyplot.imshow(trends)
pyplot.colorbar()
pyplot.show()

**Both the range and trends maps, above, look a little weird, probably because these gridded temperatures are interpolated from station data, which can be sparse.** If were really interested in extrapolating the range or trend in temperatures, we should probably use a remote-sensing based product, instead. But this works well for educational purposes.

---

## Array Masks and Zonal Statistics

We've seen some examples of how raster data can be in Python handled as `numpy` arrays. Arrays come in multiple data types and we often need to handle or combine different types of numbers. For example, an array might be used to represent a categorical value, like a land-cover classification. Categorical data are usually represented by integers. 

One of the common GIS analysis routines performed on raster data is **zonal statistics,** a statistical summary of a raster's values that fall in different *zones* or spatial regions. Here, we'll use zonal statistics to summarize the amount of soil organic carbon in different global plant communities.

The soil organic carbon (SOC) data we'll use come from [the NASA Soil Moisture Active Passive (SMAP) Level 4 Carbon (L4C) product.](https://nsidc.org/data/SPL4CMDL/)

In [None]:
from osgeo import gdal

ds_soc = gdal.Open('http://files.ntsg.umt.edu/data/GIS_Programming/data/SPL4CMDL_Vv6040_20220901_SOC_9km.tiff')
soc = ds_soc.ReadAsArray()
soc.shape

In [None]:
pyplot.imshow(soc)

**This doesn't look right. What did we forget?**

It's always helpful to look at the raw data.

In [None]:
soc

Ah, so we have a bunch of NoData values in our array. Our plotting library doesn't know the difference between the number we use to represent NoData (-9999) and any other number value. Because -9999 is such an extreme value, our plot's colorbar is stretched too thin; we can't make out any variation in the actual data values.

### Handling NoData Values

Obviously, we need a way of telling our plotting library to ignore NoData values.

In [None]:
soc[soc == -9999] = np.nan

We're on the right track, but now we have a problem with `numpy`, because we can't store `np.nan` in an array with an integer data type.

In [None]:
soc = soc.astype(np.float32)
soc[soc == -9999] = np.nan

pyplot.figure(figsize = (12, 8))
pyplot.imshow(soc, vmin = 1000, vmax = 4000)

If the above image looks blurry or like it has a lot of holes in the data over land, note that you can tell `pyplot` to plot the data differently... Nearest-neighbor interpolation will pick the nearest raster pixel for each pixel on your screen.

In [None]:
pyplot.figure(figsize = (12, 8))
pyplot.imshow(soc, vmin = 1000, vmax = 4000, interpolation = 'nearest')

### Array Masks

Before talk about zonal statistics, we should talk about array masks. 

As we saw when we converted our NoData values to `np.nan`, we can query `numpy` arrays using conditional expressions like this:

In [None]:
soc[soc == -9999]

In [None]:
soc[soc > 4000]

In these examples, if there is no assignment operator (`=`) on the right-hand side, the values in the array that match the conditional expression are pulled out and printed to the screen.

Note that the values are returned in a predictable order but they don't have a specific shape.

In [None]:
soc[soc > 4000].shape

Here, there are over 65,000 SOC values greater than 4000 grams of carbon per meter squared... But we got all of the values as a 1D array.

Sometimes, the position of certain values within an array is important. In such cases, we actually want to take the conditional expression out of the slicing `[]` notation and use it to create a boolean array:

In [None]:
high_soc = soc > 4000
high_soc

In [None]:
high_soc.shape == soc.shape

**These kinds of arrays can be called array masks or masking arrays,** because they can be used to mask out or filter an array's contents. You'll use these kinds of conditional expressions frequently, often in combination with `np.where` or `np.argwhere`.

In [None]:
pyplot.figure(figsize = (10, 8))
pyplot.imshow(np.where(soc > 2000, soc, np.nan))

[You should know that `numpy` has support for something called a *masked array.*](https://numpy.org/doc/stable/reference/maskedarray.html). However, NumPy masked arrays can be slow to work with, especially when the array size is large, so I would recommend you avoid them. Once you're comfortable with `numpy` functions like `np.where()` and with boolean arrays, you'll never need NumPy masked arrays.

---

## More Resources

- GIS&T Body of Knowledge: [Python for GIS](https://gistbok.ucgis.org/bok-topics/python-gis)
- GIS&T Body of Knowledge: [GDAL/OGR and Geospatial Data IO Libraries](https://gistbok.ucgis.org/bok-topics/gdalogr-and-geospatial-data-io-libraries)
- [GDAL Python API documentation](https://gdal.org/api/python/osgeo.gdal.html)
- [GDAL-OGR Cookbook](https://pcjericks.github.io/py-gdalogr-cookbook/index.html)