# Crop Type Prediction

This notebook trains a model to predict crop types from Sentinel 2 Level 2-A imagery.

Our training labels come from the Radiant Earth [South Africa Crop Type Competition](https://registry.mlhub.earth/10.34911/rdnt.j0co8q/). They're a collection of scenes, with integers indicating the crop type at each pixel in the scene.

Our training data comes from Microsoft's Planetary Computer. The [Sentinel 2 Level 2-A](https://planetarycomputer.microsoft.com/dataset/sentinel-2-l2a) page describes what all is avaiable.

## Data access

We'll use STAC for data access. Specifically, we'll interact with two STAC catalogs

1. A static catalog for the labels, hosted in a Blob Storage container
2. The Planetary Computer's STAC API, to query for scenes matching some condition

The overall workflow will be

1. Load a "chip" with the label data (a 256x256 array of integer codes indicate the crop type)
2. Search for and load a scene with Sentinel 2 imagery covering the `labels` chip
3. Transform and crop the (very large) Sentinel 2 scene to match the 256x256 label scene
4. Stack and reshape the data for the machine learning model

In [None]:
import pystac
import pystac_client
import requests
import shapely.geometry
import shapely.ops
import warnings

warnings.filterwarnings("ignore", "Creating an ndarray from ragged")

### Load labels

We have a STAC catalog of labels for the training data, which is based off the collection used in the Radiant Earth competition.

In [None]:
training_catalog = pystac.read_file(
    "https://esip2021.blob.core.windows.net/esip2021/train/collection.json"
)
training_catalog

Each of these Items contains a few thing. They all share the same set of labels: integer codes indicating a particular crop type.

In [None]:
N_SCENES = 25
links = training_catalog.get_item_links()[:N_SCENES]
label_items = [link.resolve_stac_object().target for link in links]

labels = requests.get(label_items[0].assets["raster_values"].href).json()

labels

And like any STAC item, they have a specific footprint. Let's plot them on the map.

In [None]:
import geopandas

df = geopandas.GeoDataFrame.from_features([x.to_dict() for x in label_items]).set_crs(
    4326
)
m = df.explore()
m

Each one of these plots is a (256 x 256) "chip".

In [None]:
import rioxarray

rioxarray.open_rasterio(label_items[9].assets["labels"].href).squeeze().plot.imshow(
    cmap="tab10"
);

We need to associate the label items with a Sentinel-2 Level 2-A item. We need to find an item that (mostly) covers the field and isn't too cloudy.

We could make one STAC query per label item, but that would be a bit slow and inefficient. Instead, we'll do one search to get all the items covering the bounding box of *all* of our fields. So we need the union of all the bounding boxes.

In [None]:
bbox = shapely.ops.unary_union(
    [shapely.geometry.box(*item.bbox) for item in label_items]
).bounds
bbox

Now we'll make a search for all the items matching our requirements, similar to the previous notebook.

In [None]:
stac_client = pystac_client.Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1/"
)

date_range = "2017-06-01/2017-09-01"

search = stac_client.search(
    collections=["sentinel-2-l2a"],
    bbox=bbox,
    datetime=date_range,
    query={"eo:cloud_cover": {"lt": 25}},
)
sentinel_items = list(search.get_all_items())
len(sentinel_items)

So we have bunch of Sentinel 2 items that together cover all of our fields. But these Sentinel scenes are much larger than our fields:

In [None]:
import folium

sentinel_item = sentinel_items[1]

layer = folium.TileLayer(
    requests.get(sentinel_item.assets["tilejson"].href).json()["tiles"][0],
    attr="Sentinel-2 L2A",
)

layer.add_to(m)
m

How do we know which (part of a) Sentinel-2 scene goes with each field? That's what we do in the next section. It's a bit complicated, but the basic idea is to pick the least-cloudy Sentinel-2 scene that (mostly) covers our field (at least 90% of it anyway).

In [None]:
def find_match(label_item, sentinel_items):
    # make sure we pick a sentinel scene that overlaps substantially with the label
    label_shape = shapely.geometry.shape(label_item.geometry)
    items2 = [
        item
        for item in sentinel_items
        if (
            shapely.geometry.shape(item.geometry).intersection(label_shape).area
            / label_shape.area
        )
        > 0.90
    ]
    sentinel_item = min(
        items2, key=lambda item: pystac.extensions.eo.EOExtension.ext(item).cloud_cover
    )
    return sentinel_item

In [None]:
import planetary_computer

matched = [
    planetary_computer.sign(find_match(label_item, sentinel_items))
    for label_item in label_items
]

Given the matched `(label_item, sentinel_item)` pairs, we can load in the actual data. Like in the last notebook, we'll use `stackstac` to load a bunch of bands for the training data. We'll also load the label data at the same time.

Finally, there's a slight pixel alignmnet issue, where the coordinates on the `label` data are shifted by a half-pixel from the coordinates for the training data. We'll shift the training data to match the label data.

In [None]:
import rioxarray
import stackstac


def load(label_item, sentinel_item):
    label_data = rioxarray.open_rasterio(label_item.assets["labels"].href).squeeze()

    assets = ["B02", "B03", "B04", "B05", "B06", "B07", "B09"]
    data = (
        stackstac.stack(
            sentinel_item.to_dict(),
            assets=assets,
            epsg=label_data.rio.crs.to_epsg(),  # reproject to the labels' CRS
            bounds=label_data.rio.bounds(),  # crop to the labels' bounds
            resolution=10,  # resample all assets to the highest resolution
            dtype="float32",
            fill_value=0,
        )
        .squeeze()
        .assign_coords(
            y=lambda ds: (ds.y - 5).round(),  # fix half-pixel label issue
            x=lambda ds: (ds.x + 5).round(),
        )
        .compute()
    )

    assert data.shape[1:] == label_data.shape

    # Add a label_id dimension, to track which training data goes with
    # which pixels. This will be helpful later on in evaluation.
    data = data.expand_dims({"label_id": [label_item.id]})
    label_data = label_data.expand_dims({"label_id": [label_item.id]})

    return data, label_data

In [None]:
import warnings

warnings.filterwarnings("ignore", message="pandas.Float64")

We're actually loading data now. This will take a bit of time.

In [None]:
%%time
Xs, ys = zip(
    *[
        load(label_item, sentinel_item)
        for label_item, sentinel_item in zip(label_items, matched)
    ]
)

In [None]:
Xs[0].shape

Now we have a list of DataArrays, each with the dimensions `(label_id, band, y, x)`. We'll use Scikit-Learn to train the model, which expects a 2-D array with dimensions `(observations, features)`. In this case, an "observation" is a single pixel (the pixel at coordinate `(-3717125, 274725)` for example), and the features are the 7 bands.

So we need to reshape each DataArray from size `(1, 7, 256, 256)` to `(65536, 7)` and then concatenate them all vertically.

In [None]:
import xarray as xr

X = xr.concat([x.stack(pixel=("label_id", "y", "x")).T for x in Xs], dim="pixel")
y = xr.concat([y.stack(pixel=("label_id", "y", "x")) for y in ys], dim="pixel")
assert X.indexes["pixel"].equals(y.indexes["pixel"])

In [None]:
X.shape

In [None]:
y.shape

Thanks to xarray's indexing, we can easily go from these stacked DataArray back to a plot.

In [None]:
label_id = label_items[0].id
X.sel(label_id=label_id).unstack().sel(band="B04").plot(cmap="Reds", figsize=(12, 9));

## Train the model

Now that we've done all the pre-processing, we can train the actual model.

We'll start with a scikit-learn KNeighborsClassfier ([User Guide](https://scikit-learn.org/stable/modules/neighbors.html#classification), [API Reference](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)) to establish a baseline model for this dataset.

In [None]:
import sklearn.neighbors
import sklearn.model_selection

In [None]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y)

In [None]:
clf = sklearn.neighbors.KNeighborsClassifier()
clf.fit(X_train, y_train)

Training score:

In [None]:
clf.score(X_train[::100], y_train[::100])

Test score:

In [None]:
clf.score(X_test[::100], y_test[::100])

Plot the first field.

In [None]:
x = X.sel(label_id=label_id)
yhat = clf.predict(x)

In [None]:
import matplotlib.pyplot as plt

fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(8, 4))

ys[0].plot(x="x", y="y", cmap="tab10", ax=ax1, add_colorbar=False)
ax2.imshow(yhat.reshape(256, 256), cmap="tab10")
plt.tight_layout()

ax1.set_axis_off()
ax2.set_axis_off()

ax1.set(title="Actual")
ax2.set(title="Predicted");

So we seems to be able to differentiate "field" from "not a field", but do a bad job of predicting the actual crop type. Plenty of room for improvement.

## Recap

We were able to train a basic ML model to predict crop types from Sentinel-2 satellite imagery. We used STAC to find and load our data, xarray to reshape the data into an appropriate form for the model, and scikit-learn to train the model.