# Data Ingestion - General Purpose Tooling

![Landsat8](./images/nasa_landsat8.jpg "Landsat8")

---

## Overview

In the previous notebook, you learned how to efficiently load data using geospatial-specific tooling. If that approach meets your needs, we recommend advancing to a workflow example, such as [Spectral Clustering](2.0_Spectral_Clustering_PC.ipynb).

Still here? Great! Let's explore an alternative method for data access, centered around [Intake](https://intake.readthedocs.io). Intake is a high-level library designed for data ingestion and catalog management. It excels at organizing datasets into easily manageable catalogs and provides a unified interface to load data from various sources. This is particularly useful for projects involving diverse and complex datasets.

While the geospatial-specific tooling approach is optimized for satellite data, Intake offers a broader and potentially more flexible approach for multimodal data workflows, characterized by:

- **Unified Interface**: Abstracts the details of data sources, enabling users to interact with a consistent API regardless of the data's underlying format.
- **Catalog System**: Allows for the organization of data sources into catalogs that can be version-controlled and shared, thus enhancing collaboration and transparency.
- **Extensible**: Facilitates the addition of new data sources and formats through its plugin system.

In the following sections, we will guide you through an introduction to various Intake functionalities that simplify data access and enhance both modularity and reproducibility in geospatial workflows.


## Prerequisites

| Concepts | Importance | Notes |
| --- | --- | --- |
| [Intro to Landsat](./0.0_Intro_Landsat.ipynb) | Necessary | Background |
| [Data Ingestion - Geospatial-Specific Tooling](1.0_Data_Ingestion-Geospatial.ipynb) | Helpful | |
| [Pandas Cookbook](https://foundations.projectpythia.org/core/pandas.html) | Helpful |  |
| [xarray Cookbook](https://foundations.projectpythia.org/core/xarray.html) | Necessary |  |
| [Intake Quickstart](https://intake.readthedocs.io/en/latest/index.html) | Helpful |  |
|[Intake Cookbook](https://projectpythia.org/intake-cookbook/README.html)| Necessary | |

- **Time to learn**: 20 minutes

---

## Imports

In [1]:
import intake
import hvplot.xarray
import planetary_computer

# import warnings
# warnings.simplefilter('ignore', FutureWarning) # Ignore warning about the format of epsg codes

To get started, we need to provide a STAC URL (or any other data source URL) to intake, and we can ask intake to recommend some suitable datatypes.

In [2]:
url = "https://planetarycomputer.microsoft.com/api/stac/v1"
data_types = intake.readers.datatypes.recommend(url)
print(data_types)

[<class 'intake.readers.datatypes.JSONFile'>, <class 'intake.readers.datatypes.STACJSON'>, <class 'intake.readers.datatypes.Handle'>, <class 'intake.readers.datatypes.CatalogAPI'>, <class 'intake.readers.datatypes.TiledService'>]


We will use STACJSON to read the URL.

In [None]:
data_type = intake.datatypes.STACJSON(url)
data_type

Similarly, we can check out the possible readers to use with the STACJSON datatype.

In [None]:
readers = data_type.possible_readers
print(readers)

The StacCatalogReader is probably the most suitable for our use case. We can use it to read the STAC catalog and explore the available contents.

In [None]:
reader = intake.catalogs.StacCatalogReader(
    data_type, signer=planetary_computer.sign_inplace
)
reader

We can read the catalog and see what's available:

In [None]:
stac_cat = reader.read()

In [None]:
metadata = {}
for data_description in stac_cat.data.values():
    data = data_description.kwargs["data"]
    metadata[data["id"]] = data["description"]
list(metadata.keys())

We can print the description of the desired IDs.

In [None]:
print("1:", metadata["landsat-c2-l1"])
print("2:", metadata["landsat-c2-l2"])

Specifically, we want landsat-c2-l2.

In [None]:
landsat_reader = stac_cat["landsat-c2-l2"]

We can see the metadata below.

In [None]:
landsat_reader.read().metadata

We can get a preview of the dataset by looking at the thumbnail.

In [None]:
# data as array
landsat_reader["thumbnail"].read()

In [None]:
# render with panel
landsat_reader["thumbnail"].to_reader("panel")

If that's desired, we can move on to get the items in the catalog.


In [None]:
landsat_items = landsat_reader["geoparquet-items"]
landsat_items

In [None]:
# note `output_instance`: this is because .tail() makes a pandas from a dask dataframe. GeoDataFrameToSTACCatalog
# works specifically with geopandas only
cat = landsat_items.tail(output_instance="geopandas:GeoDataFrame").GeoDataFrameToSTACCatalog.read()

In [None]:
# this is an "item collection"; each item is a set of assets (many levels here)
cat

Repeat the process aforementioned.

In [None]:
item_key = list(cat.entries.keys())[0]
subcat = cat[item_key].read()
subcat

In [None]:
# single image in one band
subcat.red.read()

In [None]:
# unfortunately, the "signer" didn't make it through
catbands = cat[item_key].to_reader(reader="StackBands", bands=["red", "green", "blue"], signer=planetary_computer.sign_inplace)

Then, we can load the actual assets.

In [None]:
# multiband image. Unfortunately, the value of the "band" variable in each input is 1, not the real
# value; they could be relabelled here
data = catbands.read(dim="band")
data

Now, we can plot the true color imagery with the extracted bands.

In [None]:
data.plot.imshow(robust=True, figsize=(10, 10))