# InstaGeo Demo

<a href="https://colab.research.google.com/github/instadeepai/InstaGeo-E2E-Geospatial-ML/blob/main/notebooks/InstaGeo_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Welcome to the InstaGeo demo notebook! This tutorial showcases the capabilities of InstaGeo, an end-to-end package designed for geospatial machine learning with multispectral data.

In this demonstration, we use ground truth geospatial point observations for cropland classification in Rwanda. The notebook will guide you through the process of creating segmentation-like data from these observations, fine-tuning the [Prithvi](https://huggingface.co/ibm-nasa-geospatial/Prithvi-100M) model, and finally visualizing the inference results on an interactive map.

By the end of this demo, you will gain hands-on experience with key InstaGeo functionalities and learn how it streamlines geospatial ML workflows from data preparation to model inference.

# Install InstaGeo

In [1]:
repository_url = "https://github.com/instadeepai/InstaGeo-E2E-Geospatial-ML"

!git clone {repository_url}

Cloning into 'InstaGeo-E2E-Geospatial-ML'...
remote: Enumerating objects: 2297, done.[K
remote: Counting objects: 100% (1900/1900), done.[K
remote: Compressing objects: 100% (699/699), done.[K
remote: Total 2297 (delta 1249), reused 1780 (delta 1179), pack-reused 397 (from 2)[K
Receiving objects: 100% (2297/2297), 11.63 MiB | 12.63 MiB/s, done.
Resolving deltas: 100% (1431/1431), done.


In [2]:
%%bash
cd InstaGeo-E2E-Geospatial-ML
pip install -e .[all]

Obtaining file:///content/InstaGeo-E2E-Geospatial-ML
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Checking if build backend supports build_editable: started
  Checking if build backend supports build_editable: finished with status 'done'
  Getting requirements to build editable: started
  Getting requirements to build editable: finished with status 'done'
  Preparing editable metadata (pyproject.toml): started
  Preparing editable metadata (pyproject.toml): finished with status 'done'
Collecting codecarbon>=3.0.7 (from instageo==0.1.0)
  Downloading codecarbon-3.2.0-py3-none-any.whl.metadata (12 kB)
Collecting rioxarray>=0.19.0 (from instageo==0.1.0)
  Downloading rioxarray-0.20.0-py3-none-any.whl.metadata (5.4 kB)
Collecting absl-py>=2.3.0 (from instageo==0.1.0)
  Downloading absl_py-2.3.1-py3-none-any.whl.metadata (3.3 kB)
Collecting astral>=3.2.0 (from instageo==0.1.0)
  Downloading astral-3.2-py3-none-any.whl.metadata (1.7 

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
firebase-admin 6.9.0 requires httpx[http2]==0.28.1, but you have httpx 0.27.2 which is incompatible.
datasets 4.0.0 requires fsspec[http]<=2025.3.0,>=2023.1.0, but you have fsspec 2025.12.0 which is incompatible.
google-genai 1.55.0 requires httpx<1.0.0,>=0.28.1, but you have httpx 0.27.2 which is incompatible.
cuml-cu12 25.10.0 requires numba<0.62.0a0,>=0.60.0, but you have numba 0.63.1 which is incompatible.
cudf-cu12 25.10.0 requires numba<0.62.0a0,>=0.60.0, but you have numba 0.63.1 which is incompatible.
gcsfs 2025.3.0 requires fsspec==2025.3.0, but you have fsspec 2025.12.0 which is incompatible.


## EarthData Login

InstaGeo currently supports multispectral data from NASA [Harmonized Landsat and Sentinel-2 (HLS)](https://hls.gsfc.nasa.gov/). Accessing HLS data requires an EarthData user account which can be created [here](https://urs.earthdata.nasa.gov/)

In [3]:
from getpass import getpass
import os

In [4]:
# Enter you EarthData user account credentials
USERNAME = getpass('Enter your EarthData username: ')
PASSWORD = getpass('Enter your EarthData password: ')

content = f"""machine urs.earthdata.nasa.gov login {USERNAME} password {PASSWORD}"""

with open(os.path.expanduser('~/.netrc'), 'w') as file:
    file.write(content)

Enter your EarthData username: ··········
Enter your EarthData password: ··········


## InstaGeo - Data

With InstaGeo installed and EarthData authentication configured, we are now ready to download and process HLS (Harmonized Landsat and Sentinel) granules using the `InstaGeo-Data` module. This module offers several powerful functionalities for handling geospatial data, including:

- Searching and retrieving metadata for HLS granules
- Downloading specific spectral bands from HLS granules
- Generating data chips and corresponding target labels for machine learning tasks

These capabilities streamline the preprocessing of multispectral data, setting the foundation for efficient geospatial model development.



In [5]:
import pandas as pd
import numpy as np
from pathlib import Path

The ground-truth geospatial observations for Rwanda cropland classification used in this notebook were sourced from the [Rwanda 2019 Crop/Non-Crop Labels (HarvestPortal)](https://data.harvestportal.org/dataset/rwanda-2019-crop-non-crop-labels) dataset. Run the following cell to download the data.

In [6]:
!wget -q --show-progress https://data.harvestportal.org/dataset/9f4b6470-2c7b-4559-95cb-49e9fd2923f6/resource/ed0ab379-a688-4419-ab96-181c726e1b22/download/ceo-2019-rwanda-cropland-sample-data-2021-04-20.csv
!wget -q --show-progress https://data.harvestportal.org/dataset/9f4b6470-2c7b-4559-95cb-49e9fd2923f6/resource/0cfc1320-f909-4759-90f9-cb5c92ca019e/download/ceo-2019-rwanda-cropland-rcmrd-set-1-sample-data-2021-04-20.csv
!wget -q --show-progress https://data.harvestportal.org/dataset/9f4b6470-2c7b-4559-95cb-49e9fd2923f6/resource/6675cc7e-e6da-4889-9905-60c0d5369ce6/download/ceo-2019-rwanda-cropland-rcmrd-set-2-sample-data-2021-04-20.csv



In [7]:
df1 = pd.read_csv("ceo-2019-rwanda-cropland-sample-data-2021-04-20.csv")
df2 = pd.read_csv("ceo-2019-rwanda-cropland-rcmrd-set-1-sample-data-2021-04-20.csv")
df3 = pd.read_csv("ceo-2019-rwanda-cropland-rcmrd-set-2-sample-data-2021-04-20.csv")

df = pd.concat([df1, df2, df3])

In [8]:
df = df[['lat', 'lon', 'collection_time', 'Crop/ or not', 'sample_id']]
df = df.rename({"lon": "x", "lat":"y", "Crop/ or not":'label', 'collection_time':"date"}, axis=1)
df.head(10)

Unnamed: 0,y,x,date,label,sample_id
0,-0.867936,29.220932,2021-03-06 18:12,Cropland,540531505
1,-1.497424,30.901432,2021-03-06 18:13,Non-crop,540531506
2,-1.759318,28.537325,2021-03-06 18:13,Non-crop,540531507
3,-2.235693,29.16731,2021-03-06 18:13,Non-crop,540531508
4,-1.128458,28.957315,2021-03-06 18:14,Cropland,540531509
5,-1.18638,28.539301,2021-03-06 18:14,Non-crop,540531510
6,-1.339662,29.692388,2021-03-06 18:15,Cropland,540531511
7,-1.758779,30.900893,2021-03-06 18:16,Non-crop,540531512
8,-2.915554,29.593948,2021-03-06 18:16,Non-crop,540531513
9,-1.398118,29.908671,2021-03-06 18:16,Cropland,540531514


In [9]:
def label_map(x):
    if x == "Cropland":
        return 1
    elif x == "Non-crop":
        return 0
    else:
        return np.nan

df['date'] = df['date'].map(lambda x: pd.to_datetime(x).strftime("%Y-%m-%d"))
df['label'] = df['label'].map(label_map)
df = df.dropna().reset_index()
df.head(10)

Unnamed: 0,index,y,x,date,label,sample_id
0,0,-0.867936,29.220932,2021-03-06,1.0,540531505
1,1,-1.497424,30.901432,2021-03-06,0.0,540531506
2,2,-1.759318,28.537325,2021-03-06,0.0,540531507
3,3,-2.235693,29.16731,2021-03-06,0.0,540531508
4,4,-1.128458,28.957315,2021-03-06,1.0,540531509
5,5,-1.18638,28.539301,2021-03-06,0.0,540531510
6,6,-1.339662,29.692388,2021-03-06,1.0,540531511
7,7,-1.758779,30.900893,2021-03-06,0.0,540531512
8,8,-2.915554,29.593948,2021-03-06,0.0,540531513
9,9,-1.398118,29.908671,2021-03-06,1.0,540531514


In [10]:
print(f"The number of labeled observations in the aggregated dataset is: {df.shape[0]}")

The number of labeled observations in the aggregated dataset is: 3589


**Optional**: For the sake of rapid experimentation, let's use a subset of the observations (for instance 10%), while keeping approximately the same distribution for the labels.

In [11]:
df = df.groupby('label', as_index=False).sample(frac=0.1).reset_index(drop=True)
print(f"The number of labeled observations in the subset is: {df.shape[0]}")

The number of labeled observations in the subset is: 359


In [12]:
from sklearn.model_selection import train_test_split

train, val_and_test = train_test_split(df, test_size=0.3)
val, test = train_test_split(val_and_test, test_size=0.5)

print(train.size, val.size, test.size)

1506 324 324


In [13]:
train.to_csv("rwanda_cropland_data_train.csv")
val.to_csv("rwanda_cropland_data_val.csv")
test.to_csv("rwanda_cropland_data_test.csv")

After splitting the data into training, validation, and test sets, the next step is to group the data by the HLS granules they belong to and download the corresponding spectral bands for each granule. Once the bands are retrieved, we will generate smaller chips and target labels with dimensions of 256 x 256 pixels.

By the end of this process, the input data will have a shape of 3 x 6 x 256 x 256 (representing three sets of six spectral bands and 256 x 256 pixel chips), and the target labels will have a shape of 256 x 256.

While these tasks might seem complex, the `InstaGeo-Data` module abstracts this process, allowing you to configure it with a simple command as shown in the following cells

### Training Split

In [23]:
%%bash
mkdir train
python -m "instageo.data.chip_creator" \
    --dataframe_path="rwanda_cropland_data_train.csv" \
    --output_directory="train" \
    --min_count=3 \
    --chip_size=256 \
    --temporal_tolerance=10 \
    --temporal_step=30 \
    --num_steps=3 \
    --masking_strategy=any \
    --mask_types=water \
    --window_size=1 \
    --processing_method=cog

mkdir: cannot create directory ‘train’: File exists
INFO:earthaccess.auth:You're now authenticated with NASA Earthdata Login
I1226 12:28:50.938397 134799735382016 chip_creator.py:309] Using HLS pipeline
I1226 12:28:50.938599 134799735382016 chip_creator.py:236] HLS dataset JSON already created
I1226 12:28:50.983706 134799735382016 chip_creator.py:241] Creating Chips and Segmentation Maps
I1226 12:28:50.983961 134799735382016 data_pipeline.py:794] All STAC items have already been processed. Nothing to do.


In [24]:
root_dir = Path.cwd()
chips_orig = os.listdir(os.path.join(root_dir, "train/chips"))
chips = [chip.replace("chip", "train/chips/chip") for chip in chips_orig]
seg_maps = [chip.replace("chip", "train/seg_maps/seg_map") for chip in chips_orig]

df = pd.DataFrame({"Input": chips, "Label": seg_maps})
df.to_csv(os.path.join("train.csv"))

In [25]:
print(f"The size of the train split: {df.shape[0]}")

The size of the train split: 0


### Validation Split

In [26]:
%%bash
mkdir val
python -m "instageo.data.chip_creator" \
    --dataframe_path="rwanda_cropland_data_val.csv" \
    --output_directory="val" \
    --min_count=3 \
    --chip_size=256 \
    --temporal_tolerance=3 \
    --temporal_step=30 \
    --num_steps=3 \
    --masking_strategy=any \
    --mask_types=water \
    --window_size=1 \
    --processing_method=cog

mkdir: cannot create directory ‘val’: File exists
INFO:earthaccess.auth:You're now authenticated with NASA Earthdata Login
I1226 12:29:24.684516 137559234473984 chip_creator.py:309] Using HLS pipeline
I1226 12:29:24.684725 137559234473984 chip_creator.py:236] HLS dataset JSON already created
I1226 12:29:24.725983 137559234473984 chip_creator.py:241] Creating Chips and Segmentation Maps
I1226 12:29:24.726228 137559234473984 data_pipeline.py:794] All STAC items have already been processed. Nothing to do.


In [27]:
root_dir = Path.cwd()
chips_orig = os.listdir(os.path.join(root_dir, "val/chips"))
chips = [chip.replace("chip", "val/chips/chip") for chip in chips_orig]
seg_maps = [chip.replace("chip", "val/seg_maps/seg_map") for chip in chips_orig]

df = pd.DataFrame({"Input": chips, "Label": seg_maps})
df.to_csv(os.path.join("val.csv"))

In [28]:
print(f"The size of the validation split: {df.shape[0]}")

The size of the validation split: 0


### Test Split

In [29]:
%%bash
mkdir test
python -m "instageo.data.chip_creator" \
    --dataframe_path="rwanda_cropland_data_test.csv" \
    --output_directory="test" \
    --min_count=3 \
    --chip_size=256 \
    --temporal_tolerance=3 \
    --temporal_step=30 \
    --num_steps=3 \
    --masking_strategy=any \
    --mask_types=water \
    --window_size=1 \
    --processing_method=cog

mkdir: cannot create directory ‘test’: File exists
INFO:earthaccess.auth:You're now authenticated with NASA Earthdata Login
I1226 12:29:58.806387 138830736789504 chip_creator.py:309] Using HLS pipeline
I1226 12:29:58.806589 138830736789504 chip_creator.py:215] Creating HLS dataset JSON.
I1226 12:29:58.806659 138830736789504 chip_creator.py:216] Retrieving HLS tile ID for each observation.
W1226 12:30:05.230779 138830736789504 stac_utils.py:384] No items found for 35MQT
W1226 12:30:06.434482 138830736789504 stac_utils.py:384] No items found for 35MQU
I1226 12:30:20.669516 138830736789504 raw.py:733] Created 0 records
I1226 12:30:20.671977 138830736789504 chip_creator.py:241] Creating Chips and Segmentation Maps
I1226 12:30:20.672315 138830736789504 data_pipeline.py:794] All STAC items have already been processed. Nothing to do.


In [30]:
root_dir = Path.cwd()
chips_orig = os.listdir(os.path.join(root_dir, "test/chips"))
chips = [chip.replace("chip", "test/chips/chip") for chip in chips_orig]
seg_maps = [chip.replace("chip", "test/seg_maps/seg_map") for chip in chips_orig]

df = pd.DataFrame({"Input": chips, "Label": seg_maps})
df.to_csv(os.path.join("test.csv"))

In [31]:
print(f"The size of the test split: {df.shape[0]}")

The size of the test split: 0


## InstaGeo - Model

After creating our dataset using the `InstaGeo-Data` module, we can move on to fine-tuning a model that includes a Prithvi backbone paired with a classification head. For regression tasks, the classification head can easily be replaced with a suitable regression head. Additionally, if a completely different model architecture is needed, it can be designed and implemented within this framework.

In [32]:
import os
import os
import pandas as pd
import numpy as np
from pathlib import Path

**Launch Training**

First compute the mean and standard deviation for the dataset and update the corresponding config file, in this case `locust.yaml`

In [33]:
!python -m instageo.model.run --config-name=locust \
    root_dir='.' \
    train.batch_size=8 \
    mode=stats \
    train_filepath="train.csv"


Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/content/InstaGeo-E2E-Geospatial-ML/instageo/model/run.py", line 29, in <module>
    import pytorch_lightning as pl
ModuleNotFoundError: No module named 'pytorch_lightning'


In [34]:
!python -m instageo.model.run --config-name=locust \
    root_dir='.' \
    train.batch_size=8 \
    train.num_epochs=5 \
    train_filepath="train.csv" \
    valid_filepath="val.csv"

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/content/InstaGeo-E2E-Geospatial-ML/instageo/model/run.py", line 29, in <module>
    import pytorch_lightning as pl
ModuleNotFoundError: No module named 'pytorch_lightning'


**Run Model Evaluation**

Adjust the `checkpoint_path` argument to use the desired model checkpoint.

In [35]:
!python -m instageo.model.run --config-name=locust \
    root_dir='.' \
    test_filepath="test.csv" \
    train.batch_size=8 \
    checkpoint_path='checkpoint-path' \
    mode=eval

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/content/InstaGeo-E2E-Geospatial-ML/instageo/model/run.py", line 29, in <module>
    import pytorch_lightning as pl
ModuleNotFoundError: No module named 'pytorch_lightning'


**Run Inference**

In [36]:
# !gsutil cp gs://instageo/utils/africa_prediction_template.csv .
!mkdir -p inference/2023-06

**Create Inference Data**

For inference, we only need to download the necessary HLS tiles and run inference directly using the sliding window inference feature.

If you're running inference across the entire African continent, you can use the `africa_prediction_template.csv`, which will automatically download 2,120 HLS granules covering Africa and parts of Asia.

For this demo, we'll limit the scope to the HLS granules included in our test split.

Note: Ensure you have approximately 1TB of storage space available for this process if you are running inference across Africa.

In [None]:
!python -m "instageo.data.chip_creator" \
    --dataframe_path="rwanda_cropland_data_test.csv" \
    --output_directory="inference/2023-06" \
    --chip_size=256 \
    --processing_method=download-only

**Run Inference**

Adjust the `checkpoint_path` argument to use the desired model checkpoint.

In [None]:
!python -m instageo.model.run --config-name=locust \
    root_dir='inference/2023-06' \
    test_filepath='hls_dataset.json' \
    train.batch_size=16 \
    test.mask_cloud=True \
    checkpoint_path='checkpoint-path' \
    mode=sliding_inference

## InstaGeo - Apps
Once inference has been completed on the HLS tiles and the results have been saved, we can use the `InstaGeo-Apps` module to visualize the predictions on an interactive map.

To visualize the results, simply move the HLS prediction GeoTIFF files to the appropriate directory, and `InstaGeo-Apps` will handle the rest, providing an intuitive and interactive mapping experience.

In [None]:
!mkdir -p predictions/2023/6
!mv inference/2023-06/predictions/* /content/predictions/2023/6

In [None]:
!npm install localtunnel

In [None]:
!nohup streamlit run InstaGeo-E2E-Geospatial-ML/instageo/apps/app.py --server.address=localhost &

Retrieve your IP address which is the password of the localtunnel

In [None]:
import urllib
print("Password/Endpoint IP for localtunnel is:",urllib.request.urlopen('https://ipv4.icanhazip.com').read().decode('utf8').strip("\n"))

In [None]:
!npx localtunnel --port 8501

## Summary

In this notebook, we demonstrated the end-to-end capabilities of InstaGeo for geospatial machine learning using multispectral data. We began by downloading and processing HLS granules, creating data chips for training, and fine-tuning a model with the Prithvi backbone. Finally, we ran inference on test data and visualized the results using the `InstaGeo-Apps` module.

By leveraging InstaGeo, complex tasks such as data preprocessing, model training, and large-scale inference can be streamlined and efficiently handled with minimal configuration.

If you found this demo helpful, please consider giving our [InstaGeo GitHub repository](https://github.com/instadeepai/InstaGeo-E2E-Geospatial-ML) a star ⭐! Your support helps us continue improving the tool for the community.

Thank you for exploring InstaGeo with us!