# MAAP AWS Access With Python

Authors: Harshini Girish (UAH), Sheyenne Kirkland (UAH), Chuck Daniels (DevSeed), Alex Mandel (DevSeed)

Date: March 26, 2025

Description: In this tutorial, we walk through accessing MAAP data in S3 buckets (maap-ops-workspace and nasa-maap-data-store) in python. We’ll also demonstrate opening a raster, vector, and text file.



## Run This Notebook

To access and run this tutorial within MAAP's Algorithm Development Environment (ADE), please refer to the ["Getting started with the MAAP"](https://docs.maap-project.org/en/latest/getting_started/getting_started.html) section of our documentation.

Disclaimer: it is highly recommended to run a tutorial within MAAP’s ADE, which already includes packages specific to MAAP, such as maap-py. Running the tutorial outside of the MAAP ADE may lead to errors.

## Additional Resources

- [earthdata: Python–R Handoff](https://github.com/NASA-Openscapes/earthdata-cloud-cookbook/blob/main/earthdata-cloud-r/python-r-handoff.Rmd)  
A notebook in NASA Openscapes that shows users how to access data from S3 links.

- [MAAP AWS Access Tutorial (R)](https://docs.maap-project.org/en/develop/technical_tutorials/working_with_r/access_aws_maap.html)  
Official MAAP documentation showing how to work with AWS-hosted datasets in R.


## Install/Import Packages
 
Let's install and load the packages necessary for this tutorial.

In [5]:
from maap.maap import MAAP
from pystac_client import Client
import geopandas as gpd
from osgeo import gdal
import pandas as pd
import boto3
import rasterio
import os
import re
import subprocess
from rasterio.session import AWSSession
from rasterio.env import Env


## Set up Access

We don’t need to manually handle temporary credentials, but we do need to set the default AWS region to     `us-west-2`.

In [6]:
# Connect to MAAP API and S3
maap = MAAP()
s3 = boto3.client("s3", region_name="us-west-2")


## Explore Buckets

Mounted paths (like `/projects/` or `/shared/`) are convenient for interactive browsing in the ADE, but they can be slower and are not portable to other environments like the DPS.  

 For reproducible and scalable workflows — especially those intended to run in the cloud or on DPS — it's recommended to use direct S3 paths or GDAL-style virtual file paths.
Now that we have access to MAAP buckets, we can retrieve data stored in AWS. Typically, users will interact with two main buckets:

1. **maap-ops-workspace** – Contains both user-private and user-shared data.  
   - Private files are found under `s3://maap-ops-workspace/private/<username>/...`  
   - Shared files are available under `s3://maap-ops-workspace/shared/<username>/...`

2. **nasa-maap-data-store** – Hosts curated datasets that have been ingested into the MAAP STAC catalog.  
   - This is the primary location for analysis-ready data used in DPS jobs, and shared workflows.


## User Shared Buckets

To list objects from a shared bucket, run the code below. Be sure to update the prefix path after "shared/" to match your desired directory.

In [21]:
s3_response = s3.list_objects_v2(
    Bucket="maap-ops-workspace",
    Prefix="shared/alexdevseed/cog-tests/"
)


To grab the identifier for each object within your bucket, run the following cell.

In [8]:
all_objects = [obj["Key"] for obj in s3_response.get("Contents", [])]
tif_objects = [key for key in all_objects if key.endswith(".tif")]
for tif in tif_objects:
    print(tif)


shared/alexdevseed/cog-tests/Landsat8_275_comp_cog_2015-2020_dps.tif
shared/alexdevseed/cog-tests/boreal_agb_20211015_0249_cog-ovr3.tif
shared/alexdevseed/cog-tests/boreal_agb_20211015_0249_cog-ovr4.tif
shared/alexdevseed/cog-tests/boreal_agb_20211015_0249_cog-ovr6.tif
shared/alexdevseed/cog-tests/boreal_agb_20211015_0249_cog-ovr8.tif
shared/alexdevseed/cog-tests/boreal_agb_20211015_0249_cog-s3o8.tif
shared/alexdevseed/cog-tests/boreal_agb_20211015_0249_cog.tif


## User Private Buckets

To access data in your private bucket, you'll follow a similar approach as before, but with an updated prefix. First, we’ll retrieve your username to correctly construct the path.

In [12]:
username = maap.profile.account_info()['username']
print("Username:", username)

Username: harshinigirish


In [18]:
prefix = f"shared/{username}/" 
s3_response = s3.list_objects_v2(Bucket=bucket, Prefix=prefix)
s3_object_keys = [obj["Key"] for obj in s3_response.get("Contents", [])]
print("S3 Objects:")
for key in s3_object_keys:
    print(key)

S3 Objects:
shared/harshinigirish/
shared/harshinigirish/GLLIDARPC_FL_20200311_FIA8_l0s12.las


**Note**:
While the following examples don't explicitly access private buckets, the process is exactly the same as for shared buckets. The only difference is the prefix path—use your username directly instead of shared/username.

## nasa-maap-data-store Buckets

To access data from the `nasa-maap-data-store` bucket, we’ll use a STAC query via the `pystac-client` library to retrieve item metadata, including file paths. These paths can then be used with tools that support STAC or direct S3 access.

In [14]:
stac_url = "https://stac.maap-project.org/"
client = Client.open(stac_url)

In this example, we'll query the `icesat2-boreal` collection to explore its available data items.

In [15]:
collection_id = "icesat2-boreal"
search = client.search(collections=[collection_id], max_items=10)
items = list(search.get_items())
print("First 10 STAC Items:")
for item in items:
    print(item.id)


First 10 STAC Items:
boreal_agb_202302151676439579_1326
boreal_agb_202302151676435792_3402
boreal_agb_202302151676435665_3417
boreal_agb_202302151676434536_3215
boreal_agb_202302151676434460_3035
boreal_agb_202302151676432986_2782
boreal_agb_202302151676430990_1278
boreal_agb_202302151676430794_26340
boreal_agb_202302151676430633_40664
boreal_agb_202302151676430594_0611


Now that we've specified our collection and retrieved a list of items, we can extract the S3 URL linked to the first item in the collection.

In [16]:
first_item = items[0]
asset_href = list(first_item.assets.values())[0].href
print("S3 URL:", asset_href)


S3 URL: s3://nasa-maap-data-store/file-staging/nasa-map/icesat2-boreal/boreal_agb_202302151676439579_1326_train_data.csv


## Accessing an Item

## TIFF

In this example, we’ll access a TIFF file from a shared S3 bucket. To read the file directly from S3, the path must begin with `/vsis3/`. We'll construct the full path by combining `/vsis3/ `with the bucket name.

In [22]:
key = "shared/alexdevseed/cog-tests/boreal_agb_20211015_0249_cog.tif"
tiff_path = f"/vsis3/{bucket}/{key}"
print("TIFF path:", tiff_path)

TIFF path: /vsis3/maap-ops-workspace/shared/alexdevseed/cog-tests/boreal_agb_20211015_0249_cog.tif


This code block uses the `rio cogeo info` command-line tool to inspect a Cloud Optimized GeoTIFF (COG) directly from S3.  
It prints detailed metadata specific to the COG structure—such as tile layout, overviews, and internal organization.  This information is useful for evaluating whether the file is optimized for cloud-based access and helps inform decisions before processing or visualization.


In [26]:
cmd = ["rio", "cogeo", "info", tiff_path]
result = subprocess.run(cmd, capture_output=True, text=True)
print(result.stdout)

Driver: GTiff
File: /vsis3/maap-ops-workspace/shared/alexdevseed/cog-tests/boreal_agb_20211015_0249_cog.tif
COG: True
Compression: None
ColorSpace: None

Profile
    Width:            3000
    Height:           3000
    Bands:            4
    Tiled:            True
    Dtype:            float32
    NoData:           -3.3999999521443642e+38
    Alpha Band:       False
    Internal Mask:    False
    Interleave:       PIXEL
    ColorMap:         False
    ColorInterp:      ('gray', 'undefined', 'undefined', 'undefined')
    Scales:           (1.0, 1.0, 1.0, 1.0)
    Offsets:          (0.0, 0.0, 0.0, 0.0)

Geo
    Crs:              PROJCS["unknown",GEOGCS["NAD83",DATUM["North_American_Datum_1983",SPHEROID["GRS 1980",6378137,298.257222101004,AUTHORITY["EPSG","7019"]],AUTHORITY["EPSG","6269"]],PRIMEM["Greenwich",0],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4269"]],PROJECTION["Albers_Conic_Equal_Area"],PARAMETER["latitude_of_center",40],PARAMETER["longitud

As a best practice, it's important to know which GDAL drivers are available, as using the appropriate driver ensures efficient and reliable access to geospatial data. Different drivers support different formats (e.g., GeoTIFF, NetCDF, Shapefile), and selecting the right one can significantly impact performance and compatibility.

Please refer to the ["GDAL OGR driver list"](https://gdal.org/en/stable/drivers/vector/index.html) for more details.

In [24]:
with rasterio.Env() as env:
    drivers = list(env.drivers().items())
    for short_name, can_create in drivers[:5]: 
        print(f"{short_name:<10} | Can Create: {can_create}")


VRT        | Can Create: Virtual Raster
GTI        | Can Create: GDAL Raster Tile Index
DERIVED    | Can Create: Derived datasets using VRT pixel functions
GTiff      | Can Create: GeoTIFF
COG        | Can Create: Cloud optimized GeoTIFF generator


This code snippet runs the gdalinfo command-line utility from within Python to read metadata from a TIFF file stored in an AWS S3 bucket. The file path is formatted with `/vsis3/`, which allows GDAL to access cloud-hosted data directly. The command is executed using Python’s subprocess module, and the output—containing detailed metadata about the raster file (such as size, projection, and geotransform)—is captured and printed.

## Vector

In this example, we access a `GeoPackage` file stored in a shared S3 bucket using the geopandas package. As with raster data, we prepend ` /vsis3/` to the file path so that GDAL can stream the data directly from S3 without downloading it locally.

In [27]:
prefix = "shared/smk0033/CONUSbiohex2020/biohex.gpkg"
response = s3.list_objects_v2(Bucket=bucket, Prefix=prefix)
gpkg_keys = [obj["Key"] for obj in response.get("Contents", []) if obj["Key"].endswith(".gpkg")]

key = gpkg_keys[0]
vector_path = f"/vsis3/{bucket}/{key}"
print("GeoPackage path:", vector_path)

gdf = gpd.read_file(vector_path)
print(vector_path)


GeoPackage path: /vsis3/maap-ops-workspace/shared/smk0033/CONUSbiohex2020/biohex.gpkg
/vsis3/maap-ops-workspace/shared/smk0033/CONUSbiohex2020/biohex.gpkg


In [28]:
print(gdf.head())

   USHEXES_ID  EMAP_HEX  PROP_FORES  SE_PROP_FO   CRM_LIVE  SE_CRM_LIV  \
0        1680    1680.0    0.966835    3.247659  76.729213   14.810822   
1        1681    1681.0    0.983914    1.123591  72.751194   10.498955   
2        1568    1568.0    0.854100   12.539034  88.527037   20.416719   
3        1456    1456.0    0.543536   22.598699  52.052440   40.713392   
4        1345    1345.0    0.520229   23.210199  42.179547   29.260777   

   CRM_STND_D  SE_CRM_STN  CRM_LIVE_D  SE_CRM_L_1  ...  SE_JENK_LI  \
0    2.091053   68.108338   78.820266   15.381299  ...   17.287907   
1    1.870613   25.186416   74.621807   10.496552  ...    9.390085   
2    0.703147   58.649462   89.230184   20.333036  ...   20.126408   
3    3.783766   37.665236   55.836206   39.080061  ...   37.532659   
4    0.340501   50.498881   42.520048   29.336943  ...   27.366023   

   JENK_STND_  SE_JENK_ST  JENK_LIVE_  SE_JENK__1    EST_SAMPLE  SAMPLED_PL  \
0   23.530244   59.050753  127.717229   10.583475  1424

## CSV (Spatial)

In this example, we’ll load a CSV file containing spatial data directly from the MAAP STAC results. We use the item variable, and then modify it to stream data using the /vsis3/ prefix.

In [29]:
asset_href = "s3://nasa-maap-data-store/file-staging/nasa-map/icesat2-boreal/boreal_agb_202302151676439579_1326_train_data.csv"


Since we already have a complete S3 path, we convert the `"s3://"` prefix to `"/vsis3/"`. Additionally, we define the appropriate field names for longitude and latitude so that the file is interpreted as spatial.

To learn more, refer to the [GDAL Comma Separated Value (.csv) driver documentation](https://gdal.org/drivers/vector/csv.html).


In [30]:
csv_path = asset_href.replace("s3://", "/vsis3/")
gdf = gpd.read_file(
    f"CSV:{csv_path}",
    engine="fiona",
    X_POSSIBLE_NAMES="lon",
    Y_POSSIBLE_NAMES="lat"
)
print(gdf.head())


         lon        lat               AGB                   SE  \
0 -76.301546  51.089067  13.2031877105918  0.00120325130936702   
1 -79.011834  50.972447  3.88344532354623  0.00107527195707417   
2 -76.397307  50.458315   4.3007091919769  0.00107527195707417   
3 -76.308436  50.442678  43.3027732332638  0.00120325130936702   
4 -77.456452  52.031459  2.34135031326733  0.00107527195707417   

                     geometry  
0  POINT (-76.30155 51.08907)  
1  POINT (-79.01183 50.97245)  
2  POINT (-76.39731 50.45832)  
3  POINT (-76.30844 50.44268)  
4  POINT (-77.45645 52.03146)  


## CSV (non-spatial)

For this example, we’ll access a CSV file from our shared bucket.

In [31]:
csv_listing = s3.list_objects_v2(Bucket=bucket, Prefix="shared/smk0033/csv_ex/")
csv_keys = [obj["Key"] for obj in csv_listing.get("Contents", [])]
csv_key = csv_keys[3]
print(csv_key)


shared/smk0033/csv_ex/country_estimates_gedi_l4b_v002.csv


Although this CSV file can be accessed directly from shared storage or S3, we’re downloading it locally before reading. This approach helps avoid potential memory issues or latency that can arise when reading files over a network connection—especially for formats like CSV that aren’t inherently cloud-optimized.  

Downloading also ensures better compatibility with processing tools like `pandas`, which expect local file handles for some operations. While cloud-native streaming is preferred for large geospatial formats (e.g., COGs), working with local copies of non-spatial files can improve stability and simplicity in many cases.


Before downloading, let’s create a new directory to put our file.

In [32]:
os.makedirs("./data", exist_ok=True)

In [33]:
#create file name for download
filename = os.path.basename(csv_key)
print("Filename:", filename)

Filename: country_estimates_gedi_l4b_v002.csv


In [34]:
download_path = os.path.join("./data", filename)
s3.download_file(Bucket=bucket, Key=csv_key, Filename=download_path)

In [35]:
# Read CSV into DataFrame
data = pd.read_csv(download_path)
print(data.head())


       Country ISO3  Percent_Forest  FAO_Forested_AGBD  FAO_Forested_AGBD.1  \
0        Aruba  ABW             2.3            -9999.0              -9999.0   
1  Afghanistan  AFG             1.9            -9999.0              -9999.0   
2       Angola  AGO            53.4               30.3                 16.2   
3     Anguilla  AIA            61.1              210.0                128.3   
4      Albania  ALB            28.8            -9999.0              -9999.0   

   GEDI_L4B_Total_AGBD  GEDI_L4B_Total_AGBD.1  GEDI_L4B_AGBD_SE_Percent  \
0                  2.1                    0.5                      23.6   
1                 24.7                    1.3                       5.4   
2                 34.6                    0.6                       1.9   
3                  4.4                    1.0                      22.5   
4                 56.9                    1.4                       2.5   

      FAO_AGB  GEDI_L4B_AGB  GEDI_L4B_AGB_SE  
0 -9999.00000      0.000036