This document includes code to test and document various pygcdl functions. Before running the code in this document, run the code in "sample_data/create_spatial_data.ipynb" to create the data files that this code uses.

In [1]:
# First we import the necessary libraries
import sys
import pygcdl
import geopandas as gpd
import os
import pandas as pd
import numpy as np
from pathlib import Path
import zipfile
import importlib
importlib.reload(pygcdl)

<module 'pygcdl' from 'C:\\Users\\Noa.Mills\\Documents\\pygcdl\\pygcdl.py'>

In [2]:
# Create the pygcdl object
# url_base set for local development
# Remove url_base for Ceres development and testing
pygcdl_obj = pygcdl.PyGeoCDL(url_base="http://127.0.0.1:8000")

In [3]:
print(pygcdl_obj.list_datasets())

{'DaymetV4': 'Daymet Version 4', 'GTOPO30': 'Global 30 Arc-Second Elevation', 'MODIS_NDVI': 'MODIS NDVI Data, Smoothed and Gap-filled, for the Conterminous US: 2000-2015', 'NASS_CDL': 'NASS Cropland Data Layer', 'NLCD': 'National Land Cover Database', 'PRISM': 'PRISM', 'RAPV3': 'Rangeland Analysis Platform Version 3', 'SMAP-HB1km': 'SMAP HydroBlocks - 1 km', 'Soilgrids250mV2': 'SoilGrids — global gridded soil information', 'VIP': 'Vegetation Index and Phenology (VIP) Vegetation Indices Daily Global 0.05Deg CMG V004'}


In [4]:
print(pygcdl_obj.get_dataset_info("PRISM")["vars"])

{'ppt': 'total precipitation (rain+melted snow)', 'tmean': 'mean temperature (mean of tmin and tmax)', 'tmin': 'minimum temperature', 'tmax': 'maximum temperature', 'tdmean': 'mean dew point temperature', 'vpdmin': 'minimum vapor pressure deficit', 'vpdmax': 'maximum vapor pressure deficit'}


We can upload a geometry as:
- A geojson file
- A shapefile
- A zipfile containing shapefile files
- A csv file (point data only?)
- A geopandas dataframe

The GCDL can only handle generate polygon subsets of single polygons, or multipolygon objects that contain only one polygon. If the user attempts to upload a geopandas dataframe that contains multiple polygons, then pygcdl calculates the ratio between the area of the union of polygons, and the area of the convex hull. If the union of polygons covers at least 80% of the area of the convex hull, then the pygcdl uploads the convex hull. Otherwise, pygcdl uploads each polygon individually, and returns a list of GUIDs.

If the user uploads a file, the file contents are not checked. So, it is possible for a user to upload a multipolygon file without any errors or warnings, and then run into errors when trying to use the GUID for that upload to download a polygon subset.

In [3]:
# Specify location of sample data files
sample_data_dir = Path("sample_data/output_data")

In [4]:
# Upload polygon shapefiles
subset_counties1_guid = pygcdl_obj.upload_geometry(sample_data_dir / "subset_counties1.zip")
subset_counties2_guid = pygcdl_obj.upload_geometry(sample_data_dir / "subset_counties2.zip")
subset_counties3_guid = pygcdl_obj.upload_geometry(sample_data_dir / "subset_counties3.zip")
subset_counties4_guid = pygcdl_obj.upload_geometry(sample_data_dir / "subset_counties4.zip")
subset_counties5_guid = pygcdl_obj.upload_geometry(sample_data_dir / "subset_counties5.zip")
subset_counties6_guid = pygcdl_obj.upload_geometry(sample_data_dir / "subset_counties6.zip")
subset_counties7_guid = pygcdl_obj.upload_geometry(sample_data_dir / "subset_counties7.zip")
subset_counties8_guid = pygcdl_obj.upload_geometry(sample_data_dir / "subset_counties8.zip")
print(subset_counties1_guid)
print(subset_counties2_guid)
print(subset_counties3_guid)
print(subset_counties4_guid)
print(subset_counties5_guid)
print(subset_counties6_guid)
print(subset_counties7_guid)
print(subset_counties8_guid)

f321a02a-5a58-49c1-95a8-8ceca98a9a62
f92e1fc4-6e6a-4c9c-b897-d5ea9c372ddb
9169d66f-beda-4b57-80aa-9dfc7bb9b2f5
35dc9b8a-836b-4995-8166-18a8793623b3
08c9204b-4b8f-454c-a19f-31fa139ecff8
b70ab7a6-5b0a-4d3c-99e7-142e779148e9
145ec8fb-3219-4343-86a2-203e9f0e93ca
9f2574d8-cd6c-42cb-94e5-0e19fe0d7bca


Upload a polygon .shp file. This finds associated files (ie .shp, .cpg, etc) and creates a zip file of them. Since all of our shapefiles are already in zip files, we will first unzip a .zip file, then call upload_geometry on the .shp file.

In [7]:
# Specify paths to zip file, and path to unzip those files into
path_to_zip_file = Path(sample_data_dir / "subset_counties1.zip")
upload_shp_dir = Path(sample_data_dir / "upload_shp_dir")
upload_shp_dir.mkdir(exist_ok=True)
# Unzip subset_counties1.zip files into upload_shp_dir
with zipfile.ZipFile(path_to_zip_file, 'r') as zip_ref:
    zip_ref.extractall(path=upload_shp_dir)   

In [8]:
subset_counties1_shp_guid = pygcdl_obj.upload_geometry(upload_shp_dir / "subset_counties1.shp")
print(subset_counties1_shp_guid)

1edd0164-8d5d-427e-adec-c88b28686ffd


If you observe your file system, you will now see that you have the following files in sample_data/output_data/upload_shp_dir:
- subset_counties1.cpg
- subset_counties1.dbf
- subset_counties1.prj
- subset_counties1.shp
- subset_counties1.shx
- subset_counties1.zip
We asked pygcdl to upload `subset_counties1.shp`, so it identified the minimum related files (.shp, .shx, .dbf, .prj), created the zipfile `subset_counties1.zip` containing these files, and uploaded that zipfile.

Now, we will show what happens if you attempt to upload a polygon .shp file that doesn't have the associated files in the same directory. First, we will remove all files except for `subset_counties1.shp` from sample_data/output_data/upload_shp_dir.

In [9]:
# Remove all files except for subset_counties1.shp
ext_to_remove = [".cpg", ".dbf", ".prj", ".shx", ".zip"]
base_file = upload_shp_dir / "subset_counties1"
for ext in ext_to_remove:
    base_file.with_suffix(ext).unlink()

In [10]:
# CAUSES AN ERROR
# Attempt to upload the lonely shapefile, and observe the error produced by pygcdl
faulty_guid = pygcdl_obj.upload_geometry(upload_shp_dir / "subset_counties1.shp")

FileNotFoundError: [Errno 2] No such file or directory: WindowsPath('C:/Users/Noa.Mills/Documents/pygcdl/sample_data/output_data/upload_shp_dir/subset_counties1.shx')

In [11]:
# Upload polygon geojson file
subset_counties1_geojson_guid = pygcdl_obj.upload_geometry(sample_data_dir / "subset_counties1.geojson")
subset_counties2_geojson_guid = pygcdl_obj.upload_geometry(sample_data_dir / "subset_counties2.geojson")
print(subset_counties1_geojson_guid)
print(subset_counties2_geojson_guid)

92531d94-e941-46d5-bbdf-b05313f6ff55
e80c2e94-7fb6-42c6-a7af-b14070411979


In [12]:
# Upload a points shapefile zipfile
county_centroids_shp_guid = pygcdl_obj.upload_geometry(sample_data_dir / "county_centroids.zip")
print(county_centroids_shp_guid)

ba6cbc39-31cf-40c4-8b67-0cd10856fd0c


In [13]:
# Upload a points geojson
county_centroids_geojson_guid = pygcdl_obj.upload_geometry(sample_data_dir / "county_centroids.geojson")
print(county_centroids_geojson_guid)

9d770c49-95c2-407a-9dcc-c3e0f38a10a7


In [14]:
# Upload a points csv file
county_centroids_csv_guid = pygcdl_obj.upload_geometry(sample_data_dir / "county_centroids.geojson")
print(county_centroids_csv_guid)

528b2fd6-02bb-4db4-aef7-07f12dcaf909


Now that we have covered the several different ways to upload data, and uploaded various data files for testing, let's take a look at how to download subsets of data. First we specify the datasets and variables we would like to use. We can do this with a pandas dataframe, a dict, or a matrix as follows.

In [15]:
# Specify datasets and variables as a pandas dataframe
dsvars1 = pd.DataFrame(
    [["PRISM", "ppt"], ["MODIS_NDVI", "NDVI"]], 
    columns=["dataset", "variable"])
print(dsvars1)

      dataset variable
0       PRISM      ppt
1  MODIS_NDVI     NDVI


In [5]:
# Specify datasets and variables as a dict
dsvars2 = {"PRISM":["ppt", "tmean"], "MODIS_NDVI":["NDVI"]}

In [18]:
# Specify datasets and variables as a list
dsvars3 = [["PRISM", "ppt"],["PRISM", "tmean"], ["MODIS_NDVI", "NDVI"]]

In [20]:
# Specify datasets and variables as a numpy array
dsvars4 = np.array([["PRISM", "ppt"],["PRISM", "tmean"], ["MODIS_NDVI", "NDVI"]])
dsvars4

array([['PRISM', 'ppt'],
       ['PRISM', 'tmean'],
       ['MODIS_NDVI', 'NDVI']], dtype='<U10')

Next, we specify our date data and grain method.

In [11]:
years = "2008"
months = "7:8"
grain_method = "any"

Then, we specify our spatial resolution and resampling method.

In [12]:
spat_res = 1000 # in units of meters
resample_method = "bilinear"

Lastly, we construct a directory for our downloaded files to go to.

In [13]:
output_path = Path("output_test")
if not output_path.is_dir():
    output_path.mkdir()

In [22]:
output_files = pygcdl_obj.download_polygon_subset(
    dsvars=dsvars4, 
    years=years,
    months=months,
    grain_method=grain_method,
    t_geom=subset_counties6_guid,
    dsn=output_path,
)

http://127.0.0.1:8000/subset_polygon?datasets=PRISM%3Appt%2Ctmean%3BMODIS_NDVI%3ANDVI&geom_guid=b70ab7a6-5b0a-4d3c-99e7-142e779148e9&resample_method=nearest&years=2008&months=7&grain_method=any&validate_method=strict&output_format=geotiff
Files downloaded and unzipped:  ['metadata.json', 'PRISM_ppt_2008-07.tif', 'PRISM_tmean_2008-07.tif', 'MODIS_NDVI_NDVI_2008-07-07.tif', 'MODIS_NDVI_NDVI_2008-07-15.tif', 'MODIS_NDVI_NDVI_2008-07-23.tif', 'MODIS_NDVI_NDVI_2008-07-31.tif']
