"Geo Data Science with Python" 
### Notebook Lesson 06d

# Python Packages: Download data from the Web II

This lesson discusses several smaller Python Modules useful to download and retrieve Geoscience data from the internet. 

### Sources
This notebook is contains information from the following resources:

PyDap
- https://www.pydap.org/en/latest
- https://www.pydap.org/en/latest/client.html
- https://github.com/pydap/pydap

OpenDAP Data Access Protocol
- https://www.opendap.org/
- https://opendap.github.io/documentation/QuickStart.html
- https://www.opendap.org/support/user-documentation

NASA Earthdata

- Register: https://urs.earthdata.nasa.gov/documentation/what_do_i_need_to_know
- Data Access: https://disc.gsfc.nasa.gov/data-access
- OpenDap: https://earthdata.nasa.gov/collaborate/open-data-services-and-software/api/opendap
- GESDISC data recipes: https://disc.gsfc.nasa.gov/information/howto

Examples from Earthdata
- Earthdata, read NetCDF data in Python: https://disc.gsfc.nasa.gov/information/howto?title=How%20to%20read%20and%20plot%20NetCDF%20MERRA-2%20data%20in%20Python
- Download subset of data: https://disc.gsfc.nasa.gov/information/howto?title=How%20to%20download%20a%20spatial%20and%20variable%20subset%20of%20Level%201B%20data%20using%20OPeNDAP

---
## CODE EXAMPLE 1: Open remote datasets

Let’s start **accessing gridded data**, i.e., data that is stored as a regular multidimensional array. Here’s a simple example where we access the COADS climatology from the official OPeNDAP server:

In [1]:
from pydap.client import open_url

ModuleNotFoundError: No module named 'pydap'

In [None]:
# open_url: function to open an URL given the location of the dataset
pdpData = open_url('http://test.opendap.org/dap/data/nc/coads_climatology.nc')

In [None]:
# this returns a DatasetType object (which is a fancy dictionary)
type(pdpData)

In [None]:
# check the names of the stored variables
pdpData.keys()  # or pdpData.values()

In [None]:
# reference the SST variable
sst = pdpData['SST']  

# or with "lazy syntax": dataset.SST

In [None]:
# this returns a GridType object (which is a multidimensional array)
type(sst)

In [None]:

# the array has specific axes defining each of its dimensions
sst.dimensions

In [None]:
# the array maps to all variables with related dimensions
sst.maps

In [None]:
# axes of the array can be called directly
sst.TIME  # gives same as pdpData['TIME']

In [None]:
# axes of the array are of BaseType object
type(sst.TIME)  # same for TIME, COADSY and COADSX

---
## CODE EXAMPLE 2: Introspect the variable attributes

Call attributes of the DataGrid object (class):

In [None]:
# check shape of the sst variable from the coads_climatology dataset
sst.shape

In [None]:
sst.dtype

In [None]:
sst.TIME.shape

In [None]:
sst.TIME.dtype

In [None]:
# call an attribute of GridType object: attributes
sst.attributes

---
## CODE EXAMPLE 3: Download a subset of gridded data

Pydap will download the accessed data on-the-fly as needed:

In [None]:
# get shape again and compare to mappings
sst.shape 

In [None]:
sst.TIME.shape, sst.COADSY.shape, sst.COADSX.shape

In [None]:
# only now, this will download data from the server !!!
grid = sst[0,10:14,10:14] 

In [None]:
# data subset is also of object GridType
type(grid)

In [None]:
#the grid data and its mappings can be viewed with the attribute data
grid.data

In [None]:
#data itself can be accessed in the array attribute of the Grid...
grid.array[:]

In [None]:
#... or through the data axes
grid.array[:].data

In [None]:
# access with dictionary syntax or 'lazy' syntax:
grid['SST'][:].data
grid['TIME'][:].data    # or grid.TIME[:].data
grid['COADSX'][:].data  # or grid.COADSX[:].data
grid['COADSY'][:].data  # or grid.COADSY[:].data


In [None]:
grid.data

In [None]:
#Alternatively: dowloaded the data directly, skipping the axes:
sst.array[0,10:14,10:14].data

---
## CODE EXAMPLE 4: Determine Download URL for Certain Subset

In [None]:
# open website: http://test.opendap.org/opendap/data/nc/
# click on nc file
# get link for subset dataset through access form...
# e.g. only for SST variable and time variable

In [None]:
pdpData2 = open_url('http://test.opendap.org/opendap/data/nc/coads_climatology.nc?TIME[0:1:11],SST[0:1:11][0:1:89][0:1:179]')
pdpData2.keys()

---
## CODE EXAMPLE 5: Open the file with netCDF4 (after downloading) 

You could also download the file we just worked with from the OPeNDAP server and interact with it via the netCDF4 package, as we discussed during the last lesson, that means with the same netCDF4 functions and attributes. You could try this below. This should demonstrate, how the package **pydap** provides just another interface (like the package netCDF4) to the same science data structure (NetCDF Science Data files). The power of pydap and OPeNDAP servers is, though, that you do not have to download the entire dataset, only the variables or slices that you need.

In [None]:
# uncomment the lines to download the file to your computer
import requests
url = 'http://test.opendap.org/dap/data/nc/coads_climatology.nc'
filename = 'coads_climatology.nc'
# r = requests.get(url, allow_redirects=True, stream=True)  
# open(filename, 'wb').write(r.content) 

In [None]:
# open it with netCDF4
from netCDF4 import Dataset
data = Dataset('coads_climatology.nc')

In [None]:
data

---
### CODE EXAMPLE 6: OpenDAP with EarthData and Authentication

In [None]:
# to make this work:
#   - needed to instal lxml package
#   - need to authorize applications on EarthData account:
#     "NSIDC V0 OPeNDAP" and "NASA GESDISC DATA ARCHIVE" 
#     for later also: OB.DAAC MERIS, GESDISC Test Data Archive

In [None]:
from pydap.client import open_url       # needed for OPeNDAP access
from pydap.cas.urs import setup_session # needed for Earthdata login
import numpy as np

In [None]:
url = 'https://hydro1.gesdisc.eosdis.nasa.gov/opendap/hyrax/GLDAS/GLDAS_CLSM10_M.2.1/2021/GLDAS_CLSM10_M.A202101.021.nc4'
username = 'geopythonvt'      # please replace with your own username
password = 'GeoPythonVT2021'  # please replace with your own password
session = setup_session(username, password, check_url=url) # create Earthdata session
dataset = open_url(url, session=session)  # create connection to dataset

In [None]:
# The dataset has lot's of variables:
dataset.keys()

In [None]:
dataset.attributes

In [None]:
# Let's say we are only interested in Rainfall data
dataset.Rainf_tavg.attributes

In [None]:
dataset.Rainf_tavg.shape

In [None]:
dataset.Rainf_tavg.shape

In [None]:
dataset.time[:].data

In [None]:
dataset.time[:].units

In [None]:
# Now we can download only the rainfall variable 
# and its mappings and metadata: time, lon, lat, missing_value
rainf = dataset.Rainf_tavg.array[:].data
lon = dataset.lon[:].data
lat =  dataset.lat[:].data
lonGrid, latGrid = np.meshgrid(lon,lat)
fillVal = dataset.Rainf_tavg.missing_value

#### Plot the rainfall data

In [None]:
fillVal

In [None]:
# replace missing values with nan
rainf[rainf==fillVal] = np.nan
# convert unit from kg m-2 s-1 to kg m-2 d-1
rainf_day = rainf*60*60*24

In [None]:
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(10, 6))
plt.pcolormesh(lonGrid,latGrid, rainf_day[0], shading='auto',cmap='nipy_spectral') 
# or use: 
#plt.scatter(lonGrid,latGrid, c=rainf_day, cmap='nipy_spectral', s=1) 
plt.colorbar(label='GLDAS average daily Rainfall in Jan 2021(mm)')

---
## CODE EXAMPLE 7: Download entire file with requests (no subsetting)

For this we need to setup the login to Earthdata by creating a .netrc file
See also descriptions here: https://disc.gsfc.nasa.gov/data-access

You need to replace `<username>` and `<password>` with your own Earthdata login credentials.

In [None]:
%%bash
cd $HOME
touch .netrc
echo "machine urs.earthdata.nasa.gov login <username>` password <password>" > .netrc
chmod 0600 .netrc

In [None]:
# Set the URL string to point to a specific data URL. Some generic examples are:
#   https://servername/data/path/file
#   https://servername/opendap/path/file[.format[?subset]]
#   https://servername/daac-bin/OTF/HTTP_services.cgi?KEYWORD=value[&KEYWORD=value]
URL = 'https://hydro1.gesdisc.eosdis.nasa.gov/opendap/hyrax/GLDAS/GLDAS_CLSM10_M.2.1/2001/GLDAS_CLSM10_M.A200101.021.nc4.html'

# Set the FILENAME string to the data file name, the LABEL keyword value, or any customized name. 
FILENAME = 'GLDAS_CLSM10_M.A200101.021.nc4'

import requests
result = requests.get(URL)
try:
    result.raise_for_status()
    f = open(FILENAME,'wb')
    f.write(result.content)
    f.close()
    print('contents of URL written to '+FILENAME)
except:
    print('requests.get() returned an error code '+str(result.status_code))
