# Direct S3 Data Access with GDAL Virtual Raster Format (VRT)

## Summary

Hello World

---

## Exercise

### Import Required Packages

In [None]:
#

### Get Temporary Credentials and Configure Local Environment  

To perform direct S3 data access one needs to acquire temporary S3 credentials. The credentials give users direct access to S3 buckets in NASA Earthdata Cloud. **AWS credentials should not be shared**, so take precautions when using them in notebooks our scripts. **Note,** these temporary credentials are valid for only **1 hour**. For more information regarding the temporary credentials visit <https://data.lpdaac.earthdatacloud.nasa.gov/s3credentialsREADME>.

In [None]:
#

In [None]:
#

#### Insert the credentials into our `boto3` session and configure out `rasterio` environment for data access

Create a boto3 Session object using your temporary credentials. This Session can then be used to pass those credentials and get S3 objects from applicable buckets.

In [None]:
#

For this exercise, we are going to open up a context manager for the notebook using the `rasterio.env` module to store the required GDAL and AWS configurations we need to access the data in Earthdata Cloud. While the context manager is open (`rio_env.__enter__()`) we will be able to run the open or get  data commands that would typically be executed within a `with` statement, thus allowing us to more freely interact with the data. We’ll close the context (`rio_env.__exit__()`) at the end of the notebook.

GDAL environment variables must be configured to access Earthdata Cloud data assets. Geospatial data access Python packages like `rasterio` and `rioxarray` depend on GDAL, leveraging GDAL's "Virtual File Systems" to read remote files. GDAL has a lot of environment variables that control it's behavior. Changing these settings can mean the difference being able to access a file or not. They can also have an impact on the performance.

In [None]:
#

### Read In and Process STAC Asset Links

In the previous section, we used the NASA CMR-STAC API to discover HLS assets the intersect with our search criteria, i.e., ROI, Date range, and collections. The search results were filtered and saved as text files by individual bands for each tile. We will read in the text files for tile **T13TGF** for the **RED** (L30: B04 & S30: B04), **NIR** (L30: B05 & S30: B8A), and **Fmask** bands. 

#### List text files with HLS links

In [None]:
#

#### Read in our asset links for **BO4** (RED)

In [None]:
#

#### Read in and combine our asset links for **BO5** (Landsat NIR) and **B8A** (Sentinel-2 NIR)

The near-infrared (NIR) band for Landsat is **B05** while the NIR band for Sentinel-2 is **B8A**. In the next step we will read in and combine the lists into a single NIR list.

In [None]:
#

#### Read in our asset links for **Fmask**

In [None]:
#

In this example we will use the `gdalbuildvrt.exe` utility to create a time series virtual raster format (VRT) file. The utility, however, expects the links to be formated with the GDAL virtual file system (VSI) path, rather than the actual asset links. We will therefore use the VSI path to access our assets. The examples below show the VSI path substitution for S3 (`vsis3`) links.  

```text
/vsis3/lp-prod-protected/HLSS30.015/HLS.S30.T13TGF.2020191T172901.v1.5.B04.tif
``` 

See the [GDAL Virtual File Systems](https://gdal.org/user/virtual_file_systems.html) for more information regarding GDAL VSI.

#### Write out a new text file containing the `vsis3` path

In [None]:
#

In [None]:
#

In [None]:
#

### Read in geoJSON for subsetting

We will use the input geoJSON file to `clip` the source data to our desired region of interest.

In [None]:
#

To `clip` the source data to our input feature boundary, we need to transform the feature boundary from its original WGS84 coordinate reference system to the projected reference system of the source HLS file (i.e., UTM Zone 13). 

In [None]:
#

#### Transform geoJSON feature from WGS84 to UTM

In [None]:
#

### Direct S3 Data Access

#### Start up a dask client

In [None]:
#

In [None]:
#

There are multiple way to read COG data in as a time series. The `subprocess` package is used in this example to run GDAL's build virtual raster file (`gdalbuildvrt`) executable outside our python session. First we’ll need to construct a string object with the command and it’s parameter parameters (including our temporary credentials). Then, we run the command using the ` subprocess.call()` function.

#### Build GDAL VRT Files

##### Construct the GDAL VRT call

In [None]:
#

We now have a fully configured `gdalbuildvrt` string that we can pass to Python's `subprocess` module to run the `gdalbuildvrt` executable outside our Python environment.

#### Execute `gdalbuildvrt` to construct a VRT on disk from the `S3` links

In [None]:
#

`0` means success! We'll have some troubleshooting to do you get any other value. In this tutorial, the path for the output VRT file or the input file list are the first things to check. 

While we're here, we'll build the VRT files for the NIR layers and the Fmask layers.

In [None]:
#

In [None]:
#

### Reading in an HLS time series

We can now read the VRT files into our Python session. A drawback of reading VRTs into Python is that the `time` coordinate variable needs to be contructed. Below we not only read in the VRT file using `rioxarray`, but we also repurpose the `band` variable, which is generated automatically, to hold out time information.

#### Read the RED VRT in as xarray with Dask backing

In [None]:
#

Above we use the parameter `chunk` in the `rioxarray.open_rasterio()` function to enable the Dask backing. What this allows is *lazy reading* of the data, which means the data is not actually read in into memory at this point. What we have is an object with some metadata and pointer to the source data. The data will be streamed to us when we call for it, but not stored in memory until with call the Dask `compute()` or `persist()` methods.

#### Print out the `time` coordinate

In [None]:
#

#### Clip out the ROI and persist the result in memory

Up until now, we haven't read any of the HLS data into memory. Now we will use the `persist()` method to load the data into memory.

In [None]:
#

Above, we persisted the clipped results to memory using the `persist()` method. This doesn't necessarily need to be done, but it will substantially improve the performance of the visualization of the time series below. 

#### Plot `red_clip` with `hvplot`

In [None]:
#

### Read in the NIR and Fmask VRT files

In [None]:
#

In [None]:
#

### Create an `xarray dataset`

We will now combine the **RED**, **NIR**, and **Fmask** arrays into a dataset and create/add a new NDVI variable.

In [None]:
#

Above, we created a new NDVI variable. Now, we will clip and plot our results.

In [None]:
#

#### Plot NDVI

In [None]:
#

You may have notices that some images for some of the time step are 'blurrier' than other. This is because they are contaminated in some way, be it clouds, cloud shadows, snow, ice.

### Apply quality filter

We want to keep NDVI data values where Fmask equals 0 (no clouds, no cloud shadow, no snow/ice, no water.

In [None]:
#

In [None]:
#

#### Aggregate by month

Finally, we will use xarray's `groupby` operation to aggregate by month.

In [None]:
#

In [None]:
#