# Data integration with ICESat-2 - Part I

```{admonition} Learning Objectives
**Goals**
- Identify and locate non-ICESat-2 data sets
- Acquiring data from the cloud or via download
- Open data in Pandas and Xarray and basic functioning of DataFrames
```

```{admonition} Key Takeaway
You will be able to open a time series of Cloud Optimized Geotiffs with multiple bands from the cloud directly into memory in Xarray and visualize them with ICESat-2 and ATM data.
```

For this tutorial, feel free to run the code along with us as we live code by downsizing the zoom window and splitting your screen (or using two screens). Or you can simply watch the zoom walkthrough. Don't worry if you fall behind on the code. The notebook is standalone and you can easily run the code at your own pace another time to catch anything you missed. 

We will have one exercise you can type into a notebook, or figure out in a separate document. We will also ask some questions that you can responsd to in the tutorial Slack channel.

## Computing environment

We'll be using the following open source Python libraries in this notebook:

In [None]:
import ipyleaflet
from ipyleaflet import Map, GeoData, LayersControl,Rectangle, basemaps, basemap_to_tiles, TileLayer, SplitMapControl, Polygon

import ipywidgets
import datetime
import re

In [None]:
# %matplotlib widget
import satsearch
from satsearch import Search
import geopandas as gpd
import ast
import pandas as pd
import geoviews as gv
import hvplot.pandas
from ipywidgets import interact
from IPython.display import display, Image
import intake # if you've installed intake-STAC, it will automatically import alongside intake
import xarray as xr
import matplotlib.pyplot as plt
import boto3
import rasterio as rio
from rasterio.session import AWSSession
from rasterio.plot import show
import rioxarray as rxr
from dask.utils import SerializableLock
import os
import hvplot.xarray
import numpy as np
from pyproj import Proj, transform

# Suppress library deprecation warnings
import warnings
warnings.filterwarnings('ignore')

## 1. Identify and acquire the ICESat2 product(s) of interest

* What is the application of this product?
* What region and resolution is needed?

#### Download ICESat-2 ATL03 data from desired region

Remember icepyx? We are going to use that again to download some ICESat-2 ATL06 data over our region of interest.

In [None]:
import icepyx as ipx

In [None]:
# Specifying the necessary icepyx parameters
short_name = 'ATL06'
spatial_extent = 'hackweek_kml_jakobshavan.kml' # KML polygon centered on Jakobshavan
date_range = ['2019-04-01', '2019-04-30']
rgts = ['338'] # IS-2 RGT of interest

You may notice that we specified a RGT track. As seen below, a large number of ICESat-2 overpasses occur for Jakobshavan. In the interest of time (and computer memory), we are going to look at only one of these tracks.

In [None]:
# Show image of area of interest (data viz tutorial will get in deeper so don't explain much):
center = [69.2, -50]
zoom = 7

# Open KML file for visualizing
gpd.io.file.fiona.drvsupport.supported_drivers['KML'] = 'rw'
jk = gpd.read_file(spatial_extent, driver='KML')

m = Map(basemap=basemap_to_tiles(basemaps.NASAGIBS.ModisAquaTrueColorCR, '2020-07-18'),center=center,zoom=zoom)
geo_data = GeoData(geo_dataframe = jk)

m.add_layer(geo_data)
m.add_control(LayersControl())
m

In [None]:
# Setup the Query object
region = ipx.Query(short_name, spatial_extent, date_range, tracks=rgts)

# Show the available granules
region.avail_granules(ids=True)

Looks like we have an ICESat-2 track! Let's quickly visualize the data to ensure that there are no clouds.

In [None]:
# Request information from OpenAltimetry
cyclemap, rgtmap = region.visualize_elevation()

rgtmap

Looks good! Now it's time to download the data.

In [None]:
# Set Earthdata credentials
uid =
email = 
region.earthdata_login(uid, email)

# Order the granules
region.order_granules()

In [None]:
# Download the data
path = '/home/jovyan/website2022/book/tutorials/DataIntegration/'
region.download_granules(path)

In [None]:
import h5py

# Load the ICESat-2 data. We will just look at the central beams (GT2R/L)
is2_file = 'processed_ATL06_20190420093051_03380303_005_01_full.h5'
with h5py.File(is2_file, 'r') as f:
    is2_gt2r = pd.DataFrame(data={'lat': f['gt2r/land_ice_segments/latitude'][:],
                                  'lon': f['gt2r/land_ice_segments/longitude'][:],
                                  'elev': f['gt2r/land_ice_segments/h_li'][:]})
    is2_gt2l = pd.DataFrame(data={'lat': f['gt2l/land_ice_segments/latitude'][:],
                                  'lon': f['gt2l/land_ice_segments/longitude'][:],
                                  'elev': f['gt2l/land_ice_segments/h_li'][:]})

## 2. Identify other products of interest

### Question - Respond in Slack
What research problems have you wanted to address that require more than one dataset?

### Where are other data sets stored?

* Cloud datasets in AWS 
   https://registry.opendata.aws/ 
* NASA EarthData
   https://search.earthdata.nasa.gov/search/
* ESA Copernicus Hub
   https://scihub.copernicus.eu
* Etc.

More on this in the Cloud Computing Tools tutorial

Today, we will show ATM (non-AWS) and Landsat (AWS)

## 3. Acquire non-cloud data and open:  ATM data access

Why did we choose April 2019 and RGT 338? In Spring 2019, an airborne campaign called Operation IceBridge was flown across Jakobshavan as validation for ICESat-2. Onboard was the Airborne Topographic Mapper, a lidar that works at both 532 nm (like ICESat-2) and 1064 nm (near-infrared). More information about Operation IceBridge and ATM may be found here: https://nsidc.org/data/icebridge

Here, we are going to try and co-register ATM spot measurements with ICESat-2. Because both data sets are rather large, this can be computationally expensive, so we will only consider one flight track with the ATM 532 nm beam.

Operation IceBridge data is not available on the cloud, so this data was downloaded directly from NSIDC. If you are interested in using IceBridge data, NSIDC has a useful data portal here: https://nsidc.org/icebridge/portal/map

In [None]:
# Load the ATM data into a DataFrame
atm_file = 'ILATM2_20190506_151600_smooth_nadir3seg_50pt.csv'
atm_l2 = pd.read_csv(atm_file)

atm_l2.head()

We opened this data into a Pandas DataFrame, which is a handy tool for Earth data exploration and analysis. The column names derive automatically from the first row of the csv and each row corresponds to an ATM measurement.

### Opening and manipulating data in Pandas

Pandas excels at helping you explore, clean, and process tabular data, such as data stored in spreadsheets or databases. In pandas, a Series is a 1-D data table and a DataFrame is the 2-D data table, which we just saw above.

Read csv, the funtion we used above, is the easiest way to open a csv data file into a Pandas DataFrame. We can specify formating, data selection, indexing, and much more when reading any data into a Pandas DataFrame. Below we read in the data again, specifying different headers, assigning the first column as the index, and assigning the first column as the header (even though we are renaming it, so that it doesn't mistake the first row in the csv as data).

In [None]:
# Load data with specific headers
headers = ['UTC', 'Lat', 'Lon',
       'Height', 'South-to-North_Slope',
       'West-to-East_Slope', 'RMS_Fit', 'Number_Measurements',
       'Number_Measurements_Removed',
       'Distance_Of_Block_To_The_Right_Of_Aircraft', 'Track_Identifier']
atm_l2 = pd.read_csv(atm_file, names=headers, index_col=0, header=0)
atm_l2.head()

Now we can explore the data and DataFrame functions...

In [None]:
# Find out the names of the columns
atm_l2.columns

In [None]:
# Show data in only one of those columns
atm_l2['Lat'].head()

In [None]:
# Same thing, but another way
atm_l2.Lat.head()

If we want something more intuitive for our index, we can create a column of datetime objects that uses the date from the ATM file name and seconds_of_day column as date and time information. 

In [None]:
# Reset the index
atm_l2 = atm_l2.reset_index()

# Use the date in the string of the file name to create a datetime of the date
date = pd.to_datetime(atm_file[7:15])

# Use UTC seconds_of_day column to calculate time of day that we will use to add time to the datetime
time = pd.to_timedelta(atm_l2.UTC, unit='s')

# Add the time to date and set as index
atm_l2['DateTime'] = date + time
atm_l2 = atm_l2.set_index('DateTime')
atm_l2.head()

Now we can easily slice data by date, month, day, etc.

In [None]:
atm_l2['2019-05-06'].head()

Or by a range of dates

In [None]:
atm_l2['2019-05-06':'2019-05-07']

#### Slicing

Two methods for slicing:

* .loc for label-based indexing
* .iloc for positional indexing

Use loc to slice specific indices

In [None]:
atm_l2.index

In [None]:
atm_l2.loc['2019-05-06 15:16:09.500']

Select indices and columns

In [None]:
atm_l2.loc['2019-05-06 15:16:09.500', ['Lat', 'Lon', 'Height']]

Use iloc to select by row and column number

In [None]:
atm_l2.iloc[[0,1],[2,3,4]]

#### Statistical manipulations

Say we want to know the mean ATM height for the data...

In [None]:
atm_l2['Height'].mean()

In [None]:
atm_l2['Height'].describe()

In [None]:
atm_l2['Number_Measurements'].sum()

If we want to know the mean of each column grouped for each Track Identifier...

In [None]:
# Group rows together by a column and take the mean of each group
atm_l2.groupby('Track_Identifier').mean()

Groupby is pretty complex. You can dive deeper here: 
    https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html

You can also resample your data to get the mean of all measurements at 30 second intervals...

In [None]:
atm_l2.resample('30S').mean()

To use your own functions, one might first try to use a for loop to iterate over rows or columns of data. Pandas has made an easy and fast alternative, __apply()__. This function acts as a map() function in Python. It takes a function as an input and applies the function to an entire DataFrame. 

Something easy could be to take the median of each column of the data. We specify the np.median function and axis=0 to pass in all rows and iterate over each column.

In [None]:
atm_l2.apply(np.median, axis=0)

Say you want to use only specific rows or columns in your function. For instance, you want to know the total number of measurements (i.e. Number_Measurements + Number_Measurements_Removed). We already have a function that takes two columns and adds them together. Now we want to apply it to the data.

First, we call the .apply() method on the atm_l2 dataframe. Then use the lambda function to iterate over the rows (or columns) of the dataframe. For every row, we grab the Number_Measurements and Number_Measurements_Removed columns and pass them to the calc_total function. Finally, we specify the axis=1 to tell the .apply() method that we are passing in columns to apply the function to each row.

In [None]:
def calc_total(a,b):
    total = a + b
    return total

atm_l2['Total_Measurements'] = atm_l2.apply(lambda row: calc_total(row['Number_Measurements'], row['Number_Measurements_Removed']), axis=1)
atm_l2.head()

## Exercise

Try making .apply work with this new function to create a new column, 'Distance_to_Jakobshavn', using the Lat and Lon columns as inputs. We've already supplied the function, which requires a latitude and longitude input to calculate the distance between the ATM measurement and a specified point on the Jakobshavn Glacier ice front (`jlat`/`jlon`). Complete the line that applies the function to those columns in the data to get the Distance_to_Jakobshavn.

In [None]:
from math import sin, cos, sqrt, atan2, radians

def distance(a,b):
    '''
    Calculate distance between a set point and a lat lon pair from the data
    a = lat
    b = lon
    '''
    
    jlat,jlon = 69.2330, -49.2434
    
    # approximate radius of earth in km
    R = 6373.0

    lat1 = radians(jlat)
    lon1 = radians(jlon)
    lat2 = radians(a)
    lon2 = radians(b)

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))

    distance = R * c
    return distance

atm_l2['Distance_to_Jakobshavn']  = .....
atm_l2.head()

Great work! Now let's reset the index to start fresh for multi-indexing.

In [None]:
atm_l2 = atm_l2.reset_index()
atm_l2.head()

#### Multi-indexing

Multi-level indexing opens the door to more sophisticated data analysis and manipulation, especially for working with higher dimensional data. In essence, it enables you to store and manipulate data with an arbitrary number of dimensions in lower dimensional data structures like Series (1d) and DataFrame (2d).

Here, we will demonstrate a few basic things you can do with MultiIndexing. If we wanted to think about our data by DateTime and then by Track Identifier, we would set both colomns as indices...

In [None]:
atm_l2 = atm_l2.set_index(['DateTime','Track_Identifier']).sort_index()
atm_l2

For some terminology, the `levels` of a MultiIndex are the former column names (UTC, Track_Identifier). The `labels` are the actual values in a `level`, (2019-05-06 15:16:09.500, 0-3, ...). `Levels` can be referred to by name or position, with 0 being the outermost level.

Slicing the outermost index level is pretty easy, we just use our regular .loc[row_indexer, column_indexer] to grab a couple datetimes we want. We'll select the columns Lat and Lon where the UTC was `2019-05-06 15:16:09.500` and `2019-05-06 15:16:09.750`.

In [None]:
atm_l2.loc['2019-05-06 15:16:09.500',['Lat','Lon']]

If you wanted to select the rows whose track was 0 or 1, .loc wants [row_indexer, column_indexer] so let's wrap the two elements of our row indexer (the list of UTCs and the Tracks) in a tuple to make it a single unit:

In [None]:
atm_l2.loc[('2019-05-06 15:16:09.500',[0,1]),['Lat','Lon']]

Now you want info from any UTC but for the specific tracks again (0,1). Below the : indicates all labels in this level.

In [None]:
atm_l2.loc[pd.IndexSlice[:,[0,1]], ['Lat','Lon']]

You can do a lot with groupby, pivoting, and reshaping, but I won't dive into that in this tutorial. You can check out more here: https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html

Now that we are oriented with the Pandas DataFrame, let's get back to the ATM data we have opened and co-register it with the ICESat-2 lines.

### Co-register ICESat-2 with ATM data

In [None]:
# Load the ATM data into a DataFrame
atm_file = 'ILATM2_20190506_151600_smooth_nadir3seg_50pt.csv'
atm_l2 = pd.read_csv(atm_file)

atm_l2.head()

The ATM L2 file contains plenty of information, including surface height estimates and slope of the local topography. It also contains a track identifier - ATM takes measurements from multiple parts of the aircraft, namely starboard, port, and nadir. To keep things simple, we will filter the DataFrame to only look at the nadir track (Track_Identifier = 0).

In [None]:
atm_l2 = atm_l2[atm_l2['Track_Identifier']==0]

# Change the longitudes to be consistent with ICESat-2
atm_l2['Longitude(deg)'] -= 360

print(atm_l2.size)

Let's take a quick look at where ATM is relative to ICESat-2...

In [None]:
# Subset the ICESat-2 data to the ATM latitudes
is2_gt2r = is2_gt2r[(is2_gt2r['lat']<atm_l2['Latitude(deg)'].max()) & (is2_gt2r['lat']>atm_l2['Latitude(deg)'].min())]
is2_gt2l = is2_gt2l[(is2_gt2l['lat']<atm_l2['Latitude(deg)'].max()) & (is2_gt2l['lat']>atm_l2['Latitude(deg)'].min())]


# Set up a map with the flight tracks as overlays
from ipyleaflet import Map, basemaps, basemap_to_tiles, Polyline

m = Map(
    basemap=basemap_to_tiles(basemaps.Esri.WorldImagery),
    center=(69.25, 310.35-360),
    zoom=10
)

gt2r_line = Polyline(
    locations=[
        [is2_gt2r['lat'].min(), is2_gt2r['lon'].max()],
        [is2_gt2r['lat'].max(), is2_gt2r['lon'].min()]
    ],
    color="green" ,
    fill=False
)
m.add_layer(gt2r_line)

gt2l_line = Polyline(
    locations=[
        [is2_gt2l['lat'].min(), is2_gt2l['lon'].max()],
        [is2_gt2l['lat'].max(), is2_gt2l['lon'].min()]
    ],
    color="green" ,
    fill=False
)
m.add_layer(gt2l_line)

atm_line = Polyline(
    locations=[
        [atm_l2['Latitude(deg)'].min(), atm_l2['Longitude(deg)'].max()],
        [atm_l2['Latitude(deg)'].max(), atm_l2['Longitude(deg)'].min()]
    ],
    color="orange" ,
    fill=False
)
m.add_layer(atm_line)

m

Looks like ATM aligns very closely with the left beam (GT2L), so hopefully the two beams will agree. The terrain over this region is quite rough, so we may expect some differences between ATM and GT2R. ICESat-2 also flew over Jakobshavan 16 days before ATM, so there might be slight differences due to ice movement.

We have looked at how we can quickly download ICESat-2 and airborne lidar data, and process them using Pandas. We will engage in a more thorough analysis in the Data Integration session tomorrow.

## 3.  Search and open (Landsat) images from the cloud

Next we will show how we can open and manipulate cloud-based Landsat imagery (raster data) for analysis with ICESat-2 data.  

What if you want to search for Landsat 8 data over an area of interest? Browsing through lists of files is cumbersome. Instead of downloading, we can use cloud-optimized approaches that require no downloading to search and access the data.

### Cloud optimized approaches
* Organize data as an aggregation of small, independently retrievable objects (e.g., zarr, HDF2, Geotiff in the Cloud)
* Allow access to pieces of large objects (e.g., Cloud-Optimized GeoTIFF3, OPeNDAP4 in the Cloud)

We will be working with Cloud Optimized GeoTIFF (COG). A COG is a GeoTIFF file with an internal organization that enables more efficient workflows in the cloud environment.  It does this by leveraging the ability of clients issuing ​HTTP GET range requests to ask for just the parts of a file they need instead of having to open the entire image or data set (see more at https://www.cogeo.org/).

An Amazon Web Services (AWS) account is required to directly access Landsat data in
the cloud housed within the usgs-landsat S3 requester pays bucket. Landsat is stored in the AWS USWest data center, which is the same data center our Hub is located so egress of Landsat is free. Users can find all objects within the Landsat record by using either the Satellite (SAT) Application Programing Interface (API) or SpatioTemporal Asset Catalog
(STAC) Browser. 

This is an example url for accessing a band in the cloud from an S3 bucket:\
s3://usgs-landsat/collection02/level-2/standard/oli-tirs/2020/202/025/LC08_L1TP_202025_20190420_20190507_01_T1/LC08_L1TP_202025_20190420_20190507_01_T1_B4.TIF

For more information about accessing Landsat S3, this is the User Manual: \
https://d9-wret.s3.us-west-2.amazonaws.com/assets/palladium/production/s3fs-public/atoms/files/LSDS-2032-Landsat-Commercial-Cloud-Direct-Access-Users-Guide-v2.pdf.pdf

### Search for Landsat imagery

To explore and access COG's easily we will use a SpatioTemporal Asset Catalog (STAC). The STAC specification provides a common, machine-readable (JSON) format for describing a wide range of geospatial datasets. STAC’s goal is to make it easier to index and discover any geospatial dataset that can be described by a spatial extent and time. STAC is the way geospatial asset metadata is structured and queried and it makes querrying S3 buckets easier. Learn more here: https://github.com/radiantearth/stac-spec.

In [None]:
# Sets up credentials for acquiring images through dask/xarray
os.environ["AWS_REQUEST_PAYER"] = "requester"

# Sets up proper credentials for acquiring data through rasterio
aws_session = AWSSession(boto3.Session(), requester_pays=True)

Extract geometry bounds from the ICESat-2 KML file used above so that we can perform the Landsat spatial search.

In [None]:
# Extract geometry bounds
geom = jk.geometry[0]
print(geom.bounds)

We will search for imagery in STAC catalog using satsearch: 
https://github.com/sat-utils/sat-search

In [None]:
# Search STAC API for Landsat images based on a bounding box, date and other metadata if desired

bbox = (geom.bounds[0], geom.bounds[1], geom.bounds[2], geom.bounds[3]) #(west, south, east, north) 

timeRange = '2019-05-06/2019-05-07'
    
results = Search.search(url='https://ibhoyw8md9.execute-api.us-west-2.amazonaws.com/prod',
                        collection='usgs-landsat/collection02/',
                        datetime=timeRange,
                        bbox=bbox,    
                        # properties=properties,
                        sort=['<datetime'])

print('%s items' % results.found())
items = results.items()

# Save search to geojson file
gjson_outfile = f'/home/jovyan/website2022/book/tutorials/DataIntegration/Landsat.geojson'
items.save(gjson_outfile)

We can include property searches, such as path, row, cloud-cover, as well with the `properties` flag.

We are given a satsearch collection of items (images)

In [None]:
items

Load the geojson file into geopandas and inspect the items we want to collect

In [None]:
# Load the geojson file
gf = gpd.read_file(gjson_outfile)
gf.head(2)

In [None]:
# Plot search area of interest and frames on a map using Holoviz Libraries (more on these later)
cols = gf.loc[:,('id','landsat:wrs_path','landsat:wrs_row','geometry')]
footprints = cols.hvplot(geo=True, line_color='k', hover_cols=['landsat:wrs_path','landsat:wrs_row'], alpha=0.2, title='Landsat 8 T1',tiles='ESRI')
tiles = gv.tile_sources.CartoEco.options(width=700, height=500) 
labels = gv.tile_sources.StamenLabels.options(level='annotation')
tiles * footprints * labels

### Intake all scenes using the intake-STAC library
Intake-STAC facilitates discovering, exploring, and loading spatio-temporal datasets.

Intake-STAC provides Intake Drivers for SpatioTemporal Asset Catalogs (STAC). This provides a simple toolkit for working with STAC catalogs and for loading STAC assets as Xarray objects.

In [None]:
catalog = intake.open_stac_item_collection(items)
list(catalog)

Let's explore the metadata and keys for the first scene

In [None]:
sceneids = list(catalog)
item3 = catalog[sceneids[3]]
item3.metadata
# for keys in item0.keys():
#     print (keys)

We can explore the metadata for any of these:

In [None]:
item3['blue'].metadata

In [None]:
# This is the url needed to grab data from the S3 bucket
# From the satsearch catalog:
items[3].asset('blue')['alternate']['s3']['href'] # can use item asset name or title (blue or B2)

# From the intake-STAC catalog
item3.blue.metadata['alternate']['s3']['href'] # must use item asset name (blue)

### Open and visualize each image using RasterIO 

In [None]:
import rasterio as rio

# Retrieve first scene using rio
item_url = item3.blue.metadata['alternate']['s3']['href']

# Read and plot with grid coordinates 
with rio.Env(aws_session):
    with rio.open(item_url) as src:
        fig, ax = plt.subplots(figsize=(9,8))
        
        # To plot
        show(src,1)
        
        # To open data into a numpy array
        profile = src.profile
        arr = src.read(1)

We can open directly into xarray using RasterIO...

### Manipulating data in Xarray

Pandas and Xarray have very similar structures and ways of manipulating data, but Pandas excels with 2-D data and Xarray is ideal for higher dimension data. Xarray introduces labels in the form of dimensions, coordinates and attributes on top of Pandas-like DataFrames.

Xarray has 2 fundamental data structures:

* `DataArray`, which holds single multi-dimensional variables and its coordinates
* `Dataset`, which holds multiple variables that potentially share the same coordinates

<img src="https://xarray-contrib.github.io/xarray-tutorial/_images/xarray-data-structures.png" width="800px">


Xarray doesn’t just keep track of labels on arrays – it uses them to provide a powerful and concise interface. We will only scratch the surface here on what Xarray can do. To learn more, there are great Xarray tutorials here: https://xarray-contrib.github.io/xarray-tutorial/online-tutorial-series/01_xarray_fundamentals.html

We can use RasterIO to easily open into an Xarray `DataArray`:

In [None]:
rastxr = xr.open_dataarray(item_url,engine='rasterio')
rastxr

Or a `DataSet`:

In [None]:
rastxr = xr.open_dataset(item_url,engine='rasterio')
rastxr

We can open using rioxarray, which integrates RasterIO and Xarray and is the most efficient way of opening using RasterIO:

In [None]:
import rioxarray as rxr

rastrxr = rxr.open_rasterio(item_url)
rastrxr

We can see `Attributes` have been added to the Xarray using the same url.

Beyond what Xarray and Rasterio provide, Rioxarray has these added benefits (plus others):
* It supports multidimensional datasets such as netCDF.
* It loads in the CRS, transform, and nodata metadata in standard CF & GDAL locations.
* It supports masking and scaling data with the masked and mask_and_scale kwargs.
* It loads raster metadata into the attributes.

For more info: https://corteva.github.io/rioxarray/stable/index.html

Another convenient means for opening a lot of raster data into Xarray is using Dask. Xarray integrates with Dask to support parallel computations and streaming computation on datasets that don’t fit into memory. So this is perfect when you want to process a lot of data. 

Dask divides arrays into many small pieces, called chunks, each of which is presumed to be small enough to fit into memory.

Unlike NumPy, which has eager evaluation, operations on Dask arrays are lazy. Operations queue up a series of tasks mapped over blocks, and no computation is performed until you actually ask values to be computed (e.g., to print results to your screen or write to disk). At that point, data is loaded into memory and computation proceeds in a streaming fashion, block-by-block.

More on Dask in the Cloud Computing Tools tutorial.

To expand our Xarray toolbox for working with larger data sets that we don't necessarily want entirely in memory, we will start by reading in 3 bands of a Landsat scene to Xarray using Dask.

In [None]:
sceneid = catalog[sceneids[0]]
print (sceneid.name)

band_names = ['red','green','blue']

bands = []

# Construct xarray for scene
for band_name in band_names:
    # Specify chunk size (x,y), Landsat COG is natively in 512 chunks so is best to use this or a multiple
    band = sceneid[band_name](chunks=dict(band=1, x=2048, y=2048),urlpath=sceneid[band_name].metadata['alternate']['s3']['href']).to_dask()
    band['band'] = [band_name]
    bands.append(band)
scene = xr.concat(bands, dim='band')
scene

We can choose not to specify chunk sizes and have everything read as one chunk. Though, when we load larger sets of imagery, like this one, we can change these chunk sizes to use dask. Typically, it’s best to align dask chunks with the way image chunks (typically called “tiles”) are stored on disk or cloud storage buckets. The landsat data is stored on AWS S3 in a tiled Geotiff format where tiles are 512x512, so we should pick some multiple of that, and typically aim for chunk sizes of ~100Mb (although this is subjective). You can read more about dask chunks here: https://docs.dask.org/en/latest/array-best-practices.html.

Similarly as with pandas, we can explore variables easily. This time we can work with coordinates (equivalent to indices in pandas). Here x might often be the longitude (it can be renamed to this actually):

In [None]:
scene.x

We can also keep track of arbitrary metadata (called attributes) in the form of a Python dictionary: 

In [None]:
scene.attrs

In [None]:
scene.crs

We can apply operations over dimensions by name. Here, if we want to slice the data to only have the blue band:

In [None]:
scene.sel(band='blue')

Notice that instead of loc (from Pandas) we are using sel, but they funtion synonymously.

Mathematical operations (e.g., x - y) vectorize across multiple dimensions (array broadcasting) based on dimension names. Let's determine the mean reflectance for the blue band:

In [None]:
scene.sel(band='blue').mean()#.values 

And you can easily use the split-apply-combine paradigm with groupby, which I won't show here.

The N-dimensional nature of xarray’s data structures makes it suitable for dealing with multi-dimensional scientific data, and its use of dimension names instead of axis labels (dim='time' instead of axis=0) makes such arrays much more manageable than the raw numpy ndarray or pandas.

Now let's open all the bands and multiple days together into an xarray. This will be a more complex xarray with an extra `'time'` dimension. Since the catalog we have has a combination of Level 1 and 2 data, let's keep only the scene IDs for Level 1 data.

In [None]:
sceneids = list(catalog)
sceneids = [sceneid for sceneid in sceneids if sceneid.endswith('_T1')]
sceneids

Let's create the time variable first for the xarray time dimension. This finds the desired scene IDs in the geopandas dataframe we have above and extracts their 'datetime' information. These datetimes get recorded into an Xarray variable object for 'time'.

In [None]:
# Create time variable for time dim
time_var = xr.Variable('time',gf.loc[gf.id.isin(sceneids)]['datetime'])
time_var

Now we will search and collect band names for grabbing each desired band. We will just grab the bands that have 30 m pixels. This provides and example of how you can search data in the STAC catalog.

In [None]:
band_names = []

# Get band names
sceneid = catalog[sceneids[1]]
for k in sceneid.keys():
    M = getattr(sceneid, k).metadata
    if 'eo:bands' in M:
        resol = M['eo:bands'][0]['gsd']
        print(k, resol)
        if resol == 30: 
            band_names.append(k)
            
# Add qa band
band_names.append('qa_pixel')

And now open all of it...

In [None]:
# Import to xarray
# In xarray dataframe nans are in locations where concat of multiple scenes has expanded the grid (i.e. different path/rows).
scenes = []

for sceneid in sceneids:
    sceneid = catalog[sceneid]
    print (sceneid.name)

    bands = []

    # Construct xarray for scene, open each band, append and concatenate together to create a scene, 
    # then append and concatenate each scene to create the full dataframe 
    for band_name in band_names:
        band = sceneid[band_name](chunks=dict(band=1, x=2048, y=2048),urlpath=sceneid[band_name].metadata['alternate']['s3']['href']).to_dask()
        band['band'] = [band_name]
        bands.append(band)
    scene = xr.concat(bands, dim='band')
    scenes.append(scene)

# Concatenate scenes with time variable
ls_scenes = xr.concat(scenes, dim=time_var)

ls_scenes

We now have 2 Landsat scenes with all of the bands we are interested in stored in an xarray. 

We can also easily subset and visualize:

In [None]:
sbands = ['blue', 'nir08', 'swir16']

# Select the first datetime
t = ls_scenes.time.values[1]
print (t)

# Create a upper left and lower right coordinates for subsetting 
ulx = 300000
uly = 7695000
lrx = 330000
lry = 7670000

image = ls_scenes.sel(time=t,band=sbands,y=slice(lry,uly),x=slice(ulx,lrx))

In [None]:
fig, ax = plt.subplots(figsize=(8,6))
image.sel(band='blue').plot()

Since this data is in UTM 22N, we can reproject to the standard lat/lon coordinate system (WGS-84) and map with the ICESat-2 and ATM lines.

In [None]:
image = image.rio.reproject(4326)

ISlats = [is2_gt2r['lat'].min(), is2_gt2r['lat'].max()]
# ISlons = (is2_gt2r['lon'].max(), is2_gt2r['lon'].min())
ISlons = [-55.624,-55.646]

ATMlats = [atm_l2['Latitude(deg)'].min(), atm_l2['Latitude(deg)'].max()]
# ATMlons = [atm_l2['Longitude(deg)'].max(), atm_l2['Longitude(deg)'].min()]
ATMlons = [-55.624,-55.646]

fig, ax = plt.subplots(figsize=(8,6))
image.sel(band='blue').plot()
plt.plot(ISlons,ISlats,color = 'green')
plt.plot(ATMlons,ATMlats,color = 'orange')

The reprojection to WGS-84 didn't calculated the longitudes to be 6 degrees off, so we shifted the IS2 and ATM data for ease of visualization. This issue only seems to arise with reprojections from some UTM projections and should not be an issue with most data.

## Summary

Congratulations! You've completed the tutorial. In this tutorial you have gained the skills to: 
* Search for non-ICESat-2 datasets
* Open data into pandas and xarray dataframes/arrays, and 
* Manipulate, visualize, and explore the data

We have concluded by mapping the three data sets together. More to come in the second data integration tutorial tomorrow!