# Automated bulk download of Landsat image subsets through AWS

_Last modified 2022-05-11._

This script is run to download Landsat images over the glaciers available through the AWS s3 bucket. The workflow is streamlined to analyze images for 10s to 100s of glaciers, specifically, the marine-terminating glaciers along the periphery of Greenland. Sections of code that may need to be modified are indicated as below:

    ##########################################################################################

    Code that must be modified.

    ##########################################################################################

 
### Configure your AWS profile to access the Landsat images on the s3 bucket:

Follow instructions at https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html to get required __aws__ command line software.

Set up your AWS profile with a payment option. Then configure it to your machine following these steps:

    aws configure --profile terminusmapping
    
Enter in your credentials.
 
### Steps in the script:
    1. Set-up: import packages, set paths, and enter glaciers IDs
    2. Find all the Landsat footprints that overlap the glaciers
    3. Download Landsat metadata (*MTL.txt) files from AWS for all overlapping scenes
    4. Calculate cloud % over terminus box using Landsat quality band (QA_PIXEL)
    5. Create buffer zone around terminus boxes and rasterize terminus boxes
    6. Download non-cloudy Landsat images from AWS
    7. Grab image acquisition dates from metadata files
    8. Delete the *QA_PIXEL.TIF files downloaded in step (4) to save space

# 1) Set-up: import packages, set paths, and enter glaciers IDs

In [23]:
import numpy as np
import pandas as pd
import scipy
import math
import subprocess
import os
import shutil
import datetime
import cv2
from PIL import Image
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
import glob

# geospatial packages
import fiona
import geopandas as gpd
from shapely.geometry import Polygon, Point, LineString
import shapely
from matplotlib.pyplot import imshow
import rasterio as rio

# Enable fiona KML file reading driver
gpd.io.file.fiona.drvsupport.supported_drivers['KML'] = 'rw'

# import necessary functions from automated-glacier-terminus.py
from automated_terminus_functions import distance

In [None]:
# # change display width if desired
# from IPython.display import display, HTML

# display(HTML(data="""
# <style>
#     div#notebook-container    { width: 95%; }
#     div#menubar-container     { width: 65%; }
#     div#maintoolbar-container { width: 99%; }
# </style>
# """))

## AWS configuration:

In [None]:
# ! aws s3 ls --profile terminusmapping

In [24]:
# AWS settings
from rasterio.session import AWSSession
import pickle
import boto3
import boto3.session

cred = boto3.Session(profile_name='terminusmapping').get_credentials()
ACCESS_KEY = cred.access_key
SECRET_KEY = cred.secret_key
SESSION_TOKEN = cred.token  ## optional


s3client = boto3.client('s3', 
                        aws_access_key_id = ACCESS_KEY, 
                        aws_secret_access_key = SECRET_KEY, 
#                         aws_session_token = SESSION_TOKEN
                       )

# response = s3client.get_object(Bucket='name_of_your_bucket', Key='path/to_your/file.pkl')
# body = response['Body'].read()
# data = pickle.loads(body)
 
######################################################################################
# path to the collection on AWS usgs-landsat s3 bucket:
collectionpath = 'collection02/level-1/standard/' # collection 2 level 1 data being used
######################################################################################

## Define paths, satellites, geographic projections:

In [25]:
######################################################################################
# ADJUST THESE VARIABLES:
basepath = '/home/jukes/Documents/Sample_glaciers/' # folder containing the all glacier shapefile(s)
downloadpath ='/media/jukes/jukes1/LS8aws/' # folder to eventually contain downloaded Landsat images

sats = ['L7','L8'] # names of landsats to download images from ('L7' for Landsat 7 or 'L8' or both)
L8_yrs = np.arange(2013,2022).astype(str) # set target years for L8: 2013-2021
L7_yrs = np.arange(1999,2004).astype(str) # set target years for L7: 1999-2003
L8_bands = [8] # panchromatic band for L8
L7_bands = [8] # panchromatic band for L7

repopath = '/home/jukes/automated-glacier-terminus/' # path to this repository
os.chdir(repopath) # change directories to this repo

source_srs = '3413' # EPSG code for the current projection of the glacier shapefiles 
# (3413 = Greenland polar stereographic)

csvext = '_test_Box009.csv' # enter a file suffix for the CSV files produced 
# that describes the analysis (e.g., glacier or group of glaciers)

RGIpath = '/media/jukes/jukes1/RGI_shps/' # path to folder with all individual RGI glacier outline shapefiles
boxespath = '/media/jukes/jukes1/Boxes_individual/' # folder with all individual glacier terminus box shapes
######################################################################################

In [26]:
# filenames that will be written in this script
# all with common extension
print("CSV files that will be produced:"); print()
PR_FILENAME = 'LS_pathrows'+csvext; print(PR_FILENAME) # glacier Landsat path, row, zone info
BOX_FILENAME = 'Buffdist'+csvext; print(BOX_FILENAME) # buffer distances around glacier terminus boxes
DATES_FILENAME = 'imgdates'+csvext; print(DATES_FILENAME) # acquisition dates for downloaded Landsat images

CSV files that will be produced:

LS_pathrows_test_Box009.csv
Buffdist_test_Box009.csv
imgdates_test_Box009.csv


#### Keep a record of the csv file names generated as many of them will be used later for analysis.

##  Enter in the glacier BoxIDs:

The Greenland peripheral glacier terminus boxes were referenced using their 3 digit BoxID: Box###.
For other glaciers, replace this code with a list of IDs corresponding to the glaciers and corresponding shapefiles (e.g. BoxHelheim.shp). 

In [27]:
######################################################################################
BoxIDs = []
boxes = list(map(str, np.arange(9, 10, 1))) #1, 642, 1
for BoxID in boxes: # convert integers to 3-digit strings with leading zeros
    BoxID = BoxID.zfill(3)
    BoxIDs.append(BoxID)
print(BoxIDs) # show the final BoxIDs
######################################################################################

['009']


### Create new folders corresponding to these glaciers:

In [10]:
# create new BoxID folders 
for BoxID in BoxIDs:
    # create folder to hold glacier shapefiles
    shapefilepath = basepath+'Box'+BoxID+'/' # path to that folder
    if os.path.exists(shapefilepath):
#         shutil.rmtree(shapefilepath) # remove the old folder
        print("Path exists already for Box", BoxID)
    else:
        os.mkdir(basepath+'Box'+BoxID)
            
    # create folder to hold glacier images (inside downloadpath)
    if os.path.exists(downloadpath+'Box'+BoxID):
        print("Path exists already in LS8aws for Box", BoxID)
    else:
        os.mkdir(downloadpath+'Box'+BoxID)
    
    # Now place terminus box shapefile and RGI glacier outline shapefile into the
    # boxespath folder. Done automatically below for the Greenland peripheral glaciers:
    ######################################################################################
    ID = int(BoxID) # make into an integer in order to grab the .shp files
    
    # if the terminus box shapefile is not in this folder, then move it
    if not os.path.exists(shapefilepath+'Box'+BoxID+'.shp'):
        for filename in os.listdir(boxespath):
            if filename.startswith('BoxID_'+str(ID)):
                shutil.copyfile(boxespath+filename, basepath+'Box'+BoxID+'/Box'+BoxID+filename[-4:])
                print("Box"+BoxID+filename[-4:], "moved")
    else:
        print("Box"+BoxID+'.shp', "already in folder")

    if not os.path.exists(shapefilepath+'RGI_Box'+BoxID+'.shp'): # if the RGI shapfile is not in this folder
        # move RGI glacier outline into the new folder
        for filename in os.listdir(RGIpath):
            if filename.startswith('BoxID_'+str(ID)):
                shutil.copyfile(RGIpath+filename, basepath+'Box'+BoxID+'/RGI_Box'+BoxID+filename[-4:])
                print("RGI_Box"+BoxID+filename[-4:], "moved")
    else:
        print("RGI_Box"+BoxID+'.shp', "already in folder")
    ######################################################################################

Path exists already for Box 009
Box009.shp already in folder
RGI_Box009.shp already in folder


# 2) Find all the Landsat footprints that overlap the glaciers

This step requires the WRS-2_bound_world_0.kml file containing the footprints of all the Landsat scene boundaries available through the USGS (https://www.usgs.gov/land-resources/nli/landsat/landsat-shapefiles-and-kml-files). Place this file in your base directory (basepath). 

To check if they overlap the glacier terminus box shapefiles, the box shapefiles must be in WGS84 coordinates (ESPG: 4326). If they are not yet, we use the following GDAL command to reproject them into WGS84:

        ogr2ogr -f "ESRI Shapefile" -t_srs EPSG:NEW_EPSG_NUMBER -s_srs EPSG:OLD_EPSG_NUMBER out.shp in.shp

In [11]:
# Reproject terminus box shapefiles to WGS84 if in a different projection
for BoxID in BoxIDs:
    boxespath = basepath+"Box"+BoxID+"/Box"+BoxID # access the BoxID folders created 
    # construct the gdal command
    rp = "ogr2ogr -f 'ESRI Shapefile' -t_srs EPSG:4326 -s_srs EPSG:"+source_srs+" "
    rp +=boxespath+"_WGS.shp "+boxespath+".shp"
    print("Command:", rp) # check command
    subprocess.run(rp, shell=True, check=True) # run the command on terminal
    
    # if an error is produced, check the error output on the terminal window that runs this notebook

Command: ogr2ogr -f 'ESRI Shapefile' -t_srs EPSG:4326 -s_srs EPSG:3413 /home/jukes/Documents/Sample_glaciers/Box009/Box009_WGS.shp /home/jukes/Documents/Sample_glaciers/Box009/Box009.shp


In [12]:
# Grab the WGS84 coordinates of the boxes
box_points = {} # dictionary of points
for BoxID in BoxIDs:
    boxpath = basepath+"Box"+BoxID+"/Box"+BoxID # path to the reprojected terminus box
    termbox = fiona.open(boxpath+'_WGS.shp') # open reprojected terminus box
    box = termbox.next(); box_coords=box['geometry']['coordinates'][0] # grab coords
    points = [] # to hold the box vertices
    
    # read coordinates and convert to a shapely object
    for coord_pair in box_coords: 
        lat = coord_pair[0]; lon = coord_pair[1]        
        point = shapely.geometry.Point(lat, lon) # create shapely point 
        points.append(point) # append to points list
        
    box_points.update({BoxID: points}) # update dictionary
    print("Box"+BoxID+" coordinates recorded.") # keep track of progress

Box009 coordinates recorded.


  


In [13]:
######################################################################################
# open the kml file with the Landsat path, row footprints:
WRS = fiona.open(basepath+'WRS-2_bound_world_0.kml', driver='KML') # check the path to the world bounds file
print('Landsat footprint file opened.')
######################################################################################

Landsat footprint file opened.


In [14]:
paths = []; rows = []; boxes = [] # create lists to hold the paths and rows and BoxIDs

#loop through all Landsat scenes (path, row footprints)
for feature in WRS:
    # create shapely polygons from the Landsat footprints
    coordinates = feature['geometry']['coordinates'][0]
    coords = [xy[0:2] for xy in coordinates]
    pathrow_poly = Polygon(coords)
    
    # grab the path and row name from the WRS kml file:
    pathrowname = feature['properties']['Name']  
    path = pathrowname.split('_')[0]; row = pathrowname.split('_')[1]
#     print(path, row)
    
    # for each feature, loop through each of the vertices stored in the dictionary
    for BoxID in box_points:  
        box_points_in = 0 # counter for number of box_points in the pathrow_geom:
        points = box_points.get(BoxID) # grab the points corresponding to the ID
        for i in range(0, len(points)):
            point = points[i]
            if point.within(pathrow_poly): # if the pathrow shape contains the point
                box_points_in = box_points_in+1 # append the counter
        if box_points_in == 5: # if all box vertices are inside the footprint, save the path, row, BoxID
            paths.append('%03d' % int(path))
            rows.append('%03d' % int(row))
            boxes.append(BoxID)

# Store in dataframe
boxes_pr_df = pd.DataFrame(list(zip(boxes, paths, rows)), columns=['BoxID','Path', 'Row'])
boxes_pr_df = boxes_pr_df.sort_values(by='BoxID')
boxes_pr_df # display

Unnamed: 0,BoxID,Path,Row
0,9,30,6
1,9,29,6
2,9,28,6
3,9,27,6
4,9,32,5
5,9,31,5


In [17]:
# save to file
boxes_pr_df.to_csv(path_or_buf = basepath+PR_FILENAME, sep=',') # write to csv

# 3) Download metadata files from AWS s3 for overlapping Landsat scenes
     
The syntax for listing the Collection 2 Landsat image files AWS s3 bucket is as follows:

    aws s3 ls --request-payer requester s3://usgs-landsat/collection02/level-2/standard/oli-tirs/yyyy/path/row/LC08_LS2R_pathrow_yyyyMMdd_yyyyMMdd_02_T1/ 
    
__NOTE: Including the --request-payer requester as part of this line indicates that the referenced user will be charged for data download.__

We can use the paths and rows in the dataframe to access the full Landsat scene list and the corresponding metdata files. Read https://docs.opendata.aws/landsat-pds/readme.html to learn more.
    
The metadata files will be downloaded into folders corresponding to the Landsat footprint, identified by the Path Row numbers:
    
    aws s3api get-object --bucket usgs-landsat --key collection02/level-2/standard/oli-tirs/yyyy/path/row/LC08_L2SP_pathrow_yyyyMMdd_yyyyMMdd_02_T1/LC08_L2SP_pathrow_yyyyMMdd_yyyyMMdd_02_T1_MTL.txt  --request-payer requester LC08_L2SP_pathrow_yyyyMMdd_yyyyMMdd_02_T1_MTL.txt

In [28]:
# Read in csv file from Step 2
boxes_pr_df = pd.read_csv(basepath+PR_FILENAME, dtype=str)
boxes_pr_df = boxes_pr_df.set_index('BoxID'); boxes_pr_df

Unnamed: 0_level_0,Unnamed: 0,Path,Row
BoxID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
9,0,30,6
9,1,29,6
9,2,28,6
9,3,27,6
9,4,32,5
9,5,31,5


In [33]:
# Loop through the dataframe containing overlapping path, row info:
for index, row in boxes_pr_df.iterrows():
    p = row['Path']; r = row['Row']; folder_name = 'Path'+p+'_Row'+r+'_c2' # folder name
    bp_out = downloadpath+folder_name+'/' # output path for the downloaded files
    print("Downloaded metadata files are stored in:",bp_out)
    
    # create Path_Row folders if they don't exist already
    if os.path.exists(bp_out):
        print(folder_name, " exists already, skip directory creation")
    else:
        os.mkdir(bp_out)
        print(folder_name+" directory made")
    
    for sat in sats: # for each satellite
        if sat == 'L8':
            collectionfolder = 'oli-tirs/'; years = L8_yrs; prefix='LC08' # set folder, years, file prefix
        elif sat == 'L7':
            collectionfolder = 'etm/'; years = L7_yrs; prefix='LE07' # set folder, years, file prefix
        
        # loop through years
        for year in years:
            # grab list of images in each year, path, row folder
            find_imgs = 'aws s3 ls --profile terminusmapping --request-payer requester s3://usgs-landsat/'
            find_imgs += collectionpath+collectionfolder
            find_imgs += year+'/'+p+'/'+r+'/'
            print(find_imgs)
            if subprocess.run(find_imgs,shell=True).returncode != 0:
                print('No results found for '+collectionpath+collectionfolder+year+'/'+p+'/'+r+'/')
                results = [] # empty results
            else:
                result = subprocess.check_output(find_imgs,shell=True) # grab the avilable images
                results = result.split() # split string
            
            imagenames = []
            for line in results: # loop through strings
                line = str(line)
                if prefix in line and 'T1' in line: # find just the Tier-1 images
                    imgname = line[2:-2]; imagenames.append(imgname)

            # download the metadata (MTL.txt) file if it doesn't exist
            for imgname in imagenames:
                if not os.path.exists(bp_out+imgname+'_MTL.txt'): # check in output directory
                    command = 'aws s3api get-object --bucket usgs-landsat --key '+collectionpath+collectionfolder
                    command += year+'/'+p+'/'+r+'/'
                    command += imgname+'/'+imgname+'_MTL.txt'
                    command += ' --profile terminusmapping --request-payer requester '
                    command += bp_out+imgname+'_MTL.txt'
                    print('Downloading', imgname+'_MTL.txt')
                    subprocess.run(command,shell=True,check=True)
                else:
                    print(imgname+'_MTL.txt exists. Skip.')

Downloaded metadata files are stored in: /media/jukes/jukes1/LS8aws/Path030_Row006_c2/
Path030_Row006_c2  exists already, skip directory creation
aws s3 ls --profile terminusmapping --request-payer requester s3://usgs-landsat/collection02/level-1/standard/etm/1999/030/006/
LE07_L1TP_030006_19990705_20200918_02_T1_MTL.txt exists. Skip.
aws s3 ls --profile terminusmapping --request-payer requester s3://usgs-landsat/collection02/level-1/standard/etm/2000/030/006/
LE07_L1TP_030006_20000520_20200918_02_T1_MTL.txt exists. Skip.
LE07_L1TP_030006_20000605_20200918_02_T1_MTL.txt exists. Skip.
aws s3 ls --profile terminusmapping --request-payer requester s3://usgs-landsat/collection02/level-1/standard/etm/2001/030/006/
LE07_L1TP_030006_20010405_20200917_02_T1_MTL.txt exists. Skip.
LE07_L1TP_030006_20010811_20200917_02_T1_MTL.txt exists. Skip.
aws s3 ls --profile terminusmapping --request-payer requester s3://usgs-landsat/collection02/level-1/standard/etm/2002/030/006/
LE07_L1TP_030006_20020323_2

LE07_L1TP_029006_20000427_20200918_02_T1_MTL.txt exists. Skip.
LE07_L1TP_029006_20000513_20200918_02_T1_MTL.txt exists. Skip.
LE07_L1TP_029006_20001004_20200917_02_T1_MTL.txt exists. Skip.
aws s3 ls --profile terminusmapping --request-payer requester s3://usgs-landsat/collection02/level-1/standard/etm/2001/029/006/
LE07_L1TP_029006_20010414_20200917_02_T1_MTL.txt exists. Skip.
LE07_L1TP_029006_20010516_20200917_02_T1_MTL.txt exists. Skip.
aws s3 ls --profile terminusmapping --request-payer requester s3://usgs-landsat/collection02/level-1/standard/etm/2002/029/006/
LE07_L1TP_029006_20020722_20200916_02_T1_MTL.txt exists. Skip.
LE07_L1TP_029006_20020823_20200916_02_T1_MTL.txt exists. Skip.
aws s3 ls --profile terminusmapping --request-payer requester s3://usgs-landsat/collection02/level-1/standard/etm/2003/029/006/
LE07_L1TP_029006_20030404_20200915_02_T1_MTL.txt exists. Skip.
aws s3 ls --profile terminusmapping --request-payer requester s3://usgs-landsat/collection02/level-1/standard/ol

LC08_L1TP_028006_20130907_20200913_02_T1_MTL.txt exists. Skip.
LC08_L1TP_028006_20130923_20200913_02_T1_MTL.txt exists. Skip.
aws s3 ls --profile terminusmapping --request-payer requester s3://usgs-landsat/collection02/level-1/standard/oli-tirs/2014/028/006/
LC08_L1TP_028006_20140302_20200911_02_T1_MTL.txt exists. Skip.
LC08_L1TP_028006_20140318_20200911_02_T1_MTL.txt exists. Skip.
LC08_L1TP_028006_20140505_20200911_02_T1_MTL.txt exists. Skip.
LC08_L1TP_028006_20140521_20200911_02_T1_MTL.txt exists. Skip.
LC08_L1TP_028006_20140708_20200911_02_T1_MTL.txt exists. Skip.
LC08_L1TP_028006_20140825_20200911_02_T1_MTL.txt exists. Skip.
aws s3 ls --profile terminusmapping --request-payer requester s3://usgs-landsat/collection02/level-1/standard/oli-tirs/2015/028/006/
LC08_L1TP_028006_20150305_20200909_02_T1_MTL.txt exists. Skip.
LC08_L1TP_028006_20150321_20201015_02_T1_MTL.txt exists. Skip.
LC08_L1TP_028006_20150406_20201016_02_T1_MTL.txt exists. Skip.
LC08_L1TP_028006_20150422_20200909_02_T1_

LC08_L1TP_027006_20160229_20200907_02_T1_MTL.txt exists. Skip.
LC08_L1TP_027006_20160417_20200907_02_T1_MTL.txt exists. Skip.
LC08_L1TP_027006_20160503_20200907_02_T1_MTL.txt exists. Skip.
LC08_L1TP_027006_20160620_20201016_02_T1_MTL.txt exists. Skip.
LC08_L1TP_027006_20160706_20200906_02_T1_MTL.txt exists. Skip.
LC08_L1TP_027006_20160722_20200906_02_T1_MTL.txt exists. Skip.
LC08_L1TP_027006_20160807_20201016_02_T1_MTL.txt exists. Skip.
LC08_L1TP_027006_20160823_20200906_02_T1_MTL.txt exists. Skip.
LC08_L1TP_027006_20160908_20200906_02_T1_MTL.txt exists. Skip.
LC08_L1TP_027006_20160924_20200906_02_T1_MTL.txt exists. Skip.
aws s3 ls --profile terminusmapping --request-payer requester s3://usgs-landsat/collection02/level-1/standard/oli-tirs/2017/027/006/
LC08_L1TP_027006_20170319_20200904_02_T1_MTL.txt exists. Skip.
LC08_L1TP_027006_20170404_20200904_02_T1_MTL.txt exists. Skip.
LC08_L1TP_027006_20170506_20200904_02_T1_MTL.txt exists. Skip.
LC08_L1TP_027006_20170623_20200903_02_T1_MTL.txt

LC08_L1TP_032005_20180309_20200901_02_T1_MTL.txt exists. Skip.
LC08_L1TP_032005_20180325_20200901_02_T1_MTL.txt exists. Skip.
LC08_L1TP_032005_20180410_20201016_02_T1_MTL.txt exists. Skip.
LC08_L1TP_032005_20180426_20200901_02_T1_MTL.txt exists. Skip.
LC08_L1TP_032005_20180512_20200901_02_T1_MTL.txt exists. Skip.
LC08_L1TP_032005_20180528_20201015_02_T1_MTL.txt exists. Skip.
LC08_L1TP_032005_20180613_20200831_02_T1_MTL.txt exists. Skip.
LC08_L1TP_032005_20180629_20200831_02_T1_MTL.txt exists. Skip.
LC08_L1TP_032005_20180731_20200831_02_T1_MTL.txt exists. Skip.
LC08_L1TP_032005_20180816_20200831_02_T1_MTL.txt exists. Skip.
LC08_L1TP_032005_20180917_20200830_02_T1_MTL.txt exists. Skip.
LC08_L1TP_032005_20181003_20200830_02_T1_MTL.txt exists. Skip.
aws s3 ls --profile terminusmapping --request-payer requester s3://usgs-landsat/collection02/level-1/standard/oli-tirs/2019/032/005/
LC08_L1TP_032005_20190312_20200829_02_T1_MTL.txt exists. Skip.
LC08_L1TP_032005_20190413_20200828_02_T1_MTL.txt

LC08_L1TP_031005_20180302_20200902_02_T1_MTL.txt exists. Skip.
LC08_L1TP_031005_20180318_20200901_02_T1_MTL.txt exists. Skip.
LC08_L1TP_031005_20180403_20200901_02_T1_MTL.txt exists. Skip.
LC08_L1TP_031005_20180419_20200901_02_T1_MTL.txt exists. Skip.
LC08_L1TP_031005_20180521_20200901_02_T1_MTL.txt exists. Skip.
LC08_L1TP_031005_20180606_20200831_02_T1_MTL.txt exists. Skip.
LC08_L1TP_031005_20180622_20201015_02_T1_MTL.txt exists. Skip.
LC08_L1TP_031005_20180708_20200831_02_T1_MTL.txt exists. Skip.
LC08_L1TP_031005_20180724_20200831_02_T1_MTL.txt exists. Skip.
LC08_L1TP_031005_20180825_20200831_02_T1_MTL.txt exists. Skip.
LC08_L1TP_031005_20180910_20200830_02_T1_MTL.txt exists. Skip.
LC08_L1TP_031005_20180926_20200830_02_T1_MTL.txt exists. Skip.
LC08_L1TP_031005_20181012_20200830_02_T1_MTL.txt exists. Skip.
aws s3 ls --profile terminusmapping --request-payer requester s3://usgs-landsat/collection02/level-1/standard/oli-tirs/2019/031/005/
LC08_L1TP_031005_20190321_20200829_02_T1_MTL.txt

# 4) Calculate cloud % over terminus box using Landsat quality band

If the terminus box shapefiles were not originally in UTM projection, will need to reproject them into UTM to match the Landsat projection. The code automatically finds the UTM zones from the metadata files and fills in the following syntax to reproject:
    
    ogr2ogr -f "ESRI Shapefile" -t_srs EPSG:326zone output.shp input.shp
    
#### If the terminus box shapefiles are already in UTM projection, skip the following cell and rename the files to end with "\_UTM\_##.shp" where ## corresponds to the zone number (e.g., "\_UTM\_07.shp", "\_UTM\_21.shp").

In [18]:
zones = {} # initialize dictionary to hold UTM zone for each Landsat scene path row
zone_list = [] # list of zones

# Loop through all scenes:
for index, row in boxes_pr_df.iterrows():
    BoxID = str(index)
    p = row['Path']; r = row['Row']; folder_name = 'Path'+p+'_Row'+r+'_c2' # Landsat path and row
    pr_folderpath = downloadpath+folder_name+'/' # path to the downloaded metadata files
    pathtoshp = basepath+"Box"+BoxID+"/Box"+BoxID # path to the terminus box shapefiles (all projections)
    
    if len(os.listdir(pr_folderpath)) > 0: # if there are files in the folder
        # grab UTM Zone from the first metadata file
        mtl_scene = glob.glob(pr_folderpath+'*_MTL.txt')[0]
        mtl = open(mtl_scene, 'r')
        
        # loop through lines in the metadata file to find the UTM ZONE
        for line in mtl:  
            variable = line.split("=")[0] # grab the variable name
            if ("UTM_ZONE" in variable):
                zone = '%02d' % int(line.split("=")[1][1:-1]) # grab the 2-digit zone number
                zones.update({folder_name: zone}); zone_list.append(zone) # add to zone lists
                break
                
        # reproject shapefile(s) into UTM
        zone = zones[folder_name]
        rp_shp = 'ogr2ogr -f "ESRI Shapefile" '+pathtoshp+'_UTM_'+zone+'.shp '+pathtoshp+'_WGS.shp'
        rp_shp += ' -t_srs EPSG:326'+zone
        subprocess.run(rp_shp, shell=True,check=True)
        
    else: # if no files in folder, zone = nan, must fill in manually
        zone_list.append(np.nan)
        
boxes_pr_df['Zone'] = zone_list # add to the path row dataframe
boxes_pr_df.head()

Unnamed: 0_level_0,Unnamed: 0,Path,Row,Zone
BoxID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
8,0,31,6,19
8,1,30,6,19
8,2,29,6,19
8,3,28,6,19
8,4,27,6,20


In [19]:
# overwrite path row csv file with UTM zone information, see above for variable PR_FILENAME
boxes_pr_df.to_csv(path_or_buf = basepath+PR_FILENAME, sep=',')

Use GDAL and __vsi3__ link to download subset of the quality band we will use to determine cloud cover over the terminus:

    gdalwarp -cutline path_to_shp.shp -crop_to_cutline /vsi3/usgs-landsat/collection02/level-1/standard/oli-tirs/yyyy/path/row/scene/scene_QA_PIXEL.TIF path_to_subset_QA_PIXEL.TIF


In [20]:
# Loop through all scenes:
for index, row in boxes_pr_df.iterrows():
    p = row['Path']; r = row['Row']; zone = row['Zone'] # grab path, row, zone
    BoxID = str(index)
    folder_name = 'Path'+p+'_Row'+r+'_c2'
    pr_folderpath = downloadpath+folder_name+'/' # path to the downloaded metadata files
    pathtoshp = basepath+"Box"+BoxID+"/Box"+BoxID # path to the terminus box shapefiles (all projections)
    pathtoshp_rp = pathtoshp+'_UTM_'+zone # path to the UTM projected box shapefile

    files = os.listdir(pr_folderpath) # grab the names of the Landsat scenes
    
    # for all files in the path row folders
    for file in files:
        scene = file[:40] # slice the filename to grab the scene name

        if scene.startswith('L') and 'T1' in scene: # L1TP scenes
            scene_year = scene[17:21] # grab the year from the scene name
            
            if scene.startswith('LC08'):
                collectionfolder='oli-tirs/'
            elif scene.startswith('LE07'):
                collectionfolder='etm/'
                
            # set path to the QA pixel Landsat files
            pathtoQAPIXEL='/vsis3/usgs-landsat/'+collectionpath+collectionfolder
            pathtoQAPIXEL+=scene_year+'/'
            pathtoQAPIXEL+=p+'/'+r+'/'
            pathtoQAPIXEL+=scene+'/'+scene+"_QA_PIXEL.TIF"
            
            # set path to the subset QA pixel files inside the path row folders
            subsetout = pr_folderpath+scene+'_QA_PIXEL_Box'+BoxID+'.TIF' 
            
            # if the file hasn't already been downloaded
            if not os.path.exists(subsetout):
                print('Downloading', scene)
                # construct download command
                QAPIXEL_dwnld_cmd='gdalwarp -overwrite -cutline '+pathtoshp_rp+'.shp -crop_to_cutline '
                QAPIXEL_dwnld_cmd+= pathtoQAPIXEL+' '+subsetout
                QAPIXEL_dwnld_cmd+=' --config AWS_REQUEST_PAYER requester --config AWS_REGION us-west-2'
                QAPIXEL_dwnld_cmd+=' --config AWS_SECRET_ACCESS_KEY '+SECRET_KEY
                QAPIXEL_dwnld_cmd+=' --config AWS_ACCESS_KEY_ID '+ACCESS_KEY

                subprocess.run(QAPIXEL_dwnld_cmd, shell=True, check=True)

Downloading LC08_L1TP_031006_20180505_20200901_02_T1
Downloading LC08_L1TP_031006_20200627_20200823_02_T1
Downloading LC08_L1TP_031006_20140526_20200911_02_T1
Downloading LC08_L1TP_031006_20150716_20200908_02_T1
Downloading LC08_L1TP_031006_20190913_20200826_02_T1
Downloading LC08_L1TP_031006_20190711_20200828_02_T1
Downloading LC08_L1TP_031006_20140814_20200911_02_T1
Downloading LC08_L1TP_031006_20150614_20200909_02_T1
Downloading LC08_L1TP_031006_20170619_20200903_02_T1
Downloading LC08_L1TP_031006_20180403_20200901_02_T1
Downloading LC08_L1TP_031006_20170416_20201015_02_T1
Downloading LC08_L1TP_031006_20140915_20200911_02_T1
Downloading LC08_L1TP_031006_20200713_20200912_02_T1
Downloading LC08_L1TP_031006_20160616_20201016_02_T1
Downloading LC08_L1TP_031006_20161006_20200906_02_T1
Downloading LC08_L1TP_031006_20190929_20200825_02_T1
Downloading LC08_L1TP_031006_20150411_20200909_02_T1
Downloading LC08_L1TP_031006_20140830_20200911_02_T1
Downloading LC08_L1TP_031006_20170705_20201016

Downloading LC08_L1TP_030006_20140706_20200911_02_T1
Downloading LC08_L1TP_030006_20200807_20200916_02_T1
Downloading LC08_L1TP_030006_20140908_20200911_02_T1
Downloading LC08_L1TP_030006_20190906_20200828_02_T1
Downloading LC08_L1TP_030006_20210506_20210517_02_T1
Downloading LC08_L1TP_030006_20210420_20210430_02_T1
Downloading LC08_L1TP_030006_20140823_20200911_02_T1
Downloading LC08_L1TP_030006_20160406_20200907_02_T1
Downloading LC08_L1TP_030006_20200503_20200820_02_T1
Downloading LC08_L1TP_030006_20210404_20210409_02_T1
Downloading LE07_L1TP_030006_20030513_20200916_02_T1
Downloading LC08_L1TP_030006_20140417_20200911_02_T1
Downloading LC08_L1TP_030006_20180717_20200831_02_T1
Downloading LC08_L1TP_030006_20180802_20200831_02_T1
Downloading LC08_L1TP_030006_20210522_20210529_02_T1
Downloading LC08_L1TP_030006_20200706_20200913_02_T1
Downloading LC08_L1TP_030006_20200229_20200822_02_T1
Downloading LC08_L1TP_030006_20160727_20200906_02_T1
Downloading LC08_L1TP_030006_20170714_20201015

Downloading LC08_L1TP_028006_20180905_20200831_02_T1
Downloading LC08_L1TP_028006_20190823_20200828_02_T1
Downloading LC08_L1TP_028006_20200910_20200919_02_T1
Downloading LC08_L1TP_028006_20190807_20200827_02_T1
Downloading LC08_L1TP_028006_20170427_20201015_02_T1
Downloading LC08_L1TP_028006_20191010_20200825_02_T1
Downloading LC08_L1TP_028006_20140318_20200911_02_T1
Downloading LC08_L1TP_028006_20180719_20200831_02_T1
Downloading LC08_L1TP_028006_20190417_20200829_02_T1
Downloading LC08_L1TP_028006_20150727_20200908_02_T1
Downloading LC08_L1TP_028006_20161001_20200906_02_T1
Downloading LC08_L1TP_028006_20210812_20210819_02_T1
Downloading LC08_L1TP_028006_20200521_20200820_02_T1
Downloading LC08_L1TP_028006_20210929_20211013_02_T1
Downloading LC08_L1TP_028006_20160713_20200906_02_T1
Downloading LC08_L1TP_028006_20140521_20200911_02_T1
Downloading LC08_L1TP_028006_20180804_20200831_02_T1
Downloading LE07_L1TP_028006_20020325_20200916_02_T1
Downloading LC08_L1TP_028006_20190401_20200829

Downloading LE07_L1TP_032005_20020609_20200916_02_T1
Downloading LC08_L1TP_032005_20180309_20200901_02_T1
Downloading LC08_L1TP_032005_20150621_20201015_02_T1
Downloading LC08_L1TP_032005_20210808_20210819_02_T1
Downloading LC08_L1TP_032005_20180528_20201015_02_T1
Downloading LC08_L1TP_032005_20160911_20200906_02_T1
Downloading LC08_L1TP_032005_20210605_20210614_02_T1
Downloading LE07_L1TP_032005_20010606_20200917_02_T1
Downloading LC08_L1TP_032005_20190920_20200826_02_T1
Downloading LC08_L1TP_032005_20150418_20200909_02_T1
Downloading LC08_L1TP_032005_20190413_20200828_02_T1
Downloading LC08_L1TP_032005_20150808_20200908_02_T1
Downloading LC08_L1TP_032005_20210621_20210629_02_T1
Downloading LC08_L1TP_032005_20210909_20210916_02_T1
Downloading LC08_L1TP_032005_20180816_20200831_02_T1
Downloading LC08_L1TP_032005_20170322_20200904_02_T1
Downloading LC08_L1TP_032005_20190904_20200826_02_T1
Downloading LC08_L1TP_032005_20180410_20201016_02_T1
Downloading LC08_L1TP_032005_20170914_20200903

# 5) Create buffer around terminus boxes

First, we need to grab the buffer distance which we set equal to the maximum dimension of the image (in meters).

In [21]:
buffers = []
# Calculate a buffer distance around the terminus box:
for BoxID in BoxIDs:
    for file in os.listdir(basepath+'Box'+BoxID+'/'):
        if 'UTM' in file and '.shp' in file and "Box" in file: # identify UTM projected box
            boxpath = basepath+"Box"+BoxID+"/"+file  
            termbox = fiona.open(boxpath)
            
    # grab the box coordinates:
    box = termbox.next(); box_geom= box.get('geometry'); box_coords = box_geom.get('coordinates')[0]
    points = []
    for coord_pair in box_coords:
        lat = coord_pair[0]; lon = coord_pair[1]; points.append([lat, lon])
    # Calculate distance between coord 1 and 2 and between 2 and 3
    coord1 = points[0]; coord2 = points[1]; coord3 = points[2]   
    dist1 = distance(coord1[0], coord1[1], coord2[0], coord2[1]);
    dist2 = distance(coord2[0], coord2[1], coord3[0], coord3[1]) 
    buff_dist = int(np.max([dist1, dist2])) # pick the longer one as the buffer distance
    buffers.append(buff_dist)

# store as dataframe:
buff_df = pd.DataFrame(list(zip(BoxIDs, buffers)), columns=['BoxID', 'Buff_dist_m'])
buff_df

  # Remove the CWD from sys.path while we load stuff.


Unnamed: 0,BoxID,Buff_dist_m
0,8,9438


In [22]:
# write to csv
buff_df.to_csv(basepath+BOX_FILENAME) 

Then, we create the buffer zone shapefile and reproject it to UTM using GDAL.

To create the buffer zone shapefile, we use the GDAL command **ogr2ogr** with the following syntax:

    ogr2ogr Buffer###.shp path_to_terminusbox###.shp  -dialect sqlite -sql "SELECT ST_Buffer(geometry, buffer_distance) AS geometry,*FROM 'Box###'" -f "ESRI Shapefile"

Then to reproject the the buffer shapefiles to UTM using **gdalwarp**.

In [25]:
# loop through the buffer distance dataframe:
for index, row in buff_df.iterrows():
    BoxID = row['BoxID']
    zones = boxes_pr_df.loc[BoxID, 'Zone'] # grab zone matching BoxID from other dataframe
    buff_dist = str(row['Buff_dist_m'])
    
    # paths
    terminusbox_path = basepath+"Box"+BoxID+"/Box"+BoxID+".shp" # path to box shapefile
    outputbuffer_path = basepath+"Box"+BoxID+"/Buffer"+BoxID+".shp" # path and name of new buffer file
    
    # Set buffer creation command
    buffer_cmd = 'ogr2ogr '+outputbuffer_path+" "+terminusbox_path
    buffer_cmd +=' -dialect sqlite -sql "SELECT ST_Buffer(geometry, '+buff_dist+") AS geometry,*FROM 'Box"
    buffer_cmd +=BoxID+"'"+'" -f "ESRI Shapefile"'
    print("Command:", buffer_cmd)
    
    subprocess.run(buffer_cmd, shell=True, check=True) # run on terminal
    
    # Reprojection needs to happen for each zone
    for zone in zones:
        rp_shp = 'ogr2ogr -f "ESRI Shapefile" -t_srs EPSG:326'+zone+' -s_srs EPSG:'+source_srs+' '
        rp_shp += outputbuffer_path[:-4]+"_UTM_"+zone+".shp "+outputbuffer_path[:-4]+'.shp'
        subprocess.run(rp_shp, shell=True, check=True) # reproject

Command: ogr2ogr /home/jukes/Documents/Sample_glaciers/Box008/Buffer008.shp /home/jukes/Documents/Sample_glaciers/Box008/Box008.shp -dialect sqlite -sql "SELECT ST_Buffer(geometry, 9438) AS geometry,*FROM 'Box008'" -f "ESRI Shapefile"


# 6) Download non-cloudy Landsat images from AWS

To remove cloudy images, we will find the number of pixels in our terminus box that exceed a threshold value in the QA_PIXEL band corresponding to cloud likelihood. If the fraction of cloudy pixels with values is above the threshold, we won't download the image. 

Additionally, we remove images that are primarily black (fill value of 0 or 1 in QA_PIXEL band). This ensures that the scenes that cut off halfway across the glacier are not included in further analysis. The fill percent threshold may need to be adjusted.

In [26]:
######################################################################################
# These are the recommended values. Adjust thresholds here:
QAPIXEL_thresh = 22280.0 # QA pixel value threshold to be considered cloud
cpercent_thresh = 50.0 # maximum cloud cover % in terminus box
fpercent_thresh = 60.0 # maximum fill % in terminus box
######################################################################################

In [27]:
# Download images that pass these thresholds:
for index, row in boxes_pr_df.iterrows():
    # grab paths
    p = row['Path']; zone = row['Zone']; r = row['Row']; BoxID = index; 
    folder_name = 'Path'+p+'_Row'+r+'_c2'
    pr_folderpath = downloadpath+folder_name+'/'
    bp_out = downloadpath+'Box'+BoxID+'/' # folder name for downloaded images
    if os.path.exists(bp_out): # create folder if it does not exist
        print("Box"+BoxID, " exists already. Skip creation of directory.")
    else:
        os.mkdir(bp_out)
        print("Box"+BoxID+" directory made.")
    
    # path to the shapefile covering the region that will be downloaded
    pathtobuffer = basepath+'Box'+BoxID+'/Buffer'+BoxID+'_UTM_'+zone+'.shp'  # buffer around box - recommended
#     pathtobox = basepath+'Box'+BoxID+'/Box'+BoxID+'_UTM_'+zone+'.shp' # just the box
    
    for scene in os.listdir(pr_folderpath):
        if scene.startswith('L') and scene.endswith(".TIF") and 'T1' in scene: # For Tier-1 images
            scene = scene[:40] # scene name
            year = scene[17:21] # grab acquisition year
            
            if scene.startswith("LC08"): # Landsat 8
                collectionfolder = 'oli-tirs/'; bands = L8_bands
            elif scene.startswith("LE07"): # Landsat 7
                collectionfolder = 'etm/'; bands = L7_bands
 
            QApixelpath = pr_folderpath+scene+'_QA_PIXEL_Box'+BoxID+'.TIF' # path to QA_PIXEL file
            subsetQApixel = mpimg.imread(QApixelpath) # read in as numpy array
            
            # calculate percentages of cloud and fill bixels
            totalpixels = subsetQApixel.shape[0]*subsetQApixel.shape[1] # total number of pixels
            cloudQApixel = subsetQApixel[subsetQApixel > QAPIXEL_thresh] # cloudy pixels (value > QAPIXEL_thresh)
            fillQApixel = subsetQApixel[subsetQApixel < 2.0] # fill pixels (value = 0 or 1)
            cloudpixels = len(cloudQApixel); fillpixels = len(fillQApixel) # count the cloudy and fill pixels
            cloudpercent = int(float(cloudpixels)/float(totalpixels)*100) # calculate percent cloudy
            fillpercent = int(float(fillpixels)/float(totalpixels)*100) # calculate percent fill
            print(scene, 'Cloud % ', cloudpercent, 'Fill %', fillpercent) # check values
            
            # evaluate thresholds
            if cloudpercent <= cpercent_thresh and fillpercent <= fpercent_thresh:
                # download the bands for that scene into your scene folders:
                for band in bands:
                        band = str(band) # string format
                        
                        # input path to your bands in AWS:
                        pathin = '/vsis3/usgs-landsat/'+collectionpath+collectionfolder+year+'/'+p+"/"+r+"/"+scene+"/"+scene+"_B"+band+".TIF"
                        
                        outfilename = scene+"_B"+band+'_Buffer'+BoxID+'.TIF' # output file name
                        pathout = downloadpath+'Box'+BoxID+'/'+outfilename # full output file path
                        
                        # if the file hasn't already been downloaded
                        if not os.path.exists(pathout):
                            # download
                            download_cmd = 'gdalwarp -overwrite -cutline '+pathtobuffer+' -crop_to_cutline '+pathin+' '+pathout
                            download_cmd+=' --config AWS_REQUEST_PAYER requester --config AWS_REGION us-west-2'
                            download_cmd+=' --config AWS_SECRET_ACCESS_KEY '+SECRET_KEY
                            download_cmd+=' --config AWS_ACCESS_KEY_ID '+ACCESS_KEY   
                            print('Downloading:', outfilename)
                            subprocess.run(download_cmd, shell=True, check=True)
                        else:
                            print(outfilename, 'exists')
            else:
                print(scene, 'failed cloud & fill thresholds')
                        

Box008  exists already. Skip creation of directory.
LC08_L1TP_031006_20181012_20200830_02_T1 Cloud %  59 Fill % 39
LC08_L1TP_031006_20181012_20200830_02_T1 failed cloud & fill thresholds
LC08_L1TP_031006_20140627_20200911_02_T1 Cloud %  46 Fill % 39
LC08_L1TP_031006_20140627_20200911_02_T1_B8_Buffer008.TIF exists
LC08_L1TP_031006_20180403_20200901_02_T1 Cloud %  60 Fill % 39
LC08_L1TP_031006_20180403_20200901_02_T1 failed cloud & fill thresholds
LC08_L1TP_031006_20160616_20201016_02_T1 Cloud %  17 Fill % 39
LC08_L1TP_031006_20160616_20201016_02_T1_B8_Buffer008.TIF exists
LC08_L1TP_031006_20170416_20201015_02_T1 Cloud %  60 Fill % 39
LC08_L1TP_031006_20170416_20201015_02_T1 failed cloud & fill thresholds
LC08_L1TP_031006_20190625_20200829_02_T1 Cloud %  60 Fill % 39
LC08_L1TP_031006_20190625_20200829_02_T1 failed cloud & fill thresholds
LC08_L1TP_031006_20180724_20200831_02_T1 Cloud %  60 Fill % 39
LC08_L1TP_031006_20180724_20200831_02_T1 failed cloud & fill thresholds
LC08_L1TP_031006_

LE07_L1TP_030006_20030716_20200915_02_T1 Cloud %  0 Fill % 46
Downloading: LE07_L1TP_030006_20030716_20200915_02_T1_B8_Buffer008.TIF
LC08_L1TP_030006_20190922_20200826_02_T1 Cloud %  12 Fill % 39
LC08_L1TP_030006_20190922_20200826_02_T1_B8_Buffer008.TIF exists
LC08_L1TP_030006_20150826_20200908_02_T1 Cloud %  0 Fill % 39
LC08_L1TP_030006_20150826_20200908_02_T1_B8_Buffer008.TIF exists
LC08_L1TP_030006_20180514_20200901_02_T1 Cloud %  60 Fill % 39
LC08_L1TP_030006_20180514_20200901_02_T1 failed cloud & fill thresholds
LC08_L1TP_030006_20210709_20210720_02_T1 Cloud %  60 Fill % 39
LC08_L1TP_030006_20210709_20210720_02_T1 failed cloud & fill thresholds
LC08_L1TP_030006_20150725_20200908_02_T1 Cloud %  56 Fill % 39
LC08_L1TP_030006_20150725_20200908_02_T1 failed cloud & fill thresholds
LC08_L1TP_030006_20210927_20210930_02_T1 Cloud %  49 Fill % 39
LC08_L1TP_030006_20210927_20210930_02_T1_B8_Buffer008.TIF exists
LE07_L1TP_030006_20030513_20200916_02_T1 Cloud %  0 Fill % 39
Downloading: LE07

LC08_L1TP_030006_20200908_20200919_02_T1 Cloud %  3 Fill % 39
LC08_L1TP_030006_20200908_20200919_02_T1_B8_Buffer008.TIF exists
LC08_L1TP_030006_20190602_20200828_02_T1 Cloud %  0 Fill % 39
LC08_L1TP_030006_20190602_20200828_02_T1_B8_Buffer008.TIF exists
LC08_L1TP_030006_20201010_20201015_02_T1 Cloud %  60 Fill % 39
LC08_L1TP_030006_20201010_20201015_02_T1 failed cloud & fill thresholds
LC08_L1TP_030006_20180701_20200831_02_T1 Cloud %  0 Fill % 39
LC08_L1TP_030006_20180701_20200831_02_T1_B8_Buffer008.TIF exists
LC08_L1TP_030006_20160625_20200906_02_T1 Cloud %  0 Fill % 39
LC08_L1TP_030006_20160625_20200906_02_T1_B8_Buffer008.TIF exists
LC08_L1TP_030006_20161015_20200905_02_T1 Cloud %  34 Fill % 39
LC08_L1TP_030006_20161015_20200905_02_T1_B8_Buffer008.TIF exists
LC08_L1TP_030006_20180615_20201016_02_T1 Cloud %  60 Fill % 39
LC08_L1TP_030006_20180615_20201016_02_T1 failed cloud & fill thresholds
LC08_L1TP_030006_20210725_20210803_02_T1 Cloud %  0 Fill % 39
LC08_L1TP_030006_20210725_202108

LE07_L1TP_029006_20000513_20200918_02_T1 Cloud %  0 Fill % 39
LE07_L1TP_029006_20000513_20200918_02_T1_B8_Buffer008.TIF exists
LC08_L1TP_029006_20170402_20200904_02_T1 Cloud %  60 Fill % 39
LC08_L1TP_029006_20170402_20200904_02_T1 failed cloud & fill thresholds
LC08_L1TP_029006_20150616_20201016_02_T1 Cloud %  60 Fill % 39
LC08_L1TP_029006_20150616_20201016_02_T1 failed cloud & fill thresholds
LC08_L1TP_029006_20130914_20200912_02_T1 Cloud %  56 Fill % 39
LC08_L1TP_029006_20130914_20200912_02_T1 failed cloud & fill thresholds
LC08_L1TP_029006_20190627_20200828_02_T1 Cloud %  60 Fill % 39
LC08_L1TP_029006_20190627_20200828_02_T1 failed cloud & fill thresholds
LC08_L1TP_029006_20170520_20200904_02_T1 Cloud %  60 Fill % 39
LC08_L1TP_029006_20170520_20200904_02_T1 failed cloud & fill thresholds
LC08_L1TP_029006_20160922_20200906_02_T1 Cloud %  35 Fill % 39
LC08_L1TP_029006_20160922_20200906_02_T1_B8_Buffer008.TIF exists
LC08_L1TP_029006_20180726_20200831_02_T1 Cloud %  60 Fill % 39
LC08_L1

LC08_L1TP_028006_20140505_20200911_02_T1 Cloud %  3 Fill % 39
LC08_L1TP_028006_20140505_20200911_02_T1_B8_Buffer008.TIF exists
LC08_L1TP_028006_20160526_20200906_02_T1 Cloud %  26 Fill % 39
LC08_L1TP_028006_20160526_20200906_02_T1_B8_Buffer008.TIF exists
LC08_L1TP_028006_20170427_20201015_02_T1 Cloud %  60 Fill % 39
LC08_L1TP_028006_20170427_20201015_02_T1 failed cloud & fill thresholds
LC08_L1TP_028006_20180414_20200901_02_T1 Cloud %  60 Fill % 39
LC08_L1TP_028006_20180414_20200901_02_T1 failed cloud & fill thresholds
LE07_L1TP_028006_20010407_20200917_02_T1 Cloud %  0 Fill % 39
LE07_L1TP_028006_20010407_20200917_02_T1_B8_Buffer008.TIF exists
LC08_L1TP_028006_20210321_20210401_02_T1 Cloud %  60 Fill % 39
LC08_L1TP_028006_20210321_20210401_02_T1 failed cloud & fill thresholds
LC08_L1TP_028006_20200910_20200919_02_T1 Cloud %  60 Fill % 39
LC08_L1TP_028006_20200910_20200919_02_T1 failed cloud & fill thresholds
LC08_L1TP_028006_20170716_20200903_02_T1 Cloud %  60 Fill % 39
LC08_L1TP_02800

LC08_L1TP_027006_20150821_20200908_02_T1 Cloud %  11 Fill % 46
LC08_L1TP_027006_20150821_20200908_02_T1_B8_Buffer008.TIF exists
LC08_L1TP_027006_20151008_20200908_02_T1 Cloud %  49 Fill % 46
LC08_L1TP_027006_20151008_20200908_02_T1_B8_Buffer008.TIF exists
LC08_L1TP_027006_20150618_20200909_02_T1 Cloud %  53 Fill % 46
LC08_L1TP_027006_20150618_20200909_02_T1 failed cloud & fill thresholds
LC08_L1TP_027006_20210226_20210303_02_T1 Cloud %  53 Fill % 46
LC08_L1TP_027006_20210226_20210303_02_T1 failed cloud & fill thresholds
LC08_L1TP_027006_20160823_20200906_02_T1 Cloud %  24 Fill % 46
LC08_L1TP_027006_20160823_20200906_02_T1_B8_Buffer008.TIF exists
Box008  exists already. Skip creation of directory.
LC08_L1TP_032005_20140922_20200910_02_T1 Cloud %  20 Fill % 39
LC08_L1TP_032005_20140922_20200910_02_T1_B8_Buffer008.TIF exists
LC08_L1TP_032005_20190616_20200830_02_T1 Cloud %  60 Fill % 39
LC08_L1TP_032005_20190616_20200830_02_T1 failed cloud & fill thresholds
LC08_L1TP_032005_20150824_20200

# 7) Automatically grab the image acquisition dates from the metadata files

In [28]:
datetimes = [] # list of scene datetimes
scenes_dated = [] # list of scenes

for BoxID in BoxIDs:
    bp_out = downloadpath+'Box'+BoxID+'/' # path to downloaded images for that glacier
    
    # Grab all path row folder names from boxes_pr_df:
    paths = boxes_pr_df.loc[BoxID,'Path']
    rows = boxes_pr_df.loc[BoxID,'Row']
    
    # Grab the downloaded scenes
    downloaded_scenes = os.listdir(bp_out)
    for scene in downloaded_scenes:
        if scene.startswith('L') and 'T1' in scene and scene.endswith('.TIF'):
            scenename = scene[:40]
            
            # Search for metadata file in each path, row folder:
            found = False # not found yet
            for a in range(0, len(paths)): # look in each path row folder
                folder_name = 'Path'+paths[a]+'_Row'+rows[a]+'_c2'
                folderpath = downloadpath+folder_name+'/'
                
                # if not there
                if not os.path.exists(folderpath+scenename+'_MTL.txt'):
                    continue # skip to the next folder
                else: # if there
                    # open the file
                    mdata = open(folderpath+scenename+"_MTL.txt", "r")
                    # find the acquisition date in the file
                    for line in mdata:
                        variable = line.split("=")[0]
                        if ("DATE_ACQUIRED" in variable):
                            date = line.split("=")[1][1:-1] # find acquisition date
                    # save scenename and date
                    dates = datetime.datetime.strptime(date, '%Y-%m-%d') # save as datetime object
                    print(scenename, dates)
                    datetimes.append(dates); scenes_dated.append(scenename) # store in lists
                    
                    found = True # found the file
                    break # stop search
            
            if found == False: # if the file was not found at all
                # grab acquisition date from the filename
                date = scene[17:25]
                dates = datetime.datetime.strptime(date, '%Y%m%d') # save as datetime object
                print(scenename, 'missing metadata file. Guessing from filename instead:', dates)

# Store in a dataframe
datetime_df = pd.DataFrame(list(zip(scenes_dated, datetimes)), columns=['Scene', 'datetime'])
datetime_df = datetime_df.sort_values(by='datetime', ascending=True); datetime_df = datetime_df.drop_duplicates()
datetime_df

LC08_L1TP_030006_20150810_20200908_02_T1 2015-08-10 00:00:00
LC08_L1TP_032005_20181003_20200830_02_T1 2018-10-03 00:00:00
LC08_L1TP_031006_20141001_20200910_02_T1 2014-10-01 00:00:00
LC08_L1TP_028006_20170902_20200903_02_T1 2017-09-02 00:00:00
LC08_L1TP_031006_20190321_20200829_02_T1 2019-03-21 00:00:00
LC08_L1TP_030006_20210911_20210916_02_T1 2021-09-11 00:00:00
LC08_L1TP_029006_20130930_20200912_02_T1 2013-09-30 00:00:00
LC08_L1TP_031006_20160616_20201016_02_T1 2016-06-16 00:00:00
LC08_L1TP_028006_20200622_20200824_02_T1 2020-06-22 00:00:00
LC08_L1TP_031006_20180505_20200901_02_T1 2018-05-05 00:00:00
LC08_L1TP_031006_20200713_20200912_02_T1 2020-07-13 00:00:00
LC08_L1TP_031005_20210630_20210708_02_T1 2021-06-30 00:00:00
LC08_L1TP_030006_20190704_20200830_02_T1 2019-07-04 00:00:00
LC08_L1TP_028006_20190620_20200830_02_T1 2019-06-20 00:00:00
LC08_L1TP_029006_20130829_20200912_02_T1 2013-08-29 00:00:00
LC08_L1TP_031006_20210918_20210925_02_T1 2021-09-18 00:00:00
LC08_L1TP_031006_2020083

Unnamed: 0,Scene,datetime
189,LE07_L1TP_027006_19990630_20200918_02_T1,1999-06-30
51,LE07_L1TP_030006_19990705_20200918_02_T1,1999-07-05
93,LE07_L1TP_027006_19990716_20200918_02_T1,1999-07-16
188,LE07_L1TP_028006_19990723_20200918_02_T1,1999-07-23
133,LE07_L1TP_031006_19990829_20200918_02_T1,1999-08-29
...,...,...
102,LC08_L1TP_031005_20210918_20210925_02_T1,2021-09-18
15,LC08_L1TP_031006_20210918_20210925_02_T1,2021-09-18
272,LC08_L1TP_027006_20210922_20210930_02_T1,2021-09-22
25,LC08_L1TP_030006_20210927_20210930_02_T1,2021-09-27


In [29]:
# write dates to csv
datetime_df.to_csv(basepath+DATES_FILENAME, sep=',') 

# 8) Delete all quality band files (*QA_PIXEL.TIF) to save space

These files will not be needed after the download step, so they can be removed to save space.

In [30]:
for BoxID in BoxIDs:
    # Grab all path row folder names from boxes_pr_df:
    paths = boxes_pr_df.loc[BoxID,'Path']; rows = boxes_pr_df.loc[BoxID,'Row']
    
    for a in range(0, len(paths)): # look in each path row folder
        folder_name = 'Path'+paths[a]+'_Row'+rows[a]+'_c2'
        folderpath = downloadpath+folder_name+'/'
        
        # remove all files with QA_PIXEL in the name
        for file in os.listdir(folderpath):
            if 'QA_PIXEL' in file:
                os.remove(folderpath+file)