# Step 2: Create and Push a Batch Run to AWS

**WARNING**: Make sure everything is set correctly BEFORE running this notebook!  Since this notebook starts processing on the AWS servers, it has the potential  to consume a lot of resources, aka dollars...

This notebook:
* Imports the ASF database query file we created in step 1
* chooses a set of interferograms to process
* prepares the files needed to process those interferograms
* uploads files to AWS servers and starts the processing

Requires the following files to be in the folder where you run the code:
* apmb.geojson (file with coordinates for APMB)
* query.geojson (file with results of ASF database query)
* template.yml (yaml file with ISCE processing parameters)

## 2.1: Import Python packages

Import the packages needed to run this notebook.  You may need to install the dinosar package first.

In [None]:
# if dinosar library not in base environment uncomment below (run just once)
#!pip install --no-cache git+https://github.com/scottyhq/dinosar.git@master

In [1]:
import subprocess
import os
import dinosar
import geopandas as gpd
import getpass

## 2.2: Choose processing parameters

Make sure you have this section set the way you want it before running later cells!  

In this section we'll set:
* Which interferograms to make
* ISCE processing parameters (swaths, filters, etc.)
* AWS processing parameters (S3 bucket name, AWS job name)

### 2.2.1 Choose which interferograms to process
This section sets the track to process, and makes pairs from one date each month for all the data from that track.

Choose which track to process:

In [2]:
# choose track to process
track = 76
# load the ASF search results that you generated in Step 1
gf = dinosar.archive.asf.load_inventory('query.geojson')
# create a new dataframe with only the track selected, in the date bounds selected
gdf=gf.query('relativeOrbit == @track')

This section is currently set up to then make sequential pairs with one date from each month.  However, that can be changed in future!

In [3]:
# create yet another dataframe with only some info so we can quickly check that we selected the data we actually want
df = gdf.loc[:,['frameNumber','dateStamp','relativeOrbit']].sort_values(by='dateStamp')
df.head()

Unnamed: 0,frameNumber,dateStamp,relativeOrbit
1072,1100,2014-11-03,76
1071,1105,2014-11-03,76
1070,1110,2014-11-03,76
1054,1105,2014-11-27,76
1055,1100,2014-11-27,76


In [4]:
# Take only unique dates and set DateTimeIndex
df  = df.drop_duplicates('dateStamp').set_index('dateStamp', drop=False)
df.head()

Unnamed: 0_level_0,frameNumber,dateStamp,relativeOrbit
dateStamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2014-11-03,1100,2014-11-03,76
2014-11-27,1105,2014-11-27,76
2014-12-21,1100,2014-12-21,76
2015-01-14,1099,2015-01-14,76
2015-02-07,1105,2015-02-07,76


In [7]:
# function that puts two dates in the correct format for an interferogram pair
def dates_to_pairs(dates):
    pairs = []
    for i in range(len(dates)-1):
        master = dates[i]
        second = dates[i+1]
        intdir = 'int-{0}-{1}'.format(master, second)
        pairs.append(intdir)
    
    return pairs

In [8]:
# function that creates pairs with one acquisition from each month
def one_per_month(df):
    # Single date per month
    tmp = df.resample('M').first()
    tmp1 = tmp.sort_index(ascending=False)
    dates = tmp1.dateStamp.apply(lambda x: x.strftime('%Y%m%d')).to_list()
    pairs = dates_to_pairs(dates)
    
    return pairs

In [9]:
# create the pairs
pairs = one_per_month(df)
# print the number of pairs and a list of the pairs
print(f'{len(pairs)-1} Interferograms to generate:')
pairs

66 Interferograms to generate:


['int-20200505-20200405',
 'int-20200405-20200306',
 'int-20200306-20200205',
 'int-20200205-20200106',
 'int-20200106-20191201',
 'int-20191201-20191101',
 'int-20191101-20191002',
 'int-20191002-20190902',
 'int-20190902-20190803',
 'int-20190803-20190704',
 'int-20190704-20190604',
 'int-20190604-20190505',
 'int-20190505-20190405',
 'int-20190405-20190306',
 'int-20190306-20190210',
 'int-20190210-20190105',
 'int-20190105-20181206',
 'int-20181206-20181106',
 'int-20181106-20181001',
 'int-20181001-20180907',
 'int-20180907-20180802',
 'int-20180802-20180703',
 'int-20180703-20180603',
 'int-20180603-20180504',
 'int-20180504-20180404',
 'int-20180404-20180305',
 'int-20180305-20180209',
 'int-20180209-20180104',
 'int-20180104-20171211',
 'int-20171211-20171105',
 'int-20171105-20171012',
 'int-20171012-20170906',
 'int-20170906-20170801',
 'int-20170801-20170708',
 'int-20170708-20170602',
 'int-20170602-20170509',
 'int-20170509-20170415',
 'int-20170415-20170304',
 'int-201703

### 2.2.2: Set ISCE Processing Parameters

1. Open the "template.yml" file and save a copy with a new name for this batch run (e.g., "Template_83_JanFeb2019.yml").
2. In the new .yml file, change the swath numbers to match the track you want to process:
  * 156: 1,2
  * 149: 2,3
  * 76: 1,2
  * 83: 2,3
3. Optional - change other ISCE processing parameters in the template as needed.

In [10]:
# enter the name of the .yml template file here
template="templateA76_1month_All.yml"

### 2.2.3: Set AWS parameters
These are the parameters you need to set for each job you push to AWS:
* Destination for processing files on AWS (folder name)
* Job name - to track your job while it is processing

Ideally, these names should tell you something about the processing job (i.e., "Track83_2014-2020_1month")

In [11]:
# aws parameters:
dirname = 'A76-1month-All-TEST' #name of the folder on AWS where your processing files will be
jobname = 'A76-1month-All-TEST' #name for the job on AWS

## 2.3: Create processing directories and push them to AWS

Now that we've defined our processing parameters and chosen the interferograms to process, we'll create a processing directory for each interferogram, and then upload those directories to the AWS server (S3).

In [14]:
# set full path to bucket on S3
bucket = 's3://dinosar/processing/uturuncu/' + dirname
# create a text file with the interferogram pairs to process
pairsFile = 'pairs.txt'

paths = [bucket+'/'+x for x in pairs]
with open(pairsFile, 'w') as f:
    f.write('\n'.join(paths))

cmd = f'aws s3 cp {pairsFile} {bucket}/{pairsFile}'#write the command to push pairs.txt to the AWS bucket

print(cmd)

subprocess.call(cmd, shell=True) # runs the command to push the files with pairs to process to the AWS bucket

aws s3 cp pairs.txt s3://dinosar/processing/uturuncu/A76-1month-All-TEST/pairs.txt


0

In [15]:
with open(pairsFile) as f:
    pairs = [line.rstrip() for line in f]
    mapping = dict(enumerate(pairs))
mapping

{0: 's3://dinosar/processing/uturuncu/A76-1month-All-TEST/int-20200505-20200405',
 1: 's3://dinosar/processing/uturuncu/A76-1month-All-TEST/int-20200405-20200306',
 2: 's3://dinosar/processing/uturuncu/A76-1month-All-TEST/int-20200306-20200205',
 3: 's3://dinosar/processing/uturuncu/A76-1month-All-TEST/int-20200205-20200106',
 4: 's3://dinosar/processing/uturuncu/A76-1month-All-TEST/int-20200106-20191201',
 5: 's3://dinosar/processing/uturuncu/A76-1month-All-TEST/int-20191201-20191101',
 6: 's3://dinosar/processing/uturuncu/A76-1month-All-TEST/int-20191101-20191002',
 7: 's3://dinosar/processing/uturuncu/A76-1month-All-TEST/int-20191002-20190902',
 8: 's3://dinosar/processing/uturuncu/A76-1month-All-TEST/int-20190902-20190803',
 9: 's3://dinosar/processing/uturuncu/A76-1month-All-TEST/int-20190803-20190704',
 10: 's3://dinosar/processing/uturuncu/A76-1month-All-TEST/int-20190704-20190604',
 11: 's3://dinosar/processing/uturuncu/A76-1month-All-TEST/int-20190604-20190505',
 12: 's3://din

In [17]:
script = 'prep_topsApp_local'
for i,p in enumerate(pairs):
    intname = os.path.basename(p)
    junk,master,slave=intname.split('-')
    intdir = f'int-{master}-{slave}'
    cmd = f'{script} -i query.geojson -m {master} -p {track} -s {slave} -t {template}'
    print(i, cmd)
    subprocess.call(cmd, shell=True) # runs the command to make processing directories on the local system  

0 prep_topsApp_local -i query.geojson -m 20200505 -p 76 -s 20200405 -t templateA76_1month_All.yml
1 prep_topsApp_local -i query.geojson -m 20200405 -p 76 -s 20200306 -t templateA76_1month_All.yml
2 prep_topsApp_local -i query.geojson -m 20200306 -p 76 -s 20200205 -t templateA76_1month_All.yml
3 prep_topsApp_local -i query.geojson -m 20200205 -p 76 -s 20200106 -t templateA76_1month_All.yml
4 prep_topsApp_local -i query.geojson -m 20200106 -p 76 -s 20191201 -t templateA76_1month_All.yml
5 prep_topsApp_local -i query.geojson -m 20191201 -p 76 -s 20191101 -t templateA76_1month_All.yml
6 prep_topsApp_local -i query.geojson -m 20191101 -p 76 -s 20191002 -t templateA76_1month_All.yml
7 prep_topsApp_local -i query.geojson -m 20191002 -p 76 -s 20190902 -t templateA76_1month_All.yml
8 prep_topsApp_local -i query.geojson -m 20190902 -p 76 -s 20190803 -t templateA76_1month_All.yml
9 prep_topsApp_local -i query.geojson -m 20190803 -p 76 -s 20190704 -t templateA76_1month_All.yml
10 prep_topsApp_loca

Now we should have a processing directory for each interferogram in this directory.  Each processing directory should have two files:
* topsApp.xml  = input file for ISCE processing
* download-links.txt = text file with the links to download all the data we'll need for processing

Now we push all those directories to the S3 cloud storage on AWS:

In [None]:
# Move these to cloud storage
# Push folder of text files to S3
for i,p in enumerate(pairs):
    intname = os.path.basename(p)
    junk,master,slave=intname.split('-')
    intdir = f'int-{master}-{slave}'
    cmd = f'aws s3 sync {intdir}/ {bucket}/{intdir}/'
    print(cmd)
    subprocess.call(cmd, shell=True)

print(f'Moved files to {bucket}')

## 2.4: Launch Processing on AWS (WARNING: can consume lots of AWS resources!!)
Now that we have all the files we need for processing on the AWS servers, we can start processing!  Don't run these cells until you're *SURE* you've got the interferograms you want!

In [None]:
# Enter your NASA URS password to download SLCs
nasauser = 'pmacqueen' # NASA EarthData username
nasapass = getpass.getpass() # NASA EarthData password (will create an interactive textbox)

In [None]:
# don't change these:
demDir = 's3://dinosar/processing/uturuncu/dem' #where the DEM is stored on AWS
jobdef = 'uturuncu-array' # sets certain parameters on AWS
jobqueue = 'uturuncu-queue' # sets certain parameters on AWS
array_size = len(pairs)


# NOTE: job-name, job-queue, and job-definition are JSON files that I've created for AWS Batch
# The specify type of computers to use, etc
cmd = f"aws batch submit-job \
--job-name {jobname} \
--job-queue {jobqueue} \
--job-definition {jobdef} \
--array-properties size={array_size} \
--parameters 'S3_PAIRS={bucket}/{pairsFile},S3_DEM={demDir}' \
--container-overrides 'environment=[{{name=NASAUSER,value={nasauser}}},{{name=NASAPASS,value={nasapass}}}]' \
"

# WARNING: this prints your password as plain text, careful not to push to github
# If you run the command in terminal sometimes the error messages are more helpful
#print(cmd) # uncomment to print the command for debugging purposes.

In [None]:
# Run this cell to start processing!
aws_batchout = subprocess.check_output(cmd, shell=True) # runs the AWS command to start the batch job
aws_batchout # prints out the job-id - make a note of the job id for finding the logs later!


In [None]:
# Write a text file with a summary of this job for reference later
sumfilename = 'summary_' + jobname + '.txt'

with open(sumfilename, 'w') as sf:
    sf.write('track = '+ str(track) + '\n')
    sf.write('template file = '+ template + '\n')
    sf.write('S3 Bucket Name = ' + dirname + '\n')
    sf.write('Job Name = ' + jobname)
    sf.write('Job Name and Job ID on AWS = ' + aws_batchout)

## 2.5: Wait for the jobs to finish!

* The end products will be in "s3://dinosar/results/uturuncu/(dirname)/(int_dir)/merged"
* Check the file "summary_(jobname).txt" for a summary of the processing parameters you set in this script
* You can monitor jobs here: https://us-west-2.console.aws.amazon.com/batch/home?region=us-west-2#/jobs/queue/arn:aws:batch:us-west-2:783380859522:job-queue~2Futuruncu-queue?state=PENDING
* After 24 hours, you'll have to look up your job here using the job id: https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logStream:group=/aws/batch/job;streamFilter=typeLogStreamPrefix