### <span style="color: blue">Download waveform data from SCEDC public data set (on AWS) for the 2019 Ridgecrest earthquake sequence</span>

#### The aim of this tutorial is to demonstrate how leverage some of several tools offered by the SCEDC to query and download data.  
  
**We will download waveform data for the 2019 Ridgecrest earthquake sequence from the public data set for stations of the GS network code.**  

The GS network code belongs to the US Geological Survey Networks. These GS stations were installed especially to collect data for the 2019 Ridgecrest sequence.

#### Pre-requisites

In order to use this code, you must have

a. an AWS account  
b. AWS credentials  
c. install the AWS command line interface (cli); ``pip install awscli``  
**OR**  
c. python module boto3; ``pip install boto3``  


Instructions on how to obtain a. and b. can be found at https://scedc.caltech.edu/data/getstarted-pds.html under Requirements


1. **The first tool we will use is the *Clickable Station Map* to get some basic metadata information regarding the GS stations that were deployed during this time period.** The map is available at https://service.scedc.caltech.edu/SCSNStationMap/station.html
  
  
    ``Using the "Control Panel" on the right, select networks = GS and click on "Display".``
  
  
     This will bring up the GS stations on the map. These stations are also available to be downloaded as a CSV file, under Station List. The downloaded file will be named station.csv
  
  
     ``Click on the "Download as csv" to save the CSV file to your computer.``   
     ![img](./map.png)
       

2. The next step is to parse the network and station codes out of the CSV file so we can use them to check for data availability.

In [1]:
# Open the CSV file and parse out the network and station codes

with open("station.csv") as fp:
    all_lines = [line.strip().replace('"','') for line in fp.readlines()[2:]]
    
stations = [(line.split(',')[0],line.split(',')[1]) for line in all_lines]
print(stations)

[('GS', 'CA01'), ('GS', 'CA02'), ('GS', 'CA03'), ('GS', 'CA04'), ('GS', 'CA05'), ('GS', 'CA06'), ('GS', 'CA07'), ('GS', 'CA08'), ('GS', 'CA09'), ('GS', 'CA10')]


3. Find what data is availabile for these stations using the SCEDC FDSN availability web service. In order to narrow our search results, we will consider only HH1 channels. The availability web service has two endpoints; extent and query.

    3a. Using the extent endpoint of the availability web service, find what time spans of data are available for the HH1 channels of the GS network.  
      
    3b. Next, use the query endpoint of the availability web service with format=request, and request duration of 1 day, to get a list of requests. This list of channel + duration is one we want to get waveform data for. Save this list of requests to a file named avail.txt

In [2]:
# 3a. Use the extent endpoint of the availability web service, 
# find what time spans of data are available for the HH1 channels of the GS network. 

# define some variables for reuse

network = 'GS'
channel = 'HH1'

avail_extent_url = 'https://service.scedc.caltech.edu/fdsnws/availability/1/extent?'\
                   'network=' + network + '&channel=' + channel + '&format=text&nodata=404'

import urllib.request
with urllib.request.urlopen(avail_extent_url) as response:
    extent_out = response.read().decode('utf-8')

print (extent_out)

# As seen below, the earliest data for HH1 channels starts on 2019-07-08. 

#Network Station Location Channel Quality SampleRate Earliest Latest Updated TimeSpans Restriction
GS CA01 00 HH1 D 100.00 2019-07-08T00:40:57.000000Z 2019-10-30T19:10:34.000000Z NA NA OPEN
GS CA02 00 HH1 D 100.00 2019-07-08T01:54:23.000000Z 2019-07-26T01:55:33.000000Z NA NA OPEN
GS CA03 00 HH1 D 100.00 2019-07-08T20:24:36.000000Z 2020-02-18T21:23:38.000000Z NA NA OPEN
GS CA04 00 HH1 D 100.00 2019-07-09T21:55:08.000000Z 2019-11-27T18:10:30.000000Z NA NA OPEN
GS CA05 00 HH1 D 100.00 2019-07-10T18:31:16.000000Z 2020-01-22T21:06:13.000000Z NA NA OPEN
GS CA06 00 HH1 D 100.00 2019-07-11T00:29:52.000000Z 2020-02-06T18:03:34.000000Z NA NA OPEN



In [3]:
# 3b. use the query endpoint of the availability web service with format=request, 
# and request duration of 1 day, to get a list of requests. 
# This is the list of channel + duration that we want to get waveform data for. 
# Save this list of requests to a file named avail.txt

# The day chosen to get the data for
starttime = '2019-07-09T00:00:00'
endtime = '2019-07-09T23:59:59'

import os
avail_file = os.path.join('/tmp','avail.txt')

avail_query_url = 'https://service.scedc.caltech.edu/fdsnws/availability/1/query?'\
                    'network=' + network + '&channel=' + channel + '&format=request&nodata=404'\
                    '&starttime=' + starttime + '&endtime=' + endtime

with urllib.request.urlopen(avail_query_url) as response:
    with open(avail_file,"w") as fp:
        fp.writelines(response.read().decode('utf-8'))


4. Use the program called pds-fetch-continuous-data to retrieve waveform data from the public data set for the list we produced in previous step.  
    This program is available at https://github.com/SCEDC/cloud.git and is one of the several example programs provided for accessing the public data set.   
    
    IMPORTANT : In order to proceed further, ensure you have fulfilled the pre-requirements mentioned in the beginning of the tutorial.
     
     
     
     

In [None]:
!git clone https://github.com/SCEDC/cloud.git

In [None]:
!python3 cloud/pds-fetch-continuous-data/fetch_continuous_data.py --help


In [4]:
!python3 cloud/pds-fetch-continuous-data/fetch_continuous_data.py --infile /tmp/avail.txt --outdir /tmp


Starting download from pds for 3 requests...
Dividing requests between 1 process(es)
[ 858826 ] ['aws', 's3', 'cp', '--quiet', 's3://scedc-pds/continuous_waveforms/2019/2019_190/GSCA01_HH100_2019190.ms', '/tmp/GSCA01_HH100_2019190.ms']
[ 858826 ] ['aws', 's3', 'cp', '--quiet', 's3://scedc-pds/continuous_waveforms/2019/2019_190/GSCA02_HH100_2019190.ms', '/tmp/GSCA02_HH100_2019190.ms']
[ 858826 ] ['aws', 's3', 'cp', '--quiet', 's3://scedc-pds/continuous_waveforms/2019/2019_190/GSCA03_HH100_2019190.ms', '/tmp/GSCA03_HH100_2019190.ms']
[ 858826 ] Time to download per process is  7 seconds
[ 858826 ] MB downloaded    =  41.88720703125
[ 858826 ] Rate of download =  5.98388671875  MB/sec


Summary : 

TOTAL MB downloaded :  41.88720703125 
AVG time to download per process :  7.0 seconds



In [5]:
!ls /tmp/*.ms


/tmp/GSCA01_HH100_2019190.ms  /tmp/GSCA03_HH100_2019190.ms
/tmp/GSCA02_HH100_2019190.ms  /tmp/GSCA04_HH100_2019190.ms


# OR

Another option for step 4. is to use the boto3 python module provided by AWS to query and get objects/files from the SCEDC public data set (pds). 

In [6]:
import boto3
import os
import datetime

region = 'us-west-2'

# to get your access credentials see Pre-requisites a. of this tutorial 
aws_access_key_id = 'somekey' 
aws_secret_access_key = 'sshhhh' 

bucket = 'scedc-pds'
prefix = 'continuous_waveforms'

# SCEDC public data set stored miniSEED data in day long files. 
# A day is the minimum temporal granularity for retrieval.
#
# So, convert the starttime (or endtime) to year_julianday, which the path naming format expected by the pds.
#
year = datetime.datetime.fromisoformat(starttime+'+00:00').strftime('%Y') #2019
julian_day = datetime.datetime.fromisoformat(starttime+'+00:00').strftime('%Y_%j') #2019_190


In [7]:
# get an s3 client using your credentials

s3 = boto3.client('s3',
                   region_name='us-west-2',
                   aws_access_key_id='AKIAJGMPLZC3NYXPRZ3Q', 
                   aws_secret_access_key='Twg4V+TFvB2mzeSXi4n44SAPMzTMZ3p+NLCnSRJu')

In [8]:
# create the prefix expected by pds; continuous_waveforms/YYYY/YYYY_JJJ/GSCA. 
# The last part is the network code and first two characters of the station code to narrow our search.

prefix = os.path.join(prefix, year, julian_day, 'GSCA')

# use list_objects_v2 to get a list of the filenames to download

response = s3.list_objects_v2(
            Bucket = bucket,
            Prefix = prefix
)

# get a list of pds objects of for GS.CA*.HH1

pds_objs = [obj['Key'] for obj in response['Contents'] if obj['Key'].find('HH1') != -1]

kwargs = {'Bucket':bucket}
for key in pds_objs:
    kwargs['Key'] = key
    
    # local filename with path 
    kwargs['Filename'] = os.path.join('/tmp', os.path.basename(key))
    
    print(key, '==>', kwargs['Filename'])
    s3.download_file(**kwargs)

continuous_waveforms/2019/2019_190/GSCA01_HH100_2019190.ms ==> /tmp/GSCA01_HH100_2019190.ms
continuous_waveforms/2019/2019_190/GSCA02_HH100_2019190.ms ==> /tmp/GSCA02_HH100_2019190.ms
continuous_waveforms/2019/2019_190/GSCA03_HH100_2019190.ms ==> /tmp/GSCA03_HH100_2019190.ms
continuous_waveforms/2019/2019_190/GSCA04_HH100_2019190.ms ==> /tmp/GSCA04_HH100_2019190.ms
