# Overview

This notebook provides an overview of how to:

1. Search the OBPG satellite data archive for L1A files
2. Output a list of all the granule IDs and download links
3. Creates a small database that contains each station information, with the corresponding satellite granule

Note: this notebook searches through the [Level 1 & 2 browser](https://oceancolor.gsfc.nasa.gov/cgi/browse.pl?sen=amod), and parses the download links from there. 

# Requirements

This notebook starts from the point of having a field data file which contains the time and location information for each observation *in a particular format*. There's an example file provided in this repository (`exampleStationListFile.csv`), which is printed below. We leave it as an exercise (for now) to get your field data into this format.

## Field data station information file format example

In [2]:
import pandas as pd

stationListFile = 'example-StationListFile.csv'
stationList = pd.read_csv(stationListFile, sep='\s+')
stationList

Unnamed: 0,ID,Cruise,Station,yyyy-mm-ddThh:mm:ss,Longitude,Latitude,year,month,day,hour,mins,sec
0,s980917w_103883,s980917w,103883.0,1998-09-17T13:46:48,-66.3078,43.7736,1998,9,17,13,46,48
1,s980917w_103884,s980917w,103884.0,1998-09-17T13:47:13,-66.3102,43.7732,1998,9,17,13,47,13
2,s980917w_103885,s980917w,103885.0,1998-09-17T13:47:35,-66.3124,43.7729,1998,9,17,13,47,35
3,s980917w_103886,s980917w,103886.0,1998-09-17T13:47:56,-66.3147,43.7724,1998,9,17,13,47,56
4,s101012e_144471,s101012e,144471.0,2010-10-12T14:59:51,-68.9959,44.0526,2010,10,12,14,59,51
5,s101012e_144472,s101012e,144472.0,2010-10-12T15:01:55,-68.9913,44.0504,2010,10,12,15,1,55
6,s101012e_144473,s101012e,144473.0,2010-10-12T15:03:59,-68.9858,44.0477,2010,10,12,15,3,59
7,s181205e_2018083.1,s181205e,2018083.1,2018-12-05T04:22:00,-68.444467,43.4771,2018,12,5,4,22,0
8,s181206w_2018084.1,s181206w,2018084.1,2018-12-06T10:48:00,-67.150233,43.724533,2018,12,6,10,48,0
9,s181206w_2018085.1,s181206w,2018085.1,2018-12-06T12:22:00,-67.4717,43.705533,2018,12,6,12,22,0


## Importing other modules

In [1]:
import re
import numpy as np
import datetime as dt
import requests
from lxml import html

# 1. Search OBPG for satellite granules

## Creating a list of search parameters

First, let's get a list of cruises:

In [2]:
cruises = stationList.Cruise.drop_duplicates()

For each cruise, we need to get the min and max lats and lons which we can then use for our search

In [3]:
padding = 0.05 # degrees = ~7km in the Gulf of Maine

gnatsSummary = {}
for cix,cc in enumerate(cruises):
    cruise = stationList.query('Cruise==@cruises.iloc[@cix]')
    minlat = np.nanmin(cruise.Latitude) - padding
    minlon = np.nanmin(cruise.Longitude) - padding
    maxlat = np.nanmax(cruise.Latitude) + padding
    maxlon = np.nanmax(cruise.Longitude) + padding

    gnatsSummary[cc] = {'minLon' : minlon, 'minLat' : minlat, 'maxLon' : maxlon, 'maxLat' : maxlat}
    
gnatsSummary = pd.DataFrame(gnatsSummary).T

In [4]:
gnatsSummary

Unnamed: 0,minLon,minLat,maxLon,maxLat
s980917w,-70.004800,43.565700,-66.257800,43.823600
s980918w,-69.741700,43.576300,-66.248400,43.827400
s980929w,-69.408000,43.603200,-66.216522,43.825855
s981004w,-70.034804,43.560517,-66.256700,43.828600
s981005w,-69.707900,43.576500,-66.142300,43.828600
...,...,...,...,...
s180917w,-69.853000,43.516100,-66.193700,43.818500
s181001w,-69.867900,43.516200,-66.180000,43.821100
s180823w,-69.826900,0.016700,-66.270100,43.816400
s181205e,-70.473200,43.048100,-67.107000,43.773500


Reformatting the cruise date into unix epochs

In [5]:
dtstring = gnatsSummary.index
pre2000 = dtstring.str.contains('s9')
year = np.ones(pre2000.shape).astype(int)
year[pre2000] = [int(dts[1:3])+1900 for dts in dtstring[pre2000]]
year[~pre2000] = [int(dts[1:3])+2000 for dts in dtstring[~pre2000]]
month = [int(dts[3:5]) for dts in dtstring]
day = [int(dts[5:7]) for dts in dtstring]

#dividing by 1 day to end up with float rather than a datetime object
unixTime = [int((dt.datetime(yy,mm,dd) - dt.datetime(1970,1,1)) / dt.timedelta(days=1)) \
            for (yy,mm,dd) in zip(year,month,day)]
gnatsSummary['unixTime'] = unixTime

## Searching for satellite granules

Writing a function to craft the urls:

In [6]:
def crafturl(cruise,sensor):
    # sensor must be: amod, tmod, vrsn, swml
    # cruise must be one row of a dataframe, which contains:
    # unixTime = start time in unix time
    # maxLon, maxLat, minLon, minLat = bounding coordinates for the
    #                 region of interst

    # checking sensor strings
    if sensor not in ['amod','tmod','vrsn','swml','swg']:
        raise Exception('Unrecognized sensor string must be one of:' + \
                        ' amod, tmod, vrsn, swml, swg' )
    
    # crafting search url
    browserurl = 'https://oceancolor.gsfc.nasa.gov/cgi/browse.pl'
    url = browserurl + \
            '?sub=level1or2list&sen=' + sensor + \
            '&per=DAY&day=' + cruise.unixTime + \
            '&w=' + cruise.minLon+ '&s=' + cruise.minLat + \
            '&e=' + cruise.maxLon + '&n=' + cruise.maxLat

    return url

We need to parse the L1/2 brower results webpage html. There are two paths here depending on the results of the search. If multiple files match the search, we have to parse the thumbnails page, and then for each thumbnail, go to the individual swath page, parse that and get our download link. However, if only one file matches the search, we go straight to the individual swath page.

Looping through all cruises:

In [8]:
sensors = ['amod','tmod','vrsn','swml']
granules = {}
for cruise in gnatsSummary.astype(str).itertuples():
    granids = []
    granlinks = []
    for sensor in sensors:
        url = crafturl(cruise,sensor)

        # do the L1/L2 browser search
        try:
            req = requests.get(url)
        except SSLError:
            print('SSLError in ('+cruise+', '+sensor+') continuing with next search...')
            continue
            
        
        #status code = 200 means search was successful and
        #there is content on the found page
        if req.status_code == 200: 
            # parsing the url html into an element tree
            htmltree = html.fromstring(req.content)

            # checking the title of the page
            title = htmltree.xpath('//title')[0].text

            #search html tags for <a href...>
            #double slash means return all tags
            refs = htmltree.xpath('//a')

            if "Thumbnail" in title:
                for ref in refs:
                    link = ref.get('href','')
                    regex = re.compile('[A-Z][0-9]{13}\.L2')
                    if regex.search(link):
                        #there's 2 hyperlinks for each file, one on the file name, and one
                        #on the image thumbnail - we just need one
                        brws_regex = re.compile('_BRS_BRS')
                        if not brws_regex.search(link):
                            fullurl = 'https://oceancolor.gsfc.nasa.gov'+link

                            #parsing the webpage for each satellite file to get L1A download link
                            #(using the same procedure as above)
                            try:
                                req2 = requests.get(fullurl)
                            except SSLError:
                                print('SSLError in ('+cruise+', '+sensor+') thumbnail continuing with next search...')
                                continue
                                
                            htmltree2 = html.fromstring(req2.content)
                            refs2 = htmltree2.xpath('//a')
                            for ref in refs2:
                                link2 = ref.get('href','')
                                regex2 = re.compile('[A-Z][0-9]{13}\.L1A')
                                if regex2.search(link2):
                                    granlinks += [link2]
                                    granids += [re.search('[A-Z][0-9]{13}',link2).group()]
            else:
                #parsing the webpage for each satellite file to get L1A download link
                for ref in refs:
                    link2 = ref.get('href','')
                    regex2 = re.compile('[A-Z][0-9]{13}\.L1A')
                    if regex2.search(link2):
                        granlinks += [link2]
                        granids += [re.search('[A-Z][0-9]{13}',link2).group()]

    granules[cruise.Index] = {'granid' : granids, 'granlinks' : granlinks}

Reformatting as DataFrame

In [9]:
granule_dataframes = []
for cruise,graninfo in granules.items():
    df = pd.DataFrame(graninfo)
    df['cruise'] = [cruise]*len(df)
    granule_dataframes += [df]
    
granulesDF = pd.concat(granule_dataframes)

In [10]:
granulesDF

Unnamed: 0,granid,granlinks,cruise
0,S1998260183114,https://oceandata.sci.gsfc.nasa.gov/ob/getfile...,s980917w
1,S1998260165214,https://oceandata.sci.gsfc.nasa.gov/ob/getfile...,s980917w
0,S1998261173708,https://oceandata.sci.gsfc.nasa.gov/ob/getfile...,s980918w
1,S1998261155809,https://oceandata.sci.gsfc.nasa.gov/ob/getfile...,s980918w
0,S1998272173530,https://oceandata.sci.gsfc.nasa.gov/ob/getfile...,s980929w
...,...,...,...
0,A2018340183500,https://oceandata.sci.gsfc.nasa.gov/ob/getfile...,s181206w
1,A2018340170000,https://oceandata.sci.gsfc.nasa.gov/ob/getfile...,s181206w
2,T2018340151500,https://oceandata.sci.gsfc.nasa.gov/ob/getfile...,s181206w
3,V2018340174200,https://oceandata.sci.gsfc.nasa.gov/ob/getfile...,s181206w


# 2. Output list of granule IDs and download links

Reordering the columns:

In [11]:
satellite_overpasses = granulesDF[['cruise','granid','granlinks']]
satellite_overpasses = satellite_overpasses.reset_index(drop=True)

Saving to csv:

In [12]:
satellite_overpasses.to_csv('satellite-overpasses.csv',index=False,sep=',')

# 3. Create a small database that contains each station information, with the corresponding satellite granule

In [14]:
# unique cruise list to loop through:
cruiseList = satellite_overpasses.cruise.drop_duplicates()

matchups = []
for cruise in cruiseList:
    #overpasses for this cruise
    overpasses = satellite_overpasses[satellite_overpasses.cruise.str.contains(cruise)]
    
    #stations for this cruise, with dates & times reformatted into Python datetime objects
    stations = stationList[stationList.Cruise.str.contains(cruise)]
    dt_stations = [dt.datetime(yy,mm,dd,hh,mn,0) for (yy,mm,dd,hh,mn)  
                   in zip(stations.year,stations.month,stations.day,stations.hour,stations.mins)]
    stations.loc[:,'datetimes'] = dt_stations
    
    #looping through each satellite
    cruisematchups = []
    for op in overpasses.itertuples():
        
        #getting granule info
        granule = op.granid
        mm = int(granule[10:12])
        hh = int(granule[8:10])
        yyyy = int(granule[1:5])
        doy = int(granule[5:8])

        #formatting granule date and time into Python datetime object
        ymd = dtc.DOYtoYYYYMMDD(doy,yyyy).astype(int)
        dt_granule = dt.datetime(ymd[0,0],ymd[0,1],ymd[0,2],hh,mm,0)

        #comparing times
        timediff = [dt_granule - dts for dts in stations['datetimes']]
        timediff_hrs = [td.days*24 + td.seconds/3600 for td in timediff]
        time_idx = [np.absolute(td) <= 3. for td in timediff_hrs] #using a 3 hr window

        #indexing stations
        matched = stations.loc[time_idx]

        #adding in granule info
        if len(matched) > 0:
            matched.loc[:,'granid'] = granule
            cruisematchups += [matched]   
    
    # converting matchups into a dataframe and adding to a master matchup list
    if len(cruisematchups) > 0:
        matchups += [pd.concat(cruisematchups)]

matchupList = pd.concat(matchups)

Save to csv:

In [15]:
matchupList.to_csv('matchupStationList.csv',sep='\t',index=False)