## Code experiments to process files to ScienceBase
## using new in-memory functions in the pysb module.

##### Date: June 12, 2017

##### Purpose:
Retrieve data file (content) from remote source (url) into computer memory 
<br>then upload content as memory stream to an item in ScienceBase, 
<br>for items see https://www.sciencebase.gov/catalog/item/ . 

##### Pysb author:
John Long 
<br>jllong@usgs.gov 
<br>USGS Fort Collins Science Center
<br>2150 Centre Ave, Building C
<br>Fort Collins, CO 

##### Stream modifications by:
Tristan Wellman
<br>twellman@usgs.gov
<br>Biogeogeographic Characterization Branch
<br>Core Science Analytics, Synthesis, and Libraries
<br>U.S. Geological Survey, Denver, Colorado

##### Options (a-c) for in-memory stream processing:
<br>    a) http(s) request from url (e.g. 'http://waterservices.usgs.gov/nwis/gwlevels/?siteStatus=all&sites=375907091432201&format=rdb')
<br>    b) ftp request from url (e.g. 'ftp://ftp.horizon-systems.com/NHDplus/NHDPlusV1/SourisRedRainy/NHDPlus09V01_02_NHD.zip')
<br>    c) pass BytesIO file object from io module (e.g. fileObj = BytesIO(urlopen(url)))

##### New Function:
<br> usage with required inputs: 
<br>    item = sb.upload_file_to_item_stream(item, stream_src) 
<br>
<br> usage with optional inputs: 
<br>    item = sb.upload_file_to_item_stream(item, stream_src, scrape_file=True, get_info=True, post_info=True, filename_sub = fname)        
<br> where:
<br>  1) item:  ScienceBase item json (see pysb documentation)
<br>  2) stream_src: url (http or ftp) or file object, options a-c, discussed above
<br>  3) scrape_file: optional True/False flag of whether to use ScienceBase metadata processing 
<br>  4) stream_kwargs: optional arguments for file naming and request content reporting:
<br>&nbsp;&nbsp;&nbsp;&nbsp;  (a) auxillary filename, used if absent in requested content (filename_sub = "filename.fmt")
<br>&nbsp;&nbsp;&nbsp;&nbsp;  (b) report .GET ("get_info=True") get request header information
<br>&nbsp;&nbsp;&nbsp;&nbsp;  (c) report .POST ("post_info=True") post request header information

<br>It is possible to load files individually (one-by-one) or all together (batch mode),
<br>to pass urls pointing to files or constructed BytesIO file objects (examples below). 
<br>Follow pysb general guidelines, e.g. batchload all shapefile component files

<br> NOTE: "stream_src" and "filename_sub" inputs should be list formated, 
<br> e.g. stream_source = [url_1, url_2, url_3], filename = ['my_file.txt']


In [1]:
import pysb
from io import BytesIO
import os
import time
import requests
from re import findall
from contextlib import closing
try:
    from urllib.request import urlopen
except ImportError:
    from urllib2 import urlopen
    
from IPython.display import clear_output


print ('modules loaded')

modules loaded


In [2]:
# login to ScienceBase through personal user account 

sb = pysb.SbSession()
username = raw_input("Username:  ")
sb.loginc(str(username))
time.sleep(5)

Username:  twellman@usgs.gov
········


In [10]:
# Reference or create a test ScienceBase item, accessible to the user logged in

exist = True

# existing item
if exist:
    sb_id = '5942ac52e4b0764e6c65fdb0'
    item = sb.get_item(sb_id)
    
# create a new item
else:
    new_item = {'title': 'Python test: stream download to stream upload in ScienceBase' ,
    'parentId': sb.get_my_items_id(),
    'provenance': {'annotation': 'Python ScienceBase memory process test script'}}
    item = sb.create_item(new_item)
    
print("{}\n{}".format('Using ScienceBase item: ', item))



Using ScienceBase item: 
{u'locked': False, u'hasChildren': False, u'title': u'Python test: stream download to stream upload in ScienceBase', u'provenance': {u'lastUpdatedBy': u'twellman@usgs.gov', u'createdBy': u'twellman@usgs.gov', u'annotation': u'Python ScienceBase memory process test script', u'lastUpdated': u'2017-06-23T16:22:31Z', u'dateCreated': u'2017-06-15T15:48:34Z'}, u'relatedItems': {u'link': {u'url': u'https://www.sciencebase.gov/catalog/itemLinks?itemId=5942ac52e4b0764e6c65fdb0', u'rel': u'related'}}, u'link': {u'url': u'https://www.sciencebase.gov/catalog/item/5942ac52e4b0764e6c65fdb0', u'rel': u'self'}, u'parentId': u'570c0592e4b0ef3b7ca04e9e', u'distributionLinks': [], u'id': u'5942ac52e4b0764e6c65fdb0', u'permissions': {u'read': {u'inheritsFromId': u'570c0592e4b0ef3b7ca04e9e', u'inherited': True, u'acl': [u'USER:twellman@usgs.gov']}, u'write': {u'inheritsFromId': u'570c0592e4b0ef3b7ca04e9e', u'inherited': True, u'acl': [u'USER:twellman@usgs.gov']}}}


In [11]:
# <<< OPTION A >> - IN MEMORY FILE PROCESSING: **HTTP** file processing 

# TEST 1 : Pass file url one-by-one to ScienceBase
#          report GET and POST request information, 
#          scrape = False --> turn off ScienceBase metadata read 
         
# files : a) NWIS data file (*.rdf) and b) NOAA biogeographic data (netcdf, *.nc)
file_list = ['http://waterservices.usgs.gov/nwis/gwlevels/?siteStatus=all&sites=375907091432201&format=rdb',
             'http://coastwatch.pfeg.noaa.gov/erddap/tabledap/erdCAMarCatLM.nc?time%2Cyear%2Cfish%2Cport%2Clandings&time%3E=2002-10-09T00%3A00%3A00Z&time%3C=2002-12-16T00%3A00%3A00Z']

# alternate filenames, used only if missing in url request
fnames = ['NWIS_TEST_1.rdb', 'Is_Not_Used.fmt']

# stream files one-by-one (single mode), note: file (f) and filenames (filename_sub) are lists "[]"
for i, f in enumerate(file_list):
    item = sb.upload_file_to_item_stream(item, [ f ], scrape_file=False, 
                                         get_info=True, post_info=True,
                                         filename_sub = [ fnames[i] ] )


** ** Attempting to stream url response to sb_item: 5942ac52e4b0764e6c65fdb0 from url: 

http://waterservices.usgs.gov/nwis/gwlevels/?siteStatus=all&sites=375907091432201&format=rdb

GET status code: 
\200
GET info: 
\{'Content-Encoding': 'gzip', 'Transfer-Encoding': 'chunked', 'Strict-Transport-Security': 'max-age=31536000', 'Vary': 'Accept-Encoding', 'Keep-Alive': 'timeout=15, max=100', 'Server': 'GlassFish Server Open Source Edition  4.1', 'Connection': 'Keep-Alive', 'X-UA-Compatible': 'IE=edge,chrome=1', 'Date': 'Fri, 23 Jun 2017 16:23:41 GMT', 'Access-Control-Allow-Origin': '*', 'Content-Type': 'text/plain;charset=UTF-8'}

Note: server provided filename (content-disposition) was not found,
       using provided filename: ['NWIS_TEST_1.rdb']

POST status code: 
\200
POST info: 
\{'Transfer-Encoding': 'chunked', 'Strict-Transport-Security': 'max-age=63072000; includeSubdomains; preload', 'Keep-Alive': 'timeout=10, max=299', 'Connection': 'Keep-Alive', 'X-UA-Compatible': 'IE=edge', 

In [12]:
# <<< OPTION A >> - IN MEMORY FILE PROCESSING: **HTTP** file processing 

# TEST 2 : Pass file urls all together in batch mode
#          report only mimimum request information, 
#          scrape = False --> turn off ScienceBase metadata read 
         
# files : a) NWIS data file (*.rdf) and b) NOAA biogeographic data (netcdf, *.nc)
file_list = ['http://waterservices.usgs.gov/nwis/gwlevels/?siteStatus=all&sites=375907091432201&format=rdb',
             'http://coastwatch.pfeg.noaa.gov/erddap/tabledap/erdCAMarCatLM.nc?time%2Cyear%2Cfish%2Cport%2Clandings&time%3E=2002-10-09T00%3A00%3A00Z&time%3C=2002-12-16T00%3A00%3A00Z']

# alternate filenames, used only if missing in url request
fnames = ['NWIS_TEST_2.rdb', 'This_Is_Not_Used.fmt']
    
# stream file urls in batch mode
item = sb.upload_file_to_item_stream(item, file_list, scrape_file=False, 
                                         filename_sub = fnames )


** ** Attempting to stream url response to sb_item: 5942ac52e4b0764e6c65fdb0 from url: 

http://waterservices.usgs.gov/nwis/gwlevels/?siteStatus=all&sites=375907091432201&format=rdb

Note: server provided filename (content-disposition) was not found,
       using provided filename: ['NWIS_TEST_2.rdb', 'This_Is_Not_Used.fmt']

** ** Attempting to stream url response to sb_item: 5942ac52e4b0764e6c65fdb0 from url: 

http://coastwatch.pfeg.noaa.gov/erddap/tabledap/erdCAMarCatLM.nc?time%2Cyear%2Cfish%2Cport%2Clandings&time%3E=2002-10-09T00%3A00%3A00Z&time%3C=2002-12-16T00%3A00%3A00Z


** ** Stream upload to ScienceBase complete.


In [13]:
# << OPTION B >> - IN MEMORY FILE PROCESSING: complete **ZIP** file

# TEST 3 : stream process complete zip file to ScienceBase, 
#          use alternate filename provided,
#          mininum request reporting

# file source : National Hydrography Data (*.zip)
f = 'ftp://ftp.horizon-systems.com/NHDplus/NHDPlusV1/SourisRedRainy/NHDPlus09V01_02_NHD.zip'

# specify file name from url, used only if name is absent from content disposition
fname = 'TEST3_' + f.rsplit('/',1)[1] 

# stream process zip file
item = sb.upload_file_to_item_stream(item, [f], filename_sub = [fname]) 



** ** Attempting to stream url response to sb_item: 5942ac52e4b0764e6c65fdb0 from url: 

ftp://ftp.horizon-systems.com/NHDplus/NHDPlusV1/SourisRedRainy/NHDPlus09V01_02_NHD.zip

Note: server provided filename (content-disposition) was not found,
       using provided filename: ['TEST3_NHDPlus09V01_02_NHD.zip']


** ** Stream upload to ScienceBase complete.


In [15]:
# << OPTION C >> - IN MEMORY FILE PROCESSING: pass BytesIO file object   

# TEST 4 : Request individual files as streamed file objects,
#          upload file objects one-by-one to ScienceBase

# Data files : a) NWIS water services data file (*.rdf) and b) NOAA biogeographic data (netcdf, *.nc)
file_list = ['http://waterservices.usgs.gov/nwis/gwlevels/?siteStatus=all&sites=375907091432201&format=rdb',
         'http://coastwatch.pfeg.noaa.gov/erddap/tabledap/erdCAMarCatLM.nc?time%2Cyear%2Cfish%2Cport%2Clandings&time%3E=2002-10-09T00%3A00%3A00Z&time%3C=2002-12-16T00%3A00%3A00Z']

# alternate file names 
fname_alt = ['TEST4_NWIS_rename.rdb', None]

    
# request files, convert to file objects, and stream objects one-by-one
for i, f in enumerate(file_list):
    with closing(urlopen(f)) as r:
        if 'content-disposition' in r.headers:
            fname = ''.join(findall("filename=(.+)", r.info()['Content-Disposition']))
        else:
            fname = fname_alt[i]
        fileObj = BytesIO(r.read())
        item = sb.upload_file_to_item_stream(item, [fileObj], filename_sub = [fname])


** ** Attempting to stream BytesIO file object to sb_item: 5942ac52e4b0764e6c65fdb0


Note: server provided filename (content-disposition) was not found,
       using provided filename: TEST4_NWIS_rename.rdb


** ** Stream upload to ScienceBase complete.

** ** Attempting to stream BytesIO file object to sb_item: 5942ac52e4b0764e6c65fdb0


Note: server provided filename (content-disposition) was not found,
       using provided filename: erdCAMarCatLM_36d5_b891_bee9.nc


** ** Stream upload to ScienceBase complete.


In [17]:
# << OPTION C >> - IN MEMORY FILE PROCESSING: pass BytesIO file object  --> upload to ScienceBase 

# TEST 5 : Request zip file from url, upack selected files, upload in batchmode as file objects

# Note, scrape_file=True creates errors in pysb during testing when loading a shapefile *.shp
# Errors also occurred using a stock pysb command (e.g. sb.upload_file_to_item(item,'nhdarea.shp'))
# scrape_file=False is used for the example, which halts the ScienceBase metadata read, TBR later.

from zipfile import ZipFile

# prescribe url for remote zip file
file_list = ['ftp://ftp.horizon-systems.com/NHDplus/NHDPlusV1/SourisRedRainy/NHDPlus09V01_02_NHD.zip']

# list of files in zip, note: to view available content use zipfile.namelist()
proc_files_inZip =  ['NHDPlus09/NHDFlowlineVAA.dbf',
                     'NHDPlus09/Hydrography/NHDFlowline.dbf',
                     'NHDPlus09/Hydrography/NHDFlowline.prj',
                     'NHDPlus09/Hydrography/nhdflowline.shp',
                     'NHDPlus09/Hydrography/nhdflowline.shx']        

# request files, convert to file objects, and stream file objects all together in batch mode
fileObj = []
for f in file_list:
    with closing(urlopen(f)) as r, ZipFile(BytesIO(r.read())) as zfile:
        for entry in proc_files_inZip:
            print("\n{}\n\t{}".format('selected file from zip folder: ', entry))
            fileObj.append(BytesIO(zfile.read(entry)))
        item = sb.upload_file_to_item_stream(item, fileObj, 
                                          scrape_file=False,
                                          filename_sub = proc_files_inZip)
        print(item['files'])
    


selected file from zip folder: 
	NHDPlus09/NHDFlowlineVAA.dbf

selected file from zip folder: 
	NHDPlus09/Hydrography/NHDFlowline.dbf

selected file from zip folder: 
	NHDPlus09/Hydrography/NHDFlowline.prj

selected file from zip folder: 
	NHDPlus09/Hydrography/nhdflowline.shp

selected file from zip folder: 
	NHDPlus09/Hydrography/nhdflowline.shx

** ** Attempting to stream BytesIO file object to sb_item: 5942ac52e4b0764e6c65fdb0


Note: server provided filename (content-disposition) was not found,
       using provided filename: NHDPlus09/NHDFlowlineVAA.dbf

** ** Attempting to stream BytesIO file object to sb_item: 5942ac52e4b0764e6c65fdb0


Note: server provided filename (content-disposition) was not found,
       using provided filename: NHDPlus09/Hydrography/NHDFlowline.dbf

** ** Attempting to stream BytesIO file object to sb_item: 5942ac52e4b0764e6c65fdb0


Note: server provided filename (content-disposition) was not found,
       using provided filename: NHDPlus09/Hydrography

In [18]:
print('Notebook has run to completion')

Notebook has run to completion
