# Example: Using BagIt to package data set collections (ScienceBase example)

#### Tristan P. Wellman<br>Science Analytics and Synthesis (SAS)<br>U.S. Geological Survey, Denver, Colorado
#### last modified 1/29/2019

#### Purpose

Use BagIt to package ScienceBase item

#### Synopsis

This python class was developed from foundational work authored by colleagues in the IOOS group for processing NOAA records.<br />
Please see http://ioos.github.io/notebooks_demos/notebooks/2017-11-01-Creating-Archives-Using-Bagit/ for additional information.<br />
The case example herein applies to archiving OBIS USA records stored in individual ScienceBase items referenced using unique identifiers.    

#### Script I/O 

Inputs:<br /> 
a) ScienceBase item information in *.json format,<br />
b) Archive metadata in Python dictionary format,<br />
c) Search (include/exclude) operations to constrain item file selection.
        
Output:<br /> 
a) Compressed (*.tar) BagIt archive constructed from ScienceBase item.
    
Operations: <br />
a) Constructs a Bagit data archive for preserving ScienceBase information,<br />
b) Selects appropriate files in ScienceBase item to be stored in archive using search criteria,<br /> 
c) Employs stream request to download relevant files into archive folder,<br />
d) Describes archive metadata (task name, processing uuid, provider, contact, etc.),<br /> 
e) Validates Bagit archive structure and manifest information,<br />
f) Compresses Archive folder in *.tar format for improved transfer capabilities.



In [49]:
# load Python 3 packages
import os
import re
import tarfile
import bagit
import tempfile
import shutil
import json
import datetime
import requests
import uuid
import pprint
import collections


In [50]:
# Example : Request one Sciencebase item json 

url = 'https://www.sciencebase.gov/catalog/item/57fe9d82e4b0824b2d14f221'
sb_item = requests.get(url + '?format=json').json()

In [51]:
pprint.pprint(sb_item)

{'body': "In 1998, the Florida Fish and Wildlife Conservation Commission's "
         '(FWC) Fisheries Independent Monitoring (FIM) program began a '
         'long-term monitoring effort of key reef fish populations in the '
         'Florida Keys National Marine Sanctuary. This effort was aimed at '
         'evaluating the relative abundance, size structure, and habitat '
         'utilization of specific reef fish species that are targeted by '
         'commercial and recreational fisheries.; Smith, S.G., et al. 2011, '
         'Multispecies survey design for assessing reef-fish stocks, spatially '
         'explicit management performance, and ecosystem condition. Fisheries '
         'Research 109(2011)25-41; Brandt, M.E., et. al. 2009, A Cooperative '
         'Multi-agency Reef Fish Monitoring Protocol for the Florida Keys '
         'Coral Reef Ecosystem. Retrieve from '
         'http://www.coris.noaa.gov/activities/fish_monitoring_protocol/&nbsp; '
         '&nbsp;<br>\n'


In [52]:
# Optional archive arguments 

# Two optional input dictionaries can be used to:
# 1) add archive metadata, and 
# 2) constrain search criteria to select data files from ScienceBase items. 

archive_func = collections.OrderedDict()

archive_func['archive_meta'] = collections.OrderedDict([
    ('Archive-Tag-Name', 'Archive of ScienceBase Item'),
    ('Archive-Prcessing-Date', datetime.datetime.now().isoformat()),
    ('Archive-Host-Machine', str(uuid.uuid1())),
    ('Archive-Job-Number', str(uuid.uuid4())),
    ('Source-Agency-Name', 'United States Geological Survey'),
    ('Source-Agency-Physical-Address', 'Denver Federal Center, Building 810, Lakewood, Colorado, USA'),
    ('Source-Agency-Group', 'Science Analytics and Synthesis (SAS), Core Science Systems'),
    ('Source-Agency-Contact-Name', 'John Doe'),
    ('Source-Agency-Contact-Phone', '999-999-9999'),
    ('Source-Agency-Contact-Email', 'jdoe@usgs.gov'),
    ('Source-Agency-Data-Source', url),
    ('Source-Agency-Data-Title',sb_item['title']),
])


# Search function (archive_func['search']) - 
#
# Inputs: "include" and "exclude" keys with search parameters. 
#          first key (include or exclude) is performed first, 
#          second key is performed second
# 
# function: selects files to include and/or exclude using search parameters
#
# Search Parameters
# 1) "ignore": do not use  
# 2) "all": selects all files 
# 3) custom (text, regex) search term  e.g. '\.nc' selects files with .nc extension

# example - include all files except (exclude) those with .nc* file extensions 
#
archive_func['search'] = collections.OrderedDict([('include' ,'all'), ('exclude' , '\.nc')])

archive_func

OrderedDict([('archive_meta',
              OrderedDict([('Archive-Tag-Name', 'Archive of ScienceBase Item'),
                           ('Archive-Prcessing-Date',
                            '2019-03-15T14:30:51.684221'),
                           ('Archive-Host-Machine',
                            '39b177a2-4761-11e9-a126-f45c898ede93'),
                           ('Archive-Job-Number',
                            '9f5ed45a-2034-401e-a99d-c35997891243'),
                           ('Source-Agency-Name',
                            'United States Geological Survey'),
                           ('Source-Agency-Physical-Address',
                            'Denver Federal Center, Building 810, Lakewood, Colorado, USA'),
                           ('Source-Agency-Group',
                            'Science Analytics and Synthesis (SAS), Core Science Systems'),
                           ('Source-Agency-Contact-Name', 'John Doe'),
                           ('Source-Agency-Contact-Pho

In [53]:
# Python class to archive ScienceBase item using BagIt
#
class archive_sbitem():
    
    '''Class to archive ScienceBase data by item number.
       Functions include: select, retrieve, describe, package, 
       and compress archive content'''
    
    
    # Store ScienceBase item information
    #
    def __init__(self, sbitem, **kwargs):

        '''Performs all processing steps in sequence to create BagIt archive'''
        
        print('{}\n\t{}\n\t{}'.format('Processing ScienceBase item',
                                      sbitem['title'][0:80], sbitem['link']['url']))
            
        # task sequence (workflow)    
        #
        self.sbitem = sbitem
        self._inputs(**kwargs)
        self._sbfile_select()
        self._chk_4updates()
        if self.update_flag:
            self._gen_archive()
            self._get_datafiles()
            self._add_meta()
            self._validate()
            self._tar_archive()
        
        return print("\t{}: {}".format('Archive complete for ScienceBase item: ', sbitem['id']))
        
        
    # Access get_item capabilities
    #
    def __getitem__(self, item):
        return getattr(self, item)
     
        
    # Create Bagit object and temporary workspace 
    #
    def _gen_archive(self):
        
        '''Structures BagIt folder object'''
        
        self.bagit_folder = tempfile.mkdtemp()
        self.data_folder = os.path.join(self.bagit_folder, 'data')
        self.archive = bagit.make_bag(self.bagit_folder, checksum=['sha256'])
        
        
    # Register search criteria (actual functions) and archive metadata
    #
    def _inputs(self, **kwargs):
        
        '''Stores input search criteria and archive metadata'''
        
        # search criteria (include, exclude tags)
        #
        if 'search' not in kwargs:
            self.search = {'include' :'all', 'exclude' : None}
        elif 'include' not in kwargs['search'] or 'exclude' not in kwargs['search']:
            self.search = {'include' :'all', 'exclude' : None}
        else:
            self.search = kwargs['search']
            
        if isinstance(self.search['include'], str):
            self.search['include'] = [self.search['include']]
        if isinstance(self.search['exclude'], str):
            self.search['exclude'] = [self.search['exclude']]
            
        for key in self.search:
            parlist = []
            if self.search[key] is not None:
                for criteria in self.search[key]:
                    if criteria.lower() == 'all':
                        parlist.append("(.*?)")
                    elif criteria.lower() == 'ignore':
                        parlist.append(None)
                    else:
                        parlist.append(criteria)
                self.search.update({key:parlist})
                        
        # archive metadata record
        #
        if 'archive_meta' not in kwargs:
            self.archive_meta = None
        else:
            self.archive_meta = kwargs['archive_meta']
    
        
    # Select ScienceBase file content using quasi-flexible search criteria
    #
    def _sbfile_select(self):
    
        '''Identifies ScienceBase files to archive based on search criteria'''
            
        cdict = {'include': True, 'exclude' : False}
         
        # search through ScienceBase item (files and facets keywords)
        #
        self.file_select = []
        self.file_name = []
        self.file_datestamp = []
        if 'facets' in self.sbitem:
            for fdic in self.sbitem['facets']:
                if 'files' in fdic:
                    for dfile in fdic['files']:
                        file_chk = None
                        for item in self.search.items():
                            fchk = cdict[item[0].lower()]
                            if item[1] is not None:
                                for criteria in item[1]:
                                    regx_srch = r"{}".format(criteria)
                                    if re.search(regx_srch, dfile['name']):
                                        file_chk = fchk                
                        if file_chk:
                            self.file_select.append(dfile['downloadUri'])
                            self.file_name.append(dfile['name'])
                            self.file_datestamp.append('dateUploaded')
                        
        if 'files' in self.sbitem:
            for dfile in self.sbitem['files']:         
                file_chk = None
                for item in self.search.items():
                    fchk = cdict[item[0].lower()]
                    if item[1] is not None:
                        for criteria in item[1]:
                            regx_srch = r"{}".format(criteria)
                            if re.search(regx_srch, dfile['name']):
                                file_chk = fchk            
                if file_chk:
                    self.file_select.append(dfile['downloadUri'])
                    self.file_name.append(dfile['name'])
                    self.file_datestamp.append('dateUploaded')
                    
                    
    # Compares file timestamps, determines if BagIt archive needs updating
    #
    def _chk_4updates(self):
        
        '''Compares ScienceBase item json to existing archive (BagIt) item json'''

        filename = 'ScienceBase_Archive_' + self.sbitem['id'] + '.tgz'
        tarpath = os.path.join(os.path.join(os.getcwd(),'Archives'), filename) 
        self.update_flag = True
        if os.path.exists(tarpath): 
            tar = tarfile.open(tarpath, "r:gz")
            sbitem_json  = 'ScienceBase_record_' + self.sbitem['id'] + '.json'
            for member in tar.getmembers():
                if sbitem_json == member.name.split('/')[-1]:   
                    dfile = tar.extractfile(member).read().decode()
                    BagIt_json = json.loads(dfile)
                    self.update_flag = sorted(BagIt_json.items()) != sorted(self.sbitem.items())
                    if self.update_flag:
                        print('\t{}\n\t{}'.format('Action: create new archive',
                                                  'Reason: old archive is out of date'))
                    else:
                        print('\t{}\n\t{}'.format('Action: skip processing',
                                                  'Reason: existing archive is up to date'))            
        else:
            print('\t{}\n\t{}'.format('Action: create new archive',
                                       'Reason: archive was not found in directory'))
            
    # stream file retrieve, file insertion into archive folder 
    #
    def _get_datafiles(self):
        
        '''Streams ScienceBase files into BagIt data folder'''
        
        for indx, file_path in enumerate(self.file_select):
            request = requests.get(file_path, stream=True)
            if request.status_code == 200:
                bag_path = self.data_folder + '/' + self.file_name[indx] 
                with open(bag_path, 'wb') as f:
                    request.raw.decode_content = True
                    shutil.copyfileobj(request.raw, f)            
        sbitem_fname  = self.data_folder + '/' + 'ScienceBase_record_' + self.sbitem['id'] + '.json'
        with open(sbitem_fname, 'w') as f:
            json.dump(self.sbitem, f)   
        self.archive.save(manifests=True, processes=3)
    
    
    # Customize archive metadata  
    #
    def _add_meta(self):
        
        '''Adds metadata to BagIt folder'''
        
        if self.archive_meta:
            self.archive.info.update(self.archive_meta)
            self.archive.save(manifests=True, processes=4)  
                
                
    # Validate archive
    #
    def _validate(self):
        
        '''validation check for BagIt folder'''
        
        self.validate = []
        if self.archive.is_valid():
            self.validate.append("Bagit archive is structurally valid")
        else:
            self.validate.append("Bagit archive is structurally invalid")
        try:
            self.archive.validate()
        except bagit.BagValidationError as e:
            for d in e.details:
                if isinstance(d, bagit.ChecksumMismatch):
                    self.validate.append("expected %s to have %s checksum of %s but found %s" %
                          (d.path, d.algorithm, d.expected, d.found))
                    
                    
    # Package archive folder in *.tar compressed format (save to local archive_folder)
    #
    def _tar_archive(self):
        
        '''Compresses BagIt folder and save to archive folder'''
        
        # Ensure archive folder exists
        #
        dirname = 'Archives'
        tar_directory = os.path.join(os.getcwd(),dirname)
        if not os.path.isdir(tar_directory):  
            try:  
                os.mkdir(tar_directory)
            except OSError: 
                print("Creation of the directory %s failed" % tar_directory)

        # Save archive as *.tar file
        #
        tar_filename = 'ScienceBase_Archive_' + self.sbitem['id'] + '.tgz'
        with tarfile.open('./' + dirname + '/' + tar_filename, "w:gz") as tar:
            tar.add(self.bagit_folder, arcname=tar_filename.strip('.tgz'))
            self.tar_folder = tar_filename        

In [54]:
# Create ScienceBase archive (item *.json and metadata dictionary as arguments)

# HOW TO RUN 
BagIt_Archive = archive_sbitem(sb_item, **archive_func)

Processing ScienceBase item
	1995 Florida Keys Reef Visual Census, v3.3
	https://www.sciencebase.gov/catalog/item/57fe9d82e4b0824b2d14f221
	Action: create new archive
	Reason: archive was not found in directory
	Archive complete for ScienceBase item: : 57fe9d82e4b0824b2d14f221


In [55]:
# CHECK: shows files extracted and archived

BagIt_Archive.file_name

['fk1995_712b_5843_9069.csv',
 'fk1995_iso19115.xml',
 'fk1995.csv',
 'FK1995VisualReefCensusEventCore.R',
 'FloridaKeysReefVisualCensus1995_measurementOrFact.csv',
 'FloridaKeysReefVisualCensus1995_occurrence.csv',
 'FloridaKeysReefVisualCensus1995_event.csv']

In [56]:
# CHECK: validation status for archive 
try:
    BagIt_Archive.validate
except:
    print('BagIT archive was not generated')

In [57]:
# CHECK:  display of bagit directory and metadata

def list_files(startpath):
    for root, dirs, files in os.walk(startpath):
        level = root.replace(startpath, '').count(os.sep)
        indent = ' ' * 4 * (level)
        print('{}{}/'.format(indent, os.path.basename(root)))
        subindent = ' ' * 4 * (level + 1)
        for f in files:
            print('{}{}'.format(subindent, f))

try:
    print("\n{}{}\n".format('Temporary directory path: ', BagIt_Archive.bagit_folder))

    # display temporary Bagit folder
    !ls -la $BagIt_Archive.bagit_folder

    print("\n{}\n".format('BagIt directory structure:'))
    list_files(BagIt_Archive.bagit_folder)

    info = open(BagIt_Archive.bagit_folder + '/bag-info.txt').read()
    print("\nBagIt metadata:\n\n{}".format(info))   
    
except:
    print('BagIT archive was not generated')
    


Temporary directory path: /var/folders/kg/s57rylgd76xdlxs5cgkzzfmw001k3f/T/tmp8cz3nz29

total 32
drwx------     7 twellman  domainusers    224 Mar 15 14:42 [34m.[m[m
drwx------  2627 twellman  domainusers  84064 Mar 15 14:30 [34m..[m[m
-rw-r--r--     1 twellman  domainusers    862 Mar 15 14:42 bag-info.txt
-rw-r--r--     1 twellman  domainusers     55 Mar 15 14:30 bagit.txt
drwx------    10 twellman  domainusers    320 Mar 15 14:42 [34mdata[m[m
-rw-r--r--     1 twellman  domainusers    851 Mar 15 14:42 manifest-sha256.txt
-rw-r--r--     1 twellman  domainusers    238 Mar 15 14:42 tagmanifest-sha256.txt

BagIt directory structure:

tmp8cz3nz29/
    bagit.txt
    bag-info.txt
    tagmanifest-sha256.txt
    manifest-sha256.txt
    data/
        FloridaKeysReefVisualCensus1995_measurementOrFact.csv
        FloridaKeysReefVisualCensus1995_occurrence.csv
        FK1995VisualReefCensusEventCore.R
        fk1995.csv
        FloridaKeysReefVisualCensus1995_event.csv
        fk1995_iso