# Batch Metadata Propogation and Data Upload

As part of the USGS's [Data at Risk(DaR)](https://www.fort.usgs.gov/ldi/data-at-risk-project) project, a data release was created with 80 sets of repeat photgraphs taken from the Kanab Creek portion of the USGS Southwest Biological Science Center's  Repeat Photography Collection.  

Each set of photos is intended to be individually viewed or downloaded and thus needs a metadata record that matches it's contents.  This notebook shows how the core metadata manipulation functionality of the MetadataWizard can be used to produce a set of customized records.  Additionally, this script also automates the effort of uploading and organizing the files on the release platform ([USGS ScienceBase](https://www.sciencebase.gov/catalog/)). 

The resulting output of this script (a USGS Data Release) can be viewed at:
https://www.sciencebase.gov/catalog/item/59a06998e4b038630d030600

### Import the libraries we'll be using

In [1]:
import os
import shutil

import pandas as pd

import pysb
from pymdwizard.core.xml_utils import XMLRecord, XMLNode

### The files we will be reading from, and the output folder we'll be writting to (Change these)

In [2]:
# This is a CSDGM record that we'll be using as the starting point for each of our output records.
# We'll be doing string/xml manipulation to replace certain values, with the values for a particular site.
template_xml_fname = r"USGS-Southwest-Repeat-Photography-Collection_Kanab-Creek_1872-2010_Stake-694-Metadata.xml"

# This data frame contains a list of all the data packets we'll be updateing and uploading to ScienceBase 
excel_fname = r"MetadataContents.xlsx"

# The local directory that contains the individual files we're creating metadata for and publishing
input_data_dname = r".\Final Kanab Materials"

# This is a local directory we'll be outputing our final metadata into for QA/QC
output_dname = r".\output"

# The ScienceBase identifyer of the parent item that the outputs will be uploaded to.
parent_sb_id = '59a06998e4b038630d030600'

# The sting variable with a format for our output file names
empty_fname = "USGS-Southwest-Repeat-Photography-Collection_Kanab-Creek_1872-2010_Stake-{}-Metadata.xml"

####  Open our template metadata record

In [3]:
template_md = XMLRecord(template_xml_fname)

#### Extract the title and abstract contents and prepare them

In [4]:
template_md.metadata.idinfo.citation.citeinfo.title.text

'USGS Southwest Repeat Photography Collection: Kanab Creek, southern Utah and northern Arizona, 1872-2010: Stake 694'

In [5]:
empty_title = template_md.metadata.idinfo.citation.citeinfo.title.text.replace('694', '{}')
empty_title

'USGS Southwest Repeat Photography Collection: Kanab Creek, southern Utah and northern Arizona, 1872-2010: Stake {}'

In [6]:
empty_abstract = template_md.metadata.idinfo.descript.abstract.text.replace('694', '{StakeNo}')
empty_abstract = empty_abstract.replace('1872, 1972, and 1991', '{Repeat_Dates}')
empty_abstract = empty_abstract.replace('36.39194', '{Lat}')
empty_abstract = empty_abstract.replace('-112.62944', '{Long}')

### Read in the list of values to propogate into each record

In [7]:
df = pd.read_excel(excel_fname)
df.head(5)

Unnamed: 0,StakeNo,Lat,Long,Repeat_Dates
0,1096,36.39583,-112.62389,"1872, 1974, 1983, 1990 and 1991"
1,1209,36.44944,-112.6375,"1872 (no physical image), 1968 and 1990"
2,1210,36.44317,-112.65206,1872 and 1990
3,1211,36.44317,-112.65206,1872 and 1990
4,1212,36.44611,-112.63722,"1872, 1968 and 1990"


### Log into ScienceBase

In [8]:
sb = pysb.SbSession()
#PROMPT A USER FOR PASSWORD
username = "talbertc@usgs.gov"
sb.loginc(str(username))
print("You are now connected.")

········
You are now connected.


### Run through them all

In [9]:
bounding_width = 0.01 # this is Decimal degrees of half the bounding box

for index, row in df.iterrows():
    print(row["StakeNo"])
    
    # update the title and abstract to match this stake number
    new_title = empty_title.format(row["StakeNo"])
    template_md.metadata.idinfo.citation.citeinfo.title.text = new_title
    new_abs = empty_abstract.format(**row)
    template_md.metadata.idinfo.descript.abstract.text = new_abs
    
    # update the bounding extent to be the stake location +- the bounding width in all directions 
    template_md.metadata.idinfo.spdom.bounding.northbc.text = row['Lat'] + bounding_width
    template_md.metadata.idinfo.spdom.bounding.southbc.text = row['Lat'] - bounding_width
    template_md.metadata.idinfo.spdom.bounding.westbc.text = row['Long'] + bounding_width
    template_md.metadata.idinfo.spdom.bounding.eastbc.text = row['Long'] - bounding_width
    
    # save local copy of metadata file
    output_fname = os.path.join(output_dname, empty_fname.format(row["StakeNo"]))
    template_md.save(output_fname)
    
    # make a zip of the files for this output
    data_dname = os.path.join(input_data_dname, 's' + str(row["StakeNo"]))
    shutil.make_archive(data_dname, 'zip', input_data_dname)
    zip_fname = data_dname + '.zip'
    
    # move the metadata and zip up to ScienceBase
    child_item = sb.upload_file_and_create_item(parent_sb_id, output_fname)
    child_item['citation'] = parent_citation
    sb.update_item(child_item)
    sb.upload_file_to_item(child_item, zip_fname)
    
    print('\t finished upload')
    

1096
	 finished upload
1209
	 finished upload
1210
	 finished upload
1211
	 finished upload
1212
	 finished upload
1213
	 finished upload
1214
	 finished upload
1215
	 finished upload
1216
	 finished upload
1217
	 finished upload
1218
	 finished upload
1219
	 finished upload
1220
	 finished upload
1224
	 finished upload
1226
	 finished upload
1227
	 finished upload
1231
	 finished upload
1232
	 finished upload
1233
	 finished upload
1234
	 finished upload
1235
	 finished upload
1236
	 finished upload
1237
	 finished upload
1238
	 finished upload
1239
	 finished upload
1360
	 finished upload
1504a
	 finished upload
1504b
	 finished upload
1505
	 finished upload
1506
	 finished upload
1792
	 finished upload
1793
	 finished upload
2038
	 finished upload
2039
	 finished upload
2040
	 finished upload
2042ab
	 finished upload
2049
	 finished upload
2050
	 finished upload
2051
	 finished upload
2052
	 finished upload
2053
	 finished upload
2054
	 finished upload
2055
	 finished upload
2066
	 