## SquamataMT - Jupyter notebook for releasing MT data to ScienceBase

This module performs the following operations:
- Create list of data directories
- Identify and load EDI file
- Collect release specific parameters common to ALL metadata
- Harvest EDI file values
- Identify files accompanying data release
- Poplulate MetaData Template

To execute a function/command select a cell and Hold-Shift + Press-Enter

**The 'r' signifies a string literal. Use for paths.**

Metadata wizard:  Advanced, Open In a jupyter Notebook?
Metadata Wizard 2.o from ScienceBase

In [1]:
# Phil Brown (pbrown@usgs.gov) 2018
# Working Python 3 Notebook used to facilitate the release of Magnetotelluric (MT) Data to ScienceBase.

In [2]:
# Test Cell
print ("Jupyter is working.") #To run this cell, hold down Shift and press Enter.

Jupyter is working.


In [3]:
# Load required Libraries
import sys
import os
import zipfile
import csv
import pysb
import requests
import shutil
from shutil import copyfile
import zipfile
import datetime
import glob
from lxml import etree
import json
import pickle
import shutil
import fileinput
import json
import pandas as pd
import numpy as np
from IPython.core.display import display
from IPython.core.display import HTML
from lxml import etree
##from pymdwizard.core.xml_utils import XMLRecord
##from pymdwizard.core.xml_utils import XMLNode
import re
from ipywidgets import *
from IPython.display import display
from IPython.html.widgets import widgets



In [8]:
#Set Data Paths - perhaps we'll get a user form to do this some day?
mtDataPath = r"C:\CurrentWork\DataReleases\SquamataMT_TEST" #The 'r' signifies a string literal. Use for paths.
mtMataDataTemplatePath = r"C:\CurrentWork\DataManagement\SquamataMT"
mtMataDataTemplateName = "MT-MetaData_TEMPLATE.xml"

In [44]:
#Check Paths for the fun of it
print ('The MT Data Path is: ' + '"' + mtDataPath + '"')
mtMataDataTemplatePath + mtMataDataTemplateName

The MT Data Path is: "C:\CurrentWork\DataReleases\SquamataMT_TEST"


'C:\\CurrentWork\\DataManagement\\SquamataMTMT-MetaData_TEMPLATE.xml'

## Now, let's explore our data. 
- What files do we have? 
- What files do we import values from?

In [10]:
#Review content in file explorer

In [15]:
mtDataDirList = os.listdir(mtDataPath)
mtDataDirList

['AMT01', 'AMT02', 'AMT03', 'AMT04', 'AMT05']

In [16]:
mtStationPath = mtDataPath + '\\' + mtDataDirList[0]
mtStationPath

'C:\\CurrentWork\\DataReleases\\SquamataMT_TEST\\AMT01'

In [17]:
#Look for EDI file to load
ediList = glob.glob(os.path.join(mtStationPath, '**/*MT*.edi'),  recursive=True)
ediPath = ediList[0]
ediPath

'C:\\CurrentWork\\DataReleases\\SquamataMT_TEST\\AMT01\\USA-New_Mexico-Rio_Grande_Rift-San_Luis_Basin-2009-AMT01.edi'

## Enter the information unique to this data set but common to all metadata files
### These include:
- Data Release Title
- Data Release Originator(s)
- Larger Work Title
- Larger Work Originator(s)
- Larger Work URL
- Theme Keywords
- Location Keywords

In [18]:
## Test of creating a jupyter GUI to get this info
## Visit https://ipywidgets.readthedocs.io/en/stable/examples/Widget%20List.html
drTitle = widgets.HBox([widgets.Label(value="Data Release Title"), widgets.Textarea()])
display(drTitle)
drOriginators = widgets.HBox([widgets.Label(value="Data Release Originator(s)"), widgets.Textarea()])
display(drOriginators)
drLgWkTitle = widgets.HBox([widgets.Label(value="Larger Work Title"), widgets.Textarea()])
display(drLgWkTitle)






def handle_submit(sender):
    print(drTitle.value)
    
    text.on_submit(handle_submit)

#  At this point I released that there is no reason for this Jupyter Spegetti and aboned this effort to recreat the whole thing using Google Apps.  WAY less cluncky and it's on the cloud :)
>> >> >> **PJB** 9/20/18 14:11

In [19]:
## Test of getting this info from a Google Sheet

## Lets now import and index values from the EDI Files
- We need these values for the metadata template.  
- We also want to run stats on some of these values for the entity and attributes section

In [20]:
#Load EDI File and Read It
ediFile = open(ediPath, 'r')
ediContent = ediFile.read()
print(ediContent)
ediFile.close()


>HEAD                                                                           
                                                                                
  DATAID="Wheeler Peak"                                                         
  ACQBY=USGS                                                                    
  ACQDATE=2009-07-21
  STATE="New Mexico"                                                            
  COUNTY=Taos                                                                   
  UNITS=M                                                                       
  STDVERS=1.0                                                                   
  PROGVERS=GEOTOOLS_2.3                                                         
  PROGDATE=09/16/94                                                             
                                                                                
>INFO   MAXLINES=1000                                                           
       

In [21]:
#Now assign values to the SB MetaDataWizard Template unknowns
list_ = ediContent.splitlines()
list_length = len (list_)

for X in list_:
  if "STATE" in X:
    stateArray = X.split('=')
    state = stateArray[1]
    print ('State: ' + state)
  if "COUNTY" in X:
    countyArray = X.split('=')
    county = countyArray[1]
    print ('County: ' + county)
  if "Attachment Filename" in X and "http" in X:
    lgwrklinkArray = X.split('=')
    lgwrklink = lgwrklinkArray[1]
    print ('Attachment Filename Link: ' + lgwrklink)
  
 ## commented out - add larger work to template manually 
  ##if "Citation Title" in X: 
    ##lgwrkTitleArray = X.split('=')
    ##lgwrkTitle = lgwrkTitleArray[1]
    ##print ('Larger Work Title: ' + lgwrkTitle)

    
# Code below returns values that occupy more than one line
    
for i in range(list_length):
 value = list_[i] 
 if value.replace(" ", "") == 'SurveyPurposeDescription:':
   startIndPurpose = i + 1
   #print ('startIndPurpose: ' + str(startIndPurpose))
 if value.replace(" ", "") == 'DataDescription:':
   endIndPurpose = i - 1
   #print ('endIndPurpose: ' + str(endIndPurpose))
purpose = list_[startIndPurpose]
for j in range(startIndPurpose + 1,endIndPurpose): 
    purpose = purpose + list_[j]
    purposeClean = re.sub(' +', ' ',purpose)
print ('Purpose: ' + purposeClean)

for k in range(list_length):
 value = list_[k] 
 if value.replace(" ", "") == 'DataDescription:':
   startIndDescription = k + 1
   #print ('startIndDescription: ' + str(startIndDescription))
 if value.replace(" ", "") == 'FILECREATOR:':
   endIndDescription = k - 9
   #print ('endIndDescription: ' + str(endIndDescription))
description = list_[startIndDescription]
for l in range(startIndDescription + 1,endIndDescription): 
    description = description + list_[l]
    descriptionClean = re.sub(' +', ' ',description)
print ('Description: ' + descriptionClean)
    

State: "New Mexico"                                                            
County: Taos                                                                   
Attachment Filename Link: https://pubs.usgs.gov/of/2011/1264/report/OF11-1264.pdf    
Purpose:  This dataset includes audio-magnetotelluric (AMT) sounding data collected in 2009 in and near the San Luis Basin, New Mexico. The U.S. Geological Survey conducted a series of multidisciplinary studies, including AMT surveys, in the San Luis Basin to improve understanding of the hydrogeology of the Santa Fe Group and the nature of the sedimentary deposits comprising the principal groundwater aquifers of the Rio Grande rift. The shallow unconfined and the deeper confined Santa Fe Group aquifers in the San Luis Basin are the main sources of municipal water for the region. The population of the San Luis Basin region is growing rapidly and water shortfalls could have serious consequences. Future growth and land management in the region dep

Entity and Attribute Values for the EDI file.  List !****FREQUENCIES****!,!****IMPEDANCE ROTATION ANGLES****!,!****IMPEDANCES****!,!****COMPUTED PARAMETERS****!

>!****IMPEDANCES****!

In [37]:
# Import entity and attributes - !****IMPEDANCES****! plan to break some of these individual chunks into objects/functions

# Get Range of Impedance Values in EDI File
for k in range(list_length):
 value = list_[k] 
 if value.replace(" ", "") == '>!****IMPEDANCES****!':
   startIndImpedances = k + 1
   print ('startIndImpedances: ' + str(startIndImpedances))
 
 if value.replace(" ", "") ==  '>!****TIPPERPARAMETERS****!':
   endIndImpedances = k - 1
   print ('endIndImpedances: ' + str(endIndImpedances))

#Construct Array of Channel Headers   
count = 0
impedanceLabel = []
impedanceData = []
#Constuct a library of Headers and Values using Pandas, https://pandas.pydata.org/
impedanceDF = pd.DataFrame(data)
for l in range(startIndImpedances,endIndImpedances): 
    if list_[l][0] == '>':
     temp = list_[l].split(" ", 1)
     #print (temp)
     impedanceLabel.append((temp[0].split(">"))[1])
     dataTemp = list_[l+1]
     data = []
     for j in range(l+2,l+8):
      dataTemp = dataTemp + list_[j]
      dataTemp = re.sub(' +', ' ',dataTemp)
     data = dataTemp.split(" ")
     del data[0]
     data = np.array(data).astype(np.float) #convert String to floats
     se = pd.Series(data)
     print ((temp[0].split(">"))[1])   
     impedanceDF[((temp[0].split(">"))[1])] = se.values
    
    count = count + 1

#impedanceDF = pd.DataFrame(data, columns=(impedanceLabel))
impedanceDF
#data
#se 

startIndImpedances: 310
endIndImpedances: 417
ZXXR
ZXXI
ZXX.VAR
ZXYR
ZXYI
ZXY.VAR
ZYXR
ZYXI
ZYX.VAR
ZYYR
ZYYI
ZYY.VAR


Unnamed: 0,0,ZXXR,ZXXI,ZXX.VAR,ZXYR,ZXYI,ZXY.VAR,ZYXR,ZYXI,ZYX.VAR,ZYYR,ZYYI,ZYY.VAR
0,93524.4766,-1034.64185,-470.741638,77675.1172,1310.89575,1485.89343,124129.258,-1041.802,124.148865,58523.8711,1096.18286,741.179688,93524.4766
1,71135.4063,-186.487137,-1.843341,95383.2188,-538.065002,358.139923,166608.844,232.887787,-225.325912,40724.8711,-148.396179,-412.495636,71135.4063
2,19014.625,705.366882,-568.493835,52776.5898,2607.10547,725.450928,23843.0176,-1038.81592,-533.378906,42088.9336,56.658295,692.460388,19014.625
3,36383.1602,39.237244,-300.534912,58721.4453,499.906677,-102.019913,53571.4922,-101.110085,17.370205,39880.7617,-142.460007,208.576767,36383.1602
4,35953.3633,-173.708206,-108.13633,4034.93579,992.3797,-208.713364,13069.5557,154.865479,704.395508,11099.8057,-266.053741,1111.18359,35953.3633
5,2736.42432,-366.549561,-457.271545,5517.53516,1521.89392,423.235443,5077.17383,-504.352966,246.435242,2973.76416,-624.017212,65.458847,2736.42432
6,5097.97266,-642.188599,-1023.43903,1788.85681,1479.15466,981.460632,2718.04932,-255.760269,397.993866,3355.17944,143.025375,-414.32193,5097.97266
7,5782.28271,-270.525299,-465.956268,5237.50586,183.664383,-154.660202,3472.99194,-133.022064,91.446121,8720.07129,167.38649,251.904846,5782.28271
8,7046.35547,-152.127914,-132.579636,31455.8555,571.73468,387.82074,25928.2871,-110.850121,-0.469128,8548.54492,-81.162216,-112.738808,7046.35547
9,3876.6416,-676.03363,250.074554,62496.2813,1083.2146,636.968933,12231.9814,-260.402252,-769.099731,19806.7383,-164.157669,37.810867,3876.6416


In [38]:
# Now lets get the stats of the impedance data
#Make Array of Max Vallues
impedanceMax = []
for i in range (0,len(impedanceLabel)):
    impedanceMax.append(impedanceDF[(impedanceLabel[i])].max())
    
impedanceMin = []
for i in range (0,len(impedanceLabel)):
    impedanceMin.append(impedanceDF[(impedanceLabel[i])].min())

impedanceMin

[-1034.64185,
 -1023.43903,
 25.4506073,
 -538.065002,
 -208.713364,
 32.026619,
 -1041.802,
 -769.099731,
 41.5653725,
 -624.017212,
 -414.32193,
 41.1588516]

## Now lets get the range of values from the RSP values

In [40]:
#First Get the list of RSP files
rspList = glob.glob(os.path.join(mtStationPath, '*.RSP'),  recursive=True)
rspList

['C:\\CurrentWork\\DataReleases\\SquamataMT_TEST\\AMT01\\BF6-9621.RSP',
 'C:\\CurrentWork\\DataReleases\\SquamataMT_TEST\\AMT01\\BF6-9624.RSP',
 'C:\\CurrentWork\\DataReleases\\SquamataMT_TEST\\AMT01\\BF6-9625.RSP',
 'C:\\CurrentWork\\DataReleases\\SquamataMT_TEST\\AMT01\\EF-9608X.RSP',
 'C:\\CurrentWork\\DataReleases\\SquamataMT_TEST\\AMT01\\EF-9608Y.RSP']

## Now the raw Binary File Listing - this can be T files or W files
We will need to figure out the best way of filtering on thise - may need to build array and then delete AVG, dmp and edi file.

In [43]:
#First Get the list of RSP files
binList = glob.glob(os.path.join(mtStationPath, 'WP*.*'),  recursive=True)
binList

['C:\\CurrentWork\\DataReleases\\SquamataMT_TEST\\AMT01\\WP01A1.BP1',
 'C:\\CurrentWork\\DataReleases\\SquamataMT_TEST\\AMT01\\WP01A1.FC6',
 'C:\\CurrentWork\\DataReleases\\SquamataMT_TEST\\AMT01\\WP01A1.FC7',
 'C:\\CurrentWork\\DataReleases\\SquamataMT_TEST\\AMT01\\WP01A1.FC8',
 'C:\\CurrentWork\\DataReleases\\SquamataMT_TEST\\AMT01\\WP01A1.FC9',
 'C:\\CurrentWork\\DataReleases\\SquamataMT_TEST\\AMT01\\WP01A1.SD6',
 'C:\\CurrentWork\\DataReleases\\SquamataMT_TEST\\AMT01\\WP01A1.SD7',
 'C:\\CurrentWork\\DataReleases\\SquamataMT_TEST\\AMT01\\WP01A1.SD8',
 'C:\\CurrentWork\\DataReleases\\SquamataMT_TEST\\AMT01\\WP01A1.SD9',
 'C:\\CurrentWork\\DataReleases\\SquamataMT_TEST\\AMT01\\WP01A1.TS1',
 'C:\\CurrentWork\\DataReleases\\SquamataMT_TEST\\AMT01\\WP01A2.BP1',
 'C:\\CurrentWork\\DataReleases\\SquamataMT_TEST\\AMT01\\WP01A2.FC6',
 'C:\\CurrentWork\\DataReleases\\SquamataMT_TEST\\AMT01\\WP01A2.FC8',
 'C:\\CurrentWork\\DataReleases\\SquamataMT_TEST\\AMT01\\WP01A2.FC9',
 'C:\\CurrentWork\\D

# Populate Metadata Template

In [49]:
#Load EDI File and Read It
metaData = os.path.join(mtMataDataTemplatePath, mtMataDataTemplateName)
xmlTemplateFile = open(metaData, 'r')
metaDataContent = xmlTemplateFile.read()
print(metaDataContent)
xmlTemplateFile.close()


<?xml version="1.0" encoding="UTF-8"?>
<metadata>
	<idinfo>
		<citation>
			<citeinfo>
				<origin>{origin}</origin>
				<pubdate>{pubdate}</pubdate>
				<title>{title}</title>
				<edition>{edition}</edition>
				<geoform>ASCII and Binary Digital Data</geoform>
				<pubinfo>
					<pubplace>Denver, CO</pubplace>
					<publish>U.S. Geological Survey</publish>
				</pubinfo>
				<othercit>{othercit}</othercit><!--Please add an Orcid ID here e.g., "Additional information about Originator: Rodriguez, B.D, http://orcid.org/0000-0002-2263-611X"-->
				<onlink>{onlink}</onlink>
				<lworkcit>
					<citeinfo>
						{BeginOriginLoop}<!--Place to print larger work originators here. Example is:
						<origin>Originating Author Name</origin> /carrage return (CR is &#13; and not &#10; which is LF)
						-->
						<pubdate>{lworkcit-pubdate}</pubdate>
						<title>{lworkcit-title}</title>
						<geoform>PDF</geoform>
						<serinfo>
							<sername>{lworkcit-sername}</sername>
							<issue>{lworkci

In [77]:
# Replace some Value
title = 'New AMT Title Yippie'
#newMetaDataContent = metaDataContent
#newMetaDataContent = [metaDataContent.replace('{title}',title) for metaDataContent in newMetaDataContent]

metaDataContent.replace('{title}',title)
print (metaDataContent)

<?xml version="1.0" encoding="UTF-8"?>
<metadata>
	<idinfo>
		<citation>
			<citeinfo>
				<origin>{origin}</origin>
				<pubdate>{pubdate}</pubdate>
				<title>{title}</title>
				<edition>{edition}</edition>
				<geoform>ASCII and Binary Digital Data</geoform>
				<pubinfo>
					<pubplace>Denver, CO</pubplace>
					<publish>U.S. Geological Survey</publish>
				</pubinfo>
				<othercit>{othercit}</othercit><!--Please add an Orcid ID here e.g., "Additional information about Originator: Rodriguez, B.D, http://orcid.org/0000-0002-2263-611X"-->
				<onlink>{onlink}</onlink>
				<lworkcit>
					<citeinfo>
						{BeginOriginLoop}<!--Place to print larger work originators here. Example is:
						<origin>Originating Author Name</origin> /carrage return (CR is &#13; and not &#10; which is LF)
						-->
						<pubdate>{lworkcit-pubdate}</pubdate>
						<title>{lworkcit-title}</title>
						<geoform>PDF</geoform>
						<serinfo>
							<sername>{lworkcit-sername}</sername>
							<issue>{lworkci

In [None]:
# Write new xml file to appropriate directory

## Process Logging
We probally at some point want to trap errors and post them to an array or something to be called and listed in a file after the processing is complete.

In [15]:
#Process Logging

pl = os.path.join(training_materials_path, "Scratch_Workspace",'ProcessingLog.txt')
process_log = open(pl,'w') # Can also use 'append' mode

process_log.write(str(datetime.datetime.now()))
process_log.write("\nSomething was performed.")
process_log.write("\nSomething else was done.")
process_log.write("\nWe can record information about what a script was doing in a notes/processing file.")

process_log.close()

print ("Process log saved at:", pl)

NameError: name 'training_materials_path' is not defined