# Yaml file upload and manipulation #

### 1. upload a local yaml file, for instance the datacite metadata structure in yaml ###<br>
https://github.com/G-Node/gogs/blob/master/conf/datacite/datacite.yml

here we find mostly bibliographic metadata

Here we have a cheatsheet for yaml files 

Yaml accept blank spaces for indentation and not tabs

Declaring an integer , float and string <br>
`a: 1        # integer `    <br>     
`a: 1.234    # float     ` <br>
`b: 'abc'    # string `     <br>

Start a sequence

` keywords: ` <br>
  `- Neuroscience` <br>
  `- Keyword2 ` <br>
  `- Keyword3  `

create a list

`my_list: [1, 2, 3]`

Create a text block

`Abstract: >` <br>
`Abstract: | `    

Identify the different documents with `---` to start a document and `...` to end it

`--- ` <br>
 `   content:doc1 ` <br>
 `...`
`---` <br>
 `   content:doc2`

The following symbols are also special characters in `YAML` <br>
`* & ! % # ` @ ,.` <br>
`*` <br> reference to a repetable node
`&` define a repitable node
`!`
`%`
`#` is to comment a line <br>
`@` <br>
`,` <br>
`.`

### Let's have a look at the yaml file structure and syntax ###

Generally `YAML` files are used as configuration or parameter files

Here there is an example of metadata definition for a data publication using the `DataCite Metadata Schema `

In [None]:
# Metadata for DOI registration according to DataCite Metadata Schema 4.1.
# For detailed schema description see https://doi.org/10.5438/0014

## Required fields

# The main researchers involved. Include digital identifier (e.g., ORCID)
# if possible, including the prefix to indicate its type.
authors:
  -
    firstname: "GivenName1"
    lastname: "FamilyName1"
    affiliation: "Affiliation1"
    id: "ORCID:0000-0001-2345-6789"
  -
    firstname: "GivenName2"
    lastname: "FamilyName2"
    affiliation: "Affiliation2"
    id: "ResearcherID:X-1234-5678"
  -
    firstname: "GivenName3"
    lastname: "FamilyName3"

# A title to describe the published resource.
title: "Example Title"

# Additional information about the resource, e.g., a brief abstract.
description: |
  Example description
  that can contain linebreaks
  but has to maintain indentation.
# Lit of keywords the resource should be associated with.
# Give as many keywords as possible, to make the resource findable.
keywords:
  - Neuroscience
  - Keyword2
  - Keyword3

# License information for this resource. Please provide the license name and/or a link to the license.
# Please add also a corresponding LICENSE file to the repository.
license:
  name: "Creative Commons CC0 1.0 Public Domain Dedication"
  url: "https://creativecommons.org/publicdomain/zero/1.0/"



## Optional Fields

# Funding information for this resource.
# Separate funder name and grant number by comma.
funding:
  - "DFG, AB1234/5-6"
  - "EU, EU.12345"


# Related publications. reftype might be: IsSupplementTo, IsDescribedBy, IsReferencedBy.
# Please provide digital identifier (e.g., DOI) if possible.
# Add a prefix to the ID, separated by a colon, to indicate the source.
# Supported sources are: DOI, arXiv, PMID
# In the citation field, please provide the full reference, including title, authors, journal etc.
references:
  -
    id: "doi:10.xxx/zzzz"
    reftype: "IsSupplementTo"
    citation: "Citation1"
  -
    id: "arxiv:mmmm.nnnn"
    reftype: "IsSupplementTo"
    citation: "Citation2"
  -
    id: "pmid:nnnnnnnn"
    reftype: "IsReferencedBy"
    citation: "Citation3"


# Resource type. Default is Dataset, other possible values are Software, DataPaper, Image, Text.
resourcetype: Dataset

# Do not edit or remove the following line
templateversion: 1.2

#### The  `yaml` file content can be parsed using a python functions #### 

In [18]:
import yaml
import xmldict
import dicttoxml
import sys
local=(sys.path[-1].strip('.ipython'))
filename=local+'doi_structure.yml'
with open(filename) as file:
    yaml_file = yaml.load(file, Loader=yaml.FullLoader)

into a dictionary made of keys and entries

In [19]:
print(yaml_file)

{'authors': [{'firstname': 'GivenName1', 'lastname': 'FamilyName1', 'affiliation': 'Affiliation1', 'id': 'ORCID:0000-0001-2345-6789'}, {'firstname': 'GivenName2', 'lastname': 'FamilyName2', 'affiliation': 'Affiliation2', 'id': 'ResearcherID:X-1234-5678'}, {'firstname': 'GivenName3', 'lastname': 'FamilyName3'}], 'title': 'Example Title', 'description': 'Example description\nthat can contain linebreaks\nbut has to maintain indentation.\n', 'keywords': ['Neuroscience', 'Keyword2', 'Keyword3'], 'license': {'name': 'Creative Commons CC0 1.0 Public Domain Dedication', 'url': 'https://creativecommons.org/publicdomain/zero/1.0/'}, 'funding': ['DFG, AB1234/5-6', 'EU, EU.12345'], 'references': [{'id': 'doi:10.xxx/zzzz', 'reftype': 'IsSupplementTo', 'citation': 'Citation1'}, {'id': 'arxiv:mmmm.nnnn', 'reftype': 'IsSupplementTo', 'citation': 'Citation2'}, {'id': 'pmid:nnnnnnnn', 'reftype': 'IsReferencedBy', 'citation': 'Citation3'}], 'resourcetype': 'Dataset', 'templateversion': 1.2}


### Loop over keys and values for a tabular view ###

In [28]:
yaml_file.keys()
for key,value in yaml_file.items():
    print(key + ":" +str(value))

authors:[{'firstname': 'given', 'lastname': 'FamilyName1', 'affiliation': 'Affiliation1', 'id': 'ORCID:0000-0001-2345-6789'}, {'firstname': 'GivenName2', 'lastname': 'FamilyName2', 'affiliation': 'Affiliation2', 'id': 'ResearcherID:X-1234-5678'}, {'firstname': 'GivenName3', 'lastname': 'FamilyName3'}]
title:Example Title
description:Example description
that can contain linebreaks
but has to maintain indentation.

keywords:['Neuroscience', 'Keyword2', 'Keyword3']
license:{'name': 'Creative Commons CC0 1.0 Public Domain Dedication', 'url': 'https://creativecommons.org/publicdomain/zero/1.0/'}
funding:['DFG, AB1234/5-6', 'EU, EU.12345']
references:[{'id': 'doi:10.xxx/zzzz', 'reftype': 'IsSupplementTo', 'citation': 'Citation1'}, {'id': 'arxiv:mmmm.nnnn', 'reftype': 'IsSupplementTo', 'citation': 'Citation2'}, {'id': 'pmid:nnnnnnnn', 'reftype': 'IsReferencedBy', 'citation': 'Citation3'}]
resourcetype:Dataset
templateversion:1.2


#### Explore the dictionary entries ####

In [23]:
yaml_file['authors']

[{'firstname': 'GivenName1',
  'lastname': 'FamilyName1',
  'affiliation': 'Affiliation1',
  'id': 'ORCID:0000-0001-2345-6789'},
 {'firstname': 'GivenName2',
  'lastname': 'FamilyName2',
  'affiliation': 'Affiliation2',
  'id': 'ResearcherID:X-1234-5678'},
 {'firstname': 'GivenName3', 'lastname': 'FamilyName3'}]

In [24]:
yaml_file['authors'][0]['firstname']

'GivenName1'

#### Assign new values from command line ####

In [25]:
yaml_file['authors'][0]['firstname']='given'
yaml_file['authors'][0]['firstname']

'given'

#### Convert to xml structure ###

In [26]:
xmltest = dicttoxml.dicttoxml(yaml_file)
print(xmltest)

b'<?xml version="1.0" encoding="UTF-8" ?><root><authors type="list"><item type="dict"><firstname type="str">given</firstname><lastname type="str">FamilyName1</lastname><affiliation type="str">Affiliation1</affiliation><id type="str">ORCID:0000-0001-2345-6789</id></item><item type="dict"><firstname type="str">GivenName2</firstname><lastname type="str">FamilyName2</lastname><affiliation type="str">Affiliation2</affiliation><id type="str">ResearcherID:X-1234-5678</id></item><item type="dict"><firstname type="str">GivenName3</firstname><lastname type="str">FamilyName3</lastname></item></authors><title type="str">Example Title</title><description type="str">Example description\nthat can contain linebreaks\nbut has to maintain indentation.\n</description><keywords type="list"><item type="str">Neuroscience</item><item type="str">Keyword2</item><item type="str">Keyword3</item></keywords><license type="dict"><name type="str">Creative Commons CC0 1.0 Public Domain Dedication</name><url type="str

and submit the metadata to the publication service provider

### Yaml file are also very suitable as parameter files. Python can distinguish between different file sections.
Yaml can be used to run a subpackage in Python or to assign analysis parameters ###

Let's have a look at the parameter file in AutoStatsQConfig, source: https://github.com/gesape/AutoStatsQ

here the -! is used to start subpackages of the software

In [27]:
--- !autostatsq.config.AutoStatsQConfig
Settings:
- !autostatsq.config.GeneralSettings
  work_dir: /some/data/directory/
  list_station_lists: [/path/to/station-file/file.csv, /path/to/station-file/file.xml]
  st_use_list: [STATION] 
  # if set, only stations in this list are considered. remove or set to [] to use all stations
  # in station files.

- !autostatsq.config.CatalogConfig
  search_events: true 
  # search gCMT catalog for events?

  use_local_catalog: false
  # Or use a local (already downloaded) catalog? 
  # Needed re-runs using same catalog.

  subset_of_local_catalog: true
  # Find a subset of the full catalog?

  use_local_subsets: false
  # Use local (already saved) subset instead?
  
  subset_fns: {}
  # if so, give here paths to subset-catalog-files: e.g. 
  # {'deep': 'catalog_deep_subset.txt',
  # 'shallow': 'catalog_shallow_subset.txt'}

  min_mag: 6.5
  max_mag: 8.5
  tmin_str: '2000-01-01 00:00:00'
  tmax_str: '2018-10-01 00:00:00'
  min_dist_km: 4000.0
  max_dist_km: 20000.0
  depth_options:
    deep: [25000, 600000] # [m]
    shallow: [100, 40000] # [m]

  wedges_width: 15
  # backazimuthal step for subset generation
  # adjust to get more events, especially if time range is small

  mid_point: [46.98, 10.74]
  # give a rough estimate of midpoint of array/ network
  # optional, if not provided a geographic station midpoint is calculated

  ### catalog plotting options ###
  plot_catalog_all: false
  # plots entire catalog on a map

  plot_hist_wedges: false
  # catalog statistics plot
  
  plot_catalog_subset: false
  # plots the subset(s) on a map

- !autostatsq.config.ArrTConfig
  calc_first_arr_t: true
  # Should first arrivals be computed?

  phase_select: P|p|P(cmb)P(icb)P(icb)p(cmb)p|P(cmb)Pv(icb)p(cmb)p|P(cmb)P<(icb)(cmb)p
  # which phases?

  calc_est_R: true
  # compute arrival time of Rayleigh waves for each station-event pair? 
  # (needed for orientation test)
  v_rayleigh: 4.0  # [km/s] default

- !autostatsq.config.MetaDataDownloadConfig
  # download of metadata and data

  download_data: false
  download_metadata: false 
  use_downmeta: true
  # Set to true if downloaded metadata should be used.

  # local_metadata: [stations.xml]
  # list of local metadata files (uncomment if needed)
  # local_data: [./data]
  # list with paths to local waveform data (uncomment if needed)
  # sds_structure: true
  # if the local waveform data is saved in sds structure, set to true!
  # otherwise assessing local data might be very slow in case of large amounts of data. 
  # working on it...
  # local_data_only: true
  # if only local, no freshly downloaded data is used

  channels_download: HH*
  # '*' would download all and analyse the most broadband channel for each
  # station
  token:
    geofon: /path/to/token/token.asc
  # delete token-dictionary, if no token needed for fdsn query
  sites: [geofon, orfeus, iris]

  dt_start: 0.1
  # start time before origin time [h]
  dt_end: 1.5
  # end time after origin time [h]

- !autostatsq.config.RestDownRotConfig
  # restitution, downsampling and rotation of data
  # required for all tests
  rest_data: false
  freqlim: [0.005, 0.01, 0.2, 0.25] # [Hz]
  rotate_data: false
  deltat_down: 2 [s]
  # set deltat_down to 0.0 if no downsampling is wanted. (This will slow down everything,
  # and the PSD-test does only work if the sampling freuqency of synthtic and real data is 
  # the same.)

- !autostatsq.config.SynthDataConfig
  # computation of synthetic data
  # needed for PSD-test only, can otherwise be left out
  make_syn_data: false
  engine_path: /path/to/GF_stores
  store_id: global_2s

- !autostatsq.config.GainfactorsConfig
  # settings for first test
  calc_gainfactors: false
  gain_factor_method:
  - reference_nsl
  - [GE, MATE]
  ### describe different methods
  fband:
    corner_hp: 0.01 # [Hz]
    corner_lp: 0.2 # [Hz]
    order: 4
  taper_xfrac: 0.25 # [s]

  wdw_st_arr: 5
  wdw_sp_arr: 60
  # time window around P phase onset, start [s] before and end [s] after theo. arrival time

  snr_thresh: 2. # threshold for snr of used event
  debug_mode: false # if true, time windows are opened in snuffler to check window settings.

  phase_select: first(P|p|P(cmb)P(icb)P(icb)p(cmb)p|P(cmb)Pv(icb)p(cmb)p|P(cmb)P<(icb)(cmb)p)
  components: [Z, R, T]

  # plotting options
  plot_median_gain_on_map: false
  plot_allgains: false

- !autostatsq.config.PSDConfig
  # settings for PSD test
  calc_psd: false
  tinc: 600  # [s]
  tpad: 200  # [s]
  dt_start: 60  # [s] start before arrival of first P phases
  dt_end: 120  # [s] end before arrival of Rayleigh waves
  n_poly: 25
  norm_factor: 50
  f_ign: 0.02  # [Hz]
  only_first: true # outputs only first "flat" frequency range
  plot_psds: false
  plot_ratio_extra: false
  plot_m_rat: false
  plot_flat_ranges: false

- !autostatsq.config.OrientConfig
  # settings for orientation test
  orient_rayl: false
  bandpass: [3.0, 0.01, 0.05]  # [Hz]
  start_before_ev: 30.0  # start befor theo. Rayleigh wave arrival, [s]
  stop_after_ev: 480.0  # end after theo. Rayleigh wave arrival, [s]
  ccmin: 0.80
  # min. cross-correlation value. results below this value will not be
  # considered
  debug_mode: false
  # if true, time windows are opened in snuffler to check window settings.

  plot_heatmap: false
  # plot correction angle vs. cross-correlation value as imshow heatmap
  # usually distibution plot is better.
  plot_distr: false
  # plot correction angle vs. cross-correlation value
  # usually distibution plot is better.
  plot_orient_map_fromfile: false
  # plot a map with correction angles as lines
  plot_angles_vs_events: false
  # plot angle vs single events, one plot for each station

- !autostatsq.config.TimingConfig
  # simple test for large timing errors (> 2s)
  timing_test: false
  bandpass: [3, 0.01, 0.1]
  time_wdw: [firstP, 1200]  
  # needs a long time window for correlation
  cc_thresh: 0.6
  # test appropriate setting with debug mode, depends on frequency range
  search_locations: false  
  # uses all stations in station list if false, otherwise all found 
  # in traces are used
  debug_mode: false 
  # starts in interactive mode in snuffler showing the traces and the obtained
  # cross correlation function

- !autostatsq.config.TeleCheckConfig
  tele_check: false

- !autostatsq.config.maps
  # settings for all output maps
  map_size: [30.0, 30.0]
  pl_opt: [46, 11.75, 800000]
  # mid point of map (lat, lon) and radius [m]
  # to use automated map dimensions:
  # pl_opt: ['automatic']
  pl_topo: false
  # plotting topography can be very slow,
  # topographic data will be downloaded first


SyntaxError: invalid syntax (<ipython-input-27-bd8c8967b6bc>, line 1)

#### Yaml can be combined to csv files in csvy files or used as metadatastructure detached to the data array ####

yaml files should be validate to avoid hidden configurations or indentation errors