# Meta-genome xml parse

This jupyter notebook is designed to engage with the data stored at portal.meta-genome.org via pycdcs. This repo contains a module (xml_parse) to wrap around pycdcs to parse xml payloads from submissions in a pythonic way. E.g. storing iterations of stress-strain data from a single submission in a single pandas df (assuming same units + interval). 

Our schema is large, so there are many avenues of data scraping. Here is a list of data parsing targets and ideas:
 - Collate reference information into dictionary i.e.:\
   {"submission ID" : {"authors": ["O. Duncan", "R. Feynman"],\
                      "publication title": "pub title here"...} etc\
 - Collate metamaterial general information into dictionary i.e.:\
    {"submission ID" : {"metamaterial family": "Foam",\
                        "Unusual material property": "negative Poissons ratio"
                        "Strain convention" : "True"\
                        "Stress convention" : "True"}\
 - Collate single component measures and units for base materials i.e. :\
   {"submission ID" : {"base material ID" : ["Material name" : "Nylon 12",\
                                            "Material classification" : "Metallic"...\
                                            "Directional Sensitivity" : "Isotropic"]...}\
 - Collate single component measures and units for metamaterial i.e. :\
   {"submission ID" : {"metamaterial 1" : ["Material name" : "Nylon 12",\
                                            "Material classification" : "Metallic"...\
                                            "Directional Sensitivity" : "Isotropic"]...}\
 - from a list of submissions available to the user - generate a dict containing pandas.dfs for each continuous data curve. i.e. \
 {"submission1-ID" : {"metamaterial1" : [pandas.df.stress-strain1, pandas.df.stress-strain2, pandas.df.stress-strain3...],\
                      "metamaterial2" : [pandas.df.stress-strain1, pandas.df.stress-strain2, pandas.df.stress-strain3...],}

Here is a users workspace submission hierarchy for continuous stress-strain (most complex):
```
  workspace
  |───submission-1
  |   |───base material properties
  |   |   |───Directional Sensitivity (ISOTROPIC IS UNIQUE FROM TRANS AND ORTHO)
  |   |   |   |───Stress-strain data
  |   |   |   |   |  
  |   |   |   |   |───data-block-1
```

This means we need a 5 stage iteration for the continuous data. So for all records we will need a final dictionary that looks like:\
\
{"Workspace1": \
    {"submission ID": \
        {"base material properties ID" : \
            {"base material properties ID" : \
                {"Directional Sensitivity TYPE ID": \
                    {"Stress-strain datablock ID" : PANDAS.DF}}}}}}

Core functionality will first need to be established i.e. getting the data into the desired format, then I can build to parsing through all requested data sets.

This itteration of the meta-genome-cdcs module and jupyter notebook will primarily be concerned with data formatting.

Thinking : 
Produce a class class to take a pandas.df from pycdcs that contains the submission metadata and xml. Then we need to return a dict of dicts for each root level element ^ as given above. 

### Initial development points
 - interactive inspection of xml data
 - extract blobs - imgs + topologies
 - extract all individual measures (tensile modulus etc)
 - extract all continuous measure (stress strain data)
 - extract publication data
 - collate related submissions: if meta-material --> collect base-material PIDs
 - xml_parse only works with individual xml strings currently - ernest to construct for loop on user. Could be extended to have 'averaging' or 'multiple-submission' mode.

This code should be used as a launch point for users to interact with submitted data within meta-genome.org. Specific functionality will need to be constructed ad hoc for each user. Specific requirements  Due to the breadth of the field, development is ad hoc right now.

## Demo of current functionality

The following code boxes will demonstrate the functionality of xml_parse.

The xml parse code depends on the cdcs py module (pycdcs - https://github.com/usnistgov/pycdcs/tree/master/cdcs/CDCS) vers 0.2.1+ to scrape data from portal.meta-genome.org. 

This code works with the pandas.df xml_content header. Please note, if significant changes are made the meta-genome schema, this code will become deprecated.

NOTE: xml_parse methods are annotated.

### Implementation:

These functions require the CDCS class to be initiated. To do this, observe the code cell below:

In [2]:
# Import cdcs (pycdcs) module
from cdcs import CDCS
# import xml_parse functions
import xml_parse

# generate cdcs class - looking to global workspace does not require login creds
curator = CDCS('https://portal.meta-genome.org/', username='')

# Parse by Schema template name
template="mecha-metagenome-schema31"

# Also parse by mongo db query where:
# map.metamaterial-material-info and map.base-material-info exist.
# For example constructions of mongodb queries, first build a query using:
# portal.meta-genome.org/explore/example > select query fields > go to query builder > build query > save query > top right 'help' dropdown >
# API documentation > /explore/example/rest/saved/query/
# This will help demo how mongodb queries can be constructed

# Here are a few dummy examples
example_query_1 = "{\"$or\": [{\"map.metamaterial-material-info\": {\"$exists\": true}}, {\"map.metamaterial-material-info\": {\"$exists\": true}}]}"
example_query_2 = "{\"map.metamaterial-material-info\": {\"$exists\": true}}"
example_query_3 = "{\"$and\": [{\"$or\": [{\"map.base-material-info.bulk-density\": {\"$gt\": 0.0}}, {\"map.base-material-info.bulk-density.#text\": {\"$gt\": 0.0}}]}, {\"$or\": [{\"map.base-material-info.isotropic-choice.tensile-poissons-ratio-iso.tensile-poissons-ratio-val-iso\": {\"$gt\": 0.0}}, {\"map.base-material-info.isotropic-choice.tensile-poissons-ratio-iso.tensile-poissons-ratio-val-iso.#text\": {\"$gt\": 0.0}}]}]}"
example_query_4 = "{\"$or\": [{\"map.base-material-info.isotropic-choice.tensile-poissons-ratio-iso.tensile-poissons-ratio-val-iso\": {\"$lt\": 0.4}}, {\"map.base-material-info.isotropic-choice.tensile-poissons-ratio-iso.tensile-poissons-ratio-val-iso.#text\": {\"$lt\": 0.4}}]}"

query_dict = "{\"$or\": [{\"map.metamaterial-material-info\": {\"$exists\": true}}, {\"map.base-material-info\": {\"$exists\": true}}]}"
my_query= curator.query(template=template, mongoquery=query_dict)
my_query


100%|██████████| 20/20 [00:00<00:00, 14.74it/s]


Unnamed: 0,id,template,workspace,user_id,title,xml_content,creation_date,last_modification_date,last_change_date,template_title
0,82,56,1,10,StretchAux-LD60-OD-21.xml,"<map xmlns:xsi=""http://www.w3.org/2001/XMLSche...",2023-03-22 16:49:05.644000+00:00,2023-03-29 16:24:40.669000+00:00,2023-03-29 16:24:40.451000+00:00,mecha-metagenome-schema31
1,51,56,1,10,Chiral_10_OD-23 - S.xml,"<map xmlns:xsi=""http://www.w3.org/2001/XMLSche...",2023-03-17 11:51:08.320000+00:00,2023-03-29 16:00:07.651000+00:00,2023-03-29 16:00:07.563000+00:00,mecha-metagenome-schema31
2,81,56,1,10,PlastaZote-LD60-OD-21,"<map xmlns:xsi=""http://www.w3.org/2001/XMLSche...",2023-03-22 16:34:02.515000+00:00,2023-03-22 16:34:13.771000+00:00,2023-03-22 16:34:13.694000+00:00,mecha-metagenome-schema31
3,80,56,1,10,AuxBlock-CC-OD21.xml,"<map xmlns:xsi=""http://www.w3.org/2001/XMLSche...",2023-03-22 16:10:57.747000+00:00,2023-03-22 16:11:17.295000+00:00,2023-03-22 16:11:17.195000+00:00,mecha-metagenome-schema31
4,76,56,1,10,Aux_VC5-OD-23.xml,"<map xmlns:xsi=""http://www.w3.org/2001/XMLSche...",2023-03-22 14:42:57.495000+00:00,2023-03-22 15:28:05.182000+00:00,2023-03-22 15:28:04.851000+00:00,mecha-metagenome-schema31
5,75,56,1,10,PUR30FR-CustomFoam-OD-23.xml,"<map xmlns:xsi=""http://www.w3.org/2001/XMLSche...",2023-03-22 13:54:30.781000+00:00,2023-03-22 13:54:43.484000+00:00,2023-03-22 13:54:42.912000+00:00,mecha-metagenome-schema31
6,73,56,1,10,Ninjaflex_Shepherd_Quasistatic.xml,"<map xmlns:xsi=""http://www.w3.org/2001/XMLSche...",2023-03-17 15:59:45.019000+00:00,2023-03-22 10:58:33.687000+00:00,2023-03-22 10:58:33.356000+00:00,mecha-metagenome-schema31
7,46,56,1,10,AChiral_0_OD-23 -S.xml,"<map xmlns:xsi=""http://www.w3.org/2001/XMLSche...",2023-03-16 16:19:48.834000+00:00,2023-03-21 10:01:41.309000+00:00,2023-03-21 10:01:41.230000+00:00,mecha-metagenome-schema31
8,45,56,1,10,AChiral_0_OD-23.xml,"<map xmlns:xsi=""http://www.w3.org/2001/XMLSche...",2023-03-16 16:08:49.608000+00:00,2023-03-21 10:00:57.596000+00:00,2023-03-21 10:00:57.330000+00:00,mecha-metagenome-schema31
9,50,56,1,10,Chiral_10_OD-23.xml,"<map xmlns:xsi=""http://www.w3.org/2001/XMLSche...",2023-03-17 11:35:55.038000+00:00,2023-03-21 09:57:40.169000+00:00,2023-03-21 09:57:40.007000+00:00,mecha-metagenome-schema31


## Functionality 1: **interactive_expansion()**

This function enables users to observe the elements within a chosen submission. Through recursive implementation of functions, sequential nested elements and types can be inspected. 

When this function is called, the user inspects the submission through interaction with the python kernel. When first activated, the base level elements will appear:

Available elements:
0: versioning
1: publication-info
2: stress-strain-convention-info
3: metamaterial-material-info
4: developer-section

The user can then input the associated value of the region they wish to inspect e.g. 1 --> publication-info:
Contents of publication-info:
0: id
1: publication-authors
2: publication-authors
3: publication-authors
4: publication-title
5: publication-year
6: publication-journal
7: publication-volume
8: publication-issue
9: publication-page
10: publication-url
11: Publication-submitter
12: Publication-submitter-email

This process continues until a terminal element is reached (where no child node exists - only a value); at which point, the value is given.

Please note, this function is best used in a true .ipynb environment. I have seen ui issues when implemented in IDEs.

suggestions for development:
This function is quite rudimentary and does not account for unavailable options (i.e. negative numbers and those greater than the available). This can be factored for quite easily with try-except statements.
Also, at present, the code does not have a direct back step feature (i.e. returning to the previous element/type) as '-1' returns to the base level elements. This is due to the recursive structure. As a simple extension, the XPath can be constructed and retained. xPath.strip('.')[:-1] will path to the previous element.


In [3]:
# First select the submission to be inspected using iloc 
xml_string = my_query.iloc[1].xml_content
# construct xml_control class
my_control = xml_parse.xml_control(my_query, xml_string)
# call interactive_expansion method
my_control.interactive_expansion()

# interaction with schema will appear in output cell
# NOTE: This method works better outside of an IDE


Available elements:
0: versioning
1: publication-info
2: stress-strain-convention-info
3: metamaterial-material-info
4: developer-section


## Functionality 2: **print_publication_details()**

This is a quick method to print the publication details associated within a submission.

uses xml.etree.ElementTree python module.

Simple print statements using variables constructed from the publication-info element. 

developemnt ideas: 
 - print PID in addition to publication

In [4]:
# First select the submission to be inspected using iloc 
xml_string = my_query.iloc[1].xml_content
# construct xml_control class
my_control = xml_parse.xml_control(my_query, xml_string)
# call interactive_expansion method
my_control.print_publication_details()



Publication Details:

Title: Effect of twist on indentation resistance
Authors:
O. Duncan, M. Chester, W. Wang, A. Alderson, T. Allen
Journal: Materials Today Communications
Volume: 35
Issue: 1
Page: 105616
Year: 2023
DOI: https://doi.org/10.1016/j.mtcomm.2023.105616
URL: https://www.sciencedirect.com/science/article/pii/S2352492823003069
Submitter: Oliver Duncan
Submitter Email: o.duncan@mmu.ac.uk


## Functionality 3: **inspect_xml**()

Method to extract the single measures (tensile modulus, Poissons etc) into a single dictionary + printout.

This code is exclusively looking for measurements within the directional-selectivity element. This means certain features are missed.

Extension ideas:
Easy:
 - Include bulk-density and units within the search.
 - Implement try/except for component test err

Harder:
 - Generate parse for **component** test data - may require new method

In [3]:
import xml_parse
# First select the submission to be inspected using iloc 
xml_string = my_query.iloc[1].xml_content
# construct xml_control class
my_control = xml_parse.xml_control(my_query, xml_string)
# call interactive_expansion method
my_values = my_control.inspect_xml()



		*** DATA READOUT ***

Submission Type: metamaterial-material-info
Directional Sensitivity: orthotropic
Measurement Results:
	 tensile modulus :
		 xx tensile modulus : 0.23
		 yy tensile modulus : 12.22
		 zz tensile modulus : 0.41
		 tensile modulus units : MPa
	 compressive modulus :
		 xx compressive modulus : 0.23
		 yy compressive modulus : 2.10
		 zz compressive modulus : 0.41
		 compressive modulus units : MPa
	 tensile poissons ratio :
		 xy tensile poissons : 0.14
		 yx tensile poissons : 0.14
		 zy tensile poissons : 0.02
		 yz tensile poissons : -0.32
		 zx tensile poissons : 0.02
		 xz tensile poissons : -0.02
	 compressive poissons ratio :
		 xy compressive poissons : 0
		 yx compressive poissons : 0
		 zy compressive poissons : 0.02
		 yz compressive poissons : 0.04
		 zx compressive poissons : 0
		 xz compressive poissons : 0.02


## Functionality 4: **get_topologies()**

demo showing how to upload bulk submissions
demo getting meta-genome db calls to slice for step file extenson

development ideas:
 - enable extraction of topologies stored in all locations. Since creating this method, new sites for toplogy storage have been introduced in to the schema
 - Investigate 3d rendering of the topology files for observation. I am aware of Fresnel for 3d rendering. Lukasz also implements this into his Jupyter notebooks.

In [5]:
xml_string = my_query.iloc[0].xml_content
my_control = xml_parse.xml_control(my_query, xml_string)
my_control.get_topologies()

{'unit-cell-topologies': [None]}

## Functionality 5: get_base_stress_strain()

This method is quite deprecated as it was the first method I created for this class. More work will be required to make this usable.

This method will capture the stress strain curves for an isotropic system and store it in a dict of dicts.

points for development:
 - Collect data points for all system types (e.g trans iso, ortho etc.)
 - account for component data types


In [17]:
xml_string = my_query.iloc[5].xml_content

my_control = xml_parse.xml_control(my_query, xml_string)

stress_strain = my_control.get_base_stress_strain()


{'base_material_0': {}}


In [None]:
import xml_parse
xml_string = my_query.iloc[0].xml_content
my_control = xml_parse.xml_control(my_query, xml_string)
myvar =my_control.inspect_xml()
print(myvar)


2


## Implementing mass push - pycdcs

The following code will demo how to upload multiple submissions to portal.meta-genome.org.

In ./Resources, there are 5 similar but unique xml forms. These will be unsed for demonstration.

This code will require the user to input their own login details to the my_query object.

Code is annotated in line.

In [19]:
# Import cdcs (pycdcs) module
from cdcs import CDCS
# import xml_parse functions
import xml_parse

# generate cdcs class - looking to global workspace does not require login creds
curator = CDCS('https://portal.meta-genome.org/', username='', password="")

# Parse by Schema template name
template="mecha-metagenome-schema31"

import os

# Get the path of the folder containg xml documents
# NOTE: all xml records must have blank PID values - will be assigned on upload
folder_path = "./Resources"

# iterating through the files in folder path
for file in os.listdir(folder_path):
    # Check file is not dir
    if os.path.isfile(os.path.join(folder_path, file)):
        
        curator.upload_record(template=template, title=file, filename=folder_path+"/"+file, verbose=True)




record Chiral_sample_submission1.xml (91) successfully uploaded.
record Chiral_sample_submission2.xml (92) successfully uploaded.
record Chiral_sample_submission3.xml (93) successfully uploaded.


## Concluding remarks:

These functions are simple implementations of xml parsing methods. 

As mentioned, development of this class is quite open ended and can be unique to each group using the meta-genome

From here, i suggest building robust and general purpose methods for groups to use.