# Meta-genome xml parse -- Version 1

This jupyter notebook is designed to engage with the data stored at portal.meta-genome.org via pycdcs. This repo contains a module (xml_parse) to wrap around pycdcs to parse xml payloads for multiple submissions in a pythonic way. E.g. storing iterations of stress-strain data from a single submission in a single pandas df (assuming same units + interval). 

Our schema is large, so there are many avenues of data scraping we can take. Here is a list of targetted data parsing:
 - Collate reference information into dictionary i.e.:\
   {"submission ID" : {"authors": ["O. Duncan", "R. Feynman"],\
                      "publication title": "pub title here"...} etc\
 - Collate metamaterial general information into dictionary i.e.:\
    {"submission ID" : {"metamaterial family": "Foam",\
                        "Unusual material property": "negative Poissons ratio"
                        "Strain convention" : "True"\
                        "Stress convention" : "True"}\
 - Collate single component measures and units for base materials i.e. :\
   {"submission ID" : {"base material ID" : ["Material name" : "Nylon 12",\
                                            "Material classification" : "Metallic"...\
                                            "Directional Sensitivity" : "Isotropic"]...}\
 - Collate single component measures and units for metamaterial i.e. :\
   {"submission ID" : {"metamaterial 1" : ["Material name" : "Nylon 12",\
                                            "Material classification" : "Metallic"...\
                                            "Directional Sensitivity" : "Isotropic"]...}\
 - from a list of submissions available to the user - generate a dict containing pandas.dfs for each continuous data curve. i.e. \
 {"submission1-ID" : {"metamaterial1" : [pandas.df.stress-strain1, pandas.df.stress-strain2, pandas.df.stress-strain3...],\
                      "metamaterial2" : [pandas.df.stress-strain1, pandas.df.stress-strain2, pandas.df.stress-strain3...],}

Here is a users workspace submission hierarchy for continuous stress-strain (most complex):
```
  workspace
  |───submission-1
  |   |───base material properties
  |   |   |───Directional Sensitivity (ISOTROPIC IS UNIQUE FROM TRANS AND ORTHO)
  |   |   |   |───Stress-strain data
  |   |   |   |   |  
  |   |   |   |   |───data-block-1
```

This means we need a 5 stage iteration for the continuous data. So for all records we will need a final dictionary that looks like:\
\
{"Workspace1": \
    {"submission ID": \
        {"base material properties ID" : \
            {"base material properties ID" : \
                {"Directional Sensitivity TYPE ID": \
                    {"Stress-strain datablock ID" : PANDAS.DF}}}}}}

Core functionality will first need to be established i.e. getting the data into the desired format, then I can build to parsing through all requested data sets.

This itteration of the meta-genome-cdcs module and jupyter notebook will primarily be concerned with data formatting.

Current thinking : 
Produce a class class to take a pandas.df from pycdcs that contains the submission metadata and xml. Then we need to return a dict of dicts for each root level element ^ as given above. 

### Points for development
 - interactive inspection of xml data
 - extract blobs - imgs + topologies
 - extract all individual measures (tensile modulus etc)
 - extract all continuous measure (stress strain data)
 - extract publication data
 - collate related submissions: if meta-material --> collect base-material PIDs
 - xml_parse only works with individual xml strings currently - ernest to construct for loop on user. Could be extended to have 'averaging' or 'multiple-submission' mode.

## Demo of current functionality

The following code boxes will demonstrate the functionality of xml_parse.

The xml parse code depends on the cdcs py module (pycdcs - https://github.com/usnistgov/pycdcs/tree/master/cdcs/CDCS) vers 0.2.1+ to scrape data from portal.meta-genome.org. 

This code works with the pandas.df xml_content header. Please note, if significant changes are made the meta-genome schema, this code will become deprecated.

### Implementation:

These functions require the CDCS class to be initiated. To do this, observe the code cell below:

In [7]:
# Import cdcs (pycdcs) module
from cdcs import CDCS
# import xml_parse functions
import xml_parse

# generate cdcs class - looking to global workspace does not require login creds
curator = CDCS('https://portal.meta-genome.org/', username='')

# Parse by Schema template name
template="mecha-metagenome-schema31"

# Also parse by mongo db query where:
# map.metamaterial-material-info and map.base-material-info exist.
# For example constructions of mongodb queries, first build a query using:
# portal.meta-genome.org/explore/example > select query fields > go to query builder > build query > save query > top right 'help' dropdown >
# API documentation > /explore/example/rest/saved/query/
# This will help demo how mongodb queries can be constructed


query_dict = "{\"$or\": [{\"map.metamaterial-material-info\": {\"$exists\": true}}, {\"map.base-material-info\": {\"$exists\": true}}]}"
my_query= curator.query(template=template, mongoquery=query_dict)



100%|██████████| 21/21 [00:00<00:00, 15.83it/s]


## Functionality 1: **interactive_expansion()**

This function enables users to observe the elements within a chosen submission. Through recursive implementation of functions, sequential nested elements and types can be inspected. 

When this function is called, the user inspects the submission through interaction with the python kernel. When first activated, the base level elements will appear:

Available elements:
0: versioning
1: publication-info
2: stress-strain-convention-info
3: metamaterial-material-info
4: developer-section

The user can then input the associated value of the region they wish to inspect e.g. 1 --> publication-info:
Contents of publication-info:
0: id
1: publication-authors
2: publication-authors
3: publication-authors
4: publication-title
5: publication-year
6: publication-journal
7: publication-volume
8: publication-issue
9: publication-page
10: publication-url
11: Publication-submitter
12: Publication-submitter-email

This process continues until a terminal element is reached (where no child node exists - only a value); at which point, the value is given.

Please note, this function is best used in a true .ipynb environment. I have seen ui issues when implemented in IDEs.

suggestions for development:
This function is quite rudimentary and does not account for unavailable options (i.e. negative numbers and those greater than the available). This can be factored for quite easily with try-except statements.
Also, at present, the code does not have a direct back step feature (i.e. returning to the previous element/type) as '-1' returns to the base level elements. This is due to the recursive structure. As a simple extension, the XPath can be constructed and retained. xPath.strip('.')[:-1] will path to the previous element.


In [9]:
import xml_parse

# First select the submission to be inspected using iloc 
xml_string = my_query.iloc[1].xml_content
# construct xml_control class
my_control = xml_parse.xml_control(xml_string)
# call interactive_expansion method
my_control.interactive_expansion()

# interaction with schema will appear in output cell


TypeError: xml_control.__init__() missing 1 required positional argument: 'xml_string'

The host URL and all login access parameters (username, password, authentication, etc.) are defined when creating a CDCS object.  Setting username as an empty string will access the site as an anonymous user, i.e. someone not signed in.

In [2]:

curator = CDCS('https://portal.meta-genome.org/', username='')



template="mecha-metagenome-schema31"
query_string = "{\"$or\": [{\"map.base-material-info.isotropic-choice.tensile-poissons-ratio-iso.tensile-poissons-ratio-val-iso\": {\"$lt\": 0.4}}, {\"map.base-material-info.isotropic-choice.tensile-poissons-ratio-iso.tensile-poissons-ratio-val-iso.#text\": {\"$lt\": 0.4}}]}"

#query_dict = {"map.base-material-info.isotropic-choice.tensile-poissons-ratio-iso.tensile-poissons-ratio-val-iso": { "exists": "true" }}
query_dict = "{\"$or\": [{\"map.metamaterial-material-info\": {\"$exists\": true}}, {\"map.metamaterial-material-info\": {\"$exists\": true}}]}"
query_dict = "{\"map.metamaterial-material-info\": {\"$exists\": true}}"
query_dict1 = "{\"$and\": [{\"$or\": [{\"map.base-material-info.bulk-density\": {\"$gt\": 0.0}}, {\"map.base-material-info.bulk-density.#text\": {\"$gt\": 0.0}}]}, {\"$or\": [{\"map.base-material-info.isotropic-choice.tensile-poissons-ratio-iso.tensile-poissons-ratio-val-iso\": {\"$gt\": 0.0}}, {\"map.base-material-info.isotropic-choice.tensile-poissons-ratio-iso.tensile-poissons-ratio-val-iso.#text\": {\"$gt\": 0.0}}]}]}"
my_query= curator.query(template=template, mongoquery=query_dict)
my_query


100%|██████████| 16/16 [00:00<00:00, 23.67it/s]


Unnamed: 0,id,template,workspace,user_id,title,xml_content,creation_date,last_modification_date,last_change_date,template_title
0,89,56,1,10,Test1,"<map xmlns:xsi=""http://www.w3.org/2001/XMLSche...",2023-03-24 12:04:18.035000+00:00,2023-03-24 12:10:03.080000+00:00,2023-03-24 12:10:02.982000+00:00,mecha-metagenome-schema31
1,82,56,1,10,StretchAux-LD60-OD-21.xml,"<map xmlns:xsi=""http://www.w3.org/2001/XMLSche...",2023-03-22 16:49:05.644000+00:00,2023-03-24 09:22:54.129000+00:00,2023-03-24 09:22:54.040000+00:00,mecha-metagenome-schema31
2,80,56,1,10,AuxBlock-CC-OD21.xml,"<map xmlns:xsi=""http://www.w3.org/2001/XMLSche...",2023-03-22 16:10:57.747000+00:00,2023-03-22 16:11:17.295000+00:00,2023-03-22 16:11:17.195000+00:00,mecha-metagenome-schema31
3,76,56,1,10,Aux_VC5-OD-23.xml,"<map xmlns:xsi=""http://www.w3.org/2001/XMLSche...",2023-03-22 14:42:57.495000+00:00,2023-03-22 15:28:05.182000+00:00,2023-03-22 15:28:04.851000+00:00,mecha-metagenome-schema31
4,46,56,1,10,AChiral_0_OD-23 -S.xml,"<map xmlns:xsi=""http://www.w3.org/2001/XMLSche...",2023-03-16 16:19:48.834000+00:00,2023-03-21 10:01:41.309000+00:00,2023-03-21 10:01:41.230000+00:00,mecha-metagenome-schema31
5,45,56,1,10,AChiral_0_OD-23.xml,"<map xmlns:xsi=""http://www.w3.org/2001/XMLSche...",2023-03-16 16:08:49.608000+00:00,2023-03-21 10:00:57.596000+00:00,2023-03-21 10:00:57.330000+00:00,mecha-metagenome-schema31
6,50,56,1,10,Chiral_10_OD-23.xml,"<map xmlns:xsi=""http://www.w3.org/2001/XMLSche...",2023-03-17 11:35:55.038000+00:00,2023-03-21 09:57:40.169000+00:00,2023-03-21 09:57:40.007000+00:00,mecha-metagenome-schema31
7,65,56,1,10,Chiral_30_OD-23.xml,"<map xmlns:xsi=""http://www.w3.org/2001/XMLSche...",2023-03-17 14:19:15.949000+00:00,2023-03-21 09:55:11.226000+00:00,2023-03-21 09:55:11.013000+00:00,mecha-metagenome-schema31
8,66,56,1,10,Chiral_30_OD-23 - S.xml,"<map xmlns:xsi=""http://www.w3.org/2001/XMLSche...",2023-03-17 14:25:43.164000+00:00,2023-03-21 09:54:22.510000+00:00,2023-03-21 09:54:22.433000+00:00,mecha-metagenome-schema31
9,51,56,1,10,Chiral_10_OD-23 - S.xml,"<map xmlns:xsi=""http://www.w3.org/2001/XMLSche...",2023-03-17 11:51:08.320000+00:00,2023-03-21 09:53:18.768000+00:00,2023-03-21 09:53:18.677000+00:00,mecha-metagenome-schema31


In [3]:
xml_string = my_query.iloc[0].xml_content
my_control = xml_parse.xml_control(my_query, xml_string)

In [None]:
import xml_parse
xml_string = my_query.iloc[0].xml_content
my_control = xml_parse.xml_control(my_query, xml_string)
myvar =my_control.inspect_xml()
print(myvar)


2


In [5]:
xml_string = my_query.iloc[8].xml_content
my_control = xml_parse.xml_control(my_query, xml_string)
myvar =my_control.print_publication_details()
print(myvar)




Publication Details:

Title: Effect of twist on indentation resistance
Authors:
O. Duncan, M. Chester, W. Wang, A. Alderson, T. Allen
Journal: Materials Today Communications
Volume: 35
Issue: 1
Page: 105616
Year: 2023
DOI: https://doi.org/10.1016/j.mtcomm.2023.105616
URL: https://www.sciencedirect.com/science/article/pii/S2352492823003069
Submitter: Oliver Duncan
Submitter Email: o.duncan@mmu.ac.uk
None


REST calls can be made using the head, get, post, put, patch, and delete methods of the CDCS object.  Each method is named for the type of HTTP request to perform.  

Only the relative REST URL and any params and/or data associated with the request need to be given.  The host's URL will automatically be appended as a prefix to the REST URL, and the access parameters given when initializing the CDCS object will automatically be sent for each REST call. 

The REST call returns a requests.Response object allowing for checks of the status code as well as automatically transforming the data to str, bytes, or json contents. 