Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: addition of metadata section to the yaml file specification in RAVEN #311

Closed
2 tasks done
haowang-bioinfo opened this issue Jul 27, 2020 · 17 comments
Closed
2 tasks done
Labels
enhancement Possible enhancement that should be considered for future versions. feature A new function or new functionality for an existing function
Milestone

Comments

@haowang-bioinfo
Copy link
Member

haowang-bioinfo commented Jul 27, 2020

Description of the issue:

  • This issue propose to include a metadata section to the yaml file specification in RAVEN
    • Previously, a metadata section was introduced to the tailored yaml file in Human-GEM serving for the requirements of MetabolicAtlas, as detailed in issue #71. After continuous development and evolvement, this section functions pretty well in providing relevant information for GEM-type repo (e.g. Human-GEM), GEM archive MetabolicAtals, as well as the research community.

Expected changes:

  • Adjust RAVEN model spec with following changes:
    • Add new field version
    • Change field from description to fullName
    • Modify subfields of annotation field
      • adding subfields sourceUrl
      • combining givenName and familyName into authors
      • changing subfield from note to description
  • Adapt writeYaml function to enable the exporting of metadata information from fields id, fullNamename, version and annotation

I hereby confirm that I have:

@haowang-bioinfo haowang-bioinfo added discussion Not yet settled whether change in code is required. enhancement Possible enhancement that should be considered for future versions. feature A new function or new functionality for an existing function labels Jul 27, 2020
@BenjaSanchez
Copy link
Contributor

For additional context, below the current metaData field in Human-GEM:

- metaData:
    short_name: "Human-GEM"
    full_name: "Generic genome-scale metabolic model of Homo sapiens"
    version: "1.4.0"
    date: "2020-06-12"
    authors: "Jonathan Robinson, Hao Wang, Pierre-Etienne Cholley, Pinar Kocabas"
    email: "jonrob@chalmers.se"
    organization: "Chalmers University of Technology"
    taxonomy: "9606"
    github: "https://github.com/SysBioChalmers/Human-GEM"
    description: "Genome-scale metabolic models are valuable tools to study metabolism and provide a scaffold for the integrative analysis of omics data. This is the latest version of Human-GEM, which is a genome-scale metabolic model of a generic human cell. The objective of Human-GEM is to serve as a community model for enabling integrative and mechanistic studies of human metabolism."

The new fields + modifications sound good to me. Additionally, it would be ideal if the field names in the yaml file match with the RAVEN spec names, for clarity. Below the cases that don't match based on what is already in RAVEN + the new names @Hao-Chalmers proposed:

Field Name in RAVEN Name in HumanGEM.yml
Model id id short_name
Model name description full_name
Authors annotation.authorList authors
URL where the model lives annotation.sourceUrl github
Additional comments annotation.note description

IMO the RAVEN names for id and URL would be preferable, as the former is the main choice in the COBRA community (Matlab and Python), and the latter is more generic, as not all RAVEN models are stored in Github. Could those 2 fields change in HumanGEM.yml to id and source_url? @JonathanRob @mihai-sysbio

On the other side, the .yml standard seems more adequate for model name, authors and comments (actually it's super confusing that the RAVEN field description is the model name and the field note contains a description). Would it make sense to change those 3 fields in RAVEN to fullName, annotation.authors and annotation.description?

@edkerk
Copy link
Member

edkerk commented Jul 28, 2020

Are their corresponding (or comparable) COBRA fields for fullName, annotation.authors and annotation.description?

@mihai-sysbio
Copy link
Member

mihai-sysbio commented Jul 29, 2020

Here are the latest yml fields are listed on COBRApy's devel branch. Imho, it doesn't look like a direct mapping of the RAVEN fields.
Cobratoolbox has some rules for modelVersion, modelName and modelID.

The short-name is something meant to be as human-friendly as possible. For example, this field is what is shown in the navigation bar on Metabolic Atlas. I found this opencobra thread illustrative of the implications of the BiGG model id spec. Also, I would like to point out the distinct fields for short-name and version. To me, it is of little importance what the keyword for the value of short-name is. However, I am an advocate for its role: human-friendliness. Therefore, I would lean towards keeping this field closer to the standard-GEM naming rather than the BiGG id spec. Needless to say, in the case of versioned models, it is expected of this short-name to be the same as the repository name.

I support changing github to something else. A potential drawback of the source_url is that, as a new person, I could find it confusing if it meant to be the link to the repository, or directly to the file itself on a model hosting platform. But maybe that's just me - and I can't come up with a better suggestion than source_url.

@haowang-bioinfo
Copy link
Member Author

@BenjaSanchez the Expected changes of this issue had been updated as you recommended.

@haowang-bioinfo
Copy link
Member Author

haowang-bioinfo commented Aug 4, 2020

@edkerk according to the latest model spec in COBRA, the following four fields could be associated between RAVEN and COBRA.

Field Name in RAVEN Name in COBRA
Model id id modelID
Model name name modelName
Model version version modelVersion
Additional comments annotation.note description

@mihai-sysbio
Copy link
Member

@Hao-Chalmers would the Expected changes also include something about the shortName field?

@haowang-bioinfo
Copy link
Member Author

haowang-bioinfo commented Aug 10, 2020

@mihai-sysbio I don't think an additional shortName field is needed, since it is equivalent to the exiting id field. Or are you suggesting renaming field from id to shortName?

@mihai-sysbio
Copy link
Member

mihai-sysbio commented Aug 11, 2020

I see. To me, an ID does not have to be human friendly, unlike shortName. I think it would be clearer if some examples would be provided, maybe even both "good" and "bad". For example, a "bad" id would be h_sap13417__1_3_0, standing for Homo Sapiens model with 13417 reactions and corresponding to version 1.3.0.

@haowang-bioinfo
Copy link
Member Author

haowang-bioinfo commented Aug 11, 2020

@mihai-sysbio good point in providing examples, which can be both added to the spec in Wiki once a consensus is reached.

@edkerk
Copy link
Member

edkerk commented Apr 7, 2021

So should HumanGEM's writeHumanYaml be integrated in RAVEN's writeYaml, thereby capturing this metadata?

@haowang-bioinfo
Copy link
Member Author

haowang-bioinfo commented Apr 7, 2021

So should HumanGEM's writeHumanYaml be integrated in RAVEN's writeYaml, thereby capturing this metadata?

@edkerk full support

@edkerk edkerk added wip work in progress and removed discussion Not yet settled whether change in code is required. labels Apr 7, 2021
@edkerk edkerk added this to the 2.4.4 milestone Apr 7, 2021
@edkerk
Copy link
Member

edkerk commented Apr 7, 2021

It is not sufficient to just define fields in the RAVEN model structure, and support export to YML file format. SBML is still the de facto standard for model distribution, so these fields should also be properly stored there.

Related to this there are some unresolved issues:

  1. If we introduce version, where is this stored in the SBML file? As far as I can find, this is not covered by the SBML specification. I see two options:
    1. The version number can be appended to the model id, e.g. yeastGEM_v8_4_2. Beneficial is that this is also loaded when using cobrapy or COBRA toolbox. However, would we then split the model id from SBML into two parts: (1) model.id and (2) model.version? In that case the model would have different model ids in RAVEN contrasting to cobrapy, COBRA etc. To avoid problems, I would prefer not to run regexprep on any identifier.
    2. Include version number in the SBML as model annotation, in a similar way as taxonomy, authors, organization etc. are included. See example below. However, I don't know what tags to use, something related to <rdf>? Does standard-GEM have a role to play in this?
Example from iYali, model annotation given from line 4.

<?xml version="1.0" encoding="UTF-8"?>
<sbml xmlns="http://www.sbml.org/sbml/level3/version1/core" xmlns:fbc="http://www.sbml.org/sbml/level3/version1/fbc/version2" xmlns:groups="http://www.sbml.org/sbml/level3/version1/groups/version1" level="3" version="1" fbc:required="false" groups:required="false">
  <model metaid="iYali" id="iYali" name="iYali" fbc:strict="true">
    <annotation>
      <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:vCard="http://www.w3.org/2001/vcard-rdf/3.0#" xmlns:vCard4="http://www.w3.org/2006/vcard/ns#" xmlns:bqbiol="http://biomodels.net/biology-qualifiers/" xmlns:bqmodel="http://biomodels.net/model-qualifiers/">
        <rdf:Description rdf:about="#iYali">
          <dcterms:creator>
            <rdf:Bag>
              <rdf:li rdf:parseType="Resource">
                <vCard:N rdf:parseType="Resource">
                  <vCard:Family>Kerkhoven</vCard:Family>
                  <vCard:Given>Eduard</vCard:Given>
                </vCard:N>
                <vCard:EMAIL>eduardk@chalmers.se</vCard:EMAIL>
                <vCard:ORG rdf:parseType="Resource">
                  <vCard:Orgname>Chalmers University of Technology</vCard:Orgname>
                </vCard:ORG>
              </rdf:li>
            </rdf:Bag>
          </dcterms:creator>
          <dcterms:created rdf:parseType="Resource">
            <dcterms:W3CDTF>2021-04-05T10:27:05Z</dcterms:W3CDTF>
          </dcterms:created>
          <dcterms:modified rdf:parseType="Resource">
            <dcterms:W3CDTF>2021-04-05T10:27:05Z</dcterms:W3CDTF>
          </dcterms:modified>
          <bqbiol:is>
            <rdf:Bag>
              <rdf:li rdf:resource="https://identifiers.org/taxonomy/4952"/>
            </rdf:Bag>
          </bqbiol:is>
        </rdf:Description>
      </rdf:RDF>
    </annotation>

  1. If id is used instead of shortName [and I would argue we should, as id is similar to modelID, model.id and <model id=""> as used in COBRA, cobrapy and SBML], then why use fullName and not just name? The latter is also more in line with other software and the SBML specification.
  2. In humanGEM.yml, date is also specified, should this be part of the RAVEN model structure? And what does this date reflect, when a new version was released? RAVEN generated SBML already includes the date that the file was created, but that's probably not what is meant here. Instead, the date should be set when the new version number is set, and absent if no version number is present?
  3. Where should sourceUrl be stored in the SBML? Also in annotation, as the second suggestion for version?
  4. Note that description is not problematic to store in the SBML, it is actually stored under <notes>. With that in mind, why change note to description? cobrapy has model.notes, and it is closer to the SBML specification.

@haowang-bioinfo
Copy link
Member Author

@edkerk good arguments indeed.

@mihai-sysbio what do you think, if standard-GEM can help in adopting some fields?

@edkerk
Copy link
Member

edkerk commented Apr 8, 2021

On second thought, perhaps it is better to move the discussion about incorporation in SBML into a separate issue, as the current issue is just about the MATLAB structure and the yaml file.
The points that remain relevant are:
2. Have a model.name field instead of model.fullName.
5. Have a model.annotation.note field instead of model.annotation.description.

@mihai-sysbio
Copy link
Member

mihai-sysbio commented Apr 9, 2021

@Hao-Chalmers it would make a lot of sense to standardize (and validate) that the yml file has these fields. However, as @edkerk pointed out, maintaining compatibility with existing formats is tricky (1.ii), especially the newly added fields are to be parsed by other tools as well.

To me, the easiest path forward is what @edkerk suggested above:

current issue is just about the MATLAB structure and the yaml file

I would like to emphasize the different use cases for model.short_name and model.full_name. Here is how Metabolic Atlas uses these fields:

    "short_name": "Yeast-GEM",
    "full_name": "Consensus genome-scale metabolic model of Saccharomyces cerevisiae",
    "description": "Consensus genome-scale metabolic model of Saccharomyces cerevisiae. It is the continuation of the legacy project yeastnet",
    "version": "8.4.2",

Luckily, this GEM has a nice model.id, but it's just a coincidence that it is readable. The model.id could well have been yeastGEM_v8_4_2. Since it is an identifier, it will not be parsed into anything readable or worth presenting on a website.

@haowang-bioinfo
Copy link
Member Author

@edkerk @mihai-sysbio I adjusted the Expected changes of this issue according to your input.

@edkerk
Copy link
Member

edkerk commented Apr 9, 2021

writeYaml (5418e88) and the model fields definition (Wiki) are changed according to the discussion here, with the following exception:

  • givenName and familyName remain as (non-mandatory) fields, while authors is an additional (non-mandatory) field. This is to ensure backwards compatibility, as givenName and familyName are actually coded in the SBML, and authors is not, while their meaning is not identical (givenName and familyName would match organization and email, while for authors this is ambigious).
  • also other subfields of model.annotation (defaultLB, defaultUB) are included as metaData in the yaml file.
  • by default writeYaml no longer sorts the identifiers (it used to do this, while writeHumanYaml doesn't, probably best to keep the identifier order by default).

Renaming model.description to model.name additionally required small refactoring of 23 files (fe7d417). As this breaks backwards compatibility with models that would already have been loaded in MATLAB, I suggest these changes result in release 2.5.0 instead of 2.4.4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Possible enhancement that should be considered for future versions. feature A new function or new functionality for an existing function
Projects
None yet
Development

No branches or pull requests

4 participants