Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

json / yaml export - complicated data structure #102

Closed
liebermeister opened this issue Oct 7, 2019 · 18 comments
Closed

json / yaml export - complicated data structure #102

liebermeister opened this issue Oct 7, 2019 · 18 comments

Comments

@liebermeister
Copy link

Hi Jonathan,

again a comment regarding file conversion (same data file, examples/biochemical_models/data.xlsx); now it concerns json and yaml export via

obj-tables convert schema.csv data.xlsx data.json
obj-tables convert schema.csv data.xlsx data.yml

From the original tables, I had expected a data structure such as

[
[model_1],
[compound1, compound2, compound3, ..]
[reaction1, reaction2]
]

where the entries point to each other via their ids: e.g., reaction1 would have an attribute "Model", with value "e_coli", which matches the id of model1.

Now I saw that the the attributes (by which objects point to other objects) do not contain ids (or "variable names"), but the objects themselves. Specifically, the data structure starts with compound1, which contains (as an attribute) a data structure describing model1, which in turn contains all compounds and reactions (which then, again, contain "simple" instances of the model). In the end, there are all other compounds (with no model or reaction information at all.

I imagine that this data structure does the job for exporting / importing the python data objects, but it is difficult to make sense of. Can you have a look at this again and see if my solution (described above) would also work for you?
(that is, representing a table by a list of relatively "flat" objects, whose attributes can be strings or lists, but not objects themselves, and which instead point to other objects via ids?)

Thank you!

All the best,
Wolf

@jonrkarr jonrkarr self-assigned this Oct 7, 2019
@jonrkarr
Copy link
Member

jonrkarr commented Oct 7, 2019

This scheme won't work. These formats are not intended to be human-readable. I don' think JSON or YAML is well-suited for that purpose. The tabular formats are better suited for human readability. These formats are intended to make it easy for a machine to reconstruct a dataset, including all of the relationships. These formats cannot rely on the human-readable ids because classes aren't required to have such ids. The formats have to encode relationships more generically. The formats also have to communicate information about the type of each object so the files are independent from schemas.

@liebermeister
Copy link
Author

Thank you!!
I see the point.

The problem I have is that I thought I could import the json / yaml files into matlab and directly obtain a good data structure. At the moment, the way information is arranged seems a bit arbitrary (information about model(s), compounds, and reactions is scattered over the tree structure, some information (e.g. id: e_coli for the model) appears many times, while other information is not duplicated at all.

What I don't understand: in the yaml tree, each object already has an id and a type (e.g., "__id: 0" and "__type: Compound" for the fructose 6 phosphate). So why can't these objects be ordered in lists (one list for each type, and within the list, objects would be ordered by their id)? That would be much closer to what I had in mind.

Basically, I think it would be nice to have a "symmetric" data structure, in which all compounds appear in the same way, reactions appear in the same way, and so on. When importing the yaml structure into matlab, I will have to convert it into such a form - so why not also structure the yaml file like this?

Maybe we can talk in the next few days?

All the best,
Wolf

@jonrkarr
Copy link
Member

jonrkarr commented Oct 8, 2019

Because JSON can't represent circular relationships among objects and because JSON doesn't support custom classes, we have to use custom codes and design choices to encode ObjTables data into JSON and decode ObjTables data out of JSON. Even if we change how the data is encoded into JSON, we'd still need custom code to decode the type information and references. This encoding/decoding is encapsulated by the obj_tables.io.JsonReader and obj_tables.io.JsonWriter classes. Because these design choices are encapsulated, and because the JSON isn't intended to be human-readable, how the data is encoded into JSON isn't important to users.

My thinking is that obj_tables.io.JsonReader needs to be implemented in MATLAB (and any other language where obj_tables is used). Essentially, this means implementing a version of obj_tables.io.JsonReader in MATLAB (~100 lines of code).

@liebermeister
Copy link
Author

OK, I see. I will try to implement the JSON reader in MATLAB (without validation, ie assuming that the json file has been directly generated by obj_tables).

Do you think we need a JSON writer class for matlab? Probably not, right? The matlab -> python direction can always be accomplished through csv I guess.

@jonrkarr
Copy link
Member

jonrkarr commented Oct 9, 2019

JsonReader

If you assume that the data has already been validated, then the JSON doesn't need to be further validated. I wouldn't recommend implementing table parsing or validation in MATLAB. Even with MATLAB object-oriented programming, this would likely take more than 10,000 lines of code.

JsonWriter

Do you want to programmatically create objects in MATLAB (i.e., programmatically generate species, reactions, rate laws, etc.)? If so, a JsonWriter class could be useful. Then you could create structs that represent species and reactions, convert them to JSON, and use ObjTables to write them to tables rather than writing the same information directly to tables with MATLAB (structs are easier to manipulate than tables).

However, making this really useful would require implementing MATLAB classes for each type of object (a class for each table and each relationship) rather than using structs. ObjTables uses Python metaprogramming to make such classes easy to generate. This is implemented in ~1,000 lines of code. Because MATLAB doesn't have metaprogramming, this would likely take at least an order of magnitude more code.

@liebermeister
Copy link
Author

I think matlab classes are not necessary.

I would like to continue working with structs, for example one structure than I'm using already: a document is a struct containing the tables ("models"); each table would is a struct containing the table rows ("objects"); and each table of table is a struct containing the individual table cells ("attributes").

This structure could be directly converted into JSON, but it would not be the "asymmetric" JSON structure used by obc_tables, which I would not know how to generate structure reliably.

Another - rather pragmatic - option for matlab would be, given a csv file to be imported, to python to validate the file, and if the file is correct, to simply read the original csv file into matlab (knowing that it is correct). Then I can easily generate my own matlab structs. For validating an SBtab document, I export it to a csv and then run the python validator.

Do you think this makes sense?

@jonrkarr
Copy link
Member

jonrkarr commented Oct 9, 2019

MATLAB classes aren't necessary, but its easier to build a user-friendly interface with custom classes. Having such an interface is more important when there are relationships among objects that need to be managed and when there are more data types. This is less necessary for SBtab since it largely ignores relationships and only has a few data types.

Yes, the ObjTables validation could be accessed by (1) saving structs to CSV and (2) using ObjTables to validate the CSV.

@jonrkarr
Copy link
Member

I looked at the JSON output again to remind myself how I designed it. Its pretty simple. Its a flat list of objects. The type of each object is indicated by the key __type. Each object is also assigned an internal id (__id), which is used to encode relationships among objects. This __id should be used to decode the relationships. To decode the relationships, you need to know which attributes represent relationships. This can be obtained from the schema.

If you wish to have a more hierarchical structure, you can group the objects based on __type. I can add an option to the Python code to return the objects grouped by type (as well as to read in objects encoded in this alternative encoding).

@liebermeister
Copy link
Author

No, that sounds good .. but it doesn't really match what I see. Here's the yaml code I obtain (which, I expect, has the same struture as the JSON code): it has several levels, with some information appearing multiple times. Can you check again if this is there structure you meant to design?

  • __id: 0
    __type: Compound
    id: D_Fructose_6_phosphate
    identifiers: kegg.compound::C00085
    is_constant: true
    model:
    __id: 1
    __type: Model
    compounds:
    • __id: 0
      __type: Compound
      id: D_Fructose_6_phosphate
    • __id: 2
      __type: Compound
      id: D_Glucose
      identifiers: kegg.compound::C00031
      is_constant: true
      model:
      __id: 1
      __type: Model
      id: e_coli
      name: D-Glucose
    • __id: 3
      __type: Compound
      id: D_Glucose_6_phosphate
      identifiers: kegg.compound::C00092
      is_constant: true
      model:
      __id: 1
      __type: Model
      id: e_coli
      name: D-Glucose 6-phosphate
    • __id: 4
      __type: Compound
      id: Phosphoenolpyruvate
      identifiers: kegg.compound::C00074
      is_constant: true
      model:
      __id: 1
      __type: Model
      id: e_coli
      name: Phosphoenolpyruvate
    • __id: 5
      __type: Compound
      id: Pyruvate
      identifiers: kegg.compound::C00022
      is_constant: true
      model:
      __id: 1
      __type: Model
      id: e_coli
      name: Pyruvate
      id: e_coli
      name: ''
      reactions:
    • __id: 6
      __type: Reaction
      equation: -1.0 D_Glucose_6_phosphate; 1.0 D_Fructose_6_phosphate
      gene: PGI
      id: PGI_R02740
      identifiers: kegg.reaction::R02740
      is_reversible: true
      model:
      __id: 1
      __type: Model
      id: e_coli
      name: ''
    • __id: 7
      __type: Reaction
      equation: -1.0 D_Glucose; -1.0 Phosphoenolpyruvate; 1.0 Pyruvate; 1.0 D_Glucose_6_phosphate
      gene: PTS
      id: PTS_RPTSsy
      identifiers: kegg.reaction::RPTSsy
      is_reversible: true
      model:
      __id: 1
      __type: Model
      id: e_coli
      name: ''
      name: D-Fructose 6-phosphate
  • __id: 2
    __type: Compound
    id: D_Glucose
  • __id: 3
    __type: Compound
    id: D_Glucose_6_phosphate
  • __id: 4
    __type: Compound
    id: Phosphoenolpyruvate
  • __id: 5
    __type: Compound
    id: Pyruvate
  • __id: 1
    __type: Model
    id: e_coli
  • __id: 6
    __type: Reaction
    id: PGI_R02740
  • __id: 7
    __type: Reaction
    id: PTS_RPTSsy

@jonrkarr
Copy link
Member

jonrkarr commented Oct 11, 2019 via email

@jonrkarr
Copy link
Member

jonrkarr commented Oct 11, 2019 via email

@liebermeister
Copy link
Author

I generated this by

obj-tables convert schema.csv data.xlsx data.yml

with the data files from obj_tables/examples/biochemical_models

Ok, mayeb it's necessary to repeat this, but I thought that writing

model:
__id: 1
__type: Model
id: e_coli

multiple times is redundant, because, for example

model:
__id: 1

should do the job. But my main worry is not the repetition, but the fact that a lot of information about model, compounds, and reactions appears inside the first compound element, and not where I would expect it - in the respective elements in the outer list.

@jonrkarr
Copy link
Member

I flattened out the encoding.

Example:

- __id: 0
  __type: Compound
  id: D_Fructose_6_phosphate
  identifiers: kegg.compound::C00085
  is_constant: true
  model:
    __id: 5
    __type: Model
    id: e_coli
  name: D-Fructose 6-phosphate
- __id: 1
  __type: Compound
  id: D_Glucose
  identifiers: kegg.compound::C00031
  is_constant: true
  model:
    __id: 5
    __type: Model
    id: e_coli
  name: D-Glucose
- __id: 2
  __type: Compound
  id: D_Glucose_6_phosphate
  identifiers: kegg.compound::C00092
  is_constant: true
  model:
    __id: 5
    __type: Model
    id: e_coli
  name: D-Glucose 6-phosphate
- __id: 3
  __type: Compound
  id: Phosphoenolpyruvate
  identifiers: kegg.compound::C00074
  is_constant: true
  model:
    __id: 5
    __type: Model
    id: e_coli
  name: Phosphoenolpyruvate
- __id: 4
  __type: Compound
  id: Pyruvate
  identifiers: kegg.compound::C00022
  is_constant: true
  model:
    __id: 5
    __type: Model
    id: e_coli
  name: Pyruvate
- __id: 5
  __type: Model
  compounds:
  - __id: 0
    __type: Compound
    id: D_Fructose_6_phosphate
  - __id: 1
    __type: Compound
    id: D_Glucose
  - __id: 2
    __type: Compound
    id: D_Glucose_6_phosphate
  - __id: 3
    __type: Compound
    id: Phosphoenolpyruvate
  - __id: 4
    __type: Compound
    id: Pyruvate
  id: e_coli
  name: ''
  reactions:
  - __id: 6
    __type: Reaction
    id: PGI_R02740
  - __id: 7
    __type: Reaction
    id: PTS_RPTSsy
- __id: 6
  __type: Reaction
  equation: -1.0 D_Glucose_6_phosphate; 1.0 D_Fructose_6_phosphate
  gene: PGI
  id: PGI_R02740
  identifiers: kegg.reaction::R02740
  is_reversible: true
  model:
    __id: 5
    __type: Model
    id: e_coli
  name: ''
- __id: 7
  __type: Reaction
  equation: -1.0 D_Glucose; -1.0 Phosphoenolpyruvate; 1.0 Pyruvate; 1.0 D_Glucose_6_phosphate
  gene: PTS
  id: PTS_RPTSsy
  identifiers: kegg.reaction::RPTSsy
  is_reversible: true
  model:
    __id: 5
    __type: Model
    id: e_coli
  name: ''

@jonrkarr
Copy link
Member

jonrkarr commented Oct 11, 2019

I think can make this is a bit more flexible so that the structure of the JSON/YML can be controlled by the user:

  • Make JsonWriter serialize any instance of Model, list of instances of Model, or dictionary which contains instances of Model.
  • Make JsonReader deserialize anything created by JsonWriter

This will give the user control over the many semantically-equivalent ways of encoding the same data into JSON/YML.

Then I can use this to generate JSON with the structure you're expecting.

That said, I don't think its necessary to make the JSON human-readable. The JSON just has to capture the semantic meaning of the objects and their relationships.

@jonrkarr
Copy link
Member

The relationships to other objects are represented as dictionaries. E.g.,

{
"__id": 7,
"__type": "Reaction",
id: PTS_RPTSsy
}

The only information that must be included is __id. I have chosen to include __type and the primary attribute (e.g., id) because I think this makes it more readable. However, this isn't necessary.

@liebermeister
Copy link
Author

Fantastic!!! That completely solves the problem I had.

Thank you!

@jonrkarr
Copy link
Member

The default output will now be grouped by class as illustrated at the bottom of this comment. This structure is more similar to the way that objects of different types are represented by different tables.

FYI, the Python code which generates the JSON/YAML is more flexible than this:

  • It can encode any object that is composed of instances of Model, list, dict, and scalars (None, str, bool, int, float) into JSON/YAML.
  • This allows customization of how objects are encoded into JSON/YAML.
  • In particular, it allows extra information to be encoded into JSON/YAML. This information could be thought of as the analog of the comments in tables.

The Python code which decodes the JSON/YAML is equally flexible:

  • Regardless how objects are encoded in JSON/YAML, they can be converted into tables. In that case, all other data (i.e., the "comments") is ignored.

Note, this flexibility is not extended to the command line program and REST API. The command line program and REST API can only encode data into JSON/YAML as illustrated below. I don't think it makes sense to extend this flexibility to the command line program and REST API; this would require users to specify the output format, which seems unnecessarily complicated. One thing that would be easy to extend to the command line program and REST API would be an option to encode the data into JSON/YAML as a flat list (as illustrated 3 comments above) rather than as a dictionary.

Compound:
- __id: 0
  __type: Compound
  id: D_Fructose_6_phosphate
  identifiers: kegg.compound::C00085
  is_constant: true
  model:
    __id: 5
    __type: Model
    id: e_coli
  name: D-Fructose 6-phosphate
- __id: 1
  __type: Compound
  id: D_Glucose
  identifiers: kegg.compound::C00031
  is_constant: true
  model:
    __id: 5
    __type: Model
    id: e_coli
  name: D-Glucose
- __id: 2
  __type: Compound
  id: D_Glucose_6_phosphate
  identifiers: kegg.compound::C00092
  is_constant: true
  model:
    __id: 5
    __type: Model
    id: e_coli
  name: D-Glucose 6-phosphate
- __id: 3
  __type: Compound
  id: Phosphoenolpyruvate
  identifiers: kegg.compound::C00074
  is_constant: true
  model:
    __id: 5
    __type: Model
    id: e_coli
  name: Phosphoenolpyruvate
- __id: 4
  __type: Compound
  id: Pyruvate
  identifiers: kegg.compound::C00022
  is_constant: true
  model:
    __id: 5
    __type: Model
    id: e_coli
  name: Pyruvate
Model:
- __id: 5
  __type: Model
  compounds:
  - __id: 0
    __type: Compound
    id: D_Fructose_6_phosphate
  - __id: 1
    __type: Compound
    id: D_Glucose
  - __id: 2
    __type: Compound
    id: D_Glucose_6_phosphate
  - __id: 3
    __type: Compound
    id: Phosphoenolpyruvate
  - __id: 4
    __type: Compound
    id: Pyruvate
  id: e_coli
  name: ''
  reactions:
  - __id: 6
    __type: Reaction
    id: PGI_R02740
  - __id: 7
    __type: Reaction
    id: PTS_RPTSsy
Reaction:
- __id: 6
  __type: Reaction
  equation: -1.0 D_Glucose_6_phosphate; 1.0 D_Fructose_6_phosphate
  gene: PGI
  id: PGI_R02740
  identifiers: kegg.reaction::R02740
  is_reversible: true
  model:
    __id: 5
    __type: Model
    id: e_coli
  name: ''
- __id: 7
  __type: Reaction
  equation: -1.0 D_Glucose; -1.0 Phosphoenolpyruvate; 1.0 Pyruvate; 1.0 D_Glucose_6_phosphate
  gene: PTS
  id: PTS_RPTSsy
  identifiers: kegg.reaction::RPTSsy
  is_reversible: true
  model:
    __id: 5
    __type: Model
    id: e_coli
  name: ''

@jonrkarr
Copy link
Member

jonrkarr commented Oct 12, 2019

Here's example code for decoding JSON in another language without access to the obj_tables Python package (< 50 lines):
https://github.com/KarrLab/obj_tables/tree/master/examples/decode_json.py

Here's the unit test for the code:
https://github.com/KarrLab/obj_tables/blob/master/tests/test_examples.py#L222

For MATLAB,

  • list should be replaced by cellarray
  • struct should be replaced by a custom class which behaves like a struct but also supports handles (references/pointer). The class below should work (copied from MATLAB central), although my MATLAB knowledge is rusty.
classdef hstruct < handle
  properties
    data
  end
  
  methods
    function obj = hstruct(data)
      obj.data = data;
    end
  end
end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants