New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
json / yaml export - complicated data structure #102
Comments
This scheme won't work. These formats are not intended to be human-readable. I don' think JSON or YAML is well-suited for that purpose. The tabular formats are better suited for human readability. These formats are intended to make it easy for a machine to reconstruct a dataset, including all of the relationships. These formats cannot rely on the human-readable ids because classes aren't required to have such ids. The formats have to encode relationships more generically. The formats also have to communicate information about the type of each object so the files are independent from schemas. |
Thank you!! The problem I have is that I thought I could import the json / yaml files into matlab and directly obtain a good data structure. At the moment, the way information is arranged seems a bit arbitrary (information about model(s), compounds, and reactions is scattered over the tree structure, some information (e.g. id: e_coli for the model) appears many times, while other information is not duplicated at all. What I don't understand: in the yaml tree, each object already has an id and a type (e.g., "__id: 0" and "__type: Compound" for the fructose 6 phosphate). So why can't these objects be ordered in lists (one list for each type, and within the list, objects would be ordered by their id)? That would be much closer to what I had in mind. Basically, I think it would be nice to have a "symmetric" data structure, in which all compounds appear in the same way, reactions appear in the same way, and so on. When importing the yaml structure into matlab, I will have to convert it into such a form - so why not also structure the yaml file like this? Maybe we can talk in the next few days? All the best, |
Because JSON can't represent circular relationships among objects and because JSON doesn't support custom classes, we have to use custom codes and design choices to encode ObjTables data into JSON and decode ObjTables data out of JSON. Even if we change how the data is encoded into JSON, we'd still need custom code to decode the type information and references. This encoding/decoding is encapsulated by the My thinking is that |
OK, I see. I will try to implement the JSON reader in MATLAB (without validation, ie assuming that the json file has been directly generated by obj_tables). Do you think we need a JSON writer class for matlab? Probably not, right? The matlab -> python direction can always be accomplished through csv I guess. |
JsonReaderIf you assume that the data has already been validated, then the JSON doesn't need to be further validated. I wouldn't recommend implementing table parsing or validation in MATLAB. Even with MATLAB object-oriented programming, this would likely take more than 10,000 lines of code. JsonWriterDo you want to programmatically create objects in MATLAB (i.e., programmatically generate species, reactions, rate laws, etc.)? If so, a JsonWriter class could be useful. Then you could create structs that represent species and reactions, convert them to JSON, and use ObjTables to write them to tables rather than writing the same information directly to tables with MATLAB (structs are easier to manipulate than tables). However, making this really useful would require implementing MATLAB classes for each type of object (a class for each table and each relationship) rather than using structs. ObjTables uses Python metaprogramming to make such classes easy to generate. This is implemented in ~1,000 lines of code. Because MATLAB doesn't have metaprogramming, this would likely take at least an order of magnitude more code. |
I think matlab classes are not necessary. I would like to continue working with structs, for example one structure than I'm using already: a document is a struct containing the tables ("models"); each table would is a struct containing the table rows ("objects"); and each table of table is a struct containing the individual table cells ("attributes"). This structure could be directly converted into JSON, but it would not be the "asymmetric" JSON structure used by obc_tables, which I would not know how to generate structure reliably. Another - rather pragmatic - option for matlab would be, given a csv file to be imported, to python to validate the file, and if the file is correct, to simply read the original csv file into matlab (knowing that it is correct). Then I can easily generate my own matlab structs. For validating an SBtab document, I export it to a csv and then run the python validator. Do you think this makes sense? |
MATLAB classes aren't necessary, but its easier to build a user-friendly interface with custom classes. Having such an interface is more important when there are relationships among objects that need to be managed and when there are more data types. This is less necessary for SBtab since it largely ignores relationships and only has a few data types. Yes, the ObjTables validation could be accessed by (1) saving structs to CSV and (2) using ObjTables to validate the CSV. |
I looked at the JSON output again to remind myself how I designed it. Its pretty simple. Its a flat list of objects. The type of each object is indicated by the key If you wish to have a more hierarchical structure, you can group the objects based on |
No, that sounds good .. but it doesn't really match what I see. Here's the yaml code I obtain (which, I expect, has the same struture as the JSON code): it has several levels, with some information appearing multiple times. Can you check again if this is there structure you meant to design?
|
How did you generate this?
|
The information isn't repeated. What looks like repetition is the encoding of a relationship.
|
I generated this by obj-tables convert schema.csv data.xlsx data.yml with the data files from obj_tables/examples/biochemical_models Ok, mayeb it's necessary to repeat this, but I thought that writing model: multiple times is redundant, because, for example model: should do the job. But my main worry is not the repetition, but the fact that a lot of information about model, compounds, and reactions appears inside the first compound element, and not where I would expect it - in the respective elements in the outer list. |
I flattened out the encoding. Example:
|
I think can make this is a bit more flexible so that the structure of the JSON/YML can be controlled by the user:
This will give the user control over the many semantically-equivalent ways of encoding the same data into JSON/YML. Then I can use this to generate JSON with the structure you're expecting. That said, I don't think its necessary to make the JSON human-readable. The JSON just has to capture the semantic meaning of the objects and their relationships. |
The relationships to other objects are represented as dictionaries. E.g.,
The only information that must be included is |
Fantastic!!! That completely solves the problem I had. Thank you! |
The default output will now be grouped by class as illustrated at the bottom of this comment. This structure is more similar to the way that objects of different types are represented by different tables. FYI, the Python code which generates the JSON/YAML is more flexible than this:
The Python code which decodes the JSON/YAML is equally flexible:
Note, this flexibility is not extended to the command line program and REST API. The command line program and REST API can only encode data into JSON/YAML as illustrated below. I don't think it makes sense to extend this flexibility to the command line program and REST API; this would require users to specify the output format, which seems unnecessarily complicated. One thing that would be easy to extend to the command line program and REST API would be an option to encode the data into JSON/YAML as a flat list (as illustrated 3 comments above) rather than as a dictionary.
|
Here's example code for decoding JSON in another language without access to the Here's the unit test for the code: For MATLAB,
|
Hi Jonathan,
again a comment regarding file conversion (same data file, examples/biochemical_models/data.xlsx); now it concerns json and yaml export via
obj-tables convert schema.csv data.xlsx data.json
obj-tables convert schema.csv data.xlsx data.yml
From the original tables, I had expected a data structure such as
[
[model_1],
[compound1, compound2, compound3, ..]
[reaction1, reaction2]
]
where the entries point to each other via their ids: e.g., reaction1 would have an attribute "Model", with value "e_coli", which matches the id of model1.
Now I saw that the the attributes (by which objects point to other objects) do not contain ids (or "variable names"), but the objects themselves. Specifically, the data structure starts with compound1, which contains (as an attribute) a data structure describing model1, which in turn contains all compounds and reactions (which then, again, contain "simple" instances of the model). In the end, there are all other compounds (with no model or reaction information at all.
I imagine that this data structure does the job for exporting / importing the python data objects, but it is difficult to make sense of. Can you have a look at this again and see if my solution (described above) would also work for you?
(that is, representing a table by a list of relatively "flat" objects, whose attributes can be strings or lists, but not objects themselves, and which instead point to other objects via ids?)
Thank you!
All the best,
Wolf
The text was updated successfully, but these errors were encountered: