Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging formulas to normalize them despite sources. Added Notes column #68

Merged
merged 2 commits into from
Sep 19, 2017

Conversation

samseaver
Copy link
Contributor

This is a bit of a double-whammy, all formulas updated so that there's no extra characters, parentheses, multipliers, etc. so to normalize across the board, and make it easier when balancing reactions.

In addition, if formulas had multipliers, either explicit or implicit (i.e. '(formula)[xn*]') we take note in the Notes column, as "PO" for polymers.

@samseaver
Copy link
Contributor Author

@JamesJeffryes The output of the travis checks is massive, is there a place where we can find a summary of the issues blocking the merge?

@JamesJeffryes
Copy link
Contributor

JamesJeffryes commented Sep 18, 2017

Generally you have to look at the bottom of the raw log. The build is failing because of this error 19 new unbalanced errors detected Basically there are 19 more non-obsolete reactions that are unbalanced (according to the formulas, not the status column) relative to the master branch. Unfortunately it does not track which ones changed at the moment (it just stores the overall count). I am guessing this is probably caused by compounds that did not have a valid formula now bing parseable but because the "notes" change affects every line I cant really tell.

Copy link
Contributor

@JamesJeffryes JamesJeffryes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are confident that the formulas are correct, I am OK merging this PR despite of the failing build. Changing the formulas won't make any reaction that was balanced unbalenced in practice though they could be examined.

@mmundy42
Copy link
Contributor

Take a look at cpd00021, cpd00030, cpd00058, cpd00149, and cpd00244. The formulas include the charge.

@samseaver
Copy link
Contributor Author

Yes, with hindsight, it was a mistake to commit both a massive change (i.e. the addition of a single column) and spot-changes (i.e. the update of different formulas).

Is Travis monitoring whether reaction balances are changing ? We should probably clear up why Travis had these 19 reactions listed as "balanced" in the first place.

@samseaver
Copy link
Contributor Author

In any case, yes, I'm very confident in the formula-parsing-re-merging I did, here are some examples:

Updating cpd17447: (SO4. Ca)2. H2O --> H2Ca2O9S2
Updating cpd17517: SO4. Co --> CoO4S
Updating cpd20546: Mg(Al,Fe)Si4O10(OH).4H2O --> H9AlFeMgO15Si4

I was also going to follow this up with the new set of reaction balances too, so we can see which reactions would have a different status, so lets merge, and then review.

Finally, this goes back to what I was saying about using pretty-printed JSON objects, it helps see the diffs between individual columns.

@JamesJeffryes
Copy link
Contributor

JamesJeffryes commented Sep 18, 2017 via email

@JamesJeffryes
Copy link
Contributor

basically any rxn id that appears above is a new "unbalanced reaction". I would love switch everything over to pretty print json! I'd have to rewrite the tests but it would be more legible than TSV @mmundy42 do you have a lot code outside this repo that uses the current tsv schema?

@samseaver
Copy link
Contributor Author

I spot-checked one of these: rxn06114

This happens:
Updating cpd11670: C14H21NO112 --> C28H42N2O22
Updating cpd11716: C14H21NO14S
2 --> C28H42N2O28S2

The stored formula had a multiplier. However, in my current reaction balancing code (not yet committed), the status of rxn06114 is still "OK".

@samseaver
Copy link
Contributor Author

Ah-ha, I just found a filter I had in my reaction balancing code, so now I can see that I'm getting the same newly reported imbalances. I'm recording them, and what I will do is I will follow up on them individually to clean up those imbalances and re-commit.

@mmundy42
Copy link
Contributor

Yes, I have some code that uses the tsv schema but it would be easy to switch to a JSON schema.

One thing to consider is how to handle fields that have no values in tsv files. Would all "records" in the JSON file have a key for every field?

Also, it would be good to make sure that fields are consistently defined. For example, in a biomasses file, all of the numeric values should be floats. I'd suggest defining a json-schema for each file and using the jsonschema Python package to validate the files.

@JamesJeffryes
Copy link
Contributor

Love the idea of schema validation as part of the tests. I would say "field": null for missing info is my preference. It's explicit which is better for manual curation and schema validation.

@samseaver
Copy link
Contributor Author

Sweet! Updating my code to specifically assume most multipliers should default to '1' results in passed checks on reaction balancing.

@samseaver
Copy link
Contributor Author

I've been using a "null" text string for the TSV schema, I'm ambivalent as to whether we should maintain using a "null" string or a true null value.

@JamesJeffryes
Copy link
Contributor

Great! This is how the checks are supposed to work.

@JamesJeffryes JamesJeffryes merged commit 1375055 into master Sep 19, 2017
@JamesJeffryes JamesJeffryes deleted the Merging_Formulas branch September 19, 2017 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants