New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Biotype so_term #364
Biotype so_term #364
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that you have accidentally committed into this branch several changes (schema changes, version bump) which have already been committed into #363. Could you remove them, please?
it was on purpose, this branch requires the changes from #363. |
In that case please base this PR against schema_update_97 for now, as it is this really is not reviewer-friendly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much better! Just two minor issues as far as I am concerned.
modules/Bio/EnsEMBL/Feature.pm
Outdated
$so_term = $ref->SEQUENCE_ONTOLOGY->{'term'}; | ||
}; | ||
|
||
unless ($so_term || $ref eq 'Bio::EnsEMBL::Feature' ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not like using unless for complex conditions, I end up mentally substituting it with if not and propagating the negation to individual conditions. Maybe it's just me, though.
throws_ok { $biotype1->so_acc('test') } qr/so_acc must be a Sequence Ontology accession/, 'so_acc() requires a SO acc like string'; | ||
throws_ok { $biotype1->object_type('test') } qr/object_type must be gene or transcript/, 'object_type() must be gene or transcript'; | ||
|
||
# test transcript biotype object | ||
my $transcript = $gene->canonical_transcript; | ||
debug("transcript biotype"); | ||
is($transcript->biotype, 'protein_coding', "Trancript biotype is protein_coding"); | ||
is($transcript->biotype, 'protein_coding', "Transcript biotype is protein_coding"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say this and the trailing-whitespace fixes in line 102 of this file + in line 41 of cds.t should go into a separate commit, those are a different logical change from the so_term thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't want to edit commits just for this tiny thing, will keep attention to in the future commit separately when I spot those small issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Good to go as far as I am concerned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks fine.
Quick comment about the column definition and consequences.
@@ -2197,6 +2200,7 @@ CREATE TABLE biotype ( | |||
description TEXT, | |||
biotype_group ENUM('coding','pseudogene','snoncoding','lnoncoding','mnoncoding','LRG','undefined','no_group') DEFAULT NULL, | |||
so_acc VARCHAR(64), | |||
so_term VARCHAR(1023), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you using such a large number for the varchar? Even though the used space is variable MySQL will have allotted memory units reserved for the storage of the string, which will be much larger in this case. This will reflect badly when sorting or temp tables are required.
Might be worth analysing the required size that is need and restricting what you are storing. Is worth adding a test in the code to ensure that there is not truncation of the inserted text at a later date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That data is present and extracted from the ensembl_ontology_ database, term table, name column, which is a VARCHAR(1023).
It could be a TEXT instead, but it's usually better to be a VARCHAR if within it's limits and as you say it will only take as much space as it needs...
The usage is expected to be through the Biotype object, which is a select of a specific biotype name and object_type combination.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TEXT
would be the wrong way, if the object does not need to be up to 1023 then limiting down to a smaller number will help MySQL allot the correct memory size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said, the source data is 1023 so it can theoretically be that large.
I do think it probably won't be, but looking at the sequenceontology website I can't find any real numbers for how large a term can be.
The only way to play safe is to make it as large as the source data can be.
The table is however quite small, and it's usage limited (and mostly as the select I mentioned).
Requirements
Description
Using one or more sentences, describe in detail the proposed changes.
This adds so_term column to the biotype column and to the Biotype objects.
Production master_biotype table needs updating (Ensembl/ensembl-production#218)
Core Features have also been updated to include feature_so_term method, in the likelihood of feature_so_acc. Features of compara, regulation and variation need updating through separate pull requests to follow.
Use case
Describe the problem. Please provide an example representing the motivation behind the need for having these changes in place.
Biotype objects will now report the so_term in addition to the so_acc, thus an ontology database dependency is avoided. Features will also report the so_term.
Benefits
If applicable, describe the advantages the changes will have.
Improved Biotype objects and no need for ontology database to look up so terms.
Possible Drawbacks
If applicable, describe any possible undesirable consequence of the changes.
A bit more data in the biotype table.
Testing
Have you added/modified unit tests to test the changes?
yes
If so, do the tests pass/fail?
pass
Have you run the entire test suite and no regression was detected?
yes