Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Biotype so_term #364

Merged
merged 11 commits into from Feb 19, 2019
Merged

Biotype so_term #364

merged 11 commits into from Feb 19, 2019

Conversation

tgrego
Copy link
Contributor

@tgrego tgrego commented Feb 12, 2019

Requirements

  • Filling out the template is required. Any pull request that does not include enough information to be reviewed in a timely manner may be closed at the maintainers' discretion;
  • Review the contributing guidelines for this repository; remember in particular:
    • do not modify code without testing for regression
    • provide simple unit tests to test the changes
    • if you change the schema you must patch the test databases as well, see Updating the schema
    • the PR must not fail unit testing

Description

Using one or more sentences, describe in detail the proposed changes.
This adds so_term column to the biotype column and to the Biotype objects.
Production master_biotype table needs updating (Ensembl/ensembl-production#218)
Core Features have also been updated to include feature_so_term method, in the likelihood of feature_so_acc. Features of compara, regulation and variation need updating through separate pull requests to follow.

Use case

Describe the problem. Please provide an example representing the motivation behind the need for having these changes in place.
Biotype objects will now report the so_term in addition to the so_acc, thus an ontology database dependency is avoided. Features will also report the so_term.

Benefits

If applicable, describe the advantages the changes will have.
Improved Biotype objects and no need for ontology database to look up so terms.

Possible Drawbacks

If applicable, describe any possible undesirable consequence of the changes.
A bit more data in the biotype table.

Testing

Have you added/modified unit tests to test the changes?
yes

If so, do the tests pass/fail?
pass

Have you run the entire test suite and no regression was detected?
yes

@coveralls
Copy link

coveralls commented Feb 12, 2019

Coverage Status

Coverage increased (+0.02%) to 81.467% when pulling 3d64a02 on biotype_so_term into 5308e4c on master.

Copy link
Contributor

@mkszuba mkszuba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that you have accidentally committed into this branch several changes (schema changes, version bump) which have already been committed into #363. Could you remove them, please?

@tgrego
Copy link
Contributor Author

tgrego commented Feb 13, 2019

it was on purpose, this branch requires the changes from #363.
Will magically disappear as that branch is merged to master.

@mkszuba
Copy link
Contributor

mkszuba commented Feb 13, 2019

In that case please base this PR against schema_update_97 for now, as it is this really is not reviewer-friendly.

@tgrego tgrego changed the base branch from master to schema_update_97 February 13, 2019 11:38
Copy link
Contributor

@mkszuba mkszuba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better! Just two minor issues as far as I am concerned.

$so_term = $ref->SEQUENCE_ONTOLOGY->{'term'};
};

unless ($so_term || $ref eq 'Bio::EnsEMBL::Feature' ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not like using unless for complex conditions, I end up mentally substituting it with if not and propagating the negation to individual conditions. Maybe it's just me, though.

throws_ok { $biotype1->so_acc('test') } qr/so_acc must be a Sequence Ontology accession/, 'so_acc() requires a SO acc like string';
throws_ok { $biotype1->object_type('test') } qr/object_type must be gene or transcript/, 'object_type() must be gene or transcript';

# test transcript biotype object
my $transcript = $gene->canonical_transcript;
debug("transcript biotype");
is($transcript->biotype, 'protein_coding', "Trancript biotype is protein_coding");
is($transcript->biotype, 'protein_coding', "Transcript biotype is protein_coding");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say this and the trailing-whitespace fixes in line 102 of this file + in line 41 of cds.t should go into a separate commit, those are a different logical change from the so_term thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't want to edit commits just for this tiny thing, will keep attention to in the future commit separately when I spot those small issues.

Copy link
Contributor

@mkszuba mkszuba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Good to go as far as I am concerned.

@tgrego tgrego changed the base branch from schema_update_97 to master February 13, 2019 12:41
@tgrego tgrego changed the base branch from master to schema_update_97 February 15, 2019 21:12
@tgrego tgrego changed the base branch from schema_update_97 to master February 15, 2019 21:12
Copy link
Contributor

@markmcdowall markmcdowall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks fine.

Quick comment about the column definition and consequences.

@@ -2197,6 +2200,7 @@ CREATE TABLE biotype (
description TEXT,
biotype_group ENUM('coding','pseudogene','snoncoding','lnoncoding','mnoncoding','LRG','undefined','no_group') DEFAULT NULL,
so_acc VARCHAR(64),
so_term VARCHAR(1023),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you using such a large number for the varchar? Even though the used space is variable MySQL will have allotted memory units reserved for the storage of the string, which will be much larger in this case. This will reflect badly when sorting or temp tables are required.

Might be worth analysing the required size that is need and restricting what you are storing. Is worth adding a test in the code to ensure that there is not truncation of the inserted text at a later date.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That data is present and extracted from the ensembl_ontology_ database, term table, name column, which is a VARCHAR(1023).

It could be a TEXT instead, but it's usually better to be a VARCHAR if within it's limits and as you say it will only take as much space as it needs...
The usage is expected to be through the Biotype object, which is a select of a specific biotype name and object_type combination.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TEXT would be the wrong way, if the object does not need to be up to 1023 then limiting down to a smaller number will help MySQL allot the correct memory size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said, the source data is 1023 so it can theoretically be that large.
I do think it probably won't be, but looking at the sequenceontology website I can't find any real numbers for how large a term can be.
The only way to play safe is to make it as large as the source data can be.
The table is however quite small, and it's usage limited (and mostly as the select I mentioned).

@markmcdowall markmcdowall merged commit adb9673 into master Feb 19, 2019
@tgrego tgrego deleted the biotype_so_term branch May 15, 2019 09:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants