Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENSCORESW-3349 changes made loops in parser for better performance. #503

Closed
wants to merge 1 commit into from

Conversation

ameya1981
Copy link

@ameya1981 ameya1981 commented Aug 13, 2020

Requirements

  • Filling out the template is required. Any pull request that does not include enough information to be reviewed in a timely manner may be closed at the maintainers' discretion;
  • Review the contributing guidelines for this repository; remember in particular:
    • do not modify code without testing for regression
    • provide simple unit tests to test the changes
    • if you change the schema you must patch the test databases as well, see Updating the schema
    • the PR must not fail unit testing

Description

Optimize loops in RefSeqCoordinateParser which runs for human xrefs. No change in output is expected.

Use case

Potential memory issues caused by holding objects in memory

Benefits

Lesser memory required for parsing

Possible Drawbacks

NA

Testing

There is no unit test for this parser.
Running the parser as a singlejob did not result into any memory failures(over multiple times).
The xref count from these changes is same as the xref count from e100 parser

If so, do the tests pass/fail?

NA

@ameya1981 ameya1981 changed the title ENSCORESW-3349 changes made loops in parser fo better performance. no… ENSCORESW-3349 changes made loops in parser for better performance. Aug 13, 2020
@coveralls
Copy link

Coverage Status

Coverage remained the same at 78.333% when pulling 41ebd28 on optimize/RefSeqCoordinateParser into b82332c on master.

Copy link

@magaliruffier magaliruffier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not very clear to me what these changes actually bring.
If the memory problem could not be reproduced with the original code, how do we know the changes improve things?

$add_dependent_xref_sth->execute($xref_id, $dependent_xref_id);
}
}

# Also store refseq protein as direct xref for ensembl translation, if translation exists
if (defined $tl && defined $tl_of) {
if ($tl_of->seq eq $tl->seq) {
if (md5($tl_of->seq) eq md5($tl->seq)) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what advantage does using the md5 bring?
we still call the seq method, so I am assuming the same query is still being run underneath

@@ -369,7 +387,7 @@ sub run_script {
# If a best match was defined for the refseq transcript, store it as direct xref for ensembl transcript
if ($best_id) {
my ($acc, $version) = split(/\./, $id);
$version =~ s/\D//g if $version;
$version =~ s/\D//g if $version;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cosmetic changes should be avoided, and if absolutely necessary, submitted as a separate commit to avoid masking the actual code changes

@magaliruffier
Copy link

can't confirm if changes are needed

@magaliruffier magaliruffier deleted the optimize/RefSeqCoordinateParser branch January 4, 2021 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants