Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Necessary Updates to the Mapper code [ENSCORESW-2168]. #332

Merged
merged 90 commits into from
Mar 8, 2019

Conversation

avullo
Copy link

@avullo avullo commented Nov 13, 2018

Requirements

  • Filling out the template is required. Any pull request that does not include enough information to be reviewed in a timely manner may be closed at the maintainers' discretion;
  • Review the contributing guidelines for this repository; remember in particular:
    • do not modify code without testing for regression
    • provide simple unit tests to test the changes
    • if you change the schema you must patch the test databases as well, see Updating the schema
    • the PR must not fail unit testing

Description

This is a rebase onto master of branch experimental/mapper_update, which is meant to address most of the Mapper issues documented under [ENSCORESW-2168].

Use case

The current mapper code does not correctly handle several edge cases and emits an incorrect output. See [ENSCORESW-2168] for more details.

Benefits

A much more robust mapper code. Among several things, it's making the Genebuild Loutre import pipeline work correctly.

Possible Drawbacks

The Mapper::map_coordinates method could be slower under certain circumstances, i.e. when mapping from several non overlapping intervals as it has now to take into account the cost of building and querying an interval tree. Querying is more expensive than current binary search in case the tree is unbalanced. We might use the provided mutable tree implementation since we don't incur in the cost associated with keeping the tree balanced as we're not inserting or removing intervals after construction. This might also have the benefit of exploiting an existing XS implementation.

Testing

Have you added/modified unit tests to test the changes?

Yes

If so, do the tests pass/fail?

Yes.

Have you run the entire test suite and no regression was detected?

Yes. It fails locally but on just one unrelated test (sliceVariation.t) which is due the MySQL version I have (5.7) which is currently incompatible with the Variation API. The very same fail occurs also on master.

Alessandro Vullo added 30 commits November 13, 2018 16:19
…ntered interval tree, to be used in the mapper instead of binary search.
…mapped coordinates. More uniform function prototype to allow transparent calls from Slice project method.
… using the values returned by invoking the

updated mapper interfaces which return the original and mapped coordinates. The coordinates are readjusted
consider the slice strand.

There are some minor optimisations which avoid going through unnecessary code fragments in certain circumnstances.
…rval trees.

It loads XS version if it's installed, otherwise fall back to PP implementation.
@magaliruffier
Copy link

This still needs to be resolved, with feedback from the Infrastructure team and updates from Alessandro

Copy link
Contributor

@mkszuba mkszuba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly, needs quite a bit more work. In addition to my comments inline, isn't there some sort of generic tree implementation we could use instead of rolling our own?

@@ -224,7 +224,7 @@ sub map {
throw('Incorrect number of arguments.') if (!( @_ >= 6));

my ( $self, $frm_seq_region_name, $frm_start, $frm_end, $frm_strand,
$frm_cs, $to_slice )
$frm_cs, $dummy, $to_slice, $include_org_coord )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the argument list in POD accordingly. If possible, fix the things which were out of date before the change as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -277,7 +277,7 @@ sub map {
throw('Incorrect number of arguments.') if(@_ < 6);

my ($self, $frm_seq_region_name, $frm_start,
$frm_end, $frm_strand, $frm_cs, $fastmap, $to_slice) = @_;
$frm_end, $frm_strand, $frm_cs, $fastmap, $to_slice, $include_org_coord) = @_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

@@ -149,6 +149,9 @@ use Bio::EnsEMBL::Mapper::IndelCoordinate;
use Bio::EnsEMBL::Mapper::Gap;

use Bio::EnsEMBL::Utils::Exception qw(throw);
use Bio::EnsEMBL::Utils::Interval;
use Bio::EnsEMBL::Utils::Tree::Interval::Immutable;
# use Bio::EnsEMBL::Utils::Tree::Interval::Mutable;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove commented-out code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


###
#
map { push @{$from_intervals},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a fan of using map in void context and writing output inline.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

# Enable this as some preliminary experiments show this is slightly faster
# on the test suite.
#
# my $itree = Bio::EnsEMBL::Utils::Tree::Interval::Mutable->new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented-out code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


Bio::EnsEMBL::Utils::Tree::Interval::Mutable

=head1 SYNOPSIS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And more...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(still here)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


=head1 DESCRIPTION

=head1 METHODS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And more...

use Test::Exception;
use Test::Warnings qw(warning);

use Data::Dumper;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module isn't used anywhere in this file, is it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


use strict;

use Data::Dumper;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module isn't used anywhere in this file, is it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


use strict;

use Data::Dumper;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module isn't used anywhere in this file, is it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(still here)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Contributor

@mkszuba mkszuba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just flagged two previously requested changes that have been missed. Other than those two, fine by me.


Bio::EnsEMBL::Utils::Tree::Interval::Mutable

=head1 SYNOPSIS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(still here)


use strict;

use Data::Dumper;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(still here)

@tgrego
Copy link
Contributor

tgrego commented Feb 8, 2019

@avullo can you please merge master into your branch and push it just so we can make sure travis is happy with it all?
thanks

@avullo
Copy link
Author

avullo commented Feb 8, 2019

@tgrego just done

modules/t/interval_tree_mutable.t Outdated Show resolved Hide resolved
modules/t/interval_tree_mutable.t Outdated Show resolved Hide resolved
modules/t/interval_tree_mutable.t Show resolved Hide resolved
$start_idx = 0;
$end_idx = $#{$lr};
foreach my $i (@{$lr}) {
my $start = $i->{$from}{start}<=$i->{$from}{end}?$i->{$from}{start}:$i->{$from}{end};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't the clearest. Could this and the line below get split into an single if. The $start and $end can get created as $i->{$from}{start} and $i->{$from}{end} respectively then modified if they fail the if condition. This would also reduce the checks

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Mark, this is addressed now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That look much better, thank you

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The points against using an off the shelf implementation:

  • minimise reliance on external libraries (i.e. Set::IntervalTree)
  • Tree::Interval only handles the non-overlapping intervals case
  • ability to provide optimize (i.e. XS) code, when possible
  • ability to provide different suitable implementations (i.e. mutable/immutable) depending on the use case

Copy link
Contributor

@markmcdowall markmcdowall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than some minor clarity changes and some more docs, this is looking good.

I'd like to reiterate @mkszuba question about the implementation of tree interval code and why we could not use an off the shelf implementation.

$normal_slice->end());
}
my @coords = $asm_mapper->map($normal_slice->seq_region_name(),
$normal_slice->start(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove all tabs

$normal_slice->end());
}
my @coords = $asm_mapper->map($normal_slice->seq_region_name(),
$normal_slice->start(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the tabs

if(@coords > 1) {
return @coords;
} elsif ($include_org_coord) {
return @coords unless $coords[0]{mapped}->isa('Bio::EnsEMBL::Mapper::Gap');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the unless get moved to the elsif block

} elsif ($include_org_coord) {
return @coords unless $coords[0]{mapped}->isa('Bio::EnsEMBL::Mapper::Gap');
} else {
return @coords unless $coords[0]->isa('Bio::EnsEMBL::Mapper::Gap');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this not get handled as part of the if rather than a postfix in the else section?

return sort_by_begin(uniq($result));
}

sub _query_point {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the pod section?

_load($PP);
}

sub _load {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pod docs?

my $left = $self->left;

$left->parent($self->parent);
unless (defined $left->parent) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this not be swapped so the if (defined $right->parent) then perform what is in the else block

my $right = $self->right;

$right->parent($self->parent);
unless (defined $right->parent) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this not be swapped so the if (defined $right->parent) then perform what is in the else block

}
} else {
# insert into right subtree
unless (defined $self->right) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this not be swapped so the if (defined $self->right) then perform what is in the else block otherwise$self->right(Bio::EnsEMBL::Utils...


if ($i->start < $self->key) {
# insert into left subtree
unless (defined $self->left) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this not be swapped so the if (defined $self->left) then perform what is in the else block otherwise$self->left(Bio::EnsEMBL::Utils...

@tgrego
Copy link
Contributor

tgrego commented Mar 8, 2019

Looks like no more changes will be added.
As there are no blockers, I will be merging later today unless I hear something against it.

@avullo
Copy link
Author

avullo commented Mar 8, 2019

Pardon I don't have time to address the latest comments. They appear to be mostly stylistic in nature, something which can later be added. Moreover, POD's missing only for the private methods, which was intentional.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants