Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AEJ-1617 Prevent Smith-Waterman from integer overflow #142

Merged
merged 9 commits into from
Feb 9, 2021

Conversation

pkolakow
Copy link
Contributor

@pkolakow pkolakow commented Feb 3, 2021

As a result of the current implementation limitations (integer data types), the maximum sequence length is 2^15 -1 ~ 32k bytes
Thus, the maximum matching value is ~ 2^16 + 2 (max_sequence_length * match_value <= int32_max_value)

Parameter validation was added to the SmithWaterman Java class:

  • Neither reference nor aligned sequences can be longer than 2^15-1
  • Matching value cannot be greater than 2^16 (rounded)

If any of those conditions is not satisfied IllegalArgumentException is raised

The worst-case scenario is tested:

  • Alignment of two identical sequences of the maximum length
  • Match value is set to 2^16

Comment on lines 7 to 8
public static final int MAX_SW_SEQUENCE_LENGTH = 32*1024-1; //2^15 - 1
public static final int MAX_SW_MATCH_VALUE = 64*1024; // 2^16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not used here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you reference this in tests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be match this name to that used in SW java class MAX_SEQUENCE_LENGTH

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the TestingUtils class to hold all common functions and variables across all tests. #134 tests also use this class. I wanted to show that these values are specific to SmithWaterman only. I didn't want to make any confusion if someone would write new tests for PairHMM or Deflater/Inflater in the future.

@@ -0,0 +1,24 @@
package com.intel.gkl.testing_utils;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use camel format for pkg name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The package names should be snake_cased but I can concatenate it and make testingutils instead.

byte[] align = Arrays.copyOf(ref, ref.length);

SWParameters SWparameters = new SWParameters(matchValue, -5, -10, -10);
SWNativeAlignerResult result = smithWaterman.align(ref, align, SWparameters, SWOverhangStrategy.IGNORE);
Copy link
Contributor

@SnehalA SnehalA Feb 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to consider other SW Overhang strategies?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overhanging strategies won't make a difference in this case. These strategies determine the way misaligned sequence ends are handled. In this test, we expect a full alignment. For instance, in the SOFTCLIP strategy there would be nothing to truncate.

Comment on lines 33 to 35
private final int MAX_SEQUENCE_LENGTH = 32*1024-1; // 2^15 - 1
private final int MAXIMUM_MATCH_VALUE = 64*1024; // 2^16

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to add note here why these values, and for match values, but I think we should be adding checks where we compute the matrix score, th8s?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants