Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mapFromAlignments() and soft-clipping #12

Open
hpages opened this issue Feb 14, 2020 · 6 comments
Open

mapFromAlignments() and soft-clipping #12

hpages opened this issue Feb 14, 2020 · 6 comments
Assignees

Comments

@hpages
Copy link
Contributor

hpages commented Feb 14, 2020

This looks wrong:

library(GenomicAlignments)
mypos <- IRanges(1:7, width=1, names=rep("read1", 7))
alignment <- GAlignments("chr1", pos=1L, cigar="2S3M2D7M", strand("+"), names="read1")
mapFromAlignments(mypos, alignment)
# IRanges object with 7 ranges and 0 metadata columns:
#             start       end     width
#         <integer> <integer> <integer>
#   read1         1         1         1
#   read1         2         2         1
#   read1         3         3         1
#   read1         4         4         1
#   read1         5         5         1
#   read1         8         8         1
#   read1         9         9         1

The reported deletion (ref pos 6 & 7) seems wrong. Should be ref pos 4 & 5.

sessionInfo():

> sessionInfo()
R Under development (unstable) (2019-10-30 r77336)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS

Matrix products: default
BLAS:   /home/hpages/R/R-4.0.r77336/lib/libRblas.so
LAPACK: /home/hpages/R/R-4.0.r77336/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] GenomicAlignments_1.23.1    Rsamtools_2.3.4            
 [3] Biostrings_2.55.4           XVector_0.27.0             
 [5] SummarizedExperiment_1.17.2 DelayedArray_0.13.4        
 [7] BiocParallel_1.21.2         matrixStats_0.55.0         
 [9] Biobase_2.47.2              GenomicRanges_1.39.2       
[11] GenomeInfoDb_1.23.13        IRanges_2.21.3             
[13] S4Vectors_0.25.12           BiocGenerics_0.33.0        

loaded via a namespace (and not attached):
 [1] lattice_0.20-38        crayon_1.3.4           bitops_1.0-6          
 [4] grid_4.0.0             zlibbioc_1.33.1        Matrix_1.2-18         
 [7] tools_4.0.0            RCurl_1.98-1.1         compiler_4.0.0        
[10] GenomeInfoDbData_1.2.2
@hpages hpages self-assigned this Feb 14, 2020
@laurenfitch
Copy link

laurenfitch commented Feb 14, 2020

Hi Herve,
I don't know if this is related but I've seen some more odd behavior while using the package today. Please see the example below. The first 2 positions correspond to the 2 Xs. mapFromAlignments reduced the width of the second range to 0.

> mypos <- IRanges(c(52, 67, 68), width = 1, names=rep("read1", 3))
> mapFromAlignments(mypos, GAlignments("chr1", pos=1L, cigar="27S39=1I11=1X14=1X5=", strand("+"), names="read1"))
IRanges object with 3 ranges and 0 metadata columns:
            start       end     width
        <integer> <integer> <integer>
  read1        52        52         1
  read1        67        66         0
  read1        67        67         1

@gaoce
Copy link

gaoce commented Jun 9, 2023

The issue is at

int to_ref(int query_loc, const char *cig0, int pos, Rboolean narrow_left)
{
int ref_loc = query_loc + pos - 1;
int n, offset = 0, OPL, query_consumed = 0;
char OP;
while (query_consumed < query_loc &&
(n = _next_cigar_OP(cig0, offset, &OP, &OPL)))
{
switch (OP) {
/* Alignment match (can be a sequence match or mismatch) */
case 'M': case '=': case 'X':
query_consumed += OPL;
break;
/* Insertion to the reference */
case 'I': {
int width_from_insertion_start = query_loc - query_consumed;
Rboolean query_loc_past_insertion = width_from_insertion_start > OPL;
if (query_loc_past_insertion) {
ref_loc -= OPL;
} else {
ref_loc -= width_from_insertion_start;
if (!narrow_left) {
ref_loc += 1;
}
}
query_consumed += OPL;
break;
}
/* Soft clip on the read */
case 'S':
query_consumed += OPL;
break;
/* Deletion from the reference */
case 'D':
case 'N': /* Skipped region from reference; narrow to query */
ref_loc += OPL;
break;
/* Hard clip on the read */
case 'H':
break;
/* Silent deletion from the padded reference */
case 'P':
break;
default:
break;
}
offset += n;
}
if (n == 0)
ref_loc = NA_INTEGER;
return ref_loc;
}

In particular

/* Soft clip on the read */
case 'S':
query_consumed += OPL;
break;

Although query_loc is supposed to have trimmed positions on the query, i.e., after soft clipping, this block of code still consume the clipping on the read.

To demonstrate, if you remove the clipping from the alignment, everything looks fine

mypos <- IRanges(1:7, width=1, names=rep("read1", 7))
alignment <- GAlignments("chr1", pos=1L, cigar="3M2D7M", strand("+"), names="read1")
mapFromAlignments(mypos, alignment)
IRanges object with 7 ranges and 0 metadata columns:
            start       end     width
        <integer> <integer> <integer>
  read1         1         1         1
  read1         2         2         1
  read1         3         3         1
  read1         6         6         1
  read1         7         7         1
  read1         8         8         1
  read1         9         9         1

If you increase the clipping size, even the last 2 positions are wrong:

mypos <- IRanges(1:7, width=1, names=rep("read1", 7))
alignment <- GAlignments("chr1", pos=1L, cigar="7S3M2D7M", strand("+"), names="read1")
mapFromAlignments(mypos, alignment)
IRanges object with 7 ranges and 0 metadata columns:
            start       end     width
        <integer> <integer> <integer>
  read1         1         1         1
  read1         2         2         1
  read1         3         3         1
  read1         4         4         1
  read1         5         5         1
  read1         6         6         1
  read1         7         7         1

Also, soft clipping on the other side is also an issue:

mypos <- IRanges(10:11, width=1, names=rep("read1", 2))
alignment <- GAlignments("chr1", pos=1L, cigar="3M2D7M2S", strand("+"), names="read1")
mapFromAlignments(mypos, alignment)
IRanges object with 2 ranges and 0 metadata columns:
            start       end     width
        <integer> <integer> <integer>
  read1        12        12         1
  read1        13        13         1

One way to fix this is to change this function (to_ref) to not consume the soft clipped bases.

@hpages
Copy link
Contributor Author

hpages commented Jun 10, 2023

Hi @gaoce ,

Thanks for taking the time to look at the code. Would you be willing to submit a PR? Problem is that I'm not familiar with the implementation of mapFromAlignments() (the function was implemented many years ago by someone who's left the Bioconductor sphere), and I never really found the time to familiarize myself with it.

Best,
H.

@gaoce
Copy link

gaoce commented Jun 13, 2023

Hi @hpages,

Sure. I will find some time to work on it.

@fedxa
Copy link
Contributor

fedxa commented Jan 7, 2024

Hi @gaoce and @hpages,

I've bumped in the same problem identified by @hpages. I'd like to point out some further subtleties in mapFrom and mapTo with soft/hard clips. While the genome coordinates are uniquely defined for an alignment, the coordinates within the alignement itself can be defined in several ways: within the mapped part of the read without clipped parts, or counting the soft (or hard) clipped parts also. Note, that both options can be useful depending on the question asked. For example, for soft clipped reads the whole sequence (and qualities) is stored in the record, so indexing into it requires the coordinates including the clipped parts. In case of hard clipped alignments usually discarding the clipped parts is sensible, but for liftover like tasks, when alignment is not a read but an alternative shifted genome, it is required to get coordinates in the alignment including hard clips (this is the case with minimap2 assembly to assembly mapping).

For example, for hard clipped alignment GAlignments("chr1", pos=3L, cigar="1H2M1H", strand("+"), names="read1") The possible mappings can be

Mapped Read portion Pos      1 2
Read seqeuence Pos           1 2
Read position                2 3  
Read CIGAR                 H M M H
chr1 pos                 1 2 3 4 5 6

And for the soft clipped case GAlignments("chr1", pos=3L, cigar="1S2M1S", strand("+"), names="read1")

Mapped Read portion Pos      1 2
Read seqeuence Pos           2 3 
Read position                2 3  
Read CIGAR                 S M M S
chr1 pos                 1 2 3 4 5 6

To add to the confusion, in the current implementation mapFromAlignments uses the convention "mapped read position", and mapToAlignments uses "read sequence position"... This makes these two operations not to be each other inverse in case of the soft clipped alignments.

I think that a good option is to add an additional parameter style = mapped|sequence|readpos to the mapping functions, which would select one of the three options, with the default being the current implementation (mapped for mapFromAlignments and sequence for mapToAlignments). Then the existing code should not break, while allowing for all options to be selected for each use case. I am adding a pull request #35 that implements this, and fixes the original bug mentioned in the issue.

@fedxa
Copy link
Contributor

fedxa commented Jan 7, 2024

So, the suggested behaviour should be like this

readpos <- GRanges("r1",IRanges(1:6, width=1, names=rep("read1",6)))
chrpos <- GRanges("chr1",IRanges(1:6, width=1, names=LETTERS[1:6]))

Suggested mapFromAlignments behaviour

In this case "mapped" is the default if called without style parameter

alignment <- GAlignments("chr1", pos=3L, cigar="2M", strand("+"), names="read1")
mapFromAlignments(readpos, alignment)

## GRanges object with 2 ranges and 2 metadata columns:
##         seqnames    ranges strand |     xHits alignmentsHits
##            <Rle> <IRanges>  <Rle> | <integer>      <integer>
##   read1     chr1         3      * |         1              1
##   read1     chr1         4      * |         2              1
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

alignment <- GAlignments("chr1", pos=3L, cigar="1H2M1H", strand("+"), names="read1")
mapFromAlignments(readpos, alignment, style="mapped")

## GRanges object with 2 ranges and 2 metadata columns:
##         seqnames    ranges strand |     xHits alignmentsHits
##            <Rle> <IRanges>  <Rle> | <integer>      <integer>
##   read1     chr1         3      * |         1              1
##   read1     chr1         4      * |         2              1
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

mapFromAlignments(readpos, alignment, style="sequence")

## GRanges object with 2 ranges and 2 metadata columns:
##         seqnames    ranges strand |     xHits alignmentsHits
##            <Rle> <IRanges>  <Rle> | <integer>      <integer>
##   read1     chr1         3      * |         1              1
##   read1     chr1         4      * |         2              1
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

mapFromAlignments(readpos, alignment, style="readpos")

## GRanges object with 2 ranges and 2 metadata columns:
##         seqnames    ranges strand |     xHits alignmentsHits
##            <Rle> <IRanges>  <Rle> | <integer>      <integer>
##   read1     chr1         3      * |         2              1
##   read1     chr1         4      * |         3              1
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

alignment <- GAlignments("chr1", pos=3L, cigar="1S2M1S", strand("+"), names="read1")
mapFromAlignments(readpos, alignment, style="mapped")

## GRanges object with 2 ranges and 2 metadata columns:
##         seqnames    ranges strand |     xHits alignmentsHits
##            <Rle> <IRanges>  <Rle> | <integer>      <integer>
##   read1     chr1         3      * |         1              1
##   read1     chr1         4      * |         2              1
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

mapFromAlignments(readpos, alignment, style="sequence")

## GRanges object with 2 ranges and 2 metadata columns:
##         seqnames    ranges strand |     xHits alignmentsHits
##            <Rle> <IRanges>  <Rle> | <integer>      <integer>
##   read1     chr1         3      * |         2              1
##   read1     chr1         4      * |         3              1
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

mapFromAlignments(readpos, alignment, style="readpos")

## GRanges object with 2 ranges and 2 metadata columns:
##         seqnames    ranges strand |     xHits alignmentsHits
##            <Rle> <IRanges>  <Rle> | <integer>      <integer>
##   read1     chr1         3      * |         2              1
##   read1     chr1         4      * |         3              1
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Suggested mapToAlignments behaviour

In this case "sequence" is the default if called without style parameter

alignment <- GAlignments("chr1", pos=3L, cigar="2M", strand("+"), names="read1")
mapToAlignments(chrpos, alignment)

## GRanges object with 2 ranges and 2 metadata columns:
##     seqnames    ranges strand |     xHits alignmentsHits
##        <Rle> <IRanges>  <Rle> | <integer>      <integer>
##   C    read1         1      * |         3              1
##   D    read1         2      * |         4              1
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

alignment <- GAlignments("chr1", pos=3L, cigar="1H2M1H", strand("+"), names="read1")
mapToAlignments(chrpos, alignment, style="mapped")

## GRanges object with 2 ranges and 2 metadata columns:
##     seqnames    ranges strand |     xHits alignmentsHits
##        <Rle> <IRanges>  <Rle> | <integer>      <integer>
##   C    read1         1      * |         3              1
##   D    read1         2      * |         4              1
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

mapToAlignments(chrpos, alignment, style="sequence")

## GRanges object with 2 ranges and 2 metadata columns:
##     seqnames    ranges strand |     xHits alignmentsHits
##        <Rle> <IRanges>  <Rle> | <integer>      <integer>
##   C    read1         1      * |         3              1
##   D    read1         2      * |         4              1
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

mapToAlignments(chrpos, alignment, style="readpos")

## GRanges object with 2 ranges and 2 metadata columns:
##     seqnames    ranges strand |     xHits alignmentsHits
##        <Rle> <IRanges>  <Rle> | <integer>      <integer>
##   C    read1         2      * |         3              1
##   D    read1         3      * |         4              1
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

alignment <- GAlignments("chr1", pos=3L, cigar="1S2M1S", strand("+"), names="read1")
mapToAlignments(chrpos, alignment, style="mapped")

## GRanges object with 2 ranges and 2 metadata columns:
##     seqnames    ranges strand |     xHits alignmentsHits
##        <Rle> <IRanges>  <Rle> | <integer>      <integer>
##   C    read1         1      * |         3              1
##   D    read1         2      * |         4              1
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

mapToAlignments(chrpos, alignment, style="sequence")

## GRanges object with 2 ranges and 2 metadata columns:
##     seqnames    ranges strand |     xHits alignmentsHits
##        <Rle> <IRanges>  <Rle> | <integer>      <integer>
##   C    read1         2      * |         3              1
##   D    read1         3      * |         4              1
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

mapToAlignments(chrpos, alignment, style="readpos")

## GRanges object with 2 ranges and 2 metadata columns:
##     seqnames    ranges strand |     xHits alignmentsHits
##        <Rle> <IRanges>  <Rle> | <integer>      <integer>
##   C    read1         2      * |         3              1
##   D    read1         3      * |         4              1
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants