Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mapping to genome fails without apparent reason #46

Open
jorainer opened this issue Nov 7, 2017 · 1 comment
Open

Mapping to genome fails without apparent reason #46

jorainer opened this issue Nov 7, 2017 · 1 comment
Assignees

Comments

@jorainer
Copy link
Contributor

jorainer commented Nov 7, 2017

Got the following example

> library(ensembldb)
> library(AnnotationHub)
> edb <- query(AnnotationHub(), c("Ensembl 90 EnsDb", "Homo sapiens"))[[1]]
> library(Pbase)
> ## Fetch the protein ENSP00000269305
> prot <- Proteins(edb, filter = ~ protein_id == "ENSP00000269305")
> ## Mapping the protein domains to the genome works
> mapToGenome(prot, edb)
Fetch coding region for proteins ... OK
GRangesList object of length 1:
$ENSP00000269305 
GRanges object with 26 ranges and 5 metadata columns:
           seqnames             ranges strand |           tx_id
              <Rle>          <IRanges>  <Rle> |     <character>
  SSF47719       17 [7670638, 7670715]      - | ENST00000269305
   PF07710       17 [7670638, 7670715]      - | ENST00000269305
   PR00386       17 [7670659, 7670715]      - | ENST00000269305
   PR00386       17 [7673535, 7673552]      - | ENST00000269305
  SSF47719       17 [7673535, 7673573]      - | ENST00000269305
       ...      ...                ...    ... .             ...
   PR00386       17 [7675994, 7676023]      - | ENST00000269305
  SSF49417       17 [7675994, 7676080]      - | ENST00000269305
   PF00870       17 [7675994, 7676086]      - | ENST00000269305
   PF08563       17 [7676391, 7676403]      - | ENST00000269305
   PF08563       17 [7676521, 7676582]      - | ENST00000269305
                                                                                                                                                                                                       pepseq
                                                                                                                                                                                                  <character>
  SSF47719                                                                                                                                                            KKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGK
   PF07710                                                                                                                                                            KKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGK
   PR00386                                                                                                                                                                          EYFTLQIRGRERFEMFRELNEALEL
   PR00386                                                                                                                                                                          EYFTLQIRGRERFEMFRELNEALEL
  SSF47719                                                                                                                                                            KKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGK
       ...                                                                                                                                                                                                ...
   PR00386                                                                                                                                                                        SGTAKSVTCTYSPALNKMFCQLAKTCP
  SSF49417    VPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEE
   PF00870 SSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEEN
   PF08563                                                                                                                                                                          QSDPSVEPPLSQETFSDLWKLLPEN
   PF08563                                                                                                                                                                          QSDPSVEPPLSQETFSDLWKLLPEN
                 accession exonJunctions     group
               <character>     <logical> <integer>
  SSF47719 ENSP00000269305          TRUE         2
   PF07710 ENSP00000269305          TRUE         4
   PR00386 ENSP00000269305          TRUE         7
   PR00386 ENSP00000269305          TRUE         7
  SSF47719 ENSP00000269305          TRUE         2
       ...             ...           ...       ...
   PR00386 ENSP00000269305          TRUE         8
  SSF49417 ENSP00000269305          TRUE         3
   PF00870 ENSP00000269305          TRUE         5
   PF08563 ENSP00000269305          TRUE         6
   PF08563 ENSP00000269305          TRUE         6

-------
seqinfo: 1 sequence from GRCh38 genome

So, mapping works. But if a single IRanges is added it fails:

> ## Replacing/adding ranges within the protein is a little tedious at present -
> irl <- IRangesList(ENSP00000269305 = IRanges(start = 281, end = 391))
> mcols(prot@aa)$region <- irl
> mapToGenome(prot, edb, pcol = "region")
Fetch coding region for proteins ... OK
GRangesList object of length 0:
<0 elements>

-------
seqinfo: no sequences
Warning message:
Mapping failed. Returning an empty range. Last message was: Error in value[[3L]](cond): unused argument (cond)
 

Interstingly, it works if one of the protein domains is selected:

> irl <- IRangesList(ENSP00000269305 = IRanges(start = 237, end = 249))
> mcols(prot@aa)$region <- irl
> mapToGenome(prot, edb, pcol = "region")
Fetch coding region for proteins ... OK
GRangesList object of length 1:
$ENSP00000269305 
GRanges object with 1 range and 5 metadata columns:
      seqnames             ranges strand |           tx_id        pepseq
         <Rle>          <IRanges>  <Rle> |     <character>   <character>
  [1]       17 [7674216, 7674254]      - | ENST00000269305 MCNSSCMGGMNRR
            accession exonJunctions     group
          <character>     <logical> <integer>
  [1] ENSP00000269305         FALSE         1

-------
seqinfo: 1 sequence from GRCh38 genome
@jorainer
Copy link
Contributor Author

I've rewritten big part of the mapping functionality (based on the original code in Pbase) in ensembldb and now it works nicely:

> library(ensembldb)
> library(AnnotationHub)
> edb <- query(AnnotationHub(), c("Ensembl 90 EnsDb", "Homo sapiens"))[[1]]
snapshotDate(): 2017-10-27
downloading 0 resources
loading from cache 
    '/Users/jo//.AnnotationHub/64495'

> ## Define the region within the protein
> rng <- IRanges(start = 281, end = 391, names = "ENSP00000269305")
> ## Now map the coordinates
> proteinToGenome(rng, edb)
Fetching CDS for 1 proteins ... 1 found
Checking CDS and protein sequence lengths ... 1/1 OK
$ENSP00000269305
GRanges object with 4 ranges and 7 metadata columns:
      seqnames             ranges strand |      protein_id           tx_id
         <Rle>          <IRanges>  <Rle> |     <character>     <character>
  [1]       17 [7673701, 7673779]      - | ENSP00000269305 ENST00000269305
  [2]       17 [7673535, 7673608]      - | ENSP00000269305 ENST00000269305
  [3]       17 [7670609, 7670715]      - | ENSP00000269305 ENST00000269305
  [4]       17 [7669618, 7669690]      - | ENSP00000269305 ENST00000269305
              exon_id exon_rank    cds_ok protein_start protein_end
          <character> <integer> <logical>     <integer>   <integer>
  [1] ENSE00003725258         8      TRUE           281         391
  [2] ENSE00003786593         9      TRUE           281         391
  [3] ENSE00003545950        10      TRUE           281         391
  [4] ENSE00003605891        11      TRUE           281         391
  -------
  seqinfo: 1 sequence from GRCh38 genome

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant