Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subsetting the SharedRaw_Pool #17

Closed
LTLA opened this issue Oct 2, 2018 · 3 comments
Closed

Subsetting the SharedRaw_Pool #17

LTLA opened this issue Oct 2, 2018 · 3 comments

Comments

@LTLA
Copy link

LTLA commented Oct 2, 2018

Parts of my workflow involve taking a subsequence and saving it to file in a RDS object. However, the way that subseq works on the SharedRaw_Pool means that the entire sequence is still in the object:

X <- DNAStringSet("ACACTACGACGATCGATCGATCGATCGA")
object.size(X)
## 4088 bytes
object.size(subseq(X, 1, 1))
## 4088 bytes

I understand the rationale for this - I do a very similar thing in InteractionSet - but is there a function to reconstruct the pool from only the parts of sequence that are in use? This will reduce the size of the objects being passed around, which would make life a lot easier for analyzing long-read sequencing data.

Session information
R version 3.5.0 Patched (2018-04-30 r74679)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS: /home/cri.camres.org/lun01/Software/R/R-3-5-branch/lib/libRblas.so
LAPACK: /home/cri.camres.org/lun01/Software/R/R-3-5-branch/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] Biostrings_2.49.1   XVector_0.21.3      IRanges_2.15.18    
[4] S4Vectors_0.19.19   BiocGenerics_0.27.1

loaded via a namespace (and not attached):
[1] zlibbioc_1.27.0 compiler_3.5.0  tools_3.5.0    
@lawremi
Copy link
Collaborator

lawremi commented Oct 2, 2018

I think you could do a unlist() followed by a relist().

@hpages
Copy link
Contributor

hpages commented Oct 2, 2018

Yep, this is the consequence of the copy-less semantic of subseq() and other operations on DNAString/DNAStringSet objects:

X@pool
# SharedRaw_Pool of length 1
# 1: SharedRaw of length 28 (data starting at address 0x9fc52c8)

subseq(X, 5, 8)@pool
# SharedRaw_Pool of length 1
# 1: SharedRaw of length 28 (data starting at address 0x9fc52c8)

The sequences you "see" in your DNAStringSet object are just (internal) views on the pool of string data. And since the pool of string data is a list of external pointers to raw vectors, the entire thing gets serialized :-/

So before you serialize it, pass your object thru compact(). This reconstructs the pool from only the regions that are in use:

compact(subseq(X, 5, 8))@pool
# SharedRaw_Pool of length 1
# 1: SharedRaw of length 4 (data starting at address 0xbd4aab0)

@lawremi The problem with relist(unlist(X), X) is that it will copy more data than necessary when the internal views are overlapping:

X2 <- extractAt(X[[1]], IRanges(5:8, 15))
X2
#   A DNAStringSet instance of length 4
#     width seq
# [1]    11 TACGACGATCG
# [2]    10 ACGACGATCG
# [3]     9 CGACGATCG
# [4]     8 GACGATCG

X2@pool
# SharedRaw_Pool of length 1
# 1: SharedRaw of length 28 (data starting at address 0x9fc52c8)

relist(unlist(X2), X2)@pool
# SharedRaw_Pool of length 1
# 1: SharedRaw of length 38 (data starting at address 0xcd64668)

compact(X2)@pool
# SharedRaw_Pool of length 1
# 1: SharedRaw of length 11 (data starting at address 0x17b88b8)

Hope this helps.
H.

@LTLA
Copy link
Author

LTLA commented Oct 2, 2018

Thanks Herve, compact works well. Better than my hack, in any case:

as(as.character(sub.X), class(X))

@LTLA LTLA closed this as completed Oct 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants