-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encoding ASCII up to 255 #65
Comments
Also, how do I pass ASCII (raw) value 0 to BStringSet? This works:
This doesn't work
|
Hi,
The '?' you see when displaying
You want to avoid using the character string intermediate representation when going from raw to BString. One way to do this is:
Again, display is broken:
But content is correct:
2 things that would improve the current situation: (1) make it easier to go from raw to BString or BStringSet, and (2) fix display. Unfortunately I don't have the resources to work on these improvements at the moment but PRs are welcome. Thanks, |
Hi. quite late to the conversation. What the rational for these values? I am not familiar with what the values represent. I assume some kind of quality values. In Felix |
These ascii values are encoding kinetic information for PACbio sequencing data. they are a standard part of the Pacbio data format. So it would be useful to have ability to convert integers from 0 to 255. |
From experience, representing that with Example: For DNA, the values 1,2,4,8 (4 bits) are used to store letter G, A, T, C (at least 8 bit; in case of ASCII 7 bit). Only during display to console, does the conversion take place. The following snippet throws an error, but this is just because the lookup
So you need to be careful when debugging your code, that you not mistake "data stored" and "representation shown".
So you might have the problem, that without realizing, you did an integer to character conversion, which you actually don't want to do. You want to keep the value as is, if I understand you correctly. The I could image, that you need something more like this, since quality values can be stored using the
The interesting bit is, that you need to use the
However, also in
So you might want to consider at least three steps:
Personally, I would start from the last step. |
Hint: this is a bit confusing. Do you want to store ASCII or integer values from 0 to 255? It is either or. Both cannot be true at the same time. ASCII also just has 128 values and only some are human readable: https://en.wikipedia.org/wiki/ASCII |
Sorry, I by mistake wrote ASCII. What I mean is simply integer values from 0 to 255. Here is an example tag from a read in a BAM file: Per BAM specifications, the 'B:C' type of tags encode an array of unsigned 8-bit integers, i.e. 0 to 255. Because I need to use the sequenceLayer command on these arrays, I need to load them as BStringSet objects. What I do for these arrays, where x is the array is: So the question is if there is an easier / better way to do this conversion. |
Not at the moment but the good news is that we finally might have the resources to work on this soon. |
That would be great - thanks! |
This will convert it to a character nonetheless instead of retaining the original integer value. In addition you are not covering the standard, since you cannot represent 0 that way
|
Ok. I'm assuming per @hpages that you all will be able to create a better way to do this soon. |
@FelixErnst
The strings in XString/XStringSet objects are not 0-terminated like C strings. So 0 can actually be part of an XString/XStringSet object. As explained earlier in this thread, the error you are showing above is an issue with the |
Yes, you are right. With the conversion the constructor is avoided, which either would call
So I guess this is a yes for the second point. |
Finally getting around to this--I'm working on adding an optional parameter to display extended ASCII values for BStrings. It would be ideal for to maintain the "1 position = 1 character" display type for other strings, meaning that all values in 0:255 map to exactly one character. One option to do that is to use some standard set of single character codes that supports values in 0-255. It happens that the Braille character set has exactly 256 characters (encoded in unicode This would look like this:
It's not super readable, but our options are relatively limited when it comes to maintaining readability with a 256 character set...it's at least more readable than the current implementation. Definitely open to other suggestions on that front, for now I'm just looking to have something that can display at all. Most of the Currently working on a more robust version, I'll update with a link when it's in a semi-ready state. |
Example implementation is available at ahl27@494d81a
Edit: all values are now supported. Examples below.
Again, it's not the most beautiful thing in the world, but it does display all possible 8-bit integer values in a consistent format. Still a WIP, definitely open to any suggestions on better character sets we could use. Long term it's going to need a little bit more thought to ensure that things like Long-term, it definitely makes sense to make a more robust |
Thanks @ahl27. My suggestion would be to map only non-printable ASCII codes to these special characters. But for printable ASCII codes (i.e. codes >= 32 and <= 126) I think we should just stick to their usual representation. |
Hi all, As the original poster of this issue - thank you for the efforts! I also wanted to point out another situation / consideration: doing sequenceLayer on a BStringSet, the sequenceLayer command injects custom ASCII characters for D, N, I, S, H CIGAR operations specified by the D.letter, N.letter, I.letter, S.letter, and H.letter parameters. However, if the input BStrings already span the full ASCII character range, then there is no way to distinguish the input ASCII characters from these new injected ASCII characters. I think sequenceLayer should have a way to somehow inject ASCII characters that are outside the range of the BString range. Not sure how that might be possible, but this is a use case that I think is relevant to the above. Or perhaps if you want I can post it as a separate issue. |
To Hervé's comment--That's a good point...I'll take a longer look at it tomorrow. Maybe there's a cleaner solution than this first pass. Maybe it would be easier to just build a new codec for If people want to be able to compare strings following an |
Just to add - I think that sequenceLayer also would need to be changed to allow assignment of characters for special CIGAR operations outside the range of 0 to 255, so that these operations can be distinguished from input data that spans 0 to 255. |
@gevro thats an interesting use-case...I'm unfortunately not sure if it's possible without a significant overhaul of the entire The issue is that the internal calls to pull subsequences of a Maybe @hpages can confirm that that assessment is correct, he may have other ideas that make sense. Otherwise, I'd suggest the best option is likely:
|
Thanks for your thoughts. I defer to @hpages of course on potential future ways of addressing this additional issue. Or if there is a work-around (I can't think of one). |
Just some more further thoughts, since I've been thinking about this for a while now... For completeness sake, I should also mention that there is an argument to be made that non-ASCII values aren't something we should support. It doesn't seem like the information you're trying to load is a sequence at all, and measures like quality scores are already supported for arbitrary values using things like Maybe I'm misunderstanding your previous comments, so perhaps more information on what the input data is would help. "Kinetic information for PacBio sequencing reads" indicates sequencing metadata, and according to the PacBio datastandard, That said, @FelixErnst has already mentioned that there's a usage for nonstandard ASCII values in Sorry for all the back and forth, I was excited about working on this and jumped into it a little too fast--should've reread the discussion a few more times first. |
Thanks very much for trying to help here. Hopefully I can clarify why I'm stuck a bit better. I'm analyzing kinetic PacBio data, from the tags 'ip' and 'pw' in PacBio BAM files. These are encoded as 'B,C' type tags: https://pacbiofileformats.readthedocs.io/en/13.0/BAM.html . Per SAM specifications, 'B,C' type tags are unsigned 8-bit integers, i.e. values of 0 to 255: https://github.com/samtools/hts-specs/blob/master/SAMv1.pdf. These ip and pw tags have one value per query base. My goal is to then convert these from query to reference coordinates, using the 'sequenceLayer' function, which is the only function I know of that can take a string/array of values and convert from query to reference coordinates given a CIGAR alignment string. Since 'sequenceLayer' only takes XStringSet objects as input, I'm forced to store the ip and pw values for all the reads as a BStringSet, which is then input into sequenceLayer. But the issue is that sequenceLayer injects deletions as an ASCII character between 0 to 255 in the new reference-coordinate transformed string. And I can't differentiate the injected deletion character from any identical values in the original ip / pw tags. Hopefully that makes sense now? Given that, is there any feasible work-around? |
Just a thought I had: maybe a solution for this situation would be if 'sequenceLayer' could also accept and return XRaw or another type of object that allows multi-byte values (i.e. range bigger than 0 - 255)? That way I don't have to convert to BString/BStringSet before running 'sequenceLayer', and then the injected insertion/deletion/etc characters can be set to values outside of the 0-255 range. Just note, this is a separate issue from the above visualization issue. |
ah, I see--my mistake, I was looking at This is making more sense, I just want to clarify a little more since this isn't an analysis I'm super familiar with. Thanks for bearing with me haha
Are the ip/pw tags what you're actually using to map query->reference coordinates? I'm envisioning one of two scenarios: (Scenario 1) The mapping is based on the query sequence but we can't lose the kinetics info, e.g.
Here there isn't any need to load these values as a (Scenario 2) Both query/reference are just the
And in this scenario we definitely need to load both sequences as Would you mind commenting on which of these scenarios are closer to correct? Or if neither are correct, maybe like a very small contrived example to illustrate would be helpful--I saw your previous posts, but it wasn't 100% clear. Some other quick comments:
Yep--I'm in agreement with you that adding a visualization option for ""invalid"" values should be incorporated in some way. That implementation is prototyped, but I need to finish building out some testing suites before it's added. |
Thanks so much. The alignment of the ip or pw tags from query -> reference does not depend or require the actual read or reference sequences. I am just giving sequenceLayer a BStringSet that contains the ip/pw tags and the CIGAR string. In other words, the mapping relies on the CIGAR string of the read, and it does not require any info about the read or reference sequences. And it is not using the ip / pw values themselves for the mapping. Your suggestion of adding the ip and pw values as extra metadata to the read sequence, so that ip and pw do not need to be loaded as BString is interesting. Then doing sequenceLayer, and then extracting the corresponding locations from the metadata -- I'm not sure how that would work though? Specifically this step you wrote "then subset them with the returned subsetting of sequenceLayer". How would I use the result of sequenceLayer be used to extract ip/pw values in reference coordinates, and also doing so without confusing between injected 'insertion' or 'deletion' ASCII characters? See toy example:
|
Sorry for the slow reply--I've been thinking about this more over this week and looking into the codebase. First, a temporary workaround: subset_arbitrary_values_by_cigar <- function(xv, cigar){
# query -> reference mapping
ops_to_remove <- c("I", "S")
ops_to_inject <- c("D", "N")
inject_at <- cigarRangesAlongQuerySpace(cigar, ops=ops_to_inject)[[1]]
inject_w <- rev(explodeCigarOpLengths(cigar, ops=ops_to_inject)[[1]])
filler_regions <- as(rep(0, max(inject_w)), class(xv))
filler_starts <- rev(attr(inject_at, "start"))
# inject 0 filler to insertion regions
# subseq isn't vectorized, it seems, so have to loop
# going in reverse so the positions don't change from insertion
for(i in seq_along(filler_starts))
subseq(xv, start=filler_starts[i], width=0L) <- subseq(filler_regions, start=1L, width=inject_w[i])
# Delete regions
# transform the cigar string now that we've inserted
ops_to_inject <- paste(ops_to_inject, collapse='')
cigar <- gsub(paste0("[", ops_to_inject, "]"), "M", cigar)
delete_at <- cigarRangesAlongQuerySpace(cigar, ops=ops_to_remove)[[1]]
total_range <- IRanges(1, width=length(xv))
xv <- xv[setdiff(total_range, delete_at)]
xv
} This function takes in an DNA1 <- DNAString("ATTTCTAGAGA")
DNA2 <- DNAString("CGGGCTGACGC")
DNA3 <- DNAString("ATGCCAAGGAT")
DNASet <- DNAStringSet(list(DNA1,DNA2,DNA3))
cigar1 <- "5M1X2I1M1D1M"
cigar2 <- "5M1I2D1M2M"
cigar3 <- "3M2I1M2D1M1D1I3M"
cigars <- c(cigar1, cigar2, cigar3)
# You could use any character/raw string with something like:
# xv1 <- as(as.integer(charToRaw(ip1)) + 1L, "XInteger")
# Adding 1 because 0 is reserved for gap characters. XInteger supports >255, so this isn't a problem.
# I'm going to just use integers to make the output more easily verifiable
ip_codes <- as(1:11, "XInteger")
sequenceLayer(DNASet,cigars)
## DNAStringSet object of length 3:
## width seq
## [1] 10 ATTTCTA-GA
## [2] 12 CGGGC--GACGC
## [3] 11 ATGA--A-GAT
subset_arbitrary_values_by_cigar(ip_codes, cigar1)
## XInteger of length 10
## [1] 1 2 3 4 5 6 9 0 10 11
subset_arbitrary_values_by_cigar(ip_codes, cigar2)
## XInteger of length 12
## [1] 1 2 3 4 5 0 0 7 8 9 10 11
subset_arbitrary_values_by_cigar(ip_codes, cigar3)
> subset_arbitrary_values_by_cigar(ip_codes, cigar3)
## XInteger of length 11
## [1] 1 2 3 6 0 0 7 0 9 10 11 It's not perfect (would be nice if it operated on Next, a discussion on implementation. This is unfortunately a bit of a tough spot regarding implementation. On the one hand, it's not super feasible to support values above 255 for What are the possible solutions? As far as I can tell, there's three potentially viable options:
Of those, I think (2) is likely not feasible. (1) is obviously the easiest, but doesn't really solve the issue. I think (3) is likely the best solution. If I were to implement it, I'd probably do something like this:
For an example implementation, consider storing 2 ## Very quick implementation, this is likely not the most efficient
doubleBStringLayer <- function(ip_vals, cigars, ...){
ip_vals <- lapply(ip_vals, charToRaw)
lower_vals <- lapply(ip_vals, \(x) x & as.raw(0x0F))
upper_vals <- lapply(ip_vals, \(x) x & as.raw(0xF0))
# use ASCII representation
#lower_vals <- vapply(lower_vals, \(x) intToUtf8(as.integer(x) + 64L), character(1L))
#upper_vals <- vapply(upper_vals, \(x) intToUtf8(as.integer(rawShift(x,-4)) + 64L), character(1L))
# use hexadecimal representation
lower_vals <- vapply(lower_vals, \(x) paste(substring(as.character(x), 2, 2), collapse=''), character(1L))
upper_vals <- vapply(upper_vals, \(x) paste(substring(as.character(x), 2, 2), collapse=''), character(1L))
lowerBString <- BStringSet(lower_vals)
upperBString <- BStringSet(upper_vals)
# pass through any further args to sequenceLayer
lowerBString <- sequenceLayer(lowerBString, cigars, ...)
upperBString <- sequenceLayer(upperBString, cigars, ...)
BStringSetList(upperBString, lowerBString)
}
ip1 <- "2flsd9f0slm"
ip2 <- "mkwi9v0-m31"
cigar1 <- "5M1X2I1M1D1M"
cigar2 <- "5M1I2D1M2M"
doubleBStringLayer(c(ip1, ip2), c(cigar1, cigar2))
## BStringSetList of length 2
## [[1]] 3667637-66 66763--32633
## [[2]] 26c3493-cd db799--0dd31
## these translate back to the original values when read vertically:
## 32, 66, 6C, 73, 64, ...
## = 2, f, l, s, d, ... That approach would be extensible in the event that future sequencing runs encode data larger than 255, since it would be possible to encode values as arbitrarily many Implementing that class would be slightly more challenging than just this function--you'd probably need additional methods to abstract the multiple strings into a cleaner user experience (e.g., custom So that brings my (current) final recommendation to a combo of (1) and (3)--leaving |
HI Aidan, Thanks so much - very thoughtful and creative! From my user perspective, I think a separate class for kinetic data that internally splits the data into smaller BStrings is reasonable, and then adding a sequenceLayer method that can process those new kinetic data objects. I agree this is a very specific use case. I anticipate it will grow with time, since this is the basis for methylation and other calls in PacBio data but still a small group of users. So putting this in a separate package makes sense to avoid interfering with GenomicAlignments and Biostrings. My own coding level is not up to par to be able to implement this myself, but I'm happy to beta test in detail and stress test any implementation. In the meantime, I will try your work-around, though I suspect it may be quite slow when run on millions of reads, since it is not fully vectorized. I will try at least. |
No worries! Since I've been thinking about it, try just using this solution based on the sequenceLayer_bctags <- function(values, cigars, ...){
## Input:
## - values: either a list of integer or raw OR character strings
## - cigars: cigar strings
## - ... : further arguments to be passed to `sequenceLayer`
##
## Output: list of length(values) containing integers
## - -1 is gap character, all other values converted to integers
# values are integer numbers
if(is(values, 'character'))
values <- lapply(values, \(x) as.character(charToRaw(x)))
else
values <- lapply(values, \(x) as.character(as.raw(x)))
lower_vals <- vapply(values, \(x) paste(substring(x, 2, 2), collapse=''), character(1L))
upper_vals <- vapply(values, \(x) paste(substring(x, 1, 1), collapse=''), character(1L))
lowerBString <- BStringSet(lower_vals)
upperBString <- BStringSet(upper_vals)
lowerBString <- sequenceLayer(lowerBString, cigars, ...)
upperBString <- sequenceLayer(upperBString, cigars, ...)
upper_bits <- strsplit(as.character(upperBString), '')
lower_bits <- strsplit(as.character(lowerBString), '')
retval <- vector('list', length(upper_bits))
for(i in seq_along(upper_bits)){
v <- paste0(upper_bits[[i]], lower_bits[[i]])
outv <- numeric(length(v))
outv[] <- -1L
p_valid <- v!='--'
outv[p_valid] <- as.integer(paste0("0x", v[p_valid]))
retval[[i]] <- outv
}
retval
} Example Usage: ip1 <- "2flsd9f0slm"
ip2 <- "mkwi9v0-m31"
ip_values <- c(ip1, ip2)
cigar1 <- "5M1X2I1M1D1M"
cigar2 <- "5M1I2D1M2M"
cigars <- c(cigar1,cigar2)
sequenceLayer_bctags(ip_values, cigars)
## [[1]]
## [1] 50 102 108 115 100 57 115 -1 108 109
##
## [[2]]
## [1] 109 107 119 105 57 -1 -1 48 45 109 51 49
## can also provide a list of ints
set.seed(123L)
ip_values <- list(sample(11), sample(11))
ip_values
## [[1]]
## [1] 3 11 2 6 10 5 4 9 8 1 7
##
## [[2]]
## [1] 11 5 3 9 1 4 7 10 6 8 2
# also able to change sequenceLayer args
sequenceLayer_bctags(ip_values, cigars, from='reference', to='query')
## [[1]]
## [1] 3 11 2 6 10 5 -1 -1 4 8 1 7
##
## [[2]]
## [1] 11 5 3 9 1 -1 10 6 8 2 Only note is that this assumes gap characters will be Maybe down the line I can get around to making this a real class with better error checking etc. (or if anyone else sees this and wants to, that would be even better 😂) |
Thanks so much! This is super helpful and is a great workaround. I just tested it and it seems to work, but I'm curious if there is any way to speed up this part?
|
Actually it works decently fast. I can't figure out a faster (more vectorized) way to do this. |
I'm not sure if it's faster, but a more "R" way to do it would be: sequenceLayer_bctags <- function(values, cigars, ...){
# values are integer numbers
if(is(values, 'character'))
values <- lapply(values, \(x) as.character(charToRaw(x)))
else
values <- lapply(values, \(x) as.character(as.raw(x)))
lower_vals <- vapply(values, \(x) paste(substring(x, 2, 2), collapse=''), character(1L))
upper_vals <- vapply(values, \(x) paste(substring(x, 1, 1), collapse=''), character(1L))
lowerBString <- BStringSet(lower_vals)
upperBString <- BStringSet(upper_vals)
lowerBString <- sequenceLayer(lowerBString, cigars, ...)
upperBString <- sequenceLayer(upperBString, cigars, ...)
upper_bits <- strsplit(as.character(upperBString), '')
lower_bits <- strsplit(as.character(lowerBString), '')
res <- mapply(paste0, "0x", upper_bits, lower_bits, USE.NAMES=FALSE)
res <- lapply(res, \(x) {
v <- suppressWarnings(as.integer(x))
v[is.na(v)] <- -1L
v})
res
} |
Thanks! Yes I wrote something similar, but it's about the same speed as the for loop. I also decided to keep the 'deletion' positions as NA instead of -1. This is better for downstream calculations that for example need to calculate a mean kinetics value. This allows removal of this line: v[is.na(v)] <- -1L |
How do I use BString / BStringSet for values > 127?
PacBio data uses 0:255 ranges for fp/rp/fi/ri tags. I am trying to use BioStrings to work with these tags.
But I ran into trouble, for values > 127.
Technically, ASCII shoudl go up to 255, so I'm not sure why BString shouldn't be able to handle things like this:
BStringSet(rawToChar(as.raw(135)))
It seems to just return '?' for any value > 127.
The text was updated successfully, but these errors were encountered: