Skip to content

Read Names

kwrodarmer edited this page Jul 23, 2018 · 1 revision

Read Names

In July, 2018, we had some issues created regarding read names, and how they might be accessed from SRA objects. The short, general answer is you can't, although there are exceptions. This is because (again, in general) we don't store them.

In the cases where we do store them, they can be accessed with our dumper tools. In all other cases, they are replaced with a serial number. Our bam loader discards read names, while the older fastq loader is able to preserve them.

But read names have valuable information!

This is undoubtedly often the case. Typically they are low density, with common redundant information, coordinates that identify spot/location, read number, etc. In fact, some of this information is essential, and begs the question, why is it being squirreled away inside an identifier string? Clearly, we don't also put the base calls right in with the name; they are given their own designated field within any standard format. Why doesn't read index get its own field?

Read names are opaque identifiers with essentially no formatting requirements - you probably take issue with me calling them "opaque". Individual manufacturers each started with the problem of how to automatically generate unique identifier strings that naturally and inevitably end up with either serial numbers or coordinates, along with some other user-provided identifier to encourage global uniqueness. When reads are not associated by spot locality, a read index is also added to their names. Eventually, these end up accumulating a large amount of valuable information.

How should the SRA treat read names?

If we take a brief step back to remember that the SRA receives heterogeneous data that come in several formats and from different manufacturers, and tries to harmonize these into a resource with a uniform access model, it should be clear that differences tend to drop out. A researcher is not supposed to need to be concerned with the source of data in the SRA.

When we started the SRA in 2007, we used to parse all of the read names to extract all of the information we were aware they contained. This helped us to compress them while at the same time preserving potentially useful information in a first-class way with designated data series. In any event, we needed at least the read indices because we were assembling reads into a spot object internally. For the first several years, we experienced enormous error rates during submission due to malformed read names, these being just about anything that did not conform with published (or reverse engineered) manufacturer specs. To be clear, the parsers were doing exactly what they were supposed to do and rejected anything out of spec. This was frustrating for everyone involved - submitters, curators, and engineers.

Where do we put our processing artifacts?

What was surprising to us at that time (although inevitable in retrospect) was that submitters were intentionally changing the patterns of read names they sent to us. Doesn't that break the format? Well, yes it does - unless you consider the read names to be opaque (see above). If instead the read names contain valuable information that should be parsed, then any modification to the patterns is absolutely a violation of format. But where else could a pipeline stash its data in a format that can't describe it? The read name provided exactly the sort of flexible place to store anything. I always joke that we find someone's favorite recipe from their grandmother in there.

The trouble with formats

A well-specified file format is a rare thing in bioinformatics, but a pleasure to load and generate! Every data series is given its own place and tools can be written without ambiguities. At the same time, they're frustrating to pack with information. Once they become a standard, it is very difficult for them to evolve because all of the tools that depend upon them have to be updated to understand not only the latest version, but all of the prior versions as well. The research community has dealt with this by taking advantage of any ambiguity in specification to overload semantics, and in some cases (e.g. SAM/BAM) specifically adding in escape valves in the form of optional and comment fields. Fuzzy specifications allow a format to survive change for a little while through ambiguity, but at the cost of what tools can do with them.

Ambiguous formats don't survive the Extract, Transform, Load (ETL) process because the transform function must understand the source in order to generate output. The SRA performs transformation on 100% of its input to produce a canonical and harmonized representation that allows us to present a uniform access model.

VDB is a database technology that supports individual schema definitions for every object, meaning that it is entirely well-specified at the same time that it can accommodate any data. It also makes it possible for us to provide backward compatibility by supporting all versions that have ever existed. These properties make it ideal for the SRA, and allow us to create unambiguous files that expand to accommodate changes.

In conclusion

More information is being generated than will fit into FASTQ/SAM formats. The information is being generated by the manufacturers themselves, and in many cases being augmented by pre-submission pipelines. Some manufacturers have chosen technologies such as HDF5 to hold their data, but the community stubbornly clings to FASTQ and SAM/BAM. These formats have valuable bits of information stuffed into their comment or name fields under the wildly incorrect assumption that archives will store their files unchanged, without any ETL.

Using read names to store valuable information is a little like throwing a concert violin into a cloth bag and putting it through baggage handling on an airplane. Yes, it's possible you might get it back intact when you land, but it's just as possible that it might be reduced to splinters. If your data are valuable, please put them into a format that supports them - one that will survive transformation.

Because read names are treated as opaque and malleable, in the end they can only be treated as identifiers, and these are more economically represented by a serial number. And that's how the SRA treats them.