Preserving full headers in pan-genome #46

rcedgar · 2020-04-23T16:26:56Z

Per following comment in https://github.com/ababaian/serratus/blob/master/notebook/200420_cov2_pangenome.ipynb

"seqkit destroys the original headers so now each 'chromosome' or input sequence is referred to by it's accession ID only."

Seqkit truncates at the first white space. You can prevent it from truncating labels by replacing spaces by a character which never appears in a defline such as '@' and reversing out the change at the end of the processing.

This can be implemented by a short awk script which replaces space by @ on lines matching ^>, and reversed out by a similar script which replaces @ by space.

ababaian · 2020-04-23T16:55:13Z

I purposefully removed the headers and kept only the accessions to make the headers SAM-format compliant,

Reference sequence names may contain any printable ASCII characters in the range[!-~]apart from backslashes, commas, quotation marks, and brackets—i.e., apart from ‘\ , "‘’ () [] {} <>’—and may notstart with ‘*’ or ‘=’.4
REGEX: 0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]*

Accessions are nice in that they are short, but you're right they are lacking information about taxonomy. Perhaps using an ID tag for easy parsing would be the correct solution here and address #45 as well.

How about: <Accession>-<tax_id>-<host_tax_id> Then we can use the dash as a field separator and retain useful information, that is accession, virus species, and host species (if available). Else use a 0.

Thoughts?

rcedgar · 2020-04-23T16:58:41Z

The ideal pan-genome reference would have full headers and taxonomy identifiers because they are easy to remove on the fly but harder to insert when needed. From memory, I think bowtie2 truncates identifiers when making the BWT index so only accessions will appear in SAM, if not then it's trivial to truncate as a pre-processing step.

ababaian · 2020-04-23T17:02:28Z

OK I'll test that bt2 can parse NCBI headers and leave them in the pan-genome. I'll 'hotfix' cov2 and re-make it later today.

rcedgar · 2020-04-23T17:06:39Z

For the pan-genome, there are several things we might be interested in and it would be nice to have a flexible and forwards-compatible method for annotating the deflines. Examples: Cov taxid, host taxid, geneid (if only one gene), complete genome if it is one (a subset of complete genomes is very useful) etc. In my own work, I embed annotations in FASTA deflines using name=value pairs separated by semi-colons. A script can easily parse the defline and ignore names it doesn't recognize. A curated and well-annotated Coronavirus reference database would be a useful public resource in its own right.

rcedgar · 2020-04-23T17:13:01Z

To be clear, for mapping I would not advocate including all the annotations because this bloats the SAM and BAM files by repeating the same information thousands of times and is not forwards-compatible as new annotations are added. For SAM/BAM I would just have accessions. Other scripts, e.g. a hit summarizer, could make good use of the annotations. Annotations could be kept in separate files (acc+taxid, acc+hosttaxid, acc+geneid...), but then you have the problem of information distributed among several files and keeping them synchronized. IMO it is better to have one master file with all the information (FASTA with annotated deflines) from which information can be extracted as needed. This is tractable for Cov because the FASTA file with all Cov sequences is not too big.

ababaian · 2020-04-23T19:36:19Z

That's kind of what I was thinking before, having a header / taxonomy file which contains all the meta-data for each accession, and for alignment only use the accession ID. I currently don't have an idea which will be the most usable but am leaning towards (1)

Meta-data table + accession-only fasta header in pan-genome
Including standard NCBI headers in pan-genome
Creating custom headers for pan-genome

rcedgar · 2020-04-23T20:30:14Z

I think we're in agreement -- the main thing is to do it, the details are not critical.

We should distinguish between Covid sequences in at least two stages, at a minimum (1) as downloaded from NCBI and (2) a reduced redundancy reference used for mapping. Suggest we reserve "pan-genome" for (2), and invent a new term ("bulk"?) for (1).

My recommendation is to add annotations in a name=value format to the bulk sequences. There should be enough information in those annotations to automate construction of (2) and all other derived data, e.g. blacklist=yes or equivalent should be added. We could call this latter dataset "annotated bulk coronaviruses", or "ABC" for short. Having a single ABC file is a future-hardended, forwards-compatible software engineering architecture because it provides a single input for many different downstream processes.

When we add new metadata, say host taxonomy id, this becomes a new annotation and the core implementation task is to add these annotations to the ABC file. This can be done in serial: each annotation type can be added in one pass independently of the others.

New annotation types are then available with minimal effort to all scripts which use ABC because they already read the file and scan through the annotations, the only change that might be needed for a new annotation is to check the new name if needed.

Where file size is not important, annotations can be passed through unchanged. This is forwards-compatible because annotations will be included regardless of whether their type was known at the time a script was implemented.

On the other hand, if we have separate files for each type of annotation (blacklist, taxonomy...), then ongoing development and maintenance is more difficult because there are N:M dependencies (annotation files and their formats <-> scripts) which vary over time, while with my proposal this remains 1:N indefinitely.

rcedgar added Bioinformatics Bioinformatics task enhancement New feature or request labels Apr 23, 2020

ababaian closed this as completed May 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserving full headers in pan-genome #46

Preserving full headers in pan-genome #46

rcedgar commented Apr 23, 2020

ababaian commented Apr 23, 2020 •

edited

rcedgar commented Apr 23, 2020

ababaian commented Apr 23, 2020

rcedgar commented Apr 23, 2020

rcedgar commented Apr 23, 2020 •

edited

ababaian commented Apr 23, 2020

rcedgar commented Apr 23, 2020

Preserving full headers in pan-genome #46

Preserving full headers in pan-genome #46

Comments

rcedgar commented Apr 23, 2020

ababaian commented Apr 23, 2020 • edited

rcedgar commented Apr 23, 2020

ababaian commented Apr 23, 2020

rcedgar commented Apr 23, 2020

rcedgar commented Apr 23, 2020 • edited

ababaian commented Apr 23, 2020

rcedgar commented Apr 23, 2020

ababaian commented Apr 23, 2020 •

edited

rcedgar commented Apr 23, 2020 •

edited