Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserving full headers in pan-genome #46

Closed
rcedgar opened this issue Apr 23, 2020 · 7 comments
Closed

Preserving full headers in pan-genome #46

rcedgar opened this issue Apr 23, 2020 · 7 comments
Labels
Bioinformatics Bioinformatics task enhancement New feature or request

Comments

@rcedgar
Copy link
Collaborator

rcedgar commented Apr 23, 2020

Per following comment in https://github.com/ababaian/serratus/blob/master/notebook/200420_cov2_pangenome.ipynb

"seqkit destroys the original headers so now each 'chromosome' or input sequence is referred to by it's accession ID only."

Seqkit truncates at the first white space. You can prevent it from truncating labels by replacing spaces by a character which never appears in a defline such as '@' and reversing out the change at the end of the processing.

This can be implemented by a short awk script which replaces space by @ on lines matching ^>, and reversed out by a similar script which replaces @ by space.

@ababaian
Copy link
Owner

ababaian commented Apr 23, 2020

I purposefully removed the headers and kept only the accessions to make the headers SAM-format compliant,

Reference sequence names may contain any printable ASCII characters in the range[!-~]apart from backslashes, commas, quotation marks, and brackets—i.e., apart from ‘\ , "‘’ () [] {} <>’—and may notstart with ‘*’ or ‘=’.4
REGEX: 0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]*

Accessions are nice in that they are short, but you're right they are lacking information about taxonomy. Perhaps using an ID tag for easy parsing would be the correct solution here and address #45 as well.

How about: <Accession>-<tax_id>-<host_tax_id> Then we can use the dash as a field separator and retain useful information, that is accession, virus species, and host species (if available). Else use a 0.

Thoughts?

@rcedgar rcedgar added Bioinformatics Bioinformatics task enhancement New feature or request labels Apr 23, 2020
@rcedgar
Copy link
Collaborator Author

rcedgar commented Apr 23, 2020

The ideal pan-genome reference would have full headers and taxonomy identifiers because they are easy to remove on the fly but harder to insert when needed. From memory, I think bowtie2 truncates identifiers when making the BWT index so only accessions will appear in SAM, if not then it's trivial to truncate as a pre-processing step.

@ababaian
Copy link
Owner

OK I'll test that bt2 can parse NCBI headers and leave them in the pan-genome. I'll 'hotfix' cov2 and re-make it later today.

@rcedgar
Copy link
Collaborator Author

rcedgar commented Apr 23, 2020

For the pan-genome, there are several things we might be interested in and it would be nice to have a flexible and forwards-compatible method for annotating the deflines. Examples: Cov taxid, host taxid, geneid (if only one gene), complete genome if it is one (a subset of complete genomes is very useful) etc. In my own work, I embed annotations in FASTA deflines using name=value pairs separated by semi-colons. A script can easily parse the defline and ignore names it doesn't recognize. A curated and well-annotated Coronavirus reference database would be a useful public resource in its own right.

@rcedgar
Copy link
Collaborator Author

rcedgar commented Apr 23, 2020

To be clear, for mapping I would not advocate including all the annotations because this bloats the SAM and BAM files by repeating the same information thousands of times and is not forwards-compatible as new annotations are added. For SAM/BAM I would just have accessions. Other scripts, e.g. a hit summarizer, could make good use of the annotations. Annotations could be kept in separate files (acc+taxid, acc+hosttaxid, acc+geneid...), but then you have the problem of information distributed among several files and keeping them synchronized. IMO it is better to have one master file with all the information (FASTA with annotated deflines) from which information can be extracted as needed. This is tractable for Cov because the FASTA file with all Cov sequences is not too big.

@ababaian
Copy link
Owner

That's kind of what I was thinking before, having a header / taxonomy file which contains all the meta-data for each accession, and for alignment only use the accession ID. I currently don't have an idea which will be the most usable but am leaning towards (1)

  1. Meta-data table + accession-only fasta header in pan-genome
  2. Including standard NCBI headers in pan-genome
  3. Creating custom headers for pan-genome

@rcedgar
Copy link
Collaborator Author

rcedgar commented Apr 23, 2020

I think we're in agreement -- the main thing is to do it, the details are not critical.

We should distinguish between Covid sequences in at least two stages, at a minimum (1) as downloaded from NCBI and (2) a reduced redundancy reference used for mapping. Suggest we reserve "pan-genome" for (2), and invent a new term ("bulk"?) for (1).

My recommendation is to add annotations in a name=value format to the bulk sequences. There should be enough information in those annotations to automate construction of (2) and all other derived data, e.g. blacklist=yes or equivalent should be added. We could call this latter dataset "annotated bulk coronaviruses", or "ABC" for short. Having a single ABC file is a future-hardended, forwards-compatible software engineering architecture because it provides a single input for many different downstream processes.

When we add new metadata, say host taxonomy id, this becomes a new annotation and the core implementation task is to add these annotations to the ABC file. This can be done in serial: each annotation type can be added in one pass independently of the others.

New annotation types are then available with minimal effort to all scripts which use ABC because they already read the file and scan through the annotations, the only change that might be needed for a new annotation is to check the new name if needed.

Where file size is not important, annotations can be passed through unchanged. This is forwards-compatible because annotations will be included regardless of whether their type was known at the time a script was implemented.

On the other hand, if we have separate files for each type of annotation (blacklist, taxonomy...), then ongoing development and maintenance is more difficult because there are N:M dependencies (annotation files and their formats <-> scripts) which vary over time, while with my proposal this remains 1:N indefinitely.

@ababaian ababaian closed this as completed May 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bioinformatics Bioinformatics task enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants