-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preserving full headers in pan-genome #46
Comments
I purposefully removed the headers and kept only the accessions to make the headers SAM-format compliant,
Accessions are nice in that they are short, but you're right they are lacking information about taxonomy. Perhaps using an ID tag for easy parsing would be the correct solution here and address #45 as well. How about: Thoughts? |
The ideal pan-genome reference would have full headers and taxonomy identifiers because they are easy to remove on the fly but harder to insert when needed. From memory, I think bowtie2 truncates identifiers when making the BWT index so only accessions will appear in SAM, if not then it's trivial to truncate as a pre-processing step. |
OK I'll test that bt2 can parse NCBI headers and leave them in the pan-genome. I'll 'hotfix' |
For the pan-genome, there are several things we might be interested in and it would be nice to have a flexible and forwards-compatible method for annotating the deflines. Examples: Cov taxid, host taxid, geneid (if only one gene), complete genome if it is one (a subset of complete genomes is very useful) etc. In my own work, I embed annotations in FASTA deflines using name=value pairs separated by semi-colons. A script can easily parse the defline and ignore names it doesn't recognize. A curated and well-annotated Coronavirus reference database would be a useful public resource in its own right. |
To be clear, for mapping I would not advocate including all the annotations because this bloats the SAM and BAM files by repeating the same information thousands of times and is not forwards-compatible as new annotations are added. For SAM/BAM I would just have accessions. Other scripts, e.g. a hit summarizer, could make good use of the annotations. Annotations could be kept in separate files (acc+taxid, acc+hosttaxid, acc+geneid...), but then you have the problem of information distributed among several files and keeping them synchronized. IMO it is better to have one master file with all the information (FASTA with annotated deflines) from which information can be extracted as needed. This is tractable for Cov because the FASTA file with all Cov sequences is not too big. |
That's kind of what I was thinking before, having a header / taxonomy file which contains all the meta-data for each accession, and for alignment only use the accession ID. I currently don't have an idea which will be the most usable but am leaning towards (1)
|
I think we're in agreement -- the main thing is to do it, the details are not critical. We should distinguish between Covid sequences in at least two stages, at a minimum (1) as downloaded from NCBI and (2) a reduced redundancy reference used for mapping. Suggest we reserve "pan-genome" for (2), and invent a new term ("bulk"?) for (1). My recommendation is to add annotations in a name=value format to the bulk sequences. There should be enough information in those annotations to automate construction of (2) and all other derived data, e.g. blacklist=yes or equivalent should be added. We could call this latter dataset "annotated bulk coronaviruses", or "ABC" for short. Having a single ABC file is a future-hardended, forwards-compatible software engineering architecture because it provides a single input for many different downstream processes. When we add new metadata, say host taxonomy id, this becomes a new annotation and the core implementation task is to add these annotations to the ABC file. This can be done in serial: each annotation type can be added in one pass independently of the others. New annotation types are then available with minimal effort to all scripts which use ABC because they already read the file and scan through the annotations, the only change that might be needed for a new annotation is to check the new name if needed. Where file size is not important, annotations can be passed through unchanged. This is forwards-compatible because annotations will be included regardless of whether their type was known at the time a script was implemented. On the other hand, if we have separate files for each type of annotation (blacklist, taxonomy...), then ongoing development and maintenance is more difficult because there are N:M dependencies (annotation files and their formats <-> scripts) which vary over time, while with my proposal this remains 1:N indefinitely. |
Per following comment in https://github.com/ababaian/serratus/blob/master/notebook/200420_cov2_pangenome.ipynb
"seqkit destroys the original headers so now each 'chromosome' or input sequence is referred to by it's accession ID only."
Seqkit truncates at the first white space. You can prevent it from truncating labels by replacing spaces by a character which never appears in a defline such as '@' and reversing out the change at the end of the processing.
This can be implemented by a short awk script which replaces space by @ on lines matching ^>, and reversed out by a similar script which replaces @ by space.
The text was updated successfully, but these errors were encountered: