New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Genbank parser dev #61
base: master
Are you sure you want to change the base?
Conversation
Didn't test, but functionality looks great, sorry again for treading on this earlier! |
Minor suggestion for possible future enhancement, much of the code is essentially a lookup table which would be easier to maintain as an external file in (say) tsv format, e.g.
|
I fully agree, I'll try and implement it somewhere in the next few days. |
What is the status of this pull request? Are we waiting on a review from @ababaian ? |
It's good, it's definetly working but there are some features that need to be added, see the taxonomy issue. |
Two questions:
|
Nevermind the "where" part of the question, looked at the commits. |
Probably not. We can either merge and close this and have someone pick up from here. |
I can take a crack at the metadata file, but it sounds like @r1cedgar might be doing some of this with #101 . So we should coordinate. |
Parsing genbank (this issue) and uniform annotation of reference and predicted genomes (#101) are separate issues. We want both in parallel. |
The genbankParser is designed to be run as a standalone script to generate a formatted and cleaned csv table of the covid pan-genome from genbank input. It deals with most hostTaxonId mapping errors (which are plentiful), and attempts to infer these hostTaxonIds for duplicate and highly homologous entries by checking if clusters/duplicate all provide the same hostTaxonId and if so inferring it for those where none was provided (in a new column). This infers about 1500 hostTaxonIds.