Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Genbank parser dev #61

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Bdegraaf1234
Copy link

The genbankParser is designed to be run as a standalone script to generate a formatted and cleaned csv table of the covid pan-genome from genbank input. It deals with most hostTaxonId mapping errors (which are plentiful), and attempts to infer these hostTaxonIds for duplicate and highly homologous entries by checking if clusters/duplicate all provide the same hostTaxonId and if so inferring it for those where none was provided (in a new column). This infers about 1500 hostTaxonIds.

@rcedgar
Copy link
Collaborator

rcedgar commented Apr 27, 2020

Didn't test, but functionality looks great, sorry again for treading on this earlier!

@rcedgar
Copy link
Collaborator

rcedgar commented Apr 27, 2020

Minor suggestion for possible future enhancement, much of the code is essentially a lookup table which would be easier to maintain as an external file in (say) tsv format, e.g.

 else if (grepl(fixed = FALSE, "Vespadelus baverstocki", NoParenth, ignore.case = T)) {
    return("unclassified Scotoecus") 

@Bdegraaf1234
Copy link
Author

I fully agree, I'll try and implement it somewhere in the next few days.

@ababaian ababaian self-requested a review April 27, 2020 16:50
@ababaian ababaian linked an issue Apr 27, 2020 that may be closed by this pull request
@mathemage mathemage added this to Task In Progress in TODO List via automation May 2, 2020
@mathemage mathemage moved this from Task In Progress to Code Review in TODO List May 2, 2020
@taltman taltman added this to To do in Serratus Annotation May 17, 2020
@taltman
Copy link
Collaborator

taltman commented May 17, 2020

What is the status of this pull request? Are we waiting on a review from @ababaian ?

@ababaian
Copy link
Owner

It's good, it's definetly working but there are some features that need to be added, see the taxonomy issue.

@taltman
Copy link
Collaborator

taltman commented May 17, 2020

Two questions:

  • Is @Bdegraaf1234 still maintaining this code?
  • Where is this code in the repo?

@taltman
Copy link
Collaborator

taltman commented May 17, 2020

Nevermind the "where" part of the question, looked at the commits.

@ababaian
Copy link
Owner

Probably not. We can either merge and close this and have someone pick up from here.

@taltman
Copy link
Collaborator

taltman commented May 17, 2020

I can take a crack at the metadata file, but it sounds like @r1cedgar might be doing some of this with #101 . So we should coordinate.

@rcedgar
Copy link
Collaborator

rcedgar commented May 17, 2020

Parsing genbank (this issue) and uniform annotation of reference and predicted genomes (#101) are separate issues. We want both in parallel.

@ababaian ababaian removed this from Code Review in TODO List May 20, 2020
@taltman taltman removed this from To do in Serratus Annotation Jul 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Taxonomy identifiers for Cov reference database
4 participants