Genbank parser dev #61

Bdegraaf1234 · 2020-04-27T11:43:03Z

The genbankParser is designed to be run as a standalone script to generate a formatted and cleaned csv table of the covid pan-genome from genbank input. It deals with most hostTaxonId mapping errors (which are plentiful), and attempts to infer these hostTaxonIds for duplicate and highly homologous entries by checking if clusters/duplicate all provide the same hostTaxonId and if so inferring it for those where none was provided (in a new column). This infers about 1500 hostTaxonIds.

…d from genbank

rcedgar · 2020-04-27T14:58:35Z

Didn't test, but functionality looks great, sorry again for treading on this earlier!

rcedgar · 2020-04-27T15:05:35Z

Minor suggestion for possible future enhancement, much of the code is essentially a lookup table which would be easier to maintain as an external file in (say) tsv format, e.g.

 else if (grepl(fixed = FALSE, "Vespadelus baverstocki", NoParenth, ignore.case = T)) {
    return("unclassified Scotoecus")

Bdegraaf1234 · 2020-04-27T16:01:27Z

I fully agree, I'll try and implement it somewhere in the next few days.

taltman · 2020-05-17T07:15:34Z

What is the status of this pull request? Are we waiting on a review from @ababaian ?

ababaian · 2020-05-17T16:00:00Z

It's good, it's definetly working but there are some features that need to be added, see the taxonomy issue.

taltman · 2020-05-17T22:04:25Z

Two questions:

Is @Bdegraaf1234 still maintaining this code?
Where is this code in the repo?

taltman · 2020-05-17T22:05:10Z

Nevermind the "where" part of the question, looked at the commits.

ababaian · 2020-05-17T22:08:14Z

Probably not. We can either merge and close this and have someone pick up from here.

taltman · 2020-05-17T22:45:41Z

I can take a crack at the metadata file, but it sounds like @r1cedgar might be doing some of this with #101 . So we should coordinate.

rcedgar · 2020-05-17T23:56:41Z

Parsing genbank (this issue) and uniform annotation of reference and predicted genomes (#101) are separate issues. We want both in parallel.

Bdegraaf1234 and others added 3 commits April 27, 2020 13:07

added a notebook cleaning and parsing the covid pan genome as obtaine…

6079108

…d from genbank

Create README

737c23d

commit unsaved changes

26d0909

ababaian self-requested a review April 27, 2020 16:50

ababaian linked an issue Apr 27, 2020 that may be closed by this pull request

Taxonomy identifiers for Cov reference database #45

Closed

mathemage assigned rcedgar and Bdegraaf1234 May 2, 2020

mathemage added this to Task In Progress in TODO List via automation May 2, 2020

mathemage moved this from Task In Progress to Code Review in TODO List May 2, 2020

rcedgar approved these changes May 2, 2020

View reviewed changes

taltman added this to To do in Serratus Annotation May 17, 2020

ababaian mentioned this pull request May 17, 2020

Deliver uniform annotation of reference coronavirus genomes #101

Closed

ababaian removed this from Code Review in TODO List May 20, 2020

taltman removed this from To do in Serratus Annotation Jul 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Genbank parser dev #61

Genbank parser dev #61

Bdegraaf1234 commented Apr 27, 2020

rcedgar commented Apr 27, 2020

rcedgar commented Apr 27, 2020

Bdegraaf1234 commented Apr 27, 2020

taltman commented May 17, 2020

ababaian commented May 17, 2020

taltman commented May 17, 2020

taltman commented May 17, 2020

ababaian commented May 17, 2020

taltman commented May 17, 2020

rcedgar commented May 17, 2020

Genbank parser dev #61

Are you sure you want to change the base?

Genbank parser dev #61

Conversation

Bdegraaf1234 commented Apr 27, 2020

rcedgar commented Apr 27, 2020

rcedgar commented Apr 27, 2020

Bdegraaf1234 commented Apr 27, 2020

taltman commented May 17, 2020

ababaian commented May 17, 2020

taltman commented May 17, 2020

taltman commented May 17, 2020

ababaian commented May 17, 2020

taltman commented May 17, 2020

rcedgar commented May 17, 2020