Erroneous Newick output #17

rossmounce · 2015-08-07T16:53:30Z

Here's the full list of 16 erroneous Newick files:

Details

001123

description: problem fairly obvious here. OCR has interpreted a . as a , (comma). "B,multivorans"
In Newick commas are special characters so this causes problems.

001149
https://github.com/ContentMine/phylotree/blob/master/errors/TreeGraph2-validation-tests/ijs.0.001149-0-003.pbm.nwk

description: a very odd file with 199 empty/unlabelled tips! It makes a lot of sense when you view the source image: https://github.com/rossmounce/pluto-ONS/blob/master/testing/output/ijs.0.001149-0-003.pbm.png/ijs.0.001149-0-003.pbm.png AMI has clearly quite faithfully tried to interpret this odd image, it just isn't valid Newick.

003160

description this is the problem bit: ":::::::::::::47.0" unknown cause

019687

description comma inserted in taxon label: "X74685,D25307"

022285

description very poor source image, poor OCR output, comma in taxon label: "(,V/lbrloxuuLMG21346T"

02251

description not sure what problem is here, some single quote ' marks perhaps?

02303

description not sure. Perhaps the slash symbol / or the single quote marks '

02328

description possibly the negative length branches ":-59.0,Oxyrrhismarina" and there's a square bracket symbol [ "Chlamydomonas[noen" and single quote marks

02329

description taxon label with a dollar sign in it "US$433:27.0" also negative branch lengths ":-28.0,99:-16.0,:-17.0,99:-6.0"

02770

description at least three slashes in taxon labels "B.Vinson/isubsp"

02792

description loads of odd symbols ":26.0,5'5°" lots of exclamation marks "V.loge!35077:37.0" pound symbol "£31"

02806

description negative branch lengths "(:-2.0,:-5.0)NT1.16:-4.0" and a single quote mark

02994

description slash "Methanobacler/umIvanovii" single quotes and negative branch lengths "(:-1.0,:182.0)"

63077

description colon in taxon name "Peptostreptococcusmicro::135.0,"

63400

description single quote marks and double quote marks

petermr · 2015-08-07T17:57:14Z

How do I know what file created this? do we have a numbering system for
batches? file should give the batch ID (which should be described in the
"about" pages in phylotree/ ) and then the filename.

On Fri, Aug 7, 2015 at 6:49 PM, Peter Murray-Rust <
peter.murray.rust@googlemail.com> wrote:

This looks like one Newick file, not a list.

On Fri, Aug 7, 2015 at 5:59 PM, Ross Mounce notifications@github.com
wrote:

Assigned #17 #17 to
@petermr https://github.com/petermr.

—
Reply to this email directly or view it on GitHub
#17 (comment).

Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

rossmounce · 2015-08-07T18:30:46Z

There is only 1 batch. I ran through them all in a bash loop with a timeout command. Sorry I edited your comment rather than writing a new comment. Still learning this mobile phone app icons...

petermr · 2015-08-08T10:14:11Z

Yes. We should probably record these as garbles. This is what

    public void setSpeciesPattern(Pattern speciesPattern) ;

has been created for. (`PatterThe question is whether to

abort the CTree
replace the tip by some reserved Newick-friendly message - e.g. "GARBLED" - which cannot be a taxon.I think we should do the latter and I'll try to code it.

rossmounce · 2015-08-08T10:19:31Z

latter seems reasonable to me but perhaps too strict (?)

I was imagining perhaps just a post-OCR filter to remove all special characters. Replace with nothing (no space) or for certain special characters, e.g. $ replace with its most likely replacement candidate: S

petermr · 2015-08-08T10:31:22Z

The problem is that the OCR doesn't know what it is being used for. Maybe I can add a pre-OCR filter to limit the characters allowed. But this will have to wait. (Phylo is the only plugin that uses OCR ATM). I'll think about it.
This is a classic problem https://en.wikipedia.org/wiki/Error_detection_and_correction - we can detect errors, can we correct them? This will require heuristics.

petermr · 2015-08-08T10:37:15Z

Ross>>I was imagining perhaps just a post-OCR filter to remove all special characters.

We already have one. I'll look it up. It corrects "ch/o" to "chlo", etc.

petermr · 2015-08-08T10:53:27Z

See "/org/xmlcml/norma/images/ocr/italicGarbles.xml" in norma

<garbles title="italicGarbles" characters="I/">
  <!-- italic "l" -->
  <garble original="a/" edited="al"/>
  <garble original="e/" edited="el"/>
  <garble original="i/" edited="il"/>
  <garble original="I/" edited="ll"/>
  <garble original="l/" edited="ll"/>
  <garble original="o/" edited="ol"/>
  <garble original="u/" edited="ul"/>
  <garble original="y/" edited="yl"/>
  <garble original="r/o" edited="rio"/>
  <garble original="s/c" edited="sic"/>
  <garble original="g/uc" edited="gluc"/>
  <garble original="g/yc" edited="glyc"/>
  <garble original="d/\[" edited="cili"/>
  <garble original="f/uo" edited="fluo"/>
  <garble original="d/\[" edited="cili"/>
  <garble original="/\[" edited="il"/>
  <garble original="t/G" edited="tic"/>
  <garble original="k/n" edited="kin"/>

  <garble original="lI" edited="ll"/>

  <!-- these may be ambiguous -->
  <garble original="C/" edited="cl"/>
  <garble original="c/" edited="ci"/>
  <garble original="/Us" edited="ius"/>

  <!-- very ambiguous -->
   <garble original="n'x" edited="rix"/>

  <!-- ligatures -->
  <garble original="ﬁ" edited="fi"/>
ﬁ  
</garbles>

These are believable garbles. Problem is that some images are not appropriate for analysis.
P.

petermr · 2015-08-08T12:57:21Z

The architecture has the OCR in norma. At this stage we don't know the
context (and if we are doing money-related or computer-related documents)
"$" may be intended.

The interpretation has to be at the domain level. That means ami-phylo in
this case.

P.

On Sat, Aug 8, 2015 at 11:19 AM, Ross Mounce notifications@github.com
wrote:

latter seems reasonable to me but perhaps too strict (?)

I was imagining perhaps just a post-OCR filter to remove all special
characters. Replace with nothing (no space) or for certain special
characters, e.g. $ replace with its most likely replacement candidate: S

—
Reply to this email directly or view it on GitHub
#17 (comment)
.

Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

petermr · 2015-08-08T13:39:55Z

Your comments are very useful. As a first pass we can say:

negative lengths are a programming error
commas are reserved characters and should be replaced
Because there is no agreed Newick validator we have to guess the rest of the rules. For example Trex says that a Newick file much have at least 3 taxa. Is that true of all validators? Does Newick require a valid Taxon name? I don't know but I think some of my files with only node IDs have passed Trex.
The appropriate action is to compile a list of failing images and create a failing test from them - analogous to MergeTipTest.testConvertLabelsAndTreeAndMerge(). I'll incorporate them and gradually debug

petermr · 2015-08-08T13:51:44Z

Trying to check your failures. Most are not in the https://github.com/rossmounce/pluto-ONS/tree/master/testing/output file. Suspect some numbers may not be correct. As before, collect just the failing ones.

rossmounce · 2015-08-08T14:37:33Z

They are all there. All 4000+ . If you are browsing via github.com with a web browser it will show you just the first 1000 files or folders. Perhaps this could be the problem? If you clone the repo locally to your hard drive everything should be there.

petermr · 2015-08-08T15:08:30Z

OK, thx
Probably a good idea to select out the problem files.

On Sat, Aug 8, 2015 at 3:37 PM, Ross Mounce notifications@github.com
wrote:

They are all there. All 4000+ . If you are browsing via github.com with a
web browser it will show you just the first 1000 files or folders. Perhaps
this could be the problem? If you clone the repo locally to your hard drive
everything should be there.

—
Reply to this email directly or view it on GitHub
#17 (comment)
.

Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

rossmounce · 2015-08-08T15:52:56Z

Which files? The source image file?
If you know the ID of it you can go directly to it without even cloning the whole repo, via a web browser:

just append it to this URL e.g. https://github.com/rossmounce/pluto-ONS/tree/master/testing/output/
'+'
ijs.0.000174-0-000.pbm.png

= https://github.com/rossmounce/pluto-ONS/tree/master/testing/output/ijs.0.000174-0-000.pbm.png

It's all standardised structure and file names

rossmounce mentioned this issue Aug 7, 2015

validation of Newick output #16

Open

rossmounce assigned petermr Aug 7, 2015

petermr assigned rossmounce and unassigned petermr Aug 7, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Erroneous Newick output #17

Erroneous Newick output #17

rossmounce commented Aug 7, 2015

petermr commented Aug 7, 2015

rossmounce commented Aug 7, 2015

petermr commented Aug 8, 2015

rossmounce commented Aug 8, 2015

petermr commented Aug 8, 2015

petermr commented Aug 8, 2015

petermr commented Aug 8, 2015

petermr commented Aug 8, 2015

petermr commented Aug 8, 2015

petermr commented Aug 8, 2015

rossmounce commented Aug 8, 2015

petermr commented Aug 8, 2015

rossmounce commented Aug 8, 2015

Erroneous Newick output #17

Erroneous Newick output #17

Comments

rossmounce commented Aug 7, 2015

Details

petermr commented Aug 7, 2015

rossmounce commented Aug 7, 2015

petermr commented Aug 8, 2015

rossmounce commented Aug 8, 2015

petermr commented Aug 8, 2015

petermr commented Aug 8, 2015

petermr commented Aug 8, 2015

petermr commented Aug 8, 2015

petermr commented Aug 8, 2015

petermr commented Aug 8, 2015

rossmounce commented Aug 8, 2015

petermr commented Aug 8, 2015

rossmounce commented Aug 8, 2015