Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Erroneous Newick output #17

Open
rossmounce opened this issue Aug 7, 2015 · 13 comments
Open

Erroneous Newick output #17

rossmounce opened this issue Aug 7, 2015 · 13 comments
Assignees

Comments

@rossmounce
Copy link
Member

Here's the full list of 16 erroneous Newick files:

Details

  • 001123

description: problem fairly obvious here. OCR has interpreted a . as a , (comma). "B,multivorans"
In Newick commas are special characters so this causes problems.

description: a very odd file with 199 empty/unlabelled tips! It makes a lot of sense when you view the source image: https://github.com/rossmounce/pluto-ONS/blob/master/testing/output/ijs.0.001149-0-003.pbm.png/ijs.0.001149-0-003.pbm.png AMI has clearly quite faithfully tried to interpret this odd image, it just isn't valid Newick.

  • 003160

description this is the problem bit: ":::::::::::::47.0" unknown cause

  • 019687

description comma inserted in taxon label: "X74685,D25307"

  • 022285

description very poor source image, poor OCR output, comma in taxon label: "(,V/lbrloxuuLMG21346T"

  • 02251

description not sure what problem is here, some single quote ' marks perhaps?

  • 02303

description not sure. Perhaps the slash symbol / or the single quote marks '

  • 02328

description possibly the negative length branches ":-59.0,Oxyrrhismarina" and there's a square bracket symbol [ "Chlamydomonas[noen" and single quote marks

  • 02329

description taxon label with a dollar sign in it "US$433:27.0" also negative branch lengths ":-28.0,99:-16.0,:-17.0,99:-6.0"

  • 02770

description at least three slashes in taxon labels "B.Vinson/isubsp"

  • 02792

description loads of odd symbols ":26.0,5'5°" lots of exclamation marks "V.loge!35077:37.0" pound symbol "£31"

  • 02806

description negative branch lengths "(:-2.0,:-5.0)NT1.16:-4.0" and a single quote mark

  • 02994

description slash "Methanobacler/umIvanovii" single quotes and negative branch lengths "(:-1.0,:182.0)"

  • 63077

description colon in taxon name "Peptostreptococcusmicro::135.0,"

  • 63400

description single quote marks and double quote marks

@petermr
Copy link
Member

petermr commented Aug 7, 2015

How do I know what file created this? do we have a numbering system for
batches? file should give the batch ID (which should be described in the
"about" pages in phylotree/ ) and then the filename.

On Fri, Aug 7, 2015 at 6:49 PM, Peter Murray-Rust <
peter.murray.rust@googlemail.com> wrote:

This looks like one Newick file, not a list.

On Fri, Aug 7, 2015 at 5:59 PM, Ross Mounce notifications@github.com
wrote:

Assigned #17 #17 to
@petermr https://github.com/petermr.


Reply to this email directly or view it on GitHub
#17 (comment).

Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

@petermr petermr assigned rossmounce and unassigned petermr Aug 7, 2015
@rossmounce
Copy link
Member Author

There is only 1 batch. I ran through them all in a bash loop with a timeout command. Sorry I edited your comment rather than writing a new comment. Still learning this mobile phone app icons...

@petermr
Copy link
Member

petermr commented Aug 8, 2015

Yes. We should probably record these as garbles. This is what

    public void setSpeciesPattern(Pattern speciesPattern) ;

has been created for. (`PatterThe question is whether to

  • abort the CTree
  • replace the tip by some reserved Newick-friendly message - e.g. "GARBLED" - which cannot be a taxon.I think we should do the latter and I'll try to code it.

@rossmounce
Copy link
Member Author

latter seems reasonable to me but perhaps too strict (?)

I was imagining perhaps just a post-OCR filter to remove all special characters. Replace with nothing (no space) or for certain special characters, e.g. $ replace with its most likely replacement candidate: S

@petermr
Copy link
Member

petermr commented Aug 8, 2015

The problem is that the OCR doesn't know what it is being used for. Maybe I can add a pre-OCR filter to limit the characters allowed. But this will have to wait. (Phylo is the only plugin that uses OCR ATM). I'll think about it.
This is a classic problem https://en.wikipedia.org/wiki/Error_detection_and_correction - we can detect errors, can we correct them? This will require heuristics.

@petermr
Copy link
Member

petermr commented Aug 8, 2015

Ross>>I was imagining perhaps just a post-OCR filter to remove all special characters.

We already have one. I'll look it up. It corrects "ch/o" to "chlo", etc.

@petermr
Copy link
Member

petermr commented Aug 8, 2015

See "/org/xmlcml/norma/images/ocr/italicGarbles.xml" in norma

<garbles title="italicGarbles" characters="I/">
  <!-- italic "l" -->
  <garble original="a/" edited="al"/>
  <garble original="e/" edited="el"/>
  <garble original="i/" edited="il"/>
  <garble original="I/" edited="ll"/>
  <garble original="l/" edited="ll"/>
  <garble original="o/" edited="ol"/>
  <garble original="u/" edited="ul"/>
  <garble original="y/" edited="yl"/>
  <garble original="r/o" edited="rio"/>
  <garble original="s/c" edited="sic"/>
  <garble original="g/uc" edited="gluc"/>
  <garble original="g/yc" edited="glyc"/>
  <garble original="d/\[" edited="cili"/>
  <garble original="f/uo" edited="fluo"/>
  <garble original="d/\[" edited="cili"/>
  <garble original="/\[" edited="il"/>
  <garble original="t/G" edited="tic"/>
  <garble original="k/n" edited="kin"/>

  <garble original="lI" edited="ll"/>

  <!-- these may be ambiguous -->
  <garble original="C/" edited="cl"/>
  <garble original="c/" edited="ci"/>
  <garble original="/Us" edited="ius"/>

  <!-- very ambiguous -->
   <garble original="n'x" edited="rix"/>

  <!-- ligatures -->
  <garble original="fi" edited="fi"/>
fi  
</garbles>

These are believable garbles. Problem is that some images are not appropriate for analysis.
P.

@petermr
Copy link
Member

petermr commented Aug 8, 2015

The architecture has the OCR in norma. At this stage we don't know the
context (and if we are doing money-related or computer-related documents)
"$" may be intended.

The interpretation has to be at the domain level. That means ami-phylo in
this case.

P.

On Sat, Aug 8, 2015 at 11:19 AM, Ross Mounce notifications@github.com
wrote:

latter seems reasonable to me but perhaps too strict (?)

I was imagining perhaps just a post-OCR filter to remove all special
characters. Replace with nothing (no space) or for certain special
characters, e.g. $ replace with its most likely replacement candidate: S


Reply to this email directly or view it on GitHub
#17 (comment)
.

Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

@petermr
Copy link
Member

petermr commented Aug 8, 2015

Your comments are very useful. As a first pass we can say:

  • negative lengths are a programming error
  • commas are reserved characters and should be replaced
    Because there is no agreed Newick validator we have to guess the rest of the rules. For example Trex says that a Newick file much have at least 3 taxa. Is that true of all validators? Does Newick require a valid Taxon name? I don't know but I think some of my files with only node IDs have passed Trex.
  • The appropriate action is to compile a list of failing images and create a failing test from them - analogous to MergeTipTest.testConvertLabelsAndTreeAndMerge(). I'll incorporate them and gradually debug

@petermr
Copy link
Member

petermr commented Aug 8, 2015

Trying to check your failures. Most are not in the https://github.com/rossmounce/pluto-ONS/tree/master/testing/output file. Suspect some numbers may not be correct. As before, collect just the failing ones.

@rossmounce
Copy link
Member Author

They are all there. All 4000+ . If you are browsing via github.com with a web browser it will show you just the first 1000 files or folders. Perhaps this could be the problem? If you clone the repo locally to your hard drive everything should be there.

@petermr
Copy link
Member

petermr commented Aug 8, 2015

OK, thx
Probably a good idea to select out the problem files.

On Sat, Aug 8, 2015 at 3:37 PM, Ross Mounce notifications@github.com
wrote:

They are all there. All 4000+ . If you are browsing via github.com with a
web browser it will show you just the first 1000 files or folders. Perhaps
this could be the problem? If you clone the repo locally to your hard drive
everything should be there.


Reply to this email directly or view it on GitHub
#17 (comment)
.

Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

@rossmounce
Copy link
Member Author

Which files? The source image file?
If you know the ID of it you can go directly to it without even cloning the whole repo, via a web browser:

just append it to this URL e.g. https://github.com/rossmounce/pluto-ONS/tree/master/testing/output/
'+'
ijs.0.000174-0-000.pbm.png

= https://github.com/rossmounce/pluto-ONS/tree/master/testing/output/ijs.0.000174-0-000.pbm.png

It's all standardised structure and file names

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants