-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Erroneous Newick output #17
Comments
How do I know what file created this? do we have a numbering system for On Fri, Aug 7, 2015 at 6:49 PM, Peter Murray-Rust <
Peter Murray-Rust |
There is only 1 batch. I ran through them all in a bash loop with a timeout command. Sorry I edited your comment rather than writing a new comment. Still learning this mobile phone app icons... |
Yes. We should probably record these as garbles. This is what
has been created for. (`PatterThe question is whether to
|
latter seems reasonable to me but perhaps too strict (?) I was imagining perhaps just a post-OCR filter to remove all special characters. Replace with nothing (no space) or for certain special characters, e.g. $ replace with its most likely replacement candidate: S |
The problem is that the OCR doesn't know what it is being used for. Maybe I can add a pre-OCR filter to limit the characters allowed. But this will have to wait. (Phylo is the only plugin that uses OCR ATM). I'll think about it. |
Ross>>I was imagining perhaps just a post-OCR filter to remove all special characters. We already have one. I'll look it up. It corrects "ch/o" to "chlo", etc. |
See "/org/xmlcml/norma/images/ocr/italicGarbles.xml" in
These are believable garbles. Problem is that some images are not appropriate for analysis. |
The architecture has the OCR in The interpretation has to be at the domain level. That means P. On Sat, Aug 8, 2015 at 11:19 AM, Ross Mounce notifications@github.com
Peter Murray-Rust |
Your comments are very useful. As a first pass we can say:
|
Trying to check your failures. Most are not in the https://github.com/rossmounce/pluto-ONS/tree/master/testing/output file. Suspect some numbers may not be correct. As before, collect just the failing ones. |
They are all there. All 4000+ . If you are browsing via github.com with a web browser it will show you just the first 1000 files or folders. Perhaps this could be the problem? If you clone the repo locally to your hard drive everything should be there. |
OK, thx On Sat, Aug 8, 2015 at 3:37 PM, Ross Mounce notifications@github.com
Peter Murray-Rust |
Which files? The source image file? just append it to this URL e.g. https://github.com/rossmounce/pluto-ONS/tree/master/testing/output/ = https://github.com/rossmounce/pluto-ONS/tree/master/testing/output/ijs.0.000174-0-000.pbm.png It's all standardised structure and file names |
Here's the full list of 16 erroneous Newick files:
Details
description: problem fairly obvious here. OCR has interpreted a . as a , (comma). "B,multivorans"
In Newick commas are special characters so this causes problems.
https://github.com/ContentMine/phylotree/blob/master/errors/TreeGraph2-validation-tests/ijs.0.001149-0-003.pbm.nwk
description: a very odd file with 199 empty/unlabelled tips! It makes a lot of sense when you view the source image: https://github.com/rossmounce/pluto-ONS/blob/master/testing/output/ijs.0.001149-0-003.pbm.png/ijs.0.001149-0-003.pbm.png AMI has clearly quite faithfully tried to interpret this odd image, it just isn't valid Newick.
description this is the problem bit: ":::::::::::::47.0" unknown cause
description comma inserted in taxon label: "X74685,D25307"
description very poor source image, poor OCR output, comma in taxon label: "(,V/lbrloxuuLMG21346T"
description not sure what problem is here, some single quote ' marks perhaps?
description not sure. Perhaps the slash symbol / or the single quote marks '
description possibly the negative length branches ":-59.0,Oxyrrhismarina" and there's a square bracket symbol [ "Chlamydomonas[noen" and single quote marks
description taxon label with a dollar sign in it "US$433:27.0" also negative branch lengths ":-28.0,99:-16.0,:-17.0,99:-6.0"
description at least three slashes in taxon labels "B.Vinson/isubsp"
description loads of odd symbols ":26.0,5'5°" lots of exclamation marks "V.loge!35077:37.0" pound symbol "£31"
description negative branch lengths "(:-2.0,:-5.0)NT1.16:-4.0" and a single quote mark
description slash "Methanobacler/umIvanovii" single quotes and negative branch lengths "(:-1.0,:182.0)"
description colon in taxon name "Peptostreptococcusmicro::135.0,"
description single quote marks and double quote marks
The text was updated successfully, but these errors were encountered: