Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ten Years Challenge: [Rp] Typographical features for scene text recognition #35

Closed
weinman opened this issue Apr 30, 2020 · 34 comments
Closed

Comments

@weinman
Copy link

weinman commented Apr 30, 2020

Original article: Weinman, J. J. (2010, August). Typographical features for scene text recognition. In 2010 20th International Conference on Pattern Recognition (pp. 3987-3990). IEEE. DOI:10.1109/ICPR.2010.970

PDF URL: http://www.cs.grinnell.edu/~weinman/tmp/rescience/rescience20_submitted.pdf
Metadata URL: http://www.cs.grinnell.edu/~weinman/tmp/rescience/metadata.yaml
Code URL:

Scientific domain: Pattern Recognition (Machine Learning)
Programming language: CUDA, Matlab, Java, C
Suggested editor: Thomas Arildsen, Lorena Barba, Georgios Detorakis

The code archive DOI is being processed. It may not appear for 48–72 hours.

Some of the LaTeX output/layout seems a bit wonky. If anyone would like to make behind-the-scenes suggestions, they may view the source at overleaf

Original comment/submission

@rougier
Copy link
Member

rougier commented May 3, 2020

@ThomasA Could you edit this submission fior the Ten Years Reproducibility Challenge (only one reviewe needed)?

@rougier
Copy link
Member

rougier commented May 3, 2020

@weinman Thanks for your submission, we'll assign an editor soon.

@rougier
Copy link
Member

rougier commented Jun 15, 2020

@koustuvsinha @gdetor Can you edit this submission for the Ten Years Reproducibility Challenge (only 1 reviewer needed)?

@gdetor
Copy link

gdetor commented Jun 15, 2020

Hi @rougier I can handle this submission.

@rougier
Copy link
Member

rougier commented Jun 19, 2020

Oh great, thank you

@gdetor
Copy link

gdetor commented Jun 19, 2020

Hi @mlosch Could you please review this submission?

@ThomasA
Copy link

ThomasA commented Jun 22, 2020

@rougier sorry I was not quite "awake" at the moment. I was quite busy lately and I am afraid a lot of ReScience communication hid in heaps of GitHub threads that just kept piling up - most of it papers that I was not involved in.

@gdetor
Copy link

gdetor commented Jun 29, 2020

@mlosch Gently reminder.

@rougier
Copy link
Member

rougier commented Jul 23, 2020

@gdetor If@mlosch is not available, I can review.

@gdetor
Copy link

gdetor commented Jul 23, 2020

@rougier thank you.

@rougier
Copy link
Member

rougier commented Aug 3, 2020

I've started and I will try to make my review for this Friday.

@rougier
Copy link
Member

rougier commented Aug 7, 2020

Overall

This is quite a fascinating work with a really complex setup and lot of software/hardware dependencies. When you read the original article (a conference paper with a page limit I imagine), you have no idea of the whole machinery behind to obtain the results. This resonates strongly with Claerbout famous quote "an article about computational result is advertising, not scholarship. The actual scholarship is the full software environment, code and data, that produced the result.”. Here, the author gives us really precise and insightful explanations on the whole machinery. It seems to be almost magical that he managed to reproduce it but the "trick" is the use of an experimental data repository setup in 2010. This reproduction 10 years later seems to indicate it's quite a good structure for reproducible research. I did not try to re-run the software because obviously you need a CUDA environment (that I don't have currently) and the whole pipeline produces several gigabytes of data. I have only one "major" comment/suggestion: In the introduction, the author dives directly into the details of the experimental data repository without first giving an (even small) overview/context of the origjnal results. That might be a good strategy because I stopped reading a this point to read the original article and later come back to this one, but I think it would be good to give some infomation on the original article (even if all the details are later given).

I've also a few minor comments/suggestions (see below) that would need to be addressed but I think the article is already in good shape.

Minor comments

  • You might add a few words on how did your start the reproduction. Did you leave some notes in 2010 indicating the procedure or is everything documented through the experimental data repository. The README accompanying the code archive is dated 2020, is this only an update of the original README?
  • Maybe I missed it but I didn't see the link to the source code in the paper.
  • Would it be possible to have an online version (where ever you want)with all the script "expanded" such they can be browsed online?
  • In the conclusion, last sentences: "This article indicates that using long-lived tools along with well-tracked dependencies increase the chances of generating reproducible results." but in the meantime author reports difficulties in identifyng the various components of the original environment (section 4.1). I would not want the author to modify the last sentence which is a nice conclusion for this paper but in the meantime, some extra explanation would be welcome.
  • In the abstract, you have "A Replication of [1]", maybe it would be better to add also the literal citation (like the one in the referance).
  • There are some font size difference in verbatim texts. Makefile appear too big I think. Since you warn about some glitches in the template, I'll be happy to try to fix them with you.
  • Names of collections (rA, rB, etc) are not really informative. I imagine there's a reason but still...

@rougier
Copy link
Member

rougier commented Aug 7, 2020

Forgot to notify @gdetor and @weinman

@gdetor
Copy link

gdetor commented Aug 7, 2020

@rougier Thank you for your review.

@gdetor
Copy link

gdetor commented Sep 7, 2020

@weinman Gentle reminder

@rougier
Copy link
Member

rougier commented Sep 8, 2020

@gedtor @weinman Any progress ?

@weinman
Copy link
Author

weinman commented Sep 10, 2020

Thank you @rougier for the helpful review. I was sidelined last month by a natural disaster, but I'm now picking up this thread.

Major suggestion

I'm very happy to include insert a new section after the introduction giving some additional background and context for the original work. I agree that makes a lot of sense! I've drafted 4–5 paragraphs along these lines.

Because the focus of the article (I had thought) would be on the reproduction task, I thought leaving the introduction (having that focus) intact might make sense. However, I am also amenable to repeating what is given in the abstract, if the reviewers find that a useful addition. Specifically:

The original 2010 paper demonstrated that character recognition performance could be improved on difficult problems of scene text recognition by leveraging font-specific correlations between character identity and width.

I will be fiddling with pagination and figure location, but the draft is already visible in the overleaf (linked in the original post).

Minor suggestions

Starting the reproduction

This is a good point! I will add to the article that my submitted works sources always includes a pointer from tables or figures to the experimental collections that generated them. Thus, I always have a pointer to the leaf/leaves of the dependency tree (Fig. 1), from which I can work backwards.

There was no original README because the computational aspect of the work was not previously published. (Though there was always intent.)

Link to Source

The accompanying metadata YAML file does indeed insert the link in the footer on page 1 ("Code is available at ..."). Is there an additional location and/or preferred method for relaying this information?

Browsable source

It's certainly possible to include this. I had limited myself to linking to the permanent archive, because it's the only URL I trust to still be functional in say 10 years (unlike github or my own institution-hosted personal web site). Unfortunately, my institution's digital archive does really facilitate file-level browsing.

I welcome suggestions for other platforms that might blend both permanence and browsability.

Caveats in Conclusion

Agreed. Inserting a new penultimate sentence would function nicely as a transition between the "Although" and "This article indicates". The caveat(s) primarily stem from the fact that we don't have a complete docker-like archive of the host platform, which I am happy to recapitulate.

Abstract Citation

This is indeed one of the technical LaTeX issues I was wondering about. This is automatically generated, but I'm perhaps doing something wrong, the metadata.yaml contains the following:

# Information about the original article that has been replicated
replication:
 - cite: "Weinman, J. J. (2010, August). Typographical features for scene text recognition. In 2010 20th International Conference on Pattern Recognition (pp. 3987-3990). IEEE." # Full textual citation
 - bib:  \cite{weinman10typographical} # Bibtex key (if any) in your bibliography file
 - url:  https://www.cs.grinnell.edu/~weinman/pubs/weinman10typographical.pdf # URL to the PDF, try to link to a non-paywall version
 - doi:  10.1109/ICPR.2010.970 # Regular digital object identifier

And this produces the following items in metadata.tex

\def \replicationCITE{Weinman, J. J. (2010, August). Typographical features for scene text recognition. In 2010 20th International Conference on Pattern Recognition (pp. 3987-3990). IEEE.}
\def \replicationBIB{\cite{weinman10typographical}}
\def \replicationURL{https://www.cs.grinnell.edu/~weinman/pubs/weinman10typographical.pdf}
\def \replicationDOI{10.1109/ICPR.2010.970}

I do welcome assistance on getting this formatted as the editors intend! (I agree that a bare numerical citation is undesirable.)

Verbatim Texts

Good typographical eye! (Appropriate, considering the subject of the original paper.).

Indeed, I actually shrunk the name of the very long collection (eighth line of section 2,) so it didn't stick out in the margin too far. However, Makefile (which comes later) and all the rest are the appropriate default size in latex; juxtaposed these do look odd, but careful inspection of the x-height for Makefile reveals it matches the default/serif text.

Collections shorthand

Admittedly, I invented the shorthand pA, eC so that the graph (Fig. 1) would be interpretable, and then found it convenient to continue to use these shorthand keys throughout. The complete names I used in the file system would perhaps be overkill (i.e., experiments/text/ngrams/bigrams/tied_nums_intracase_L1_validation-20090708075734) for casual reference throughout the manuscript, which is why I tried to describe the relevant collections before naming them specifically with these keys. They can be cross-references in the graph (Fig. 1) or table for more details (Table 3).

I agree it's not ideal, but given the shear number and variety of them, I do not know whether it is worth inventing a new taxonomy. Section 3.1 describes the contents as they are enumerated. Table 2 attempts to use the semantically-meaningful prefix character (p, e, or r) to glob these along with the header description ("Parser", "+Training", "+Data").

A bit of "inside baseball", but another minor reason for this somewhat generic naming scheme (aside from keeping the graph tidy) is that the dependency graph structure and indeed the graphic itself are all automatically and reproducibily generated. (By tracing the DEPs files, processing the hierarchy location, and generating a file that can then be used by GraphViz.

Conclusion

I will post a revision in the next week or so, but I do invite responses to any of the comments I've made above that may help smooth the process going forward. Thank you again for hosting the challenge, giving impetus for this evaluation, and providing the very helpful review.

@rougier
Copy link
Member

rougier commented Sep 11, 2020

@weinman Thank for yo very detailed report and I'm plainly satisfied with the proposed corrections. To answer your question about the code, you can actually expand it on GitHub (or GitLab) and use software heritage (https://www.softwareheritage.org/save-and-reference-research-software/) to save the repository. Even if GitHub disappear sometime in the future, your code will be safe (software heritage is a non-profit foundation). You should obtan a swh id that you can put in the metadata.

For collection shortnames, I agree the long names would be tedious to use an I can live with the short version.

For the abstract, I think you can use \textcite instead of \cite but I'm not sure if the overleaf template is up to date. Maybe you can upload the up-to-date template from https://github.com/ReScience/template

@gdetor We're now waiting for the final version but I formally accept the submission.

@gdetor
Copy link

gdetor commented Sep 12, 2020

@rougier Thank you for the review.
@weinman Congratulations. Once you upload the final version I'll proceed with publishing the paper.

@weinman
Copy link
Author

weinman commented Sep 17, 2020

@rougier Thank you for the feedback.

Regarding the "A replication of [1]" beneath the abstract. I've updated the template to the latest version and changed metadata.ytml/metadata.tex to use \textcite, but all this does is change it to "A replication of Weinman [1]".

Any further pointers or suggestions on how to make the output match the editors' desired format are very welcome!

@rougier
Copy link
Member

rougier commented Sep 18, 2020

Can you try \fullcite instead ?

@weinman
Copy link
Author

weinman commented Sep 18, 2020

Yes, \fullcite produces a complete citation.

The metadata.yaml is peculiar on this point:

# Information about the original article that has been replicated
replication:
 - cite: # Full textual citation
 - bib:  # Bibtex key (if any) in your bibliography file
 - url:  # URL to the PDF, try to link to a non-paywall version
 - doi:  # Regular digital object identifier

My initial interpretation would be that bib is simply the bibtex key (i.e., it's weinman10typographical in my bibtex file). And cite would be the manually extracted text (I'd copied mine from what appeared in the ultimate reference list).

However, in order to produce what you've requested, the bib field above needs to be \fullcite{weinman10typographical}.

Maybe I'm misunderstanding something. However, if you too think that's peculiar, let me know if you'd like me to open an issue about it on the template

@rougier
Copy link
Member

rougier commented Sep 23, 2020

Now that you've pointed it, I'm not sure what I meant when I created the template and it might be worth to open an issue. I think this might be related to the old template we used some years ago.

@weinman
Copy link
Author

weinman commented Sep 23, 2020

Thanks. I have some other formatting issues/questions.

SWH

I've entered everything into SWH and gotten an identifier for it. I've entered the value into metadata.yaml, which generates the appropriate \def \codeSWH line in metadata.tex, however, \codeSWH does not seem to get used anywhere by the template (I'm looking particularly at the footer where it says "Code is available at", where gives it the DOI (which I'd certainly like to keep in there).

Am I to manually cite it within the body of the paper? (Fine if so, just wanted to double-check!) Or is there something else I am to do?

Paragraphs

The paragraph formatting of the paper seems a bit odd, there is neither an extra space between lines nor an indentation for new paragraphs.

This behavior was observable in the submitted PDF as well. Is it expected? It seems unusual to me, as typically there is either one or the other (extra space or indentation). I just wanted to verify.

Header "Replication"

The header says "A replication of" and I just wanted to be sure this is right, since the prefix in the title is [Rp] as opposed to [Re]. I'm not sure if replication applies to both or if there is another word that should appear there (i.e., "reproduction"?).

Sorry for so many questions!

@rougier
Copy link
Member

rougier commented Sep 24, 2020

The \codeSWH is supposed to be used in header.tex. Maybe you need to update the template you're using. And if you started from overleaf, this template might be outdated and needs to be updated.

For the missing space, I also noticed that and forgot to fix it. It's only a matter of using \usepackage{parskip} that should be added to the template (can you make a PR?)

The header should say "Reproduction" in your case. You have to change the type in the metadata (where this specific option is missing in the comment.

And many thanks for all your comments, your expert eye will help us to improve the template.

@weinman
Copy link
Author

weinman commented Sep 26, 2020

Thanks @rougier !

I updated all the files (it turns out that before I'd only done a partial update of just rescience.cls) in the template. This solved some problems but introduced others.

  • Adding \RequirePackage{parskip} to rescience.cls fixed the paragraph issue (+)
  • With the template update, SWH now properly appears in the footer (+)
  • However, the abstract and replication lines were commented out and needed to be restored (-)
  • My metadata.yaml does have Reproduction under type . I observe that this is not among the options listed in the template. (-)
  • The template seems to use this text verbatim in the header/title, but not in the bit that comes beneath the abstract. It seems to be hardcoded to "A replication of" (-).

That last point specifically is what I was asking about. The title/header indeed appears to be correct, it's the note under the abstract I wondered about. Should I make some change or is this the text expected?

I'd be happy to submit one PR that adds the package and uncomments the needed header elements (unless they're intended to be commented?).

@rougier
Copy link
Member

rougier commented Oct 1, 2020

That would be great. For the abstract and replication lines, I think I commented them because we can now have editorial and letters that do not have abstract nor replication reference. The abstract inclusion should be conditional and same for replication/reproduction. If we have a bib ref for the reproduction/eplication, then we can add it with the \fullcite, else we skip the line?

Metadata is missing the reproduction yes, good catch. For the hardcoded part, we could use the type of the submission since the line would only appear for replication/repoduction?.

If you make a PR, it would be good to reference this thread and maybe we can continue the discussion on the PR. Else, we'll continue to pollute your review 😄

@weinman
Copy link
Author

weinman commented Oct 1, 2020

@rougier Thanks again for the review and editorial guidance.
@rougier and @gdetor here's the summary of changes:

  • Added editor and reviewer to metadata
  • Added SWH link for browseable code (in addition to the tarball archive)
  • Updated template to use "reproduction of" and full citation featured beneath abstract.
  • Section 2 added with a summary of the original work and its context
  • First paragraph of Section 5 now describes the entrée to reproduction
  • Conclusion now includes a recapitulation of the caveat (" the repository does not completely capture all the host software versions and configurations ... This short-coming limited the degree of reproducibility reported in this report") before the concluding sentence.

As I understand it, the editor adds the final touches (DOI, etc.). Thus, the entire document source may be found at Overleaf. If there is some other way I should deliver it, please let me know!

@gdetor
Copy link

gdetor commented Oct 2, 2020

@weinman @rougier I'll proceed to the final editing and publishing the article

@gdetor
Copy link

gdetor commented Oct 15, 2020

Hi @weinman Could you please add the following information to the metadata.yaml file, compile the article and update the overleaf so I can get the latest version of those files?
Volume: 6, Issue: 1
DOI: 10.5281/zenodo.4091742
URL: https://zenodo.org/record/4091742/files/article.pdf
Please correct the name of the reviewer to Nicolas and remove the tag "preprint" from the manuscript.
Thank you.

@weinman
Copy link
Author

weinman commented Oct 15, 2020

Thanks @gedetor! Glad we're nearly there. I can edit those things. Should I leave the dates (so you update them), or shall I insert them? Perhaps as follows?

dates:
  - received: April 30, 2020
  - accepted: September 12, 2020
  - published: October 15, 2020

I also wasn't sure what to give for the article number.

In any case, I've added the volume, issue, DOI, and URL (not sure where that shows up), made the name correction (oops!), and the Preprint has disappeared with the definition of the DOI.

Please let me know what details I surely still need to attend to.

@gdetor
Copy link

gdetor commented Oct 15, 2020

@weinman Sure you can update the dates too. I think that's the last piece of information missing from the manuscript. The number will be assigned automatically upon submission.
Thank you

@weinman
Copy link
Author

weinman commented Oct 15, 2020

@gdetor Very good, I've made those updates. If you find anything missing, please let me know!

@gdetor
Copy link

gdetor commented Oct 15, 2020

@weinman Congratulations once again. The article is now online https://zenodo.org/record/4091742

@rougier rougier closed this as completed May 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants