Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions #2

Closed
rougier opened this issue Oct 11, 2019 · 35 comments
Closed

Questions #2

rougier opened this issue Oct 11, 2019 · 35 comments

Comments

@rougier
Copy link
Member

rougier commented Oct 11, 2019

If you have any question concerning the challenge, you can use this thread.

@rougier rougier pinned this issue Oct 11, 2019
@p16i
Copy link

p16i commented Oct 15, 2019

How can a new researcher contribute to this campaign?
For example, he/she might start his/her research career; hence, not code older than 10 years.

@khinsen
Copy link
Contributor

khinsen commented Oct 16, 2019

@heytitle Here is one idea: you could identify interesting papers in your field, and contact their authors to encourage them to participate in the challenge.

@rougier
Copy link
Member Author

rougier commented Oct 16, 2019

For new researchers, we also intend to have a repro hackaton in Bordeaux some time next year (with @annakrystalli) where you can try to reproduce papers from the litterature. We'll have a special issue linked to this repro hackaton. And @annakrystalli is also organizing several other repro hackatons.

@rougier
Copy link
Member Author

rougier commented Oct 16, 2019

Would be nice to have an entry by Margaret Hamilton stating code is available but hardware is nowhere to be found :)

@bpbond
Copy link

bpbond commented Dec 23, 2019

Looking through #1 most people seem to be moderately to highly confident of success. It occurs to me there's probably a degree of self-selection at play, with people picking studies they're relatively confident of replicating. (Not sure, speculating.) Anyway, the degree to which this isn't a random sample should probably be addressed somewhere in the special issue.

@rougier
Copy link
Member Author

rougier commented Dec 23, 2019

You're right and one of the problem may be the impossibility of finding sources. There is a proposal for a collective article giving account of failed replication such that people can quickly explain why they failed without the necessity of writing a full article for just explaining they could not find the sources (for example). But apart from that, I'm not sure how to address this bias. We'll underline it in the editorial for the special issue. Note that the bias is also true for regular replications. So far, we've only published successful replications.

@khinsen
Copy link
Contributor

khinsen commented Dec 23, 2019

Self-selection is an eternal problem with publication. It starts even earlier: people who know about ReScience are already a self-selected minority of scientists interested in reproducibility questions. For doing statistics on reproducibility, we'd have to do something like a poll of a random selection of researchers, not a call for contributions. BTW, I know someone who has been wanting to do exactly that for a few years, but never got around to actually do it (lack of funding etc.).

@kyleniemeyer
Copy link

@rougier @khinsen just to clarify, what is the deadline for the article/reproducibility report associated with the challenge?

@khinsen
Copy link
Contributor

khinsen commented Jan 6, 2020

@ev-br
Copy link

ev-br commented Jan 8, 2020

  1. Is there a deadline for the entry? (as in, is 8 Jan 2020 still OK for declaring the participation?)

  2. I am nearly certain I won't be able to run the whole set of simulations from the 2006 paper because I won't be able to justify the use of that much computational resources (it was fairly substantial back in 2006 and the machine was a vector Cray, and we specifically targeted vector machines). Is it still OK to target a representative subset? (if this runs, the rest also runs, only requires some more CPU time).

@khinsen
Copy link
Contributor

khinsen commented Jan 8, 2020

@ev-br The only deadline is April 1st for submissions. You can declare participation in the morning and submit in the afternoon if you like!

A representative subset looks reasonable, just be sure to state this in your submission. It will then be the reviewer's job to decide if it's representative enough.

@brembs
Copy link

brembs commented Mar 6, 2020

Just noticed in the FAQ that the code has to come from "myself". What does that mean specifically? Here is my case:

  1. The code I used to collect the data (TurboPascal) already existed before I started in the lab and I modified the code.
  2. The code to analyze the data and export the derived results into a spreadsheet was written in C++ (MFC) by a grad student with whom I shared the room and I also contributed a little to writing it. I have the source code and the executables still run (from around 2000).
  3. As the software used to visualize the derived results was proprietary and I don't have it any more (not sure Statistica still exists?), I have written a short R script that takes the spreadsheet and creates bargraphs.

So this means that the raw data (in our own local format written by Turbo Pascal) that I collected myself can still be read by the C++ analysis software (written largely by someone else) and with my own, current R code, I can show that the data produce the exact same graphs as in the original publication from 2000. Does that mean I qualify or is my contribution too little?

@rougier
Copy link
Member Author

rougier commented Mar 9, 2020

I think the FAQ might be a bit too restrictive. The idea was to test the (original) code you used in your article. I don't think we meant you have to have written everything yourself so from my understanding about your explanations I think you're good to go. You might need to add the above explanation in your article, especially for the current R code / Statistica (if you can trace the history of this software that might be even better).

@khinsen
Copy link
Contributor

khinsen commented Mar 9, 2020

I agree with @rougier's point of view. The point of the challenge is to let scientists evaluate the possibility of re-doing work they published in the past. It isn't so important who wrote exactly what code - what we are after is reports of how future-safe the methods of the past turned out to be. So if you consider your case unusual compared to the others, or to the description of the challenge, that means most of all that you should describe the particularities in your article.

@brembs
Copy link

brembs commented Mar 27, 2020

I've organized the code and the data according to the descriptions. Now I'm preparing to write the paper and I'm reading these two descriptions:
https://github.com/ReScience/ReScience-submission
https://github.com/ReScience/submissions
Do I understand this correctly that I have to install python and that I have to find out how to generate LaTex documents using make and the templates provided (I've used Overleaf before and find it tedious, cumbersome and more time spent with the system than with the text)?

@khinsen
Copy link
Contributor

khinsen commented Mar 27, 2020

@brembs You can prepare your article with whatever tools you like, provided that you can produce a PDF file for submission in the end. We also need the YAML file with the metadata. There is no requirement to use our template, which so far is LaTeX-only because that's what we know best.

BTW, the first repository you cite is now obsolete. Use only https://github.com/ReScience/submissions.

@brembs
Copy link

brembs commented Mar 27, 2020

Excellent, this is not a problem!

@weinman
Copy link

weinman commented Apr 2, 2020

(Apologies for such a simpleton question for a newcomer to this area.) I will likely have results that are statistically similar to what is published, but the exact results are not the same. I would naively call that a successful replication, though it's not necessarily a precisely repeated/reproduced experiment where the numerical outcomes are identical. Instead I would say the trends are the same and the conclusions hold up.

In my submission should I classify this as Rp or ¬Rp?

@khinsen
Copy link
Contributor

khinsen commented Apr 2, 2020

Ultimately that's something you should discuss with the reviewer(s) who have read and commented your paper. But as you describe the situation, it looks more like a success than a failure to me.

@broukema
Copy link

broukema commented May 2, 2020

Is there any word limit on the abstract? Since the template doesn't generate any warning to the user, I suspect that the answer is "Formally, no, but be reasonable." In case it was missed - @khinsen, @rougier - over at #9 there are me + three others hoping for a bit of an extension of the deadline.

In my case, you can see that my text is nearly ready for submission, along with the code at codeberg - unfortunately with a negative result, and plenty of feedback on the difficulties.

@khinsen
Copy link
Contributor

khinsen commented May 2, 2020

Deadline extension: see #1 (comment)

@khinsen
Copy link
Contributor

khinsen commented May 2, 2020

@weinman That's hard to say in the abstract, so it's something you should discuss with your reviewer. Pick any one for your submission and change it later if appropriate.

@broukema
Copy link

@rougier @oliviaguest @khinsen @pdebuyl or any of the contributors: Which computer science subject class on ArXiv would make most sense for the Ten Years Challenge papers? - https://arxiv.org/corr/subjectclasses

  • CY - Computers and Society - Reproducibility is sort of related to society and ethics - reproducibility is not only a practical issue, it's an ethical issue and something that affects society;

  • DL - Digital Libraries - This is where Rougier et al https://arxiv.org/abs/1707.04393 - about ReScience C - is placed;

  • HC - Human-Computer Interaction - Reproducibility is to some degree an issue of how humans (scientists) relate to software;

  • SE - Software Engineering - Reproducibility to some degree is a question of software engineering, though it's more a question of software evolution;

I'm not really convinced by any of these. There's also:

  • OH - Other

I think it would be good to encourage all of the Ten Years' Science Challenge authors to post our papers as (p)reprints (either before or after acceptance, including the final accepted version) on ArXiv or possibly BiorXiv. Acceptance on ArXiv is not guaranteed: if the ReScience C Ten Years' Science Challenge papers are accepted by the ArXiv moderators, then this will be in favour of "replicability/reproducibility science" and the journal itself being given wider recognition by the scientific community.

I guess my tendency would be to choose DL - these articles are contributing to the concept of digital preservation of scientific research papers in a deeper sense that that of the human-readable pdf.

Any arguments for/against any of these (or other) options would be welcome! :)

@oliviaguest
Copy link
Member

oliviaguest commented Jun 28, 2020

I am not bothered either way and have no opinion formed at the moment, but I'm curious: does posting on arXiv offer more archiving possibilities (because they are picked up by Google Scholar, for example)?

@bpbond
Copy link

bpbond commented Jun 28, 2020

As you note, there are links to many of those categories, but DL seems the best/most logical to me.

@broukema
Copy link

@oliviaguest ArXiv is an archive, as can be guessed from the name :). It's nearly 30 years old, so in terms of longevity, it's clearly stable. Moreover, (i) it provides a uniform, community-based way of collecting together papers by an author in the physics/astronomy/maths/statistics/computer-science area of scholarly studies; and (ii) the highly standardised and well-recognised use of ArXiv:yymm.nnnnn identifiers (with one change in 2007) shows to a human or robot reader of the bibliography of a paper that that particular reference is necessarily available under green open access.

It's much more motivating for a human to go immediately to an open access reference than to spend extra time finding the URL of a reference whose access type is unknown. Robots' decisions of where to explore are also made easier this way.

@oliviaguest
Copy link
Member

@broukema OK, I think we're having a weird miscommunication. I have used arXiv, have preprints on there, etc. I just mean why specifically are we using it here? Maybe I've missed something above, but don't we usually use Zenodo for ReScience C?

@broukema
Copy link

I'm not proposing ArXiv as an alternative to Zenodo for archiving ReScienceC pdfs; ArXiv is complementary to Zenodo, and stores the sources of papers, not the final pdfs (it caches the pdfs for some time).

I'm rather thinking of bibliometry of scientific articles, and general efficiency and modularity of scientific communication. See (i) and (ii) above. There's also the fact that I mentioned above, that ArXiv moderators provide a qualitative filter of minimal scientific quality as a research article, that, I presume, is not done on Zenodo (I haven't used Zenodo much, so I'm not sure).

There are quite a few differences between ArXiv and Zenodo which I haven't mentioned above:

  • ArXiv includes abstracts;
  • ArXiv has email options - for people with low bandwidth connections, this is an efficient, decades-long-tested method of keeping up-to-date with the latest results in the field(s) you're interested in (maybe Zenodo has these, but I didn't see them);
  • Zenodo does not seem to have fields and sub-fields categories, moderated by human volunteers;
  • ArXiv normally expects the LaTeX source (text/tables/figures) to be provided - this can be useful for extracting the data in the tables, for visual or numerical analysis of figures at better resolution than available in the final pdf, and for plagiarism analyses
  • ArXiv publicly adds text overlap warnings in the comments field - leaving it to the community to interpret whether that means plagiarism or not. In principle, this is an issue that the scientific community takes very seriously. Again, maybe Zenodo has a mechanism to do this, but I haven't noticed it.

I certainly intend to post my article on ArXiv, and whether or not other authors wish to do so, or whether or not the editors as a whole choose to make this as a recommendation, is up to the authors and editors to decide: my arguments are above.

My feeling is that if a large fraction of ReScience C articles are accepted by ArXiv moderators as valid scientific research papers, then that will help strengthen the reputation of ReScience C as a serious scientific journal.

@khinsen
Copy link
Contributor

khinsen commented Jun 29, 2020

@broukema The question of how to manage / archive our articles comes up from time to time, we are definitely not in a stable state. Currently we only use Zenodo for archiving the PDFs, and increasingly Software Heritage for archiving code repositories.

One problem with adopting a more feature-rich but also more specialized platform such as arXiv is the heterogeneity of our submissions. ReScience C covers (in theory) all domains of computational science, which have widely differing habits. Most authors use LaTeX and our article template, but this is intentionally not obligatory. Likewise, many scientific disciplines are represented in ReScience C but not in arXiv. We could certainly agree among ourselves to have authors submit all ReScience C submissions under DL, but it's the arXiv curators who have the last word on the choice of category and I have no idea how they deal with articles they consider out of scope.

I'd rather start from the other end and ask: what do we want to improve compared to our current system? A frequent request is indexation by Google Scholar, which hasn't made much progress mainly because Google doesn't have clear rules for that. You suggest archiving the source code, which is interesting as well and can be realized in many ways. The most interesting aspect of arXiv that you point out is curation. This could indeed increase ReScience's reputation in the domains covered by arXiv (which is a small minority of our contributions so far), so I think it's definitely worth considering as a recommendation to our authors - but then they should ideally submit in their respective arXiv categories.

@rougier
Copy link
Member Author

rougier commented Jun 29, 2020

Also, compiling latex on arXiv is kind of a nightmare if you don't use the exact same version as they use (which is no really up to date).

@broukema
Copy link

@rougier I've been posting to ArXiv for several decades - I vaguely recall minor LaTeXing problems only once or twice, and I assume that the version of LaTeX that I use - normally the Debian GNU/Linux stable or sometimes oldstable version - has almost never been identical to the one provided. So our experience differs here. I use ordinary source debugging (git history, binary search) to debug LaTeX errors, but more user-friendly and powerful LaTeX debugging tools exist.

@khinsen I'm not convinced that the astro-ph.CO moderators would see my article as a cosmology research article, since it's really at a meta-meta-level compared to cosmology - it's about methodology (a case study of a method) - with no cosmological result. But I think that astro-ph.CO (in my case) is worth trying as a secondary category in addition to cs.DL. The moderators in other specialties will each make their own judgments - it's certainly reasonable to try. The individual moderators' decisions are not public, but their names are public, so a systematic refusal could in principle be later raised for wider public discussion.

@oliviaguest
Copy link
Member

Did we lose track of the original question?

Which computer science subject class on ArXiv would make most sense for the Ten Years Challenge papers?

@broukema go for DL, I think — why not? ☺️

@broukema
Copy link

@oliviaguest We did digress a bit :). But I think that Konrad's point about considering the scientific topic of the original paper is a fair one - ArXiv normally allows a secondary category, and that to me seems reasonable. I'll wait until either the 03 or 04 step of the editorial process before posting my paper on ArXiv - so there's still time for anyone else interested to provide other suggestions/arguments. I've got a 02 label, and 03 is probably not far away ;).

@broukema
Copy link

Just to clarify the crosslink: I proposed cs.DL (primary) and astro-ph.CO (secondary); ArXiv moderators took three weeks to decide and accepted my article in cs.CY (primary) and cs.SE (secondary): ReScience/submissions#41

Others intending to submit their papers to ArXiv should probably consider choosing cs.CY and cs.SE immediately rather than waiting for reclassification.

@khinsen
Copy link
Contributor

khinsen commented Aug 18, 2020

Thanks @broukema for reporting on your arXiv experience!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants