guest_rougier_2017.tex

\documentclass[jou]{apa6}
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\usepackage{hyperref}
\usepackage{xcolor}
\hypersetup{
    colorlinks,
    linkcolor={red!50!black},
    citecolor={blue!50!black},
    urlcolor={blue!80!black}
}
\usepackage{apacite} 

% 500-1000 words by when we are done :)

\title{Diversity in Reproducibility}
\shorttitle{Diversity in Reproducibility}

\twoauthors{Olivia Guest}{Nicolas P. Rougier}
\twoaffiliations{Department of Experimental Psychology\\University College London, United Kingdom}{INRIA Bordeaux Sud-Ouest, Talence, France\\
Institut des Maladies Neurodégénératives, Université Bordeaux, Centre National de la Recherche Scientifique, UMR 5293, Bordeaux, France\\
LaBRI, Université de Bordeaux, Institut Polytechnique de Bordeaux, Centre National de la Recherche Scientifique, UMR 5800, Talence, France}
%olivia.guest@psy.ox.ac.uk
%nicolas.rougier@inria.fr
\abstract{}

\begin{document}
\maketitle
In our previous contribution, we proposed computational modelling-related definitions for \textbf{replicable}, i.e., experiments within a model can be recreated using its original codebase, and \textbf{reproducible}, i.e., a model can be recreated based on its specification. We stressed the importance of specifications and of access to codebases. Furthermore, we highlighted an issue in scholarly communication --- many journals do not require nor facilitate the sharing of code. In contrast, many third-party services have filled the gaps left by traditional publishers \cite<e.g.,>{binder, github, osf, rescience}. Notwithstanding, journals and peers rarely request or expect use of such services. We ended by asking: are we ready to associate codebases with articles and are we prepared to ensure computational theories are well-specified and coherently implemented?

\section{Scope of Evaluation}

Dialogue contributions include proposals for: intermediate levels between replicability and reproducibility (Crook, Hinsen); going beyond reproducibility (Kidd); encompassing computational science at large (Gureckis \& Rich, Varoquaux); and addressing communities as a function of expertise (French \& Addyman).
On the one hand, some replies discuss evaluation more broadly, empirical data collection, and software engineering.
On the other hand, some delve into the details of evaluating modelling accounts.
We will discuss the former first.

In Varoquaux's contribution, reproducibility includes replicability and code rot \cite<e.g., in fMRI:>{Eklund12072016}.
However, the titular computational reproducibility is orthogonal to maintaining a re-usable codebase.
Software and hardware inevitably go out of fashion meaning codebases expire.
Nevertheless, the overarching theory encapsulated by modelling software could withstand the effects of entropy if specified coherently, e.g., early artificial neural network codebases are not required to understand nor reproduce these models.
Indubitably, there is a balance to be struck between reimplementation and re-use.

In contrast, Gureckis and Rich extend their scope to the empirical replication crisis in psychology.
They mention that implicit knowledge often goes unpublished and thus only fully automated on-line experiments are computationally reproducible psychology.

Epistemically, empirical and software replication and reproduction are distinct from their modelling-related counterparts --- they are six related endeavours.
The difference between software for science (e.g., a statistical test) and science that is software (e.g., a cognitive model) is an important one to underline. 
In the former case the code is a tool, in the latter it constitutes an experiment.
Notwithstanding, all such evaluations have scientific merit.

\section{Levels of Evaluation}
We mentioned two of the levels in which modelling work is evaluated.
Unanimity is reached on replication as a minimum check, however some dialogue contributions go further.
To wit, Hinsen separates this endeavor into three steps.
Specifically we must check that a model is: bug-free; reproducible as presented; congruent with empirical data.
These roughly map onto the levels of talking about modelling work more generally, as Kidd notes \cite{marr82}.

\subsection{Implementation Level}
With respect to the implementation level, as Crook explains, re-running code both within a lab and by others allows for checking for bugs and, importantly, if assumed-to-be-irrelevant variables, e.g., the random seed, are not driving the results.
This also ensures documentation is appropriate. 
Success at this level indicates a model is \textit{replicable}.

\subsection{Model Level}
To evaluate the quality of the specification, we may re-write, i.e., \textit{reproduce}, the model from scratch.
This provides evidence for or against depending on the reimplementation's success.
As Kidd mentions, and as we discovered \cite{cooper14}, this process allows us to: discern when implementation details must be elevated to the theory level and vice versa; evaluate the specification; and uncover bugs.

\subsection{Theory Level}
Many methods exist for testing theories.
One such method involves computationally implementing a theory --- another is to test predictions by gathering empirical data.
As Crook points out, such data is also used to evaluate models and should be associated with the original article and codebase. 
In such cases, empirical data requires re-collecting.
This is because if the phenomenon to-be-modelled, Hinsen warns, does not occur as described by the overarching theoretical account, then both theory and model are brought into question.
``A* is a model of [...] A to the extent that [we] can use A* to answer questions [...] about A.'' \cite[p.~426]{Minsky:1965}

\section{Conclusions}
Even though definitions for terms across the replies do not fully converge,\footnote{We do not wish to prescriptively enforce our terms and definitions --- and we are open to suggestions, especially based on the use of such terms by computationally-based disciplines \cite<e.g.,>{2016arXiv160504339M,Patil066803}.} all contributors agree that change is needed and imminent.
A notable divergence of opinion can be found in the reply by French and Addyman, who believe specifications are less vital than we do.
Importantly, we agree on some fundamentals: sharing codebases; linking articles with codebases; and reproducing models \cite<e.g.,>{rescience}.

In response to our question: Hinsen proposes modellers include a specific article section on evaluation; while Crook lists community-driven initiatives for sharing codebases and specifications. 
Crook hopes, as we do, for top-down publisher-enforced sharing of resources in partially-centralised repositories.
However, this does not preclude, and may in fact require, grassroots demands.
If the scientific community rightfully yearns for change, we are required to act to make this happen.

\bibliographystyle{apacite}
\bibliography{ref}
% The following space works around a bug in typesetting the references, where the hanging indent of the last reference is incorrectly set.
\hspace*{1cm}
\end{document}