Data/Files spanning multiple RepSeP instances #8

SebastianSemper · 2018-09-04T13:31:22Z

I just want to raise the (at least to me) often encountered issue that I work with the same data across multiple projects, like

a collection of my TeX macros, which I rely heavily on especially in formula heavy publications and
a plain text bibliography consisting of an ever growing number of bibtex entries.

Both types of "data" or let's say constants have to be versioned, must not be cluttered in multiple unconnected places on the files system and as such are (ideally) maintained in a git repository

I think the specification currently lacks a standardized handling of this kind of resource. So an implementation should be up for discussion to allow this type of external sources. Is this even something that should be in the standard? My personal feel is that it should but I might be wrong.

TheChymera · 2018-09-05T15:20:58Z

pythontex/

Interesting question, of which I think the pythontex/ directory is the main instantiation I grappled with. It's boilerplate code for the underlying pythontex features, and as such it shouldn't be duplicated all over the place. Ideally it would be distributed like any other software package. The only issue is that it is very small and might need a lot of editing/amending (possibly specific to the document), which is why I have opted to keep it in userspace, for the time being.

Data

Regarding data, I think the solution is more obvious, and it is to manage it like any other software package. I maintain e.g. the mouse-brain-atlases package, which we use to version our atlases across machines, or the samri_bindata package.

Centralized bibfile

For the bib file it's more tricky. If you want to edit it often, it's not an option to install it system-wide. In fact, if you edit it often and it is supposed to evolve synchronously with your documents, I believe there is no other way than to track it in the same git repo as your documents, and to break it down so that it is specific to each project.

Sounds like I'm questioning the question instead of answering it, but I think there's simply a fundamental contradiction here. If you want to centralize the bib entries, that means you need to have a package which makes sense independently of and evolves asynchronously from your documents. I think the a/synchrony issue may be marginal to your purposes, though, since it sounds like you are rather interested in always having the latest version.

Live Package

If that is the case, I know a technology which might precisely address your problem, but it is likely not available on your distribution. It's called live pacakge versioning and it means that your package manager has a package version called -9999 (or something else that evaluates to newest), and which instructs it to download the HEAD of a specific branch. Under Gentoo Linux you could, for instance, have a cron job which hourly runs emerge =mybib-9999 =myotherstuff-9999, so that you get continuous deployment. For users, the package manager would pull the 9999 bibtex dependency at install time.

In the absence of that you could ofc just put things in userspace and deploy via a normal git command, but that will not automaticall resolve for users (even if they have your distro) and can easily get lost and messy outside the package manager (i.e. there will be no way to automatically check if the bibfile is present, unless you ship explicit checks, at which point you start turning your document into a package manager).

Still, this would not address the scope issue. If we contribute on a project (and we both use bib master files) what do we do? Do I put all your maths citations in my master file and you also include my neuro citations? Do we split them by field? How? Seems potentially very compilcated

File Delegation

Another option would be to delegate the specific subset of bibtex entries to each project. I believe you can use JabRef to do this or simply look at the .aux file, where you can get a list of only the citations actually used in any one document. I guess you could try to just delegate the set to each publication. Conversely if a user also has a master bib file, he could just import these delegated files into it.

To integrate this into the build process, I guess you would have to add a || conditional hack, so when bibtex fails with the project-included file (because you have not yet re-delegated), it falls back to a canonic location of such a master file.

SebastianSemper · 2018-09-05T15:41:15Z

Nice hints!

The Live Package option seems too specifically tailored to gentoo, so even for me (running the next best thing to gentoo – arch) it is not available.

I thought about something like a hook in the workflow, where one also incorporates one or several (possibly public) git repositories into the RepSep meta-data for the project together with hashes of the commits used during submission or publication of the work. This way for instance bibfiles or TeX macros can still evolve further on after publication but the specific release is associated with a certain state of the externally maintained git repositories in the submodules.

So summarizing we would add something like external dependencies via submodules to the specification and let git take care of keeping things in order locally in the specific RepSep project and globally for the content in the various submodules which were included.

I am just bringing this issue up, because an automated build system for deploying the work would need to take care of maintaining these submodules transparently and reliably.

TheChymera · 2018-09-06T21:39:51Z

It sounds like you are tending to some sort of package management after all --- and are adapting to the prospectively frequent updates by planning to use hash-bases versioning insead of release versioning. The choice of using Git hooks is then merely a workaround to deal with the fact that this is an even more niche use case for package managers to deal with than live packages.

Still, Git is not a package manager, and the most prominent consequence would be that you would not be able to share a resource between users or projects (and you would have to duplicate it, leading possibly to merge conflicts, etc.). I have come across a package which does something similar like you have sugested and it's making keeping explicit track of the dependencies really unpleasant.

Additionally this also means you will have to integrate boilerplate dependency checking/management code into your projects, so you would also end up duplicating that all over the place. In the end, you should try to package that part separately (as we should with the pythontex/ dir), at which point you will have created a new package manager.

As it turns out, Portage (Gentoo's package manager, and that of a few other distros) can also do hash-based versioning. Though not yet automatically. I'm also working on that, but it's nowhere close to being ready.

Not least of all, I think this use case should exist as an alternative to per-project bibliography tracking. So it would be really cool if it integrated (e.g. we collaborate on a project, you get to use your master file, but I don't have to). So I still think the best solution might be further down the “File Delegation” line of thinking.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data/Files spanning multiple RepSeP instances #8

Data/Files spanning multiple RepSeP instances #8

SebastianSemper commented Sep 4, 2018

TheChymera commented Sep 5, 2018

SebastianSemper commented Sep 5, 2018

TheChymera commented Sep 6, 2018

Data/Files spanning multiple RepSeP instances #8

Data/Files spanning multiple RepSeP instances #8

Comments

SebastianSemper commented Sep 4, 2018

TheChymera commented Sep 5, 2018

pythontex/

Data

Centralized bibfile

Live Package

File Delegation

SebastianSemper commented Sep 5, 2018

TheChymera commented Sep 6, 2018