Strengthen and adjust `copy-royal` usage guidance and caveats #2180

EliahKagan · 2025-09-21T06:34:38Z

Copy-royal is useful for taking repository trees that triggered a bug, and producing trees that diff the same, allowing regression tests to be prepared and focusing on aspects that are relevant to diffing.

However, the pattern represented by the output of copy-royal is effectively equivalent to a regular grammar of the possible inputs; more specifically, a concatenation of character classes, which in practice are each small. For example, copy-royal on Hello, world! produces Iette, yeuta!. If the input is assumed to consist of ASCII characters, the possible inputs that produce Iette, yeuta! are the strings that match the regular expression [HR][eoy][blv][blv][eoy],\ [cmw][eoy][hr][blv][dnx]!.

Because real-world inputs are not random selections of letters, the code or text of the input will often be possible to reconstruct from this pattern, such as by using an autoregressive LLM with constrained decoding to sample only the small subset of logits at each time step that are not contradicted by the the regex, while exploring multiple paths with techniques such as beam search. This approach is explored with a proof-of-concept in this Google Colab notebook.

Because copy-royal is thus effectively reversible in many cases, likely including all practical cases, we should avoid saying or implying that the input text cannot be reconstructed or discovered, and we should especially avoid saying or implying that copy-royal could or should be used to turn private or sensitive data into data that are okay to share publicly. The documentation of copy-royal has suggested irreversibility for a long time, though not initially meaning it in a robust sense, and gradually shifted to incorporate language that further suggested that it could be used for security or privacy related purposes.

This pull request rewrites the documentation for copy-royal and directly related facilities to avoid implying benefits or use cases that it does not or may not have, and instead to describe its useful purpose more clearly.

Besides in the notebook linked above, there are also some more details in the commit message.

The `copy-royal` algorithm maintains the patterns and "shape" of text sufficiently to keep diffs the same (in the vast majority of cases). It is used in `internal-tools` to help prepare test cases with what is important and relevant to a regression test of diff behavior, rather than the exact original repository content in a tree that has been found to trigger a bug. It avoids needless verbatim reproduction, while preserving aspects that are useful and necessary for testing. It keeps the focus on patterns, preventing irrelevant details of code in a tree that triggered a bug from being confused with the logic of gitoxide itself, and makes it less likely to be touched inadvertently in efforts to fix bugs or improve style (which, in test data, would cause subtle breakage). Although these benefits are substantial and we intend to continue using copy-royal in the preparation of test cases as needed if or when regressions arise, some of the guidance and rationale we had given for its use was inaccurate or misleading. Most importantly, copy-royal cannot be used in practice to redact sensitive information: if you have a repository whose contents should not be made public, then it is not safe to share the output of copy-royal run on that repository either. Copy-royal is implemented (roughly speaking) by mapping alphabetic characters down to ten letters. This removes some information, at least in principle: that is, if it were given totally random letters as input, then it would be impossible to reverse it to get those letters back. Even on input that is much more structured and predictable, such as real-world input, it obfuscates it, making it look garbled and nonsensical. However, even when one intuitively feels that it has destroyed information, it is possible to reverse it in many cases, and possibly even in all practical cases. The reason is that, in real world source code and natural language, some sequences of letters are overwhelmingly more likely to occur than others, both in general and (especially) contextually given what surrounding text is present. The information that is removed by mapping into ten letters could often be reconstructed by: 1. Building a grammar of possible inputs, which can be done in a simple manner by translating the copy-royal output one wishes to reverse into a regular expression in which every symbol in the copy-royal output becomes a character class of characters that map to it. In effect, for every output of the copy-royal algorithm, there is a regex that matches the possible inputs. 2. Predicting, stepwise, what code or text is likely to have arisen that matches that grammar. In principle this could be done with a variety of techniques or even manually. But one fruitful approach would be to use an autoregressive large language model, and apply constrained decoding[1] to sample only logits consistent with the regex. Small experiments carried out so far suggest[2] this to be a workable technique when combined with beam search[3]. (This technique does not require the specific text or code being reconstructed to have existed when the model was trained.) Accordingly, this modifies the documentation of copy-royal to avoid claiming that the input of copy-royal cannot be recovered, or anything that recommends or may appear to recommend the use of copy-royal to redact sensitive information. It also clarifies and adjusts the explanation of when it makes sense to use copy-royal, and describes some of its benefits that do not rely on the assumption that it is infeasible (or even difficult) to reverse. In the comment documenting `BlameCopyRoyal`, which is among those edited in the above ways, this also edits its top line to make clear more generally how `BlameCopyRoyal` relates to `git blame`. [1]: https://github.com/Saibo-creator/Awesome-LLM-Constrained-Decoding [2]: See link(s) in GitoxideLabs#2180 [3]: https://en.wikipedia.org/wiki/Beam_search Co-authored-by: Sebastian Thiel <sebastian.thiel@icloud.com>

Byron · 2025-09-21T08:54:59Z

Thanks a lot, much better!

cruessler · 2025-09-22T08:39:30Z

Thanks a lot!

EliahKagan force-pushed the copy-royal-guidance branch from 3d17d49 to 90a3dcb Compare September 21, 2025 06:35

EliahKagan marked this pull request as ready for review September 21, 2025 06:36

EliahKagan enabled auto-merge September 21, 2025 06:36

EliahKagan merged commit d976848 into GitoxideLabs:main Sep 21, 2025
47 of 49 checks passed

EliahKagan deleted the copy-royal-guidance branch September 21, 2025 14:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Strengthen and adjust `copy-royal` usage guidance and caveats #2180

Strengthen and adjust `copy-royal` usage guidance and caveats #2180

Uh oh!

EliahKagan commented Sep 21, 2025 •

edited

Loading

Uh oh!

Byron commented Sep 21, 2025

Uh oh!

Uh oh!

cruessler commented Sep 22, 2025

Uh oh!

Uh oh!

Uh oh!

Strengthen and adjust copy-royal usage guidance and caveats #2180

Strengthen and adjust copy-royal usage guidance and caveats #2180

Uh oh!

Conversation

EliahKagan commented Sep 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Byron commented Sep 21, 2025

Uh oh!

Uh oh!

cruessler commented Sep 22, 2025

Uh oh!

Uh oh!

Strengthen and adjust `copy-royal` usage guidance and caveats #2180

Strengthen and adjust `copy-royal` usage guidance and caveats #2180

EliahKagan commented Sep 21, 2025 •

edited

Loading