Strengthen and adjust copy-royal
usage guidance and caveats
#2180
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Copy-royal is useful for taking repository trees that triggered a bug, and producing trees that diff the same, allowing regression tests to be prepared and focusing on aspects that are relevant to diffing.
However, the pattern represented by the output of copy-royal is effectively equivalent to a regular grammar of the possible inputs; more specifically, a concatenation of character classes, which in practice are each small. For example, copy-royal on
Hello, world!
producesIette, yeuta!
. If the input is assumed to consist of ASCII characters, the possible inputs that produceIette, yeuta!
are the strings that match the regular expression[HR][eoy][blv][blv][eoy],\ [cmw][eoy][hr][blv][dnx]!
.Because real-world inputs are not random selections of letters, the code or text of the input will often be possible to reconstruct from this pattern, such as by using an autoregressive LLM with constrained decoding to sample only the small subset of logits at each time step that are not contradicted by the the regex, while exploring multiple paths with techniques such as beam search. This approach is explored with a proof-of-concept in this Google Colab notebook.
Because copy-royal is thus effectively reversible in many cases, likely including all practical cases, we should avoid saying or implying that the input text cannot be reconstructed or discovered, and we should especially avoid saying or implying that copy-royal could or should be used to turn private or sensitive data into data that are okay to share publicly. The documentation of copy-royal has suggested irreversibility for a long time, though not initially meaning it in a robust sense, and gradually shifted to incorporate language that further suggested that it could be used for security or privacy related purposes.
This pull request rewrites the documentation for copy-royal and directly related facilities to avoid implying benefits or use cases that it does not or may not have, and instead to describe its useful purpose more clearly.
Besides in the notebook linked above, there are also some more details in the commit message.