New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CompUnit::Repository::Lib (or something like it) to core #386
Comments
Is the idea for I ask because of the second drawback you mentioned: that users can't install Unicode-containing files on some file systems. I'd personally view that as a fairly large point against the idea. Given that many file systems do support Unicode characters and that many languages use non-ASCII characters fairly regularly, it's pretty easy to imagine a module author use a file name that works for them but fails for other users. And that'll only be increasingly true if some of Raku's more ambitious internationalization plans work out (of the sort discussed at the core summit, I mean). And, anyway, both S22 and the CompUnit docs make a fairly big deal out of Raku's support for Unicode file names. So, at the very least, it's not something we should give up lightly. Would there be some way to mangle/normalize the file names such that they're still human readable (at least vaguely) but that avoids breaking on non-Unicode-supporting file systems? |
I agree that in theory it is a great idea. In practice it hasn't really been used, and the way we allow it to happen at all has significant drawbacks. It is only theoretically possible in a very specific situation: installing a distribution using a custom
Punycode is probably the best thing I can think of, but that doesn't handle the arguably bigger issue of case sensitivity. I'm not sure there is a way to have all of: human readable, unicode compatible, case sensitivity. Regardless, |
Something else worth mentioning is that unicode module names can be used with |
That kind of makes it more of a footgun, though – if authors see unicode module names by more experienced Rakoons, and then use unicode when naming their own modules (not realizing that they need a But I like the overall idea of How about this scheme, off the top of my head: So But, again, I like the general idea of moving to |
I think ideally one could just extract an archive of a boring distribution somewhere and have it work. If we have to normalize 99% of all module filenames then I'm not sure its a great alternative. And for that it doesn't even avoid the issue regarding the naming of files inside of I kind of think the best we can do is to allow ecosystems to warn users against distributing code that isn't very system independant. For instance fez might warn users when they try to upload a module that has a |
IMO, the supporting Unicode filenames is pretty key to meaningfully supporting Unicode module names and, in turn, supporting Unicode module names is a core goal of Raku's whole approach to It also seems to me that fez or zef should be responsible for mapping existing, human-friendly names into names that work across platforms (maybe in a step that occurs before installation; maybe even at upload?) instead of asking users to do that mapping manually in their But all that's just my ¢2, and I know you've thought more deeply about this issue than I have. So, if no one else chimes in, I'm happy to defer to your judgment on this one. |
To me, not having to map Unicode fie names in a META6.json is a pretty low bar to support this type of feature. The fact we even make it possible to have a Unicode module name is in line with making hard things possible. We don't need to make hard things easy, just possible. We even allow the user to just use the Unicode file name on their own system if they want.
In a way that means every package system (zef, apt, etc) would be free to come up with their own complicated logic to do this. Users will still want to know how to get the mangled file name (similar to how users still want to get the sha1 file names even though they shouldn't and even though we supply users with all the tools to do things The Right Way), but they'll have no way of knowing which scheme any two distributions are using. Taken to the extreme one could say zef should ignore the META6.json file entirely (or rather just generate what it thinks is an appropriate META6.json at build time) to always do what is probably expected, but doing what is probably expected is not ideal for something with security implications. To some degree module authors have to be explicit, strict, etc if they want their code to work outside of their own systems.
I think it would be a good idea to notify/warn/error the user at upload. Modifying the distribution at upload or after download, not so much (after all the uploader should be able to know the checksum before it is uploaded). Having fez/mi6 handle the Unicode file name in the META6.json at authoring time is also logical. But regarding installation time... how can e.g.
But then we admit we can't support non-raku code that won't work with a different name. This is not an acceptable workaround, it is just the only workaround.
Maybe I'm misunderstanding, but how would someone do this for html and javascript files practically? In a production environment you don't want to serve these type of files by going through some raku code to map the names, you want to let your e.g. reverse proxy just handle all your static content from some directory directly, which means you need to access them via the names as they exist on the file system. Even getting the names at runtime is probably going to be impossible unless it is built into Raku itself, because each packaging system would have their own methods of doing this (which means we would have distributions depending on a specific package manager).
I have more thoughts on this than I'm capable of writing up unprompted in an initial github issue, so I'm happy (and would expect) to continue addressing people's concerns. Removing a feature is not an easy proposition to make. |
I'm not even sure if it must be enforced. A (suppressible) warning would be sufficient. Say, it could be a distribution to be purportedly designed for a limited set of systems where Unicode is supported. Besides, installation can reject a distribution, with a descriptive error message, if it finds out that not all file names are suitable for the local file system. (...saying nothing about too long paths on Windows...) |
Yeah, and unfortunately I'm not sure there is a good way to do that outside of rakudo itself. A naive way would be for some program to try to write these various files and see what works and what does not work and use that knowledge to know when to generate such warnings prior to where zef passes the distribution to rakudo for the actual installation of files. But that would have to be done per volume/device/mount whatever, since e.g. two directories can be pointed at different file systems (and even then could change after-the-fact, so any "database" of this info is liable to become stale). Basically such a rejection would have to come form |
I think the problems you listed for CURI can be solved within the implementation of CURI without requiring a full replacement. It's just that no one has given it a try so far. E.g. CURI can easily be changed to use subdirectories for keeping the files of multiple distros apart and use pure ASCII names as-is. It can even go a step further and simply test whether it can write a non-ASCII file name as-is and read it back. Nothing in CURI's interface requires it to rename all files or keep them in the current structure. The changes I mentioned can be done while retaining full backwards compatibility with existing installations. |
This is only true in theory. In practice it breaks custom repository locations by hard coupling them to a specific rakudo version or higher. To explain for those who don't know, So not only is updating |
|
Fixing these issues in CURI will most likely not require a change in repository version, as we record the path to the stored file for every internal name. Old rakudo versions would just follow that path and not care whether it's a SHAed file name or a directory + original name. |
One preliminary response, and then something that gets more at the heart of the issue. First, on the specific point:
Right now, if I have a Now on to the more general point:
Thanks, that's a really helpful comment – it clarifies how our perspectives differ. In my view, "giving a module the name I want" should be in the easy-things-should-be-easy category. I'm thinking partly of @finanalyst's to create non-english versions of Raku (via slangs) or @alabamenhu's work to support multi-lingual error messages. But, even setting those projects aside, "naming a module in my native language" strikes me as something that we should make easy – pretty much everyone names modules, after all. And, of course, for pretty much any non-English speaker, using names from their native language requires Unicode support at least some of the time. Conversely, I'm pretty willing to put "dynamically linking against a non-Raku program that requires a static name" in the hard-things-should-be-possible category. I'd venture a guess that the vast majority of Raku programs don't directly link against any not Raku code, much less any that requires a static name. Of course, many more indirectly do so, but that's kind of my point: the interface between Raku and non-Raku code tends to be at the library level and, IMO, it's reasonable to expect library authors to do the hard thing of supporting linking via a static name. Given that perspective, I wonder whether we could solve the issues with |
Maybe I'm misunderstanding, but to clarify I'm talking about with an installed distribution. The files don't exist with their original file names, so
To be fair I named OpenSSL (and thus
One of the problems this intends to solve is how our current method is not at all human readable. I'm not sure a scenario where many users are requesting various module authors to have all files be explicitly mapped to human readable names is something we would really want.
These problems don't exactly face the same restraints though. File systems themselves are at the core of this problem, and we would be wise to consider that abstraction when designing an interface around it. I have a strong hunch that if we asked users if they would prefer A) the ability to use unicode file names for their modules or B) the ability to untar a distribution directory into its install location and have it largely Just Work, users would choose B. Remember, Option A really does preclude option B because the files still have to be extracted from a tar file, git repository, etc onto the potentially problematic file system before zef or rakudo can rename them. |
Even if this is technically true, it also seems a bit off. For all intents the repository format has indeed changed. In the future when the repository format needs to change again it seems like it would need to know what state the repository is actually in to do a meaningful upgrade, but it won't know if its using the flat directory format, this new proposed format, or some mix of both. |
When considering how Unicode file names should work, think of how to solve this workflow:
By the time we've reached 3 it is already too late - the archive has been extracted but the file does not exist. There is no point where raku can give it an alternative name before it touches the file system for the first time... ...or rather no core friendly way. https://github.com/ugexe/Raku-CompUnit--Repository--Tar (and which S22 also references) can actually do this by extracting single files to stdout and piping that data into a file path that raku provides. But I'm not sure every version of tar supports this, nor would I suggest something in the core that shells out to e.g. |
I only have a shallow understanding, so apologies if I've made incorrect assumptions. How about a solution that retains SHA install naming but adds a file-system layer of link/junction satisfying the human desire for meaningful names? The meaningful links/junctions could sit in a distinct hierarchy - i.e. not be mixed directly in the install folders. Perhaps tooling to create/maintain a human-meaningful shadowed hierarchy on the file-system from an existing installation. This restricts file-system short-comings to the representational side and works safely alongside the universal SHA naming. There would be nothing preventing several shadow representations existing simultaneously - ASCII, English, French, Kanji - all linking to the same SHA hierarchy. On the flip side - add META for files/folders where Install should create links. Example: files in this folder should be installed as usual (SHA naming), and then representational links created in [non-clashing install location]. If a supplied file (.DLL?) cannot be linked but requires install at a fixed place on the file-system, you likely have significant chance of security/crash/problem - so I'm not sure if this is worth supporting. SHA-then-link isolates issues with representation to the link layer on non-supported filesystems, and can be reported at attempted install. Separately: IMHO Unicode module and file naming is important to developers, and should be easy. I expect filesystems will add Unicode support over time, and feel that it would be a step backwards to encourage non-Unicode in core. I'd rather see a Unicode / Case-sensitive approach that errors meaningfully when a file-system has limitations. |
@jaguart even if we implement that level of complexity, how could it solve the workflow I outlined in my previous comment? For a large percentage of people those files can't get onto the file system in the first place to even begin creating sha1 files with their data. |
Maybe I'm the one misunderstanding. In that example, why would you want to be able to use the original file names with
Yeah, that's exactly the point I was trying to get at with by drawing a distinction between programs that directly link to non-Raku code and those that only link indirectly (that is, because a dependency does the actual linking). For a Raku program that depends on
I don't share that hunch. I agree that, if we're pulling from current Raku users, there wouldn't be a huge contingent of people clamoring for option A. But I expect/hope that will change as Raku becomes more international and utf-u everywhere becomes more of a reality. But my hunch is that the group insisting on option B would be even smaller. I don't, generally speaking, expect that process to work for any software; instead, I expect that I'll need to install the software in whatever way is customary for that software/ecosystem (e.g., @jaguart wrote:
Agreed. |
It is essentially the only workflow to install modules. What is the alternative workflow to do so that isn't based on shelling out to a system dependent outside program (tar, etc)? To install a distribution (that isn't already on the file system) you download a single file in some way (tar file, git, etc). Then you have to extract it. Then Raku can do something. If the file can't be extracted on a given file system, and Raku lacks core support for whatever archive algorithm is used, then there is no reason for Raku to try to make it work on that system because it can never reach that point in the first place. In other words - we can support those Unicode filenames but distributions using them still can't be saved/extracted (and thus installed) for any practical purposes on the systems we implement the sha1-ing for in the first place. And indeed for systems where they can e.g. extract a unicode name to the file system, we don't have to do anything extra for Raku to support it with
I want to point my reverse proxy at a directory (potentially of an installed distribution's resource directory) and have it serve the files there under their original names (since the html files in that distribution would be written to the original file names similar to as if it was being loaded by CURFS). |
No, to install a distribution I type That's a somewhat flippant answer, but it gets at a more serious point: I don't see any problem with having Zef (or some other tool) be the "blessed" way to install Raku packages and to say that other installation methods may require more work. In fact, I'd bet that pretty much the only people who might want to install Raku packages without Zef are package maintainers for Linux distros (or BSDs, I guess). And those folks are both ① unlikely to have difficulty with Unicode and ② familiar enough with using If we start from the perspective that Zef is the way regular users install Raku programs, then the problem gets easier. Instead of needing to make a workflow easy for everyone, we just need to make it hard-but-possible for Zef to be able to extract the contents of an archive regardless of filesystem constraints. And you've explained why that poses challenges when the archive contains files with names that the OS considers illegal. Indeed, you might be correct that there's no way to do this without either shelling out to Where I disagree is with the idea that we'd need that support in core. Since the goal is "only" to enable Zef to install packages, it seems like we could have whatever support we need in user land, and Zef could depend on that. And, of course, that distribution wouldn't use any Unicode module names. I realize that this might seem like an "a simple matter of programming"™ type suggestion. But I'm describing what (IMO) it makes sense to aim for in the longer term. In the near/medium term I personally don't have an issue with shelling out to |
But why would the HTML files be written to reference the original file names? That's what I'm not understanding. If the HTML files are generated by the Raku distribution, then (IMO) that distribution should be able to generate them with names pointing to the installed files. If they're external to the Raku distribution, then I should be able to edit them to point to the installed files. I'm just not understanding why the name of the source-code file (as opposed to the name of the installed file) should ever need to be in my HTML. (I feel like I might be missing something basic here; my apologies if I'm being dense) |
Behind the scenes this downloads a tar file, uses git, or downloads that distribution in some way.
This requires name resolution to happen deterministically which should be in core since zef isn't responsible for module loading, or at least exist in some capacity within the CUR* - right now everything is SHA'd in CURI but not the others.
What ugexe is trying to say is that if the OS can't handle unicode names for files then this is where zef's agency ends. It can't shell out to tar or gzip and it can't continue installation in the current state. The suggested fix is creating a CUR that can handle this mutation in both extraction and resolution, not to say the CUR needs to handle the extraction but deterministically determine what it'd have been mutated to <- this is the key.
The rub is when rakudo attempts to load/resolve unicode file names on a non-unicode file system. |
Yeah, I get that of course (hence the grin). What I was trying to get at is that, since this is done behind the scenes, it's fine for it to be hard-but-difficult, which is much easier than trying to come up with a solution that fits into the workflow for typical uses. You know, the typical "torment the implementer" sort of thing…
I don't follow this. I understood @ugexe to have said (in a previous comment) that Zef could handle that situation by using the approach taken by |
Because inside of html like |
I'm not sure how that can work with the various nativecall distributions that use e.g. |
I understand that part. But what I don't understand is why the author of the distribution wouldn't just put None of that is to say that there couldn't be a situation in which someone really wants to have
Yeah, I can see how that'd be an issue. But, as in the OpenSSL case, nativecall distributions tend to be pretty low-level and written by fairly experienced Rakoons. And, almost by necessity, they deal with OS-specific issues. So I wouldn't mind a solution that requires nativecall-distribution developers to avoid Unicode filenames when they're targeting non-Unicode-supporting OSs. Or one that required them to add a field to their |
Because it then would not work when loaded by CURFS (or some other external repository class that doesn’t use sha1). The sha1 is an implementation detail of a specific repository type. |
Tar can handle it since it's just bytes in a file and does not necessarily need to be extracted anywhere. In this way TAR is handling the mutation that needs to happen to the filenames (by making it unecessary). |
Oh, there is another issue with installing directly from tar files: the tests themselves must be extracted (and that says nothing of any test modules included in t/). It would be strange if different files had vastly different naming rules. |
OK, having reflecting on what you've both said, you've convinced me that there are valid (and possibly/likely insurmountable) technical reasons not to implement the approach I was suggesting above. However, I continue to believe that "supporting Unicode file names" is an important design goal for Raku and is only going to be more important in the future – both as programming in general becomes less US-centric and as Raku specifically increasingly targets/supports non English use cases. If we're at all serious about the whole "100-year language" thing, then moving away from Unicode support seems like a decision we're likely to regret. I have another thought about how we could support Unicode filenames without running into the difficulties above, which I'll write up in a separate comment later today. |
Here are my revised thoughts, having slept on the issue: @ugexe's proposal, as I understand it, is to implement To mitigate this issue, @ugexe further suggests that Here's my proposal: instead of giving users a warning, lets fix the problem for them. That is, When I mentioned something like this upthread, @ugexe replied by pointing out that some distributions might be packaged without I believe that a well-designed process for renaming in Below, I'll sketch out one specific way this renaming scheme could be implemented. But, even if you don't like the specific implementation I suggest below, please also consider the general idea of having The general approach of this implementation is to stick with
I believe that this implementation arguably solves all 4 of the problems listed in @ugexe OP – or, less generously, it solves 3½ of them. Since this approach uses This approach arguably doesn't solve the "human-readable filenames" problem – the filenames are still hashes. However, extending Additionally, this proposal leaves people who package distributions without I look forward to hearing thoughts, both on the specific implementation I described above and on the more general idea of having |
fwiw it might be a few days before I have time to read and digest the latest replies |
This issue is similar to the one faced by the Raku Documentation suite. Here is a solution that seems to work for the Documentation suite. In the suggestions above, it seems (from my understanding of the above) that Because But it seems to me that module names, which I think are in the Suppose we stipulate that a packaged Raku distribution contains a META6 file and a file system map (eg zef would then have its way of mapping files, and others such as debian might have another. The only requirement would be a file Resource file names would not be renamed. |
CompUnit::Repository::Lib (which I'll call
CURL
) is a mix betweenCompUnit::Repository::FileSystem
(which I'll callCURFS
, and in that it uses the same folder/file naming structure everyone is used to) andCompUnit::Repository::Installation
(in that it allows for multiple distributions). The structure of the data on the file system looks like:This solves a few of the issues
CompUnit::Repository::Installation
(which I'll callCURI
) was created to solve, and it avoids trying to solve a few (arguably less important) others.CURI
andCURL
SolvesThis is the primary problem that needed to be solved, and it speaks for itself.
Kind of tied into the "multiple versions" problem is needing to be query-able. For instance when two different versions of a given module are installed and someone does
use Foo;
, the repository needs to be able to pick the proper one to be loaded (as no version was explicitly declared). Additionally there is the issue of multiple versions of bin scripts -- To have a singlePATH
entry for a given repository, the repository itself needs to be able to query itself to find the e.g. highest versioned bin script to load.CURI
Solves,CURL
Doesn't SolveThis is a hard problem, and I'd argue
CURI
only solves a small part of it. Indeed the hashing of file names meansCURI
can create module files that can be used via a unicode name. ButCURI
get the files (and the data they contain) from the file system - you have to download and extract the given distribution to your file system beforeCURI
can install it, and that isn't going to work right if those files are named in a way that doesn't work with a given file system.git
doesn't solve this problem either, as if you try to clone a repo that doesn't map correctly on your file system it will give you a warning (and yourgit diff
might show one file contains the data of another similarly cased file).(Technically something like
Distribution::Common::Remote
can be passed toCURI
to such that the files to be installed don't need to be extracted to the file system first, but that would exclude anything that uses a build step /Makefile
and anything that depends on something that uses a build step /Makefile
. And currently there isn't a way to tell if a build step needs to occur for an arbitrary distribution from just meta data, so strategically using that isn't a good option in my mind.)Renaming files (like how
CURI
renames things to their sha1 on installation) also breaks some things. Notably theOpenSSL
dlls don't work when they have been renamed, but also web stuff that may want to put assets inresources
and reference the files in html by their original names.CURI
Doesn't Solve,CURL
SolvesAs previously mentioned, renaming files can break e.g. dlls on windows and make referencing relative resources file paths in html/javascript difficult.
CURL
doesn't have this problem as files retain their original names.Users have a hard time understanding what is actually inside a repository full of sha1s.
CURL
does still use a sha1 to create the root directory of each distribution, but it doesn't have to be and even with that being the case it is relatively easy to find what is inside each of the directories as, again,CURL
files retain their original names.I suspect users would think the benefits (human readable installed files, easier to integrate with non-raku languages e.g. html and dlls) outweigh the drawbacks (can't theoretically install a distribution that contains both
Foo.rakumod
andfoo.rakumod
-- or is named with e.g. unicode characters -- on certain file systems).Problems?
CURL
currently greps each directory in its prefix, and lazily reads eachMETA6.json
until it finds the distribution it needs to load. It should probably use an index on module short names similar toCURI
.The text was updated successfully, but these errors were encountered: