Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CompUnit::Repository::Lib (or something like it) to core #386

Open
ugexe opened this issue Sep 24, 2023 · 35 comments
Open

Add CompUnit::Repository::Lib (or something like it) to core #386

ugexe opened this issue Sep 24, 2023 · 35 comments
Labels
language Changes to the Raku Programming Language

Comments

@ugexe
Copy link
Contributor

ugexe commented Sep 24, 2023

CompUnit::Repository::Lib (which I'll call CURL) is a mix between CompUnit::Repository::FileSystem (which I'll call CURFS, and in that it uses the same folder/file naming structure everyone is used to) and CompUnit::Repository::Installation (in that it allows for multiple distributions). The structure of the data on the file system looks like:

<some unique prefix 1>/META6.json
<some unique prefix 1>/lib/Foo.rakumod
<some unique prefix 2>/META6.json
<some unique prefix 2>/lib/Bar.rakumod

This solves a few of the issues CompUnit::Repository::Installation (which I'll call CURI) was created to solve, and it avoids trying to solve a few (arguably less important) others.


CURI and CURL Solves

  • Multiple versions of the same distribution

This is the primary problem that needed to be solved, and it speaks for itself.

  • Query-able

Kind of tied into the "multiple versions" problem is needing to be query-able. For instance when two different versions of a given module are installed and someone does use Foo;, the repository needs to be able to pick the proper one to be loaded (as no version was explicitly declared). Additionally there is the issue of multiple versions of bin scripts -- To have a single PATH entry for a given repository, the repository itself needs to be able to query itself to find the e.g. highest versioned bin script to load.

CURI Solves, CURL Doesn't Solve

  • Unicode file names, case insensitive file systems

This is a hard problem, and I'd argue CURI only solves a small part of it. Indeed the hashing of file names means CURI can create module files that can be used via a unicode name. But CURI get the files (and the data they contain) from the file system - you have to download and extract the given distribution to your file system before CURI can install it, and that isn't going to work right if those files are named in a way that doesn't work with a given file system. git doesn't solve this problem either, as if you try to clone a repo that doesn't map correctly on your file system it will give you a warning (and your git diff might show one file contains the data of another similarly cased file).

(Technically something like Distribution::Common::Remote can be passed to CURI to such that the files to be installed don't need to be extracted to the file system first, but that would exclude anything that uses a build step / Makefile and anything that depends on something that uses a build step / Makefile. And currently there isn't a way to tell if a build step needs to occur for an arbitrary distribution from just meta data, so strategically using that isn't a good option in my mind.)

Renaming files (like how CURI renames things to their sha1 on installation) also breaks some things. Notably the OpenSSL dlls don't work when they have been renamed, but also web stuff that may want to put assets in resources and reference the files in html by their original names.

CURI Doesn't Solve, CURL Solves

  • As previously mentioned, renaming files can break e.g. dlls on windows and make referencing relative resources file paths in html/javascript difficult. CURL doesn't have this problem as files retain their original names.

  • Users have a hard time understanding what is actually inside a repository full of sha1s. CURL does still use a sha1 to create the root directory of each distribution, but it doesn't have to be and even with that being the case it is relatively easy to find what is inside each of the directories as, again, CURL files retain their original names.


I suspect users would think the benefits (human readable installed files, easier to integrate with non-raku languages e.g. html and dlls) outweigh the drawbacks (can't theoretically install a distribution that contains both Foo.rakumod and foo.rakumod -- or is named with e.g. unicode characters -- on certain file systems).

Problems?

  • CURL currently greps each directory in its prefix, and lazily reads each META6.json until it finds the distribution it needs to load. It should probably use an index on module short names similar to CURI.
@ugexe ugexe added the language Changes to the Raku Programming Language label Sep 24, 2023
@codesections
Copy link
Contributor

Is the idea for CURL to replace CURI entirely, or would CURI still be available for situations where CURL won't work?

I ask because of the second drawback you mentioned: that users can't install Unicode-containing files on some file systems. I'd personally view that as a fairly large point against the idea. Given that many file systems do support Unicode characters and that many languages use non-ASCII characters fairly regularly, it's pretty easy to imagine a module author use a file name that works for them but fails for other users. And that'll only be increasingly true if some of Raku's more ambitious internationalization plans work out (of the sort discussed at the core summit, I mean).

And, anyway, both S22 and the CompUnit docs make a fairly big deal out of Raku's support for Unicode file names. So, at the very least, it's not something we should give up lightly.

Would there be some way to mangle/normalize the file names such that they're still human readable (at least vaguely) but that avoids breaking on non-Unicode-supporting file systems?

@ugexe
Copy link
Contributor Author

ugexe commented Sep 24, 2023

Is the idea for CURL to replace CURI entirely, or would CURI still be available for situations where CURL won't work?

CURI would still be available. However it would not be as a sort of fallback for when CURL won't work, but just because a bunch of stuff already uses it (slow moving stuff, like packagers).

And, anyway, both S22 and the CompUnit docs make a fairly big deal out of Raku's support for Unicode file names.

I agree that in theory it is a great idea. In practice it hasn't really been used, and the way we allow it to happen at all has significant drawbacks. It is only theoretically possible in a very specific situation: installing a distribution using a custom Distribution that doesn't extract files to the file system and doesn't have any build steps. There isn't really a way for a language to solve extracting a given downloaded distribution somewhere to be installed if the file system is not capable of representing those files.

Would there be some way to mangle/normalize the file names such that they're still human readable (at least vaguely) but that avoids breaking on non-Unicode-supporting file systems?

Punycode is probably the best thing I can think of, but that doesn't handle the arguably bigger issue of case sensitivity. I'm not sure there is a way to have all of: human readable, unicode compatible, case sensitivity. Regardless, CURI does not preclude the use of some theoretical name normalization.

@ugexe
Copy link
Contributor Author

ugexe commented Sep 24, 2023

Something else worth mentioning is that unicode module names can be used with CURL; it's only unicode file names that CURL doesn't support. This means if files are mapped using what is in a given META6.json (instead of naively concatting $module-name to lib/) -- and in the META6.json it maps to some non-unicode path -- that you can still refer to the module by its unicode name as you'd expect.

@codesections
Copy link
Contributor

Something else worth mentioning is that unicode module names can be used with CURL; it's only unicode file names that CURL doesn't support.

That kind of makes it more of a footgun, though – if authors see unicode module names by more experienced Rakoons, and then use unicode when naming their own modules (not realizing that they need a META6 mapping), then they'll have a module that works perfectly for them but breaks on different OSes.

But I like the overall idea of CURL and believe that we ought to be able to come up with a normalization scheme that is still mostly human-readable without sacrificing unicode support/case sensitivity.

How about this scheme, off the top of my head:
1. Any ASCII lowercase letter is left as-is
2. Any ASCII uppercase letter is preceeded by a _
3. Any other character is replaced by a _ followed by that character's decimal Unicode value

So ResuméBuilder2 would become _resum_233_builder_50. That isn't 100% human-readable, but it's close enough that you'd be able to tell what module it meant. And I think that'd work on pretty much any file system. What do think?

But, again, I like the general idea of moving to CURLs or something like them.

@ugexe
Copy link
Contributor Author

ugexe commented Sep 24, 2023

I think ideally one could just extract an archive of a boring distribution somewhere and have it work. If we have to normalize 99% of all module filenames then I'm not sure its a great alternative. And for that it doesn't even avoid the issue regarding the naming of files inside of resources/ (the dll problem, and the html files issue in particular).

I kind of think the best we can do is to allow ecosystems to warn users against distributing code that isn't very system independant. For instance fez might warn users when they try to upload a module that has a Foo.rakumod and foo.rakumod, or when a user uses unicode in a file path. Raku would still let users use unicode file names on their own system if they want, but what gets distributed (and thus has a higher expectation of being written to work on other systems) is enforced by a given authority's policy.

@codesections
Copy link
Contributor

IMO, the supporting Unicode filenames is pretty key to meaningfully supporting Unicode module names and, in turn, supporting Unicode module names is a core goal of Raku's whole approach to Distributions (it's first in the list of reasons for Raku's system in that docs page I linked earlier, for instance). And I think it'd be a shame to give up on that goal.

It also seems to me that fez or zef should be responsible for mapping existing, human-friendly names into names that work across platforms (maybe in a step that occurs before installation; maybe even at upload?) instead of asking users to do that mapping manually in their META6.json file. And when it comes to resources, I'm OK with a system that prevents people from referring to them by their original file name – so long as we have easy/well-documented methods that let them map from their source filename to their location. After all, in shell scripting we can write wc --lines /path/to/file but we can't write lines '/path/to/file' in Raku – we have to convert from the file name to an actual file, with something like lines '/path/to/file'.IO. Requiring devs to put the resources equivalent of that in HTML files doesn't strike me as too bad, especially since forgetting will generate an immediate and obvious error.

But all that's just my ¢2, and I know you've thought more deeply about this issue than I have. So, if no one else chimes in, I'm happy to defer to your judgment on this one.

@ugexe
Copy link
Contributor Author

ugexe commented Sep 24, 2023

IMO, the supporting Unicode filenames is pretty key to meaningfully supporting Unicode module names IMO, the supporting Unicode filenames is pretty key to meaningfully supporting Unicode module names

To me, not having to map Unicode fie names in a META6.json is a pretty low bar to support this type of feature. The fact we even make it possible to have a Unicode module name is in line with making hard things possible. We don't need to make hard things easy, just possible. We even allow the user to just use the Unicode file name on their own system if they want.

It also seems to me that fez or zef should be responsible for mapping existing, human-friendly names into names that work across platforms

In a way that means every package system (zef, apt, etc) would be free to come up with their own complicated logic to do this. Users will still want to know how to get the mangled file name (similar to how users still want to get the sha1 file names even though they shouldn't and even though we supply users with all the tools to do things The Right Way), but they'll have no way of knowing which scheme any two distributions are using. Taken to the extreme one could say zef should ignore the META6.json file entirely (or rather just generate what it thinks is an appropriate META6.json at build time) to always do what is probably expected, but doing what is probably expected is not ideal for something with security implications. To some degree module authors have to be explicit, strict, etc if they want their code to work outside of their own systems.

maybe in a step that occurs before installation; maybe even at upload?

I think it would be a good idea to notify/warn/error the user at upload. Modifying the distribution at upload or after download, not so much (after all the uploader should be able to know the checksum before it is uploaded). Having fez/mi6 handle the Unicode file name in the META6.json at authoring time is also logical. But regarding installation time... how can e.g. zef install https://github.com/foo/bar.git work if it contains Unicode file names? As soon as the repository is cloned on a file system that doesn't support Unicode or case sensitivity the the file will be lost - there is no chance to rename the actual file so it can exist on the file system.

I'm OK with a system that prevents people from referring to them by their original file name

But then we admit we can't support non-raku code that won't work with a different name. This is not an acceptable workaround, it is just the only workaround.

so long as we have easy/well-documented methods that let them map from their source filename to their location

Maybe I'm misunderstanding, but how would someone do this for html and javascript files practically? In a production environment you don't want to serve these type of files by going through some raku code to map the names, you want to let your e.g. reverse proxy just handle all your static content from some directory directly, which means you need to access them via the names as they exist on the file system. Even getting the names at runtime is probably going to be impossible unless it is built into Raku itself, because each packaging system would have their own methods of doing this (which means we would have distributions depending on a specific package manager).

But all that's just my ¢2, and I know you've thought more deeply about this issue than I have. So, if no one else chimes in, I'm happy to defer to your judgment on this one.

I have more thoughts on this than I'm capable of writing up unprompted in an initial github issue, so I'm happy (and would expect) to continue addressing people's concerns. Removing a feature is not an easy proposition to make.

@vrurg
Copy link
Contributor

vrurg commented Sep 24, 2023

but what gets distributed (and thus has a higher expectation of being written to work on other systems) is enforced by a given authority's policy.

I'm not even sure if it must be enforced. A (suppressible) warning would be sufficient. Say, it could be a distribution to be purportedly designed for a limited set of systems where Unicode is supported.

Besides, installation can reject a distribution, with a descriptive error message, if it finds out that not all file names are suitable for the local file system. (...saying nothing about too long paths on Windows...)

@ugexe
Copy link
Contributor Author

ugexe commented Sep 25, 2023

if it finds out that not all file names are suitable for the local file system.

Yeah, and unfortunately I'm not sure there is a good way to do that outside of rakudo itself. A naive way would be for some program to try to write these various files and see what works and what does not work and use that knowledge to know when to generate such warnings prior to where zef passes the distribution to rakudo for the actual installation of files. But that would have to be done per volume/device/mount whatever, since e.g. two directories can be pointed at different file systems (and even then could change after-the-fact, so any "database" of this info is liable to become stale). Basically such a rejection would have to come form CURI.install(...) itself after it discovers it failed to create a file that is not accessible by its stated file name. I agree that would be a good thing to have.

@niner
Copy link

niner commented Sep 25, 2023

I think the problems you listed for CURI can be solved within the implementation of CURI without requiring a full replacement. It's just that no one has given it a try so far. E.g. CURI can easily be changed to use subdirectories for keeping the files of multiple distros apart and use pure ASCII names as-is. It can even go a step further and simply test whether it can write a non-ASCII file name as-is and read it back. Nothing in CURI's interface requires it to rename all files or keep them in the current structure. The changes I mentioned can be done while retaining full backwards compatibility with existing installations.

@ugexe
Copy link
Contributor Author

ugexe commented Sep 25, 2023

Nothing in CURI's interface requires it to rename all files or keep them in the current structure.

This is only true in theory. In practice it breaks custom repository locations by hard coupling them to a specific rakudo version or higher. To explain for those who don't know, CURI has an upgrade repository mechanism for changing the files/layout of a repository when building rakudo. So lets pretend CURI is updated to use this new format and I load some code with this new rakudo via raku -I /my/custom/libs -e 'use 'My::Custom::Libs' and see it works. Then you try to do rakubrew switch $some-previous-raku-version && raku -I /my/custom/libs -e 'use 'My::Custom::Libs' and suddenly nothing works because the previous version of rakudo does not know anything about the new repository format. This same workflow (which I was using regularly) was broken for me the last the upgrade mechanism was used.

So not only is updating CURI significantly more work (trying to maintain backwards compatibility over every tiny detail -- something I'm not even sure is practically possible anymore (only technically possible) -- but even done correctly it will break some existing valid workflows.

@vrurg
Copy link
Contributor

vrurg commented Sep 25, 2023

Aren't new subdirectories are created for new Rakudo version in custom locations? Ignore, messed up with precomps.

@niner
Copy link

niner commented Sep 25, 2023

Fixing these issues in CURI will most likely not require a change in repository version, as we record the path to the stored file for every internal name. Old rakudo versions would just follow that path and not care whether it's a SHAed file name or a directory + original name.

@codesections
Copy link
Contributor

One preliminary response, and then something that gets more at the heart of the issue. First, on the specific point:

how would someone do this for html and javascript files practically? In a production environment you don't want to serve these type of files by going through some raku code to map the names, you want to let your e.g. reverse proxy just handle all your static content from some directory directly, which means you need to access them via the names as they exist on the file system.

Right now, if I have a $path that I'd like to reference in my nginx.config, I would get Raku to tell me how to do so with .canonpath or similar. What I'm suggesting is that installed files should work similarly, with a convenience method that would display their exact path to let non-Raku code point to them.

Now on to the more general point:

The fact we even make it possible to have a Unicode module name is in line with making hard things possible. We don't need to make hard things easy, just possible. We even allow the user to just use the Unicode file name on their own system if they want.

Thanks, that's a really helpful comment – it clarifies how our perspectives differ. In my view, "giving a module the name I want" should be in the easy-things-should-be-easy category. I'm thinking partly of @finanalyst's to create non-english versions of Raku (via slangs) or @alabamenhu's work to support multi-lingual error messages. But, even setting those projects aside, "naming a module in my native language" strikes me as something that we should make easy – pretty much everyone names modules, after all. And, of course, for pretty much any non-English speaker, using names from their native language requires Unicode support at least some of the time.

Conversely, I'm pretty willing to put "dynamically linking against a non-Raku program that requires a static name" in the hard-things-should-be-possible category. I'd venture a guess that the vast majority of Raku programs don't directly link against any not Raku code, much less any that requires a static name. Of course, many more indirectly do so, but that's kind of my point: the interface between Raku and non-Raku code tends to be at the library level and, IMO, it's reasonable to expect library authors to do the hard thing of supporting linking via a static name.

Given that perspective, I wonder whether we could solve the issues with CURI from the other direction. What if we keep the current default (filename based on a hash) but allow module authors to specify a static filename in their META6.json (as a map of source-file-name → installed-file-name)? That way, anyone who needs a static name can have one (but bears the responsibility for naming it in a way that works on all target file systems). And anyone who doesn't need a static name gets full Unicode support.

@ugexe
Copy link
Contributor Author

ugexe commented Sep 25, 2023

I would get Raku to tell me how to do so with .canonpath or similar

Maybe I'm misunderstanding, but to clarify I'm talking about with an installed distribution. The files don't exist with their original file names, so .canonpath isn't going to be useful besides potentially absolutifying the sha1 file name path. It doesn't help me point nginx at a specific distribution's resources directory and references those files by their original name.

I'd venture a guess that the vast majority of Raku programs don't directly link against any not Raku code, much less any that requires a static name.

To be fair I named OpenSSL (and thus IO::Socket::SSL and anything else that depends on it) specifically. And I would be willing to wager there is far more code written with OpenSSL as a dependency than there are module using unicode file names (even if you filter down OpenSSL use to windows users only).

That way, anyone who needs a static name can have one

One of the problems this intends to solve is how our current method is not at all human readable. I'm not sure a scenario where many users are requesting various module authors to have all files be explicitly mapped to human readable names is something we would really want.

I'm thinking partly of @finanalyst's to create non-english versions of Raku (via slangs) or @alabamenhu's work to support multi-lingual error messages

These problems don't exactly face the same restraints though. File systems themselves are at the core of this problem, and we would be wise to consider that abstraction when designing an interface around it. I have a strong hunch that if we asked users if they would prefer A) the ability to use unicode file names for their modules or B) the ability to untar a distribution directory into its install location and have it largely Just Work, users would choose B. Remember, Option A really does preclude option B because the files still have to be extracted from a tar file, git repository, etc onto the potentially problematic file system before zef or rakudo can rename them.

@ugexe
Copy link
Contributor Author

ugexe commented Sep 25, 2023

Fixing these issues in CURI will most likely not require a change in repository version, as we record the path to the stored file for every internal name. Old rakudo versions would just follow that path and not care whether it's a SHAed file name or a directory + original name.

Even if this is technically true, it also seems a bit off. For all intents the repository format has indeed changed. In the future when the repository format needs to change again it seems like it would need to know what state the repository is actually in to do a meaningful upgrade, but it won't know if its using the flat directory format, this new proposed format, or some mix of both.

@ugexe
Copy link
Contributor Author

ugexe commented Sep 25, 2023

When considering how Unicode file names should work, think of how to solve this workflow:

  1. User downloads UnicodeNamedDist.tar.gz
  2. User extracts UnicodeNamedDist.tar.gz to ./UnicodeNamedDist
  3. User goes to install ./UnicodeNamedDist, but precompilation fails because the distribution seems to be missing a file listed in provides

By the time we've reached 3 it is already too late - the archive has been extracted but the file does not exist. There is no point where raku can give it an alternative name before it touches the file system for the first time...

...or rather no core friendly way. https://github.com/ugexe/Raku-CompUnit--Repository--Tar (and which S22 also references) can actually do this by extracting single files to stdout and piping that data into a file path that raku provides. But I'm not sure every version of tar supports this, nor would I suggest something in the core that shells out to e.g. tar. If we had core tar.gz extraction support like golang it could be an alternative option though.

@jaguart
Copy link

jaguart commented Sep 25, 2023

I only have a shallow understanding, so apologies if I've made incorrect assumptions.

How about a solution that retains SHA install naming but adds a file-system layer of link/junction satisfying the human desire for meaningful names? The meaningful links/junctions could sit in a distinct hierarchy - i.e. not be mixed directly in the install folders. Perhaps tooling to create/maintain a human-meaningful shadowed hierarchy on the file-system from an existing installation.

This restricts file-system short-comings to the representational side and works safely alongside the universal SHA naming. There would be nothing preventing several shadow representations existing simultaneously - ASCII, English, French, Kanji - all linking to the same SHA hierarchy.

On the flip side - add META for files/folders where Install should create links. Example: files in this folder should be installed as usual (SHA naming), and then representational links created in [non-clashing install location]. If a supplied file (.DLL?) cannot be linked but requires install at a fixed place on the file-system, you likely have significant chance of security/crash/problem - so I'm not sure if this is worth supporting. SHA-then-link isolates issues with representation to the link layer on non-supported filesystems, and can be reported at attempted install.

Separately: IMHO Unicode module and file naming is important to developers, and should be easy. I expect filesystems will add Unicode support over time, and feel that it would be a step backwards to encourage non-Unicode in core. I'd rather see a Unicode / Case-sensitive approach that errors meaningfully when a file-system has limitations.

@ugexe
Copy link
Contributor Author

ugexe commented Sep 25, 2023

@jaguart even if we implement that level of complexity, how could it solve the workflow I outlined in my previous comment? For a large percentage of people those files can't get onto the file system in the first place to even begin creating sha1 files with their data.

@codesections
Copy link
Contributor

Maybe I'm misunderstanding, but to clarify I'm talking about with an installed distribution. The files don't exist with their original file names, so .canonpath isn't going to be useful besides potentially absolutifying the sha1 file name path. It doesn't help me point nginx at a specific distribution's resources directory and references those files by their original name.

Maybe I'm the one misunderstanding. In that example, why would you want to be able to use the original file names with nginx? I would think that you'd want to point nginx at the installed file – after all the original file is basically part of the source code and could be changed/deleted at any point.

To be fair I named OpenSSL (and thus IO::Socket::SSL and anything else that depends on it) specifically. And I would be willing to wager there is far more code written with OpenSSL as a dependency than there are module using unicode file names (even if you filter down OpenSSL use to windows users only).

Yeah, that's exactly the point I was trying to get at with by drawing a distinction between programs that directly link to non-Raku code and those that only link indirectly (that is, because a dependency does the actual linking). For a Raku program that depends on IO::Socket::SSL, the developer doesn't need to care at all about how OpenSSL manages to link to dlls on Windows. That's an implementation detail that's abstracted away by the library. Thus, I'm OK with it being a "hard thing"; it only needs to be solved once, at the library level. (Of course, it does need to be a possible thing to solve or else all the dependencies are in serious trouble…)

I have a strong hunch that if we asked users if they would prefer A) the ability to use unicode file names for their modules or B) the ability to untar a distribution directory into its install location and have it largely Just Work, users would choose B.

I don't share that hunch. I agree that, if we're pulling from current Raku users, there wouldn't be a huge contingent of people clamoring for option A. But I expect/hope that will change as Raku becomes more international and utf-u everywhere becomes more of a reality.

But my hunch is that the group insisting on option B would be even smaller. I don't, generally speaking, expect that process to work for any software; instead, I expect that I'll need to install the software in whatever way is customary for that software/ecosystem (e.g., .configure; make; make install, or cargo install, a program-specific-wizard on Windows, etc). And I have the same problem with the workflow you mention in the following comment: Yes, that workflow isn't well supported. By why is it a common enough workflow that we should prioritize it? This might be slightly flippant, but it seems to me that Raku offers a way to install software and users who don't want to use that way are Doing It Wrong™.

@jaguart wrote:

Separately: IMHO Unicode module and file naming is important to developers, and should be easy. I expect filesystems will add Unicode support over time, and feel that it would be a step backwards to encourage non-Unicode in core.

Agreed.

@ugexe
Copy link
Contributor Author

ugexe commented Sep 25, 2023

By why is it a common enough workflow that we should prioritize it?

It is essentially the only workflow to install modules. What is the alternative workflow to do so that isn't based on shelling out to a system dependent outside program (tar, etc)?

To install a distribution (that isn't already on the file system) you download a single file in some way (tar file, git, etc). Then you have to extract it. Then Raku can do something. If the file can't be extracted on a given file system, and Raku lacks core support for whatever archive algorithm is used, then there is no reason for Raku to try to make it work on that system because it can never reach that point in the first place. In other words - we can support those Unicode filenames but distributions using them still can't be saved/extracted (and thus installed) for any practical purposes on the systems we implement the sha1-ing for in the first place. And indeed for systems where they can e.g. extract a unicode name to the file system, we don't have to do anything extra for Raku to support it with CURL.

Maybe I'm the one misunderstanding. In that example, why would you want to be able to use the original file names with nginx?

I want to point my reverse proxy at a directory (potentially of an installed distribution's resource directory) and have it serve the files there under their original names (since the html files in that distribution would be written to the original file names similar to as if it was being loaded by CURFS).

@codesections
Copy link
Contributor

To install a distribution (that isn't already on the file system) you download a single file in some way (tar file, git, etc).

No, to install a distribution I type zef install Some::Raku::Code 😁

That's a somewhat flippant answer, but it gets at a more serious point: I don't see any problem with having Zef (or some other tool) be the "blessed" way to install Raku packages and to say that other installation methods may require more work. In fact, I'd bet that pretty much the only people who might want to install Raku packages without Zef are package maintainers for Linux distros (or BSDs, I guess). And those folks are both ① unlikely to have difficulty with Unicode and ② familiar enough with using tar and other Linux tools in their build process to be willing and able to use the rename-via-stdout method you described.

If we start from the perspective that Zef is the way regular users install Raku programs, then the problem gets easier. Instead of needing to make a workflow easy for everyone, we just need to make it hard-but-possible for Zef to be able to extract the contents of an archive regardless of filesystem constraints. And you've explained why that poses challenges when the archive contains files with names that the OS considers illegal. Indeed, you might be correct that there's no way to do this without either shelling out to tar or implementing at least some level of extraction support in Raku (though I'm not sure how far we'd have to go with that implementation – once we get to the .tar stage (as opposed to the tar.gz stage) we can read the filenames from tar header).

Where I disagree is with the idea that we'd need that support in core. Since the goal is "only" to enable Zef to install packages, it seems like we could have whatever support we need in user land, and Zef could depend on that. And, of course, that distribution wouldn't use any Unicode module names.

I realize that this might seem like an "a simple matter of programming"™ type suggestion. But I'm describing what (IMO) it makes sense to aim for in the longer term. In the near/medium term I personally don't have an issue with shelling out to tar. True, it's not pure bootstrapping, but tar is so widely available – and we'd be using such a limited set of features – that it doesn't seem like a large issue, especially if we plan to move away at some point.

@codesections
Copy link
Contributor

I want to point my reverse proxy at a directory (potentially of an installed distribution's resource directory) and have it serve the files there under their original names (since the html files in that distribution would be written to reference the original file names similar to as if it was being loaded by CURFS).

But why would the HTML files be written to reference the original file names? That's what I'm not understanding. If the HTML files are generated by the Raku distribution, then (IMO) that distribution should be able to generate them with names pointing to the installed files. If they're external to the Raku distribution, then I should be able to edit them to point to the installed files. I'm just not understanding why the name of the source-code file (as opposed to the name of the installed file) should ever need to be in my HTML.

(I feel like I might be missing something basic here; my apologies if I'm being dense)

@tony-o
Copy link

tony-o commented Sep 25, 2023

No, to install a distribution I type zef install Some::Raku::Code 😁

Behind the scenes this downloads a tar file, uses git, or downloads that distribution in some way.

That's a somewhat flippant answer, but it gets at a more serious point: I don't see any problem with having Zef (or some other tool) be the "blessed" way to install Raku packages and to say that other installation methods may require more work. In fact, I'd bet that pretty much the only people who might want to install Raku packages without Zef are package maintainers for Linux distros (or BSDs, I guess). And those folks are both ① unlikely to have difficulty with Unicode and ② familiar enough with using tar and other Linux tools in their build process to be willing and able to use the rename-via-stdout method you described.

This requires name resolution to happen deterministically which should be in core since zef isn't responsible for module loading, or at least exist in some capacity within the CUR* - right now everything is SHA'd in CURI but not the others.

If we start from the perspective that Zef is the way regular users install Raku programs, then the problem gets easier. Instead of needing to make a workflow easy for everyone, we just need to make it hard-but-possible for Zef to be able to extract the contents of an archive regardless of filesystem constraints. And you've explained why that poses challenges when the archive contains files with names that the OS considers illegal. Indeed, you might be correct that there's no way to do this without either shelling out to tar or implementing at least some level of extraction support in Raku (though I'm not sure how far we'd have to go with that implementation – once we get to the .tar stage (as opposed to the tar.gz stage) we can read the filenames from tar header).

What ugexe is trying to say is that if the OS can't handle unicode names for files then this is where zef's agency ends. It can't shell out to tar or gzip and it can't continue installation in the current state. The suggested fix is creating a CUR that can handle this mutation in both extraction and resolution, not to say the CUR needs to handle the extraction but deterministically determine what it'd have been mutated to <- this is the key.

Where I disagree is with the idea that we'd need that support in core. Since the goal is "only" to enable Zef to install packages, it seems like we could have whatever support we need in user land, and Zef could depend on that. And, of course, that distribution wouldn't use any Unicode module names.

The rub is when rakudo attempts to load/resolve unicode file names on a non-unicode file system.

@codesections
Copy link
Contributor

Behind the scenes this downloads a tar file, uses git, or downloads that distribution in some way.

Yeah, I get that of course (hence the grin). What I was trying to get at is that, since this is done behind the scenes, it's fine for it to be hard-but-difficult, which is much easier than trying to come up with a solution that fits into the workflow for typical uses. You know, the typical "torment the implementer" sort of thing…

What ugexe is trying to say is that if the OS can't handle unicode names for files then this is where zef's agency ends. It can't shell out to tar or gzip and it can't continue installation in the current state.

I don't follow this. I understood @ugexe to have said (in a previous comment) that Zef could handle that situation by using the approach taken by CompUnit::Repository::Tar, at least if shelling out to tar is acceptable. Did I misunderstand that comment?

@ugexe
Copy link
Contributor Author

ugexe commented Sep 25, 2023

But why would the HTML files be written to reference the original file names? That's what I'm not understanding. If the HTML files are generated by the Raku distribution, then (IMO) that distribution should be able to generate them with names pointing to the installed files.

Because inside of html like resources/mypage.html you might do something like <img src="myfile.png> which would work in CURFS, but when loaded from CURI it would 404 because the file would now be called 58F5FC9AA510E61F7A2C619903AEA1C929D9E007.png. Those files aren't generated by the Raku distribution, they are only distributed with it.

@ugexe
Copy link
Contributor Author

ugexe commented Sep 25, 2023

a simple matter of programming

I'm not sure how that can work with the various nativecall distributions that use e.g. Build.rakumod and/or Makefiles. Those files have to be extracted. If all the files in the archive are extracted then potentially some files fail to get created because the file system doesn't support them. If, somehow, only the files required for the Makefile are extracted then they need to also be re-archived into a new .tar.gz file to be installed (and that is ignoring that the Makefile might need to access the actual Raku module files, leading back to saying everything needs to be extracted). Furthermore, if hooks (as mentioned in S22) is ever implemented, it too would likely require all the files to be extracted pre-install.

@codesections
Copy link
Contributor

Because inside of html like resources/mypage.html you might do something like <img src="myfile.png> which would work in CURFS, but when loaded from CURI it would 404 because the file would now be called 58F5FC9AA510E61F7A2C619903AEA1C929D9E007.png.

I understand that part. But what I don't understand is why the author of the distribution wouldn't just put <img src="58F5FC9AA510E61F7A2C619903AEA1C929D9E007.png> in the HTML file. True, that would require the distribution author to introspect enough to generate that hash, but that seems like a reasonable step to take for production code – using the hashed name clarifies that it is production code and ensures that it points to the correct file. That second point isn't as relevant for a png, but matters more for js/css; indeed, IME many js/css files are already renamed with a hash for cache-busting purposes as part of the build process.

None of that is to say that there couldn't be a situation in which someone really wants to have <img src="myfile.png"> in their HTML. But if that does come up, it seems like the sort of edge case that'd be addressed by letting developers specify that resources/myfile.pngshould be mapped tomyfile.png(and accepting the responsibility for ensuring thatmyfile.png` is a valid filename).

I'm not sure how that can work with the various nativecall distributions that use e.g. Build.rakumod and/or Makefiles.

Yeah, I can see how that'd be an issue. But, as in the OpenSSL case, nativecall distributions tend to be pretty low-level and written by fairly experienced Rakoons. And, almost by necessity, they deal with OS-specific issues. So I wouldn't mind a solution that requires nativecall-distribution developers to avoid Unicode filenames when they're targeting non-Unicode-supporting OSs. Or one that required them to add a field to their META6.json. Or, at least, I'd prefer that they deal with that complexity than that someone's first Raku module runs into an issue because its name includes an umlaut.

@ugexe
Copy link
Contributor Author

ugexe commented Sep 25, 2023

I understand that part. But what I don't understand is why the author of the distribution wouldn't just put <img src="58F5FC9AA510E61F7A2C619903AEA1C929D9E007.png> in the HTML file

Because it then would not work when loaded by CURFS (or some other external repository class that doesn’t use sha1). The sha1 is an implementation detail of a specific repository type.

@tony-o
Copy link

tony-o commented Sep 25, 2023

Behind the scenes this downloads a tar file, uses git, or downloads that distribution in some way.

Yeah, I get that of course (hence the grin). What I was trying to get at is that, since this is done behind the scenes, it's fine for it to be hard-but-difficult, which is much easier than trying to come up with a solution that fits into the workflow for typical uses. You know, the typical "torment the implementer" sort of thing…

What ugexe is trying to say is that if the OS can't handle unicode names for files then this is where zef's agency ends. It can't shell out to tar or gzip and it can't continue installation in the current state.

I don't follow this. I understood @ugexe to have said (in a previous comment) that Zef could handle that situation by using the approach taken by CompUnit::Repository::Tar, at least if shelling out to tar is acceptable. Did I misunderstand that comment?

Tar can handle it since it's just bytes in a file and does not necessarily need to be extracted anywhere. In this way TAR is handling the mutation that needs to happen to the filenames (by making it unecessary).

@ugexe
Copy link
Contributor Author

ugexe commented Sep 25, 2023

Oh, there is another issue with installing directly from tar files: the tests themselves must be extracted (and that says nothing of any test modules included in t/). It would be strange if different files had vastly different naming rules.

@codesections
Copy link
Contributor

OK, having reflecting on what you've both said, you've convinced me that there are valid (and possibly/likely insurmountable) technical reasons not to implement the approach I was suggesting above.

However, I continue to believe that "supporting Unicode file names" is an important design goal for Raku and is only going to be more important in the future – both as programming in general becomes less US-centric and as Raku specifically increasingly targets/supports non English use cases. If we're at all serious about the whole "100-year language" thing, then moving away from Unicode support seems like a decision we're likely to regret.

I have another thought about how we could support Unicode filenames without running into the difficulties above, which I'll write up in a separate comment later today.

@codesections
Copy link
Contributor

Here are my revised thoughts, having slept on the issue:

@ugexe's proposal, as I understand it, is to implement CURL or similar in core, which is incompatible with certain invalid filenames: those with Unicode and those that are a case-insensitive match with another file. This means that distribution that contain invalidly-name files would not be installable on some filesystems.

To mitigate this issue, @ugexe further suggests that fez and other conforming ecosystem tools give a warning (or maybe an error) to users who try to upload a file with an invalid name. Users who upload a distribution without using fez or a tool from another conforming ecosystem would not get a warning and thus might upload a distribution that works on their machine but that can't be installed on a different filesystem (and possibly generates confusing error messages).

Here's my proposal: instead of giving users a warning, lets fix the problem for them. That is, fez and other conforming ecosystem tools should rename files before bundling them for upload.

When I mentioned something like this upthread, @ugexe replied by pointing out that some distributions might be packaged without fez (or another conforming tool). For example, distributions packaged by Debian maintainers for installation via apt. And that's entirely true. However, under @ugexe's current proposal, distributions packaged without fez/similar tools would already be uninstallable on some OSes if they included invalid filenames – and, of course, in that case, users wouldn't get a warning.

I believe that a well-designed process for renaming in fez could leave non-fez users in exactly the same situation (their distribution will work across OSes only if it doesn't contain invalid filenames). And it could allow users of fez – or, again, any conforming tool – to freely use Unicode filenames. That means that the vast majority of normal users could use whatever filename they like, including words from their native languages. The only people who would need to care about filenames are people who opt out of using fez – presumably advanced users.

Below, I'll sketch out one specific way this renaming scheme could be implemented. But, even if you don't like the specific implementation I suggest below, please also consider the general idea of having fez and other tools perform the renaming so that the vast majority of users can use whatever filenames their heart desires (and their host OS supports).


The general approach of this implementation is to stick with CULI but to have fez do some of the renaming – with user-specified exceptions. Specifically:

  • Raku provides a method to determine what the filename that CompUnit::Repository.install would generate for a given file (unless it already does? Not sure on what's currently exposed).
  • Before copying files into the sdist directory, fez creates a dist metadata file (similar to the files that Zef stores in ~/.raku/dist); specifically, this dist file includes mappings between original file paths and renamed hashes.
  • fez checks the META6.json file for some field ­– maybe keep-names – for files that should not be renamed. If any of the filenames listed in keep-names contain non-ASCII Unicode or are otherwise invalid on some filesystems, fez issues a warning.
  • fez then copies files into sdist as normal – except that it renames each file to the hashed name unless that file was listed in keep-names. In that case, fez simply copies the file without renaming.
  • When Zef downloads and extracts an archive, it looks for the dist metadata file.
    • If this file exists, Zef is installing a distribution with already-hashed filenames without any additional renaming.
    • If the file doesn't exist, Zef attempts to install as normal (which may fail if the non-fez-uploaded distribution used filenames not supported by the current filesystem)
  • Zef extends the zef locate command to accept distribution names (e.g., zef locate JSON::Fast) and to return tree showing the original filename and the hashed filename for each module provided by that distribution.

I believe that this implementation arguably solves all 4 of the problems listed in @ugexe OP – or, less generously, it solves 3½ of them.

Since this approach uses CURIs, it solves the two problems that CURIs solve: Unicode support and case-insensitive filesystems. It also solves the problem of renaming files breaking dlls and similar issues, although it does require users to opt in to this behavior. This seems justified, however, since the users who need static names are likely to be more experienced Rakoons.

This approach arguably doesn't solve the "human-readable filenames" problem – the filenames are still hashes. However, extending zef locate in the way I describe would likely go a long way towards satisfying the desire for human-readable filenames; after all, that desire is mostly about seeing where a distribution lives. So I'd argue that this counts as at leas a half solution. (And, of course, if we decide that human-readable filenames are essential, we could adopt a different renaming format (e.g., punycode or the system I sketched out upthread). I personally favor sticking with content hashes, for reasons I'm happy to explain if we get to that point.)

Additionally, this proposal leaves people who package distributions without fez or a similar tool in exactly the same position as they'd have been under the CURL proposal: their distributions will work on all filesystems only if they use non-case-sensitive, ASCII-only filenames.

I look forward to hearing thoughts, both on the specific implementation I described above and on the more general idea of having fez and similar tools handle file renaming as a pre-upload step.

@ugexe
Copy link
Contributor Author

ugexe commented Sep 26, 2023

fwiw it might be a few days before I have time to read and digest the latest replies

@finanalyst
Copy link

This issue is similar to the one faced by the Raku Documentation suite. Here is a solution that seems to work for the Documentation suite.
Some files need to be created with Unicode names, such as $?LINE, leading to a URL of /routine/$*LINE, or s/// with a URL of /routine/s///. Technically URLs like this are possible, and all the bytes in a Unicode codepoint can be escaped with %xx. However, no filesystem likes some special characters in file names, such as /.
The solution is escape all chars except ASCII ones, SHA1 the name, create a mapping file (tidy-urls), and store the rendered page at /hashed/<name in SHA1 hex characters>
When a URL is seen by a browser, it shows the escaped characters as Unicode glyphs, so eg %5C is seen as \.
When the URL hits the webserver, the webserver maps the URL to the correct resource.
Any URL containing /hashed/* hitting the webserver is rejected as unknown.
So, internally the resource is stored in a way any filesystem can manage, but externally the URL can be almost anything.
There are three filenames that fail, namely . .. and \, that is files that are named with these two single characters and one double char sequence. That is because browsers transform these three entities before they reach the webserver. So it is necessary to special case these three filenames.

In the suggestions above, it seems (from my understanding of the above) that META6.json is being suggested as a way to map from natural names to file-system compliant file names. As is pointed out, this puts the onus on the developer to know what should and shouldn't be included in META6.

Because META is mentioned, there is the question of the resources part of META6.

But it seems to me that module names, which I think are in the depends and provides sections of META6 are different in kind to the names in the resource section, particularly if the resources are named in a way the Raku system cannot affect, eg. OpenSSL dlls.

Suppose we stipulate that a packaged Raku distribution contains a META6 file and a file system map (eg RAKU_MAPS) which consists of the Unicode name of a module/distribution and its operating system equivalent. The 'operating system equivalent' being a name that all OS's will be happy with, such as a SHA1 encoded name.

zef would then have its way of mapping files, and others such as debian might have another. The only requirement would be a file RAKU_MAPS.

Resource file names would not be renamed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
language Changes to the Raku Programming Language
Projects
None yet
Development

No branches or pull requests

7 participants