Merge pull request #3082 from finanalyst/master

JJ · web-flow · commit 356eadf55abe · 2020-01-17T18:39:03.000+01:00
OK, I'm merging this. Better to have it here and improve it afterwards, that simply not have it. We can always create an issue with an RFE for later.
diff --git a/doc/Language/compilation.pod6 b/doc/Language/compilation.pod6
@@ -0,0 +1,295 @@
+=begin pod :kind("Language") :subkind("Language") :category("tutorial")
+
+=TITLE CompUnits and where to find them
+
+=SUBTITLE How and when Raku modules are compiled, where they are stored, and how to access them in compiled form.
+
+=head1 Overview
+
+Programs in Raku, as a member of the Perl language family, tend at the top level to be more at the interpreted
+end of the interpreted-compiled spectrum. In this tutorial, an 'interpreted' program means that the source code,
+namely the human-readable text such as C<say 'hello world';> is immediately processed by the C<Raku> program into code
+that can be executed by the computer, with any intermediate stages being stored in memory.
+
+A compiled program, by contrast, is one where the human readable source is first processed into machine-executable code
+and some form of this code is stored 'on disc'. In order to execute the program, the machine-readable version is loaded
+into memory and then run by the computer.
+
+Both compiled and interpreted forms have advantages. Briefly, interpreted programs can be 'whipped up' quickly and
+the source changed quickly. Compiled programs can be complex and take a significant time to pre-process into machine-readable
+code, but then running them is much faster for a user, who only 'sees' the loading and running time, not the compilation
+time.
+
+C<Raku> has both paradigms. At the B<top level> a Raku program is interpreted, but if code that is separated out into a
+Module will be compiled and the preprocessed version is then loaded when necessary. In practice, Modules that have been
+written by the community will only need to be pre-compiled once by a user when they are 'installed', for example by a
+Module manager such as C<zef>. Then they can be C<use'd>  by a developer in her own program. The effect is to make C<Raku>
+top level programs run quickly.
+
+One of the great strengths of the C<Perl> family of languages was the ability to integrate a whole ecosystem of modules
+written by competent programmers into a small program. This strength was widely copied and is now the norm for all
+languages. C<Raku> takes integration even further, making it relatively easy for C<Raku> programs to incorporate system
+libraries written in other languages into C<Raku> programs, see L<Native Call|Language/nativecall>.
+
+The experience from C<Perl> and other languages is that the distributive nature of Modules generate several practical difficulties:
+=item a popular module may go through several iterations as the API gets improved, without a guarantee that there is
+backward compatibility. So, if a program relies on some specific function or return, then there has to be a way to
+specify the B<Version>.
+=item a module may have been written by Bob, a very competent programmer, who moves on in life, leaving the module unmaintained,
+so Alice takes over. This means that the same module, with the same name, and the same general API may have have two
+versions in the wild. Alternatively, two developers (eg., Alice and Bob) who initially cooperated on a module, then part company about its
+development. Consequently, it sometimes is necessary for there to be a way to define the B<Auth> of the module.
+=item a module may be enhanced over time and the maintainer keeps two versions uptodate, but with different APIs. So it is
+may be necessary to define the B<API> required.
+=item when developing a new program a developer may want to have the modules written by both Alice and Bob installed locally.
+So it is not possible simply to have only one version of a module with a single name installed.
+
+C<Raku> enables all of these possibilities, allowing for multiple versions, multiple authorities, and multiple APIs to be present
+installed and available locally. The way classes and modules can be accessed with specific attributes
+is explained L<elsewhere|Language/typesystem#Versioning_and_authorship>. This tutorial is about how C<Raku> handles these
+possibilities.
+
+=head1 Introduction
+
+Before considering the C<Raku> framework, let's have a look at how languages like C<Perl> or C<Python> handle module
+installation and loading.
+
+=begin code
+ACME::Foo::Bar -> ACME/Foo/Bar.pm
+os.path -> os/path.py
+=end code
+
+In those languages, module names have a 1:1 relation with file system paths.
+We simply replace the double colons with slashes and add a .pm
+
+Note that these are relative paths.
+Both C<Python> and C<Perl> use a list of include paths, to complete these paths.
+In C<Perl> they are available in the global C<@INC> array.
+
+=begin code
+@INC
+
+/usr/lib/perl5/site_perl/5.22.1/x86_64-linux-thread-multi
+/usr/lib/perl5/site_perl/5.22.1/
+/usr/lib/perl5/vendor_perl/5.22.1/x86_64-linux-thread-multi
+/usr/lib/perl5/vendor_perl/5.22.1/
+/usr/lib/perl5/5.22.1/x86_64-linux-thread-multi
+/usr/lib/perl5/5.22.1/
+=end code
+
+Each of these include directories is checked for whether it contains a relative path determined from the module name.
+If the shoe fits, the file is loaded.
+
+Of course that's a bit of a simplified version.
+Both languages support caching compiled versions of modules.
+So instead of just the C<.pm> file C<Perl> first looks for a C<.pmc> file.
+And C<Python> first looks for C<.pyc> files.
+
+Module installation in both cases means mostly copying files into locations determined by the same simple mapping. The
+system is easy to explain, easy to understand, simple and robust.
+
+=head2 Why change?
+
+Why would C<Raku> need another framework? The reason is there are features that those languages lack, namely:
+=item Unicode module names
+=item Modules published under the same names by different authors
+=item Having multiple versions of a module installed
+
+The 26 Latin characters is too restrictive for virtually all real modern languages, including English, which
+has diacritics for many loan words.
+
+With a 1:1 relation between module names and file system paths, you enter a world of pain
+once you try to support Unicode on multiple platforms and file systems.
+
+Then there's sharing module names between multiple authors. This one may or may not work out well in practice.
+I can imagine using it for example for publishing a module with some fix until the original author includes
+the fix in the "official" version.
+
+Finally there's multiple versions. Usually people who need certain versions of modules reach for local::lib or
+containers or some home grown workarounds. They all have their own disadvantages. None of them would be necessary
+if applications could just say, hey I need good old, trusty version 2.9 or maybe a bug fix release of that branch.
+
+If you had any hopes of continuing using the simple name mapping solution, you probably gave up at the
+versioning requirement. Because, how would you find version 3.2 of a module when looking for a 2.9 or higher?
+
+Popular ideas included collecting information about installed modules in JSON files but when those turned out to be
+toe-nail growing slow, text files were replace by putting the meta data into SQLite databases.
+However, these ideas can be easily shot down by introducing another requirement: distribution packages.
+
+Packages for Linux distributions are mostly just archives containing some files plus some meta data.
+Ideally the process of installing such a package means just unpacking the files and updating the central package database.
+Uninstalling means deleting the files installed this way and again updating the package database.
+Changing existing files on install and uninstall makes packagers' lives much harder, so we really want to avoid that.
+Also the names of the installed files may not depend on what was previously installed.
+We must know at the time of packaging what the names are going to be.
+
+=head2 Long names
+
+=begin code
+Foo::Bar:auth<cpan:nine>:ver<0.3>:api<1>
+=end code
+
+Step 0 in getting us back out of this mess is to define a long name.
+A full module name in C<Raku> consists of the short-name, auth, version and API
+
+At the same time, the thing you install is usually not a single module but a distribution which probably contains one or more modules.
+Distribution names work just the same way as module names.
+Indeed, distributions often will just be called after their main module.
+An important property of distributions is that they are immutable.
+C<V< Foo:auth<nine>:ver<0.3>:api<1> >> will always be the name for exactly the same code.
+
+=head2 $*REPO
+In C<Perl> and C<Python> you deal with include paths, pointing to file system directories.
+In C<Raku> we call such directories "repositories" and each of these repositories is governed by an object that does the
+C<CompUnit::Repository> role.
+Instead of an C<B<@INC>> array, there's the C<$*REPO> variable.
+It contains a single repository object.
+This object has a B<next-repo> attribute that may contain another repository.
+In other words: repositories are managed as a I<linked list>.
+The important difference to the traditional array is, that when going through the list, each object has a say in whether
+to pass along a request to the next-repo or not.
+C<Raku> sets up a standard set of repositores, i.e. the "perl", "vendor" and "site" repositories, just like you know them from C<Perl>.
+In addition, we set up a "home" repository for the current user.
+
+Repositories must implement the C<need> method.
+A C<use> or C<require> statement in C<Raku> code is basically translated to a call to C<B<$*REPO>>'s C<need> method.
+This method may in turn delegate the request to the next-repo.
+
+=begin code
+role CompUnit::Repository {
+    has CompUnit::Repository $.next is rw;
+
+    method need(CompUnit::DependencySpecification $spec,
+                CompUnit::PrecompilationRepository $precomp,
+                CompUnit::Store :@precomp-stores
+                --> CompUnit:D
+                )
+        { ... }
+    method loaded(
+                --> Iterable
+                )
+        { ... }
+
+    method id( --> Str )
+        { ... }
+}
+=end code
+
+=head2 Repositories
+
+Rakudo comes with several classes that can be used for repositories.
+The most important ones are C<CompUnit::Repository::FileSystem> and C<CompUnit::Repository::Installation>.
+The FileSystem repo is meant to be used during module development and actually works just like C<Perl> when
+looking for a module.
+It doesn't support versions or auths and simply maps the short-name to a file system path.
+
+The Installation repository is where the real smarts are. When requesting a module, you will usually either do it
+via its exact long name, or you say something along the lines of "give me a module that matches this filter".
+Such a filter is given by way of a C<CompUnit::DependencySpecification> object which has fields for
+=item short-name,
+=item auth-matcher,
+=item version-matcher and
+=item api-matcher.
+
+When looking through candidates, the Installation repository will smart match a module's long name against this
+DependencySpecification or rather the individual fields against the individual matchers.
+Thus a matcher may be some concrete value, a version range or even a regex (though an arbitrary regex, such as C<.*>,
+would not produce a useful result, but something like C<3.20.1+> will only find candidates higher than 3.20.1).
+
+Loading the meta data of all installed distributions would be prohibitively slow. The current immplementation of
+the C<Raku> framework uses
+the file system as a kind of database. However, another implementation may use another strategy. The following description
+shows how one implementation works and is included here to illustrate what is happening.
+
+We store not only a distribution's files but also create indices for speeding up lookups.
+One of these indices comes in the form of directories named after the short-name of installed modules.
+However most of the file systems in common use today cannot handle Unicode names, so we cannot just use
+module names directly.
+This is where the now infamous SHA-1 hashes enter the game.
+The directory names are the ASCII encoded SHA-1 hashes of the UTF-8 encoded module short-names.
+
+In these directories we find one file per distribution that contains a module with a matching short name.
+These files again contain the ID of the dist and the other fields that make up the long name: auth, version and api.
+So by reading these files we have a usually short list of auth-version-api triplets which we can match against our
+DependencySpecification.
+We end up with the winning dist's ID, which we use to look up the meta data, stored in a JSON encoded file.
+This meta data contains the name of the file in the sources/ directory containing the requested module's code.
+This is what we can load.
+
+Finding names for source files is again a bit tricky, as there's still the Unicode issue and in addition the same
+relative file names may be used by different installed distributions (think versions).
+So for now at least, we use SHA-1 hashes of the long-names.
+
+=head2 Resources
+
+=begin code
+%?RESOURCES
+%?RESOURCES<libraries/p5helper>
+%?RESOURCES<icons/foo.png>
+%?RESOURCES<schema.sql>
+
+Foo
+|___ lib
+|     |____ Foo.rakumod
+|
+|___ resources
+      |___ schema.sql
+      |
+      |___ libraries
+            |____ p5helper
+            |        |___
+            |___ icons
+                     |___ foo.png
+
+=end code
+
+It's not only source files that are stored and found this way.
+Distributions may also contain arbitrary resource files.
+These could be images, language files or shared libraries that are compiled on installation.
+They can be accessed from within the module through the C<%?RESOURCES> hash
+
+As long as you stick to the standard layout conventions for distributions, this even works during development
+without installing anything.
+
+A nice result of this architecture is that it's fairly easy to create special purpose repositories.
+
+=head2 Dependencies
+
+Luckily precompilation at least works quite well in most cases. Yet it comes with its own set of challenges.
+Loading a single module is easy.
+The fun starts when a module has dependencies and those dependencies have again dependencies of their own.
+
+When loading a precompiled file in C<Raku> we need to load the precompiled files of all its dependencies, too.
+And those dependencies B<must> be precompiled, we cannot load them from source files.
+Even worse, the precomp files of the dependencies B<must> be exactly the same files we used for precompiling our
+module in the first place.
+
+To top it off, precompiled files work only with the exact C<Raku> binary, that was used for compilation.
+
+All of that would still be quite manageable if it weren't for an additional requirement: as a user you expect a new
+version of a module you just installed to be actually used, don't you?
+
+In other words: if you upgrade a dependency of a precompiled module, we have to detect this and precompile the module
+again with the new dependency.
+
+=head2 Precomp stores
+
+Now remember that while we have a standard repository chain, the user may prepend additional repositories by way of
+C<-I> on the command line or "use lib" in the code.
+
+These repositories may contain the dependencies of precompiled modules.
+
+Our first solution to this riddle was that each repository gets it's own precomp store where precompiled files are stored.
+We only ever load precomp files from the precomp store of the very first repository in the chain because this is the
+only repository that has direct or at least indirect access to all the candidates.
+
+If this repository is a FileSystem repository, we create a precomp store in a C<.precomp> directory.
+
+While being the safe option, this has the consequence that whenever you use a new repository, we will start out
+without access to precompiled files.
+
+Instead, we will precompile the modules used when they are first loaded.
+
+=head2 Credit
+This tutorial is based on a C<niner> L<talk|http://niner.name/talks/A%20look%20behind%20the%20curtains%20-%20module%20loading%20in%20Perl%206/>.
+=end pod