Skip to content

Commit 356eadf

Browse files
authored
Merge pull request #3082 from finanalyst/master
OK, I'm merging this. Better to have it here and improve it afterwards, that simply not have it. We can always create an issue with an RFE for later.
2 parents 99dfca4 + a10fd84 commit 356eadf

File tree

1 file changed

+295
-0
lines changed

1 file changed

+295
-0
lines changed

doc/Language/compilation.pod6

Lines changed: 295 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,295 @@
1+
=begin pod :kind("Language") :subkind("Language") :category("tutorial")
2+
3+
=TITLE CompUnits and where to find them
4+
5+
=SUBTITLE How and when Raku modules are compiled, where they are stored, and how to access them in compiled form.
6+
7+
=head1 Overview
8+
9+
Programs in Raku, as a member of the Perl language family, tend at the top level to be more at the interpreted
10+
end of the interpreted-compiled spectrum. In this tutorial, an 'interpreted' program means that the source code,
11+
namely the human-readable text such as C<say 'hello world';> is immediately processed by the C<Raku> program into code
12+
that can be executed by the computer, with any intermediate stages being stored in memory.
13+
14+
A compiled program, by contrast, is one where the human readable source is first processed into machine-executable code
15+
and some form of this code is stored 'on disc'. In order to execute the program, the machine-readable version is loaded
16+
into memory and then run by the computer.
17+
18+
Both compiled and interpreted forms have advantages. Briefly, interpreted programs can be 'whipped up' quickly and
19+
the source changed quickly. Compiled programs can be complex and take a significant time to pre-process into machine-readable
20+
code, but then running them is much faster for a user, who only 'sees' the loading and running time, not the compilation
21+
time.
22+
23+
C<Raku> has both paradigms. At the B<top level> a Raku program is interpreted, but if code that is separated out into a
24+
Module will be compiled and the preprocessed version is then loaded when necessary. In practice, Modules that have been
25+
written by the community will only need to be pre-compiled once by a user when they are 'installed', for example by a
26+
Module manager such as C<zef>. Then they can be C<use'd> by a developer in her own program. The effect is to make C<Raku>
27+
top level programs run quickly.
28+
29+
One of the great strengths of the C<Perl> family of languages was the ability to integrate a whole ecosystem of modules
30+
written by competent programmers into a small program. This strength was widely copied and is now the norm for all
31+
languages. C<Raku> takes integration even further, making it relatively easy for C<Raku> programs to incorporate system
32+
libraries written in other languages into C<Raku> programs, see L<Native Call|Language/nativecall>.
33+
34+
The experience from C<Perl> and other languages is that the distributive nature of Modules generate several practical difficulties:
35+
=item a popular module may go through several iterations as the API gets improved, without a guarantee that there is
36+
backward compatibility. So, if a program relies on some specific function or return, then there has to be a way to
37+
specify the B<Version>.
38+
=item a module may have been written by Bob, a very competent programmer, who moves on in life, leaving the module unmaintained,
39+
so Alice takes over. This means that the same module, with the same name, and the same general API may have have two
40+
versions in the wild. Alternatively, two developers (eg., Alice and Bob) who initially cooperated on a module, then part company about its
41+
development. Consequently, it sometimes is necessary for there to be a way to define the B<Auth> of the module.
42+
=item a module may be enhanced over time and the maintainer keeps two versions uptodate, but with different APIs. So it is
43+
may be necessary to define the B<API> required.
44+
=item when developing a new program a developer may want to have the modules written by both Alice and Bob installed locally.
45+
So it is not possible simply to have only one version of a module with a single name installed.
46+
47+
C<Raku> enables all of these possibilities, allowing for multiple versions, multiple authorities, and multiple APIs to be present
48+
installed and available locally. The way classes and modules can be accessed with specific attributes
49+
is explained L<elsewhere|Language/typesystem#Versioning_and_authorship>. This tutorial is about how C<Raku> handles these
50+
possibilities.
51+
52+
=head1 Introduction
53+
54+
Before considering the C<Raku> framework, let's have a look at how languages like C<Perl> or C<Python> handle module
55+
installation and loading.
56+
57+
=begin code
58+
ACME::Foo::Bar -> ACME/Foo/Bar.pm
59+
os.path -> os/path.py
60+
=end code
61+
62+
In those languages, module names have a 1:1 relation with file system paths.
63+
We simply replace the double colons with slashes and add a .pm
64+
65+
Note that these are relative paths.
66+
Both C<Python> and C<Perl> use a list of include paths, to complete these paths.
67+
In C<Perl> they are available in the global C<@INC> array.
68+
69+
=begin code
70+
@INC
71+
72+
/usr/lib/perl5/site_perl/5.22.1/x86_64-linux-thread-multi
73+
/usr/lib/perl5/site_perl/5.22.1/
74+
/usr/lib/perl5/vendor_perl/5.22.1/x86_64-linux-thread-multi
75+
/usr/lib/perl5/vendor_perl/5.22.1/
76+
/usr/lib/perl5/5.22.1/x86_64-linux-thread-multi
77+
/usr/lib/perl5/5.22.1/
78+
=end code
79+
80+
Each of these include directories is checked for whether it contains a relative path determined from the module name.
81+
If the shoe fits, the file is loaded.
82+
83+
Of course that's a bit of a simplified version.
84+
Both languages support caching compiled versions of modules.
85+
So instead of just the C<.pm> file C<Perl> first looks for a C<.pmc> file.
86+
And C<Python> first looks for C<.pyc> files.
87+
88+
Module installation in both cases means mostly copying files into locations determined by the same simple mapping. The
89+
system is easy to explain, easy to understand, simple and robust.
90+
91+
=head2 Why change?
92+
93+
Why would C<Raku> need another framework? The reason is there are features that those languages lack, namely:
94+
=item Unicode module names
95+
=item Modules published under the same names by different authors
96+
=item Having multiple versions of a module installed
97+
98+
The 26 Latin characters is too restrictive for virtually all real modern languages, including English, which
99+
has diacritics for many loan words.
100+
101+
With a 1:1 relation between module names and file system paths, you enter a world of pain
102+
once you try to support Unicode on multiple platforms and file systems.
103+
104+
Then there's sharing module names between multiple authors. This one may or may not work out well in practice.
105+
I can imagine using it for example for publishing a module with some fix until the original author includes
106+
the fix in the "official" version.
107+
108+
Finally there's multiple versions. Usually people who need certain versions of modules reach for local::lib or
109+
containers or some home grown workarounds. They all have their own disadvantages. None of them would be necessary
110+
if applications could just say, hey I need good old, trusty version 2.9 or maybe a bug fix release of that branch.
111+
112+
If you had any hopes of continuing using the simple name mapping solution, you probably gave up at the
113+
versioning requirement. Because, how would you find version 3.2 of a module when looking for a 2.9 or higher?
114+
115+
Popular ideas included collecting information about installed modules in JSON files but when those turned out to be
116+
toe-nail growing slow, text files were replace by putting the meta data into SQLite databases.
117+
However, these ideas can be easily shot down by introducing another requirement: distribution packages.
118+
119+
Packages for Linux distributions are mostly just archives containing some files plus some meta data.
120+
Ideally the process of installing such a package means just unpacking the files and updating the central package database.
121+
Uninstalling means deleting the files installed this way and again updating the package database.
122+
Changing existing files on install and uninstall makes packagers' lives much harder, so we really want to avoid that.
123+
Also the names of the installed files may not depend on what was previously installed.
124+
We must know at the time of packaging what the names are going to be.
125+
126+
=head2 Long names
127+
128+
=begin code
129+
Foo::Bar:auth<cpan:nine>:ver<0.3>:api<1>
130+
=end code
131+
132+
Step 0 in getting us back out of this mess is to define a long name.
133+
A full module name in C<Raku> consists of the short-name, auth, version and API
134+
135+
At the same time, the thing you install is usually not a single module but a distribution which probably contains one or more modules.
136+
Distribution names work just the same way as module names.
137+
Indeed, distributions often will just be called after their main module.
138+
An important property of distributions is that they are immutable.
139+
C<V< Foo:auth<nine>:ver<0.3>:api<1> >> will always be the name for exactly the same code.
140+
141+
=head2 $*REPO
142+
In C<Perl> and C<Python> you deal with include paths, pointing to file system directories.
143+
In C<Raku> we call such directories "repositories" and each of these repositories is governed by an object that does the
144+
C<CompUnit::Repository> role.
145+
Instead of an C<B<@INC>> array, there's the C<$*REPO> variable.
146+
It contains a single repository object.
147+
This object has a B<next-repo> attribute that may contain another repository.
148+
In other words: repositories are managed as a I<linked list>.
149+
The important difference to the traditional array is, that when going through the list, each object has a say in whether
150+
to pass along a request to the next-repo or not.
151+
C<Raku> sets up a standard set of repositores, i.e. the "perl", "vendor" and "site" repositories, just like you know them from C<Perl>.
152+
In addition, we set up a "home" repository for the current user.
153+
154+
Repositories must implement the C<need> method.
155+
A C<use> or C<require> statement in C<Raku> code is basically translated to a call to C<B<$*REPO>>'s C<need> method.
156+
This method may in turn delegate the request to the next-repo.
157+
158+
=begin code
159+
role CompUnit::Repository {
160+
has CompUnit::Repository $.next is rw;
161+
162+
method need(CompUnit::DependencySpecification $spec,
163+
CompUnit::PrecompilationRepository $precomp,
164+
CompUnit::Store :@precomp-stores
165+
--> CompUnit:D
166+
)
167+
{ ... }
168+
method loaded(
169+
--> Iterable
170+
)
171+
{ ... }
172+
173+
method id( --> Str )
174+
{ ... }
175+
}
176+
=end code
177+
178+
=head2 Repositories
179+
180+
Rakudo comes with several classes that can be used for repositories.
181+
The most important ones are C<CompUnit::Repository::FileSystem> and C<CompUnit::Repository::Installation>.
182+
The FileSystem repo is meant to be used during module development and actually works just like C<Perl> when
183+
looking for a module.
184+
It doesn't support versions or auths and simply maps the short-name to a file system path.
185+
186+
The Installation repository is where the real smarts are. When requesting a module, you will usually either do it
187+
via its exact long name, or you say something along the lines of "give me a module that matches this filter".
188+
Such a filter is given by way of a C<CompUnit::DependencySpecification> object which has fields for
189+
=item short-name,
190+
=item auth-matcher,
191+
=item version-matcher and
192+
=item api-matcher.
193+
194+
When looking through candidates, the Installation repository will smart match a module's long name against this
195+
DependencySpecification or rather the individual fields against the individual matchers.
196+
Thus a matcher may be some concrete value, a version range or even a regex (though an arbitrary regex, such as C<.*>,
197+
would not produce a useful result, but something like C<3.20.1+> will only find candidates higher than 3.20.1).
198+
199+
Loading the meta data of all installed distributions would be prohibitively slow. The current immplementation of
200+
the C<Raku> framework uses
201+
the file system as a kind of database. However, another implementation may use another strategy. The following description
202+
shows how one implementation works and is included here to illustrate what is happening.
203+
204+
We store not only a distribution's files but also create indices for speeding up lookups.
205+
One of these indices comes in the form of directories named after the short-name of installed modules.
206+
However most of the file systems in common use today cannot handle Unicode names, so we cannot just use
207+
module names directly.
208+
This is where the now infamous SHA-1 hashes enter the game.
209+
The directory names are the ASCII encoded SHA-1 hashes of the UTF-8 encoded module short-names.
210+
211+
In these directories we find one file per distribution that contains a module with a matching short name.
212+
These files again contain the ID of the dist and the other fields that make up the long name: auth, version and api.
213+
So by reading these files we have a usually short list of auth-version-api triplets which we can match against our
214+
DependencySpecification.
215+
We end up with the winning dist's ID, which we use to look up the meta data, stored in a JSON encoded file.
216+
This meta data contains the name of the file in the sources/ directory containing the requested module's code.
217+
This is what we can load.
218+
219+
Finding names for source files is again a bit tricky, as there's still the Unicode issue and in addition the same
220+
relative file names may be used by different installed distributions (think versions).
221+
So for now at least, we use SHA-1 hashes of the long-names.
222+
223+
=head2 Resources
224+
225+
=begin code
226+
%?RESOURCES
227+
%?RESOURCES<libraries/p5helper>
228+
%?RESOURCES<icons/foo.png>
229+
%?RESOURCES<schema.sql>
230+
231+
Foo
232+
|___ lib
233+
| |____ Foo.rakumod
234+
|
235+
|___ resources
236+
|___ schema.sql
237+
|
238+
|___ libraries
239+
|____ p5helper
240+
| |___
241+
|___ icons
242+
|___ foo.png
243+
244+
=end code
245+
246+
It's not only source files that are stored and found this way.
247+
Distributions may also contain arbitrary resource files.
248+
These could be images, language files or shared libraries that are compiled on installation.
249+
They can be accessed from within the module through the C<%?RESOURCES> hash
250+
251+
As long as you stick to the standard layout conventions for distributions, this even works during development
252+
without installing anything.
253+
254+
A nice result of this architecture is that it's fairly easy to create special purpose repositories.
255+
256+
=head2 Dependencies
257+
258+
Luckily precompilation at least works quite well in most cases. Yet it comes with its own set of challenges.
259+
Loading a single module is easy.
260+
The fun starts when a module has dependencies and those dependencies have again dependencies of their own.
261+
262+
When loading a precompiled file in C<Raku> we need to load the precompiled files of all its dependencies, too.
263+
And those dependencies B<must> be precompiled, we cannot load them from source files.
264+
Even worse, the precomp files of the dependencies B<must> be exactly the same files we used for precompiling our
265+
module in the first place.
266+
267+
To top it off, precompiled files work only with the exact C<Raku> binary, that was used for compilation.
268+
269+
All of that would still be quite manageable if it weren't for an additional requirement: as a user you expect a new
270+
version of a module you just installed to be actually used, don't you?
271+
272+
In other words: if you upgrade a dependency of a precompiled module, we have to detect this and precompile the module
273+
again with the new dependency.
274+
275+
=head2 Precomp stores
276+
277+
Now remember that while we have a standard repository chain, the user may prepend additional repositories by way of
278+
C<-I> on the command line or "use lib" in the code.
279+
280+
These repositories may contain the dependencies of precompiled modules.
281+
282+
Our first solution to this riddle was that each repository gets it's own precomp store where precompiled files are stored.
283+
We only ever load precomp files from the precomp store of the very first repository in the chain because this is the
284+
only repository that has direct or at least indirect access to all the candidates.
285+
286+
If this repository is a FileSystem repository, we create a precomp store in a C<.precomp> directory.
287+
288+
While being the safe option, this has the consequence that whenever you use a new repository, we will start out
289+
without access to precompiled files.
290+
291+
Instead, we will precompile the modules used when they are first loaded.
292+
293+
=head2 Credit
294+
This tutorial is based on a C<niner> L<talk|http://niner.name/talks/A%20look%20behind%20the%20curtains%20-%20module%20loading%20in%20Perl%206/>.
295+
=end pod

0 commit comments

Comments
 (0)