incremental builds with rdmd #170

aG0aep6G · 2015-06-14T22:47:08Z

Inspired by Andrei's call for "scaling up rdmd" (forum post, issue 14654).

But this doesn't implement Andrei's idea. Instead it does what Jacob Carlborg and myself thought might be a better approach: Compile all outdated files, and only those, with one compiler invocation to multiple object files. Seems to work fine.

See commit message for a summary of the process.

I'm planning to add some tests to rdmd_test tomorrow. But code-wise I'm happy with it now; ready to get destroyed.

dnadlinger · 2015-06-14T23:45:58Z

This won't work in the general case. DMD still emits template instances to the first module specified on the command line if it can't figure out where they should really go. Right now, incremental compilation is only guaranteed to work if you pass the exact same list of modules every time.

CyberShadow · 2015-06-14T23:47:52Z

DMD still emits template instances to the first module specified on the command line if it can't figure out where they should really go.

Is this still true? Is there a self-contained test case for this?

aG0aep6G · 2015-06-14T23:48:50Z

This won't work in the general case. DMD still emits template instances to the first module specified on the command line if it can't figure out where they should really go. Right now, incremental compilation is only guaranteed to work if you pass the exact same list of modules every time.

Bah. Do you have an example of when dmd can't figure it out?

CyberShadow · 2015-06-14T23:49:17Z

rdmd.d

-        string workDir, string objDir, in string[string] myDeps,
-        string[] compilerFlags, bool addStubMain)
+private int link(in string fullExe, in string[] objects,
+    in string[] compilerFlags, in string workDir)
 {
    version (Windows)
        fullExe = fullExe.defaultExtension(".exe");


This won't compile on Windows any more because fullExe is an in parameter

jacob-carlborg · 2015-06-15T06:46:45Z

This won't work in the general case. DMD still emits template instances to the first module specified on the command line if it can't figure out where they should really go. Right now, incremental compilation is only guaranteed to work if you pass the exact same list of modules every time.

@klickverbot "Whatever the matters are, we will fix them" [1].

[1] http://forum.dlang.org/post/mkv452$1tdo$4@digitalmars.com

jacob-carlborg · 2015-06-15T06:47:57Z

What about using a hash based on the files instead of using a timestamp?

aG0aep6G · 2015-06-15T10:18:32Z

What about using a hash based on the files instead of using a timestamp?

This hadn't crossed my mind. Spontaneous thoughts about it:

It would be more expensive. Maybe significantly so.
When hashes collide things go wrong in a way that's hard to understand for the user. Probably not a problem in reality.
It should be possible to retrofit that later on.
What problem do hashes solve? When do timestamps fail?

aG0aep6G · 2015-06-15T16:14:32Z

Added some tests. I'm not super happy with Thread.sleeping to get different timestamps, but I don't see another sane way. Setting modification times manually would require touching the object files.

jacob-carlborg · 2015-06-15T17:15:10Z

What problem do hashes solve? When do timestamps fail?

The file is only recompiled if the content has changed, even if the timestamp has been updated.

Summary of the process: Build a dependency graph (getDependencies). For every source file, figure out when the last change happened to the file itself or any of its dependencies (getTimes). Split source files up into two groups: * Put files that need to be (re-)compiled into toCompile. * Put files for which an older object file can be reused into toReuse. Compile all files in toCompile in one compiler call (compile). Link all new and old objects together in another compiler call (link).

Also move the temporary directory for the tests to /tmp/rdmd_test. We generate quite some files and we don't want to overwrite the stuff of others.

DmitryOlshansky · 2015-06-15T21:17:54Z

When hashes collide things go wrong in a way that's hard to understand for the user. Probably not a problem in reality.

I suggest you actually try to find a collision even in (technically "broken") MD5. It's not something happening except w/o considerable effort, compute power and time spent to finding collision. Note that git we are using is build on the assumption that SHA1 hash won't have a collision in one repo. In fact most often people use first 8 numbers of hash to checkout some commit.

It would be more expensive. Maybe significantly so.

I believe computing checksum of sources should be fairly cheap, sources rarely exceed a few Mb in total. Phobos is about 9Mb with all the fluf and generated tables in std/internal ticking in at 1.6Mb. Speed depends on CPU but is in the range 200+Mb/s on circa 2011 hardware.

http://stackoverflow.com/questions/2722943/is-calculating-an-md5-hash-less-cpu-intensive-than-sha-family-functions

aG0aep6G · 2015-06-15T22:02:59Z

Heads up: I think I've fucked things up with getTimes. I'm going to look at that again and will likely change things.

CyberShadow · 2015-06-15T23:04:20Z

I believe computing checksum of sources should be fairly cheap

The biggest bottleneck is going to be reading all source code from disk for every rdmd run (even if nothing changed). You could use timestamps AND checksums, but it's still an unnecessary performance hit.

Very few build systems make the decision to use checksums because it is usually simply unnecessary. It only makes sense in situations with non-trivial caching systems (where a --force switch is impractical or isn't supposed to need to exist).

aG0aep6G · 2015-06-16T15:26:16Z

Heads up: I think I've fucked things up with getTimes. I'm going to look at that again and will likely change things.

Update: I now think everything's fine. I may have been staring at this for too long. Getting the newest timestamp of a file and its dependencies, and then comparing that to the object file - that's sound, isn't it?

andralex · 2015-06-16T15:31:53Z

rdmd.d

+    // Validate extensions of extra files (--extra-file)
+    foreach (immutable f; extraFiles)
+    {
+        if (![".d", ".di", objExt].canFind(f.extension))


if (!only(".d", ".di", objExt).canFind(f.extension))

andralex · 2015-06-16T15:58:04Z

rdmd.d

        }
    }

-    immutable rootDir = dirName(rootModule);
+    return deps;
+}


At 220 lines, this seems to be quite a long function. I think it should be broken into smaller ones.

andralex · 2015-06-16T16:21:08Z

This is a bit heavier than what I'd hope for what it does, but I found no obvious bilgewater. It does raise an eyebrow that a 1KLOC program needs a HashSet on top of an associative array instead of just making use of the respective associative array.

Regarding discussion of using hashes instead of timestamps - that's a fine idea (git and others do that), but it's outside the charter of this PR, and may be implemented independently of it. Please focus reviews on the design and implementation of this PR.

@aG0aep6G: did you run speed measurements?

aG0aep6G · 2015-06-16T17:13:37Z

Followed Andrei's suggestions or replied when I didn't.

aG0aep6G · 2015-06-16T17:47:22Z

@aG0aep6G: did you run speed measurements?

Not extensively. I did time a project of mine (about 8 kloc).
Nothing changed for --force builds.
A build without any changes went up from 0.034s to 0m0.073s
Building when just the main file changed takes about as long as a full rebuild. But the issued commands are fine. The program's just template heavy.

A toy example where an imported file takes long to compile works nicely:

module main;
import slow_compile;
void main() {}

module slow_compile;
struct S(ulong depth)
{
    static if(depth > 0)
    {
        S!(depth - 1) sub1;
        S!(depth - 1) sub2;
    }
}
alias Instance = S!15;

3s for a full build, 0.2s when only main was touched.

rainers · 2015-06-16T18:49:21Z

This won't work in the general case. DMD still emits template instances to the first module specified on the command line if it can't figure out where they should really go.

Here is an example of what David referred to:

module main;
import imp;
import foo;

Struc!int x; // remove this line for second build

void main()
{
    fun();
}

module foo;
import imp;

int fun()
{
    Struc!int ix;
    return ix.get();
}

module imp;

struct Struc(T)
{
    T x;
    T get() { return x; }
}

Module imp.d is meant as an import from a library like std.algorithm, so not compiled with the other files. Making an example where it's compiled aswell is a bit harder, but not really.
Compile with dmd -c main.d foo.d, then link with dmd main.obj foo.obj. Everything fine.

Then remove the commented line in main.d, recompile with dmd -c main.d and link:

OPTLINK (R) for Win32  Release 8.00.17
Copyright (C) Digital Mars 1989-2013  All rights reserved.
http://www.digitalmars.com/ctg/optlink.html
foo.obj(foo)
 Error 42: Symbol Undefined _D3imp12__T5StrucTiZ5Struc3getMFNaNbNiNfZi

In the first run, Struc!int is built into main.obj, so it's gone after being removed from main.d.

I think an easier approach to incremental compilation is when dmd supports replacing existing symbols in a library. It accumulates all used template instances ever, though, so it might grow indefinitely...

DmitryOlshansky · 2015-06-16T19:25:55Z

The biggest bottleneck is going to be reading all source code from disk for every rdmd run (even if nothing changed).

That's true.

WalterBright · 2015-06-18T17:39:25Z

DMD still emits template instances to the first module specified on the command line if it can't figure out where they should really go.

No. DMD, when presented with multiple modules on a command line builds exactly ONE combined object file, not one per module. Templates are instantiated and inserted into that object file which the compiler does not see instantiated by one of the imported modules.

Oh, I see what you're doing, you're using -c. That does present a problem with "where does the template instance go". No obvious solution, other than using -allinst.

andralex · 2015-06-20T02:21:00Z

So... we're kind of stuck here. @WalterBright what's the way out?

rainers · 2015-06-20T07:10:20Z

Oh, I see what you're doing, you're using -c. That does present a problem with "where does the template instance go". No obvious solution, other than using -allinst.

The problem is that it is generated into only one of the object files and might disappear there with source code changes while another object file still depends on it. -allinst won't help.

jacob-carlborg · 2015-06-20T14:59:54Z

Would there be a problem if the compiler outputted the symbols in all object files? Doesn't LDC do that?

dnadlinger · 2015-06-20T19:16:10Z

@jacob-carlborg: LDC (mostly) uses the same symbol emission strategy as DMD.

aG0aep6G · 2015-06-20T19:21:55Z

Would there be a problem if the compiler outputted the symbols in all object files?

Basically what happens with separate compilation, no?

Sticking with Rainer's example, this works:

dmd -c main.d
dmd -c foo.d
dmd main.o foo.o
# remove the marked line from main.d
dmd -c main.d
dmd main.o foo.o # no problem

nm shows that there's weak duplicates in the object files (before removing the marked line). nm main.o and nm foo.o both include:

0000000000000000 W _D3imp12__T5StrucTiZ5Struc3getMFNaNbNiNfZi
0000000000000000 V _D3imp12__T5StrucTiZ5Struc6__initZ
0000000000000000 W _D3imp15__unittest_failFiZv
0000000000000000 W _D3imp7__arrayZ
0000000000000000 W _D3imp8__assertFiZv

dnadlinger · 2015-06-20T19:48:20Z

Would there be a problem if the compiler outputted the symbols in all object files?

Basically what happens with separate compilation, no?

Yes. At the risk of sounding like a broken record, incremental compilation right now is only guaranteed to work if any given module is only compiled as part of one list of source files, which cannot be changed as long as you want to reuse other object files. Only ever compiling a single module at once trivially satisfies this criterion.

MartinNowak · 2015-10-31T15:20:59Z

Compiling multiple object files in one go is fundamentally broken and we're unlikely to fix that b/c we decided to emit as few template instances as possible rather than to always emit all into every object file.
You should follow Andreis' original proposal and rebuild static libraries for each package or use dub which already knows how to do this efficiently.

CyberShadow reviewed Jun 14, 2015
View reviewed changes

aG0aep6G force-pushed the incremental-rdmd branch from e9a7a5b to 9341f8c Compare June 14, 2015 23:52

aG0aep6G added 2 commits June 15, 2015 23:16

add some tests

61f248b

Also move the temporary directory for the tests to /tmp/rdmd_test. We generate quite some files and we don't want to overwrite the stuff of others.

aG0aep6G force-pushed the incremental-rdmd branch from 0acc3c9 to 61f248b Compare June 15, 2015 21:23

andralex reviewed Jun 16, 2015
View reviewed changes

aG0aep6G added 3 commits June 16, 2015 18:42

do as Andrei says: various little style issues

beef6fd

do as Andrei says: break up getDependencies into smaller functions

76ff993

simplify

bcf6342

MartinNowak closed this Jan 14, 2017

dlang-bugzilla-migration mentioned this pull request Jan 9, 2021

rdmd should compile package at a time #428

Open

Uh oh!

incremental builds with rdmd #170

incremental builds with rdmd #170

Uh oh!

Conversation

aG0aep6G commented Jun 14, 2015

Uh oh!

dnadlinger commented Jun 14, 2015

Uh oh!

CyberShadow commented Jun 14, 2015

Uh oh!

aG0aep6G commented Jun 14, 2015

Uh oh!

CyberShadow Jun 14, 2015

Choose a reason for hiding this comment

Uh oh!

aG0aep6G Jun 14, 2015

Choose a reason for hiding this comment

Uh oh!

jacob-carlborg commented Jun 15, 2015

Uh oh!

jacob-carlborg commented Jun 15, 2015

Uh oh!

aG0aep6G commented Jun 15, 2015

Uh oh!

aG0aep6G commented Jun 15, 2015

Uh oh!

jacob-carlborg commented Jun 15, 2015

Uh oh!

DmitryOlshansky commented Jun 15, 2015

Uh oh!

aG0aep6G commented Jun 15, 2015

Uh oh!

CyberShadow commented Jun 15, 2015

Uh oh!

aG0aep6G commented Jun 16, 2015

Uh oh!

andralex Jun 16, 2015

Choose a reason for hiding this comment

Uh oh!

andralex Jun 16, 2015

Choose a reason for hiding this comment

Uh oh!

andralex commented Jun 16, 2015

Uh oh!

aG0aep6G commented Jun 16, 2015

Uh oh!

aG0aep6G commented Jun 16, 2015

Uh oh!

rainers commented Jun 16, 2015

Uh oh!

DmitryOlshansky commented Jun 16, 2015

Uh oh!

WalterBright commented Jun 18, 2015

Uh oh!

andralex commented Jun 20, 2015

Uh oh!

rainers commented Jun 20, 2015

Uh oh!

jacob-carlborg commented Jun 20, 2015

Uh oh!

dnadlinger commented Jun 20, 2015

Uh oh!

aG0aep6G commented Jun 20, 2015

Uh oh!

dnadlinger commented Jun 20, 2015

Uh oh!

MartinNowak commented Oct 31, 2015

Uh oh!

Uh oh!