Skip to content
This repository has been archived by the owner on Oct 12, 2022. It is now read-only.

[WIP] Sociomantic CDGC port #985

Closed
wants to merge 14 commits into from
Closed

[WIP] Sociomantic CDGC port #985

wants to merge 14 commits into from

Conversation

mihails-strasuns-sociomantic
Copy link
Contributor

Here is initial port result available for early experiments. It can be compiled with make -f posix.mak GC_TYPE=concurrent and passes the test suite with only shared library tests disabled (ef20b7a).

There are still many issues to be aware of:

  1. Documentation is largely missing. Working on it, reading @leandro-lucarella-sociomantic old posts (http://www.llucax.com.ar/blog/blog/tag/understanding%20the%20current%20gc) may help in the meanwhile
  2. Code style differs from Phobos standards. To be fixed soon.
  3. Shared library support is completely missing. Proxy infrastructure similar to one in existing gc needs to be added and I don't know if actual implementation will work in such environments or more changes will be needed.
  4. Deadlock issue (http://www.dsource.org/projects/tango/ticket/2087) still remains. It is not critical to our code because it almost never uses threads so no big effort was put into it but this can be huge problem for any other project.

In general this is very far from something that can be merged upstream straight away and replace default GC on linux. It can be interesting for other projects with similar architecture and requirements and probably helpful for anyone else working on better D GC.

And contributions are welcome via pull requests towards this PR base branch (https://github.com/mihails-strasuns-sociomantic/druntime-1/tree/sociomantic-cdgc-wip) or as e-mail patches (public@dicebot.lv)

@MartinNowak
Copy link
Member

Now that we have druntime configuration options (#817) I would suggest to add this as an optional GC that can be selected at program start.

@mihails-strasuns-sociomantic
Copy link
Contributor Author

Have totally missed that PR. I have actually considered implementing similar runtime approach (as CDGC naturally uses env vars for all configuration) but it did seem wrong to include code for all garbage collectors in distribution.

I'll change that once code in general will gets at least somewhat close to possibility of being merged.

@mihails-strasuns-sociomantic
Copy link
Contributor Author

btw, @MartinNowak do you have any hints about shared library support? What was needed to be changed in original GC to implement it other than proxy.d thing?

@leandro-lucarella-sociomantic
Copy link
Contributor

I recommend reading the blog in chronological order :)
http://www.llucax.com.ar/blog/blog/tag/understanding%20the%20current%20gc?sort=+date

BTW, that documents the Tango "basic" GC, which was the base GC used by druntime. The fork was done years ago, so there are some difference, but I think the basics are the same in both CDGC and druntime.

@leandro-lucarella-sociomantic
Copy link
Contributor

Now that we have druntime configuration options (#817) I would suggest to add this as an optional GC that can be selected at program start.

Oh, cool! BTW, I think at some point both GC's should be merged (this is what I wanted to do when I started my attempt to port the GC but it was unrealistic in terms of time). CDGC can already be configured to be concurrent or not at runtime via ENV vars.

@leandro-lucarella-sociomantic
Copy link
Contributor

I wonder if the code in that PR has any relation with the ENV var parser I wrote for CDGC, they look, at least in concept, mostly the same. I guess the opts module in CDGC can go now...

@MartinNowak
Copy link
Member

btw, @MartinNowak do you have any hints about shared library support? What was needed to be changed in original GC to implement it other than proxy.d thing?

GC proxy is complete BS, it is a crappy hack to avoid the most obvious ODR violations.
The only thing that I did on the GC for shared libraries was adding the runFinalizers method, which takes a segment and frees every object that depends on a finalizer in that segment.

* Authors: Leandro Lucarella <llucax@gmail.com>
*/

module gc.concurrent.dynarray;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There already is a module for arrays that doesn't use the GC, rt.util.container.array. They don't seem to have the exact same functionality but perhaps they can be merged into one implementation?

@mihails-strasuns-sociomantic
Copy link
Contributor Author

@jacob-carlborg thanks for you comments but.. I have already mentioned that sources are pending full rewrite to adjust to Phobos code style (currently using original one), I am afraid it wasn't most useful time investment :P

@MartinNowak thanks!

@jacob-carlborg
Copy link
Contributor

@mihails-strasuns-sociomantic right, I missed that. But why did you create a pull request in the first place?

@leandro-lucarella-sociomantic
Copy link
Contributor

On Wed, Oct 08, 2014 at 10:06:12AM -0700, jacob-carlborg wrote:

@mihails-strasuns-sociomantic right, I missed that. But why did you
create a pull request in the first place?

I can think of at least 2 reasons:

  1. So people can start testing it. Style is very important to get it
    merged but have 0 effect on testing.
  2. So people can start reviewing the most important aspects of the PR,
    for example what to do with shared libraries and the (in)famous
    proxy. Doing this with the code at hand is much easier :-)

@mihails-strasuns-sociomantic
Copy link
Contributor Author

But why did you create a pull request in the first place?

Because I have been asked to do so :) That way anyone can experiment with it or borrow some implementation ideas while it gets slowly tweaked to be merge-ready

Mihails Strasuns added 12 commits October 9, 2014 14:30
Moves default gc implementation to `gc.basic` and gcstub to `gc.stub`.
Build system uses `gc.basic` by default, run `make -f posix.mak GC_TYPE=stub` to pick stub one
Does not compile, files are copied from tango run-time as-is
In D2 runtime stack bottom data is provided by thread runtime and thus its needs
to be iniitalized before any relevant functions get called.
Forked process must avoid any of deinitialization - it triggers GC cleanup stage
in D2 runtime (it was added as part of shared library suppport). There is a
special C standard library function _Exit for that.
Includes all straightforward adjustments for different language
constructs, runtime difference and module names.

Can be built using `make -f posix.mak GC_TYPE=concurrent`
D2 runtime introduces new attributes GC must support
Adds malloc overload that returns actual allocation size
(which equals capacity at the point of allocation) as an out
parameter.

gc_qalloc updated to use that overload to fill the BlkInfo.size
so that druntime can write necessary metadata to the correct end
of block.
Pool update did not trigger cached size update if the pool
was cached
Probably just `shared` actually but making CDGC shared-correct is more advanced task
This is better to be fixed in druntime by moving
capacity to the end of block for all sizes. However
this simple hack similar to one present in default d2
gc is enough to pass tests.
CDGC is currently missing important bits of infrastructure
for shared library support. This temporarily disables the relevant
test case so that other ones can pass
Originally it was ended for early attempt of precise scanning
support but this approach has been reconsidered and if precise
scanning is to be added it will be done in a different way.
@leandro-lucarella-sociomantic
Copy link
Contributor

Just for reference: https://issues.dlang.org/show_bug.cgi?id=10184

@MartinNowak MartinNowak modified the milestones: 2.065, 2.067 Oct 10, 2014
It worked in D1 because of more permissive implicit conversions
src\gc\basic\bits.d \
src\gc\basic\stats.d \
src\gc\basic\proxy.d
endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The manifest should list all druntime files independent of any settings.

@MartinNowak
Copy link
Member

I tried our GC benchmark suite (fix runbench script by MartinNowak · Pull Request #998 · D-Programming-Language/druntime) and the results were not so good.
Some tests fail or never finish execution, probably due to memory corruptions.
Most of the ones that do pass take significantly longer with the concurrent GC than with the basic GC.

While the times might be caused by some bugs and can likely be improved this GC design is inherently slower than a pausing GC. Additionally to the fork overhead it has to copy memory pages whenever the concurrent program modifies them. So clearly this might be an interesting alternative GC for low latency applications but it achieves this at the cost of a higher CPU/memory bandwidth usage.
Other schemes like performing GC in idle times (vibe.d) or using a region allocator per server request might be equally interesting to achieve low latency.

@leandro-lucarella-sociomantic
Copy link
Contributor

On Sat, Oct 18, 2014 at 11:27:10AM -0700, Martin Nowak wrote:

I tried our GC benchmark suite (fix runbench script by MartinNowak · Pull Request #998 · D-Programming-Language/druntime) and the results were not so good.
Some tests fail or never finish execution, probably due to memory corruptions.
Most of the ones that do pass take significantly longer with the concurrent GC than with the basic GC.

While the times might be caused by some bugs and can likely be improved this GC design is inherently slower than a pausing GC. Additionally to the fork overhead it has to copy memory pages whenever the concurrent programs modifies them. So clearly this might be an interesting alternative GC for low latency applications but it achieves this at the cost of a higher CPU/memory bandwidth usage.

This does not match my testing (years ago, so maybe the basic GC in
druntime improved significantly since then). To my surprise the
concurrent GC did better for a real application (Dil, the only one that
was fairly maintained for D1 back then), and my guess is because the
program could keep working while the GC was working in another core. The
fork overhead is almost zero (unless you work with a very big heap,
basically the fork time is proportional to the page table). The COW
copying of course depends on the type of application, if the application
is writing on every single page very fast (before the mark phase is
done) you'll end up duplicating the whole memory. But I don't think
that's the most common case for regular application (even when there
might be a test case stressing this).

So, even when the current implementation could be buggy and suboptimal,
I really don't think the algorithm is inherently slow, and certainly not
slower than the basic one (in terms of CPU cycles, yes, is less
efficient, it definitely needs to do more work, but that work is
naturally parallelized in the concurrent GC, while in the basic one is
serialized with the applications work).

@MartinNowak
Copy link
Member

but that work is naturally parallelized in the concurrent GC, while in the basic one is
serialized with the applications work

True that, there is a good chance that many real world apps benefit from that and the low latency.
Will anyone work on fixing the bugs? I'd really like to have this as an optional GC for 2.067.

@mihails-strasuns-sociomantic
Copy link
Contributor Author

I'd really like to have this as an optional GC for 2.067

This is very unlikely. I have switched to other tasks related to D2 porting and was planning to get back to CDGC only when we have at least one internal service completely switched to D2. Until then investing more time into it would have been very impractical.

Of course if anyone else wants to contribute in the meanwhile, there will be no objections :P

@quickfur
Copy link
Member

Wow, totally missed this one! Looking forward to having this merge in the (not-so-near?) future!

@MartinNowak MartinNowak removed this from the 2.067 milestone Jan 23, 2015
@MartinNowak MartinNowak added the GC garbage collector label Mar 7, 2015
@DemiMarie
Copy link

Any chance on getting this merged?

@mihails-strasuns-sociomantic
Copy link
Contributor Author

Pretty much zero. I have explained it during last DConf talk but should have written a not here too. There are several issues with this PR:

  • GC in this PR is based on old GC implementation from Tango runtime, while existing upstream one have received many performance improvements
  • It isn't widely applicable as it traded worse throughput for better latency (by design) which is not advantageous for non real-time cases
  • There is still issue with global libc mutex which makes it almost unusable in multi-threaded code

So despite the fact I did this port as proof of concept to ensure it won't become migration blocker, it is now almost certain this port won't be used at all. Depending on performance profiling of our ported applications one of two likely approaches will be taken instead:

  • either reimplement similar concurrent fork-based behaviour on top of existing upstream GC instead
  • or throw away CDGC completely, switching real-time applications to fully manual std.allocator management and using stock GC for non-critical apps

This PR mostly remains as a reference of anyone curious for now.

@schveiguy
Copy link
Member

Per @mihails-strasuns-sociomantic, this won't be updated/merged, closing so the auto tester doesn't bother with it. Please reopen if things change.

@schveiguy schveiguy closed this Nov 23, 2015
@jacob-carlborg
Copy link
Contributor

Why not upstream this as an additional GC? Does Java have multiple GC to choose from?

@mihails-strasuns-sociomantic
Copy link
Contributor Author

Because there is no point in upstreaming something we don't use and thus won't maintain, it will quickly die from the bitrot (or waste someone else to maintain it). That would be impolite at best in my opinion.

@leandro-lucarella-sociomantic
Copy link
Contributor

It isn't widely applicable as it traded worse throughput for better latency (by design) which is not advantageous for non real-time cases

This is not entirely true, not always. If your application is not using all the cores, then the concurrent GC might speed up your application, because now the mark phase of the collection is using a core that wasn't used before, so your application get some free panellization.

@leandro-lucarella-sociomantic
Copy link
Contributor

But I generally agree this PR should stay closed for now.

@mihails-strasuns-sociomantic
Copy link
Contributor Author

Thanks for clarification.

llucax referenced this pull request in FraMecca/CDGC Aug 28, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
GC garbage collector
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants