Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track Chapel Bugs #9

Open
3 of 6 tasks
e-kayrakli opened this issue Jun 19, 2017 · 18 comments
Open
3 of 6 tasks

Track Chapel Bugs #9

e-kayrakli opened this issue Jun 19, 2017 · 18 comments

Comments

@e-kayrakli
Copy link
Collaborator

e-kayrakli commented Jun 19, 2017

This is a meta-issue to keep a TODO list of Chapel implementation bugs we found/reported:

Issue now reported: chapel-lang/chapel#6536

Issue: chapel-lang/chapel#6538

Issue: chapel-lang/chapel#6542

  • Class instances passed with explicit ref intent causes incorrect types during code generation.

Issue: chapel-lang/chapel#6642 -- only adresses the problem partially
SHA: 205021e
Notes: Run make FCHQueue

  • --cache-remote fails assertion with FIFO queue

SHA: 51dd86e
Notes: Run with make Benchmark EXTRAFLAGS='--cache-remote -sisFIFO=true' then bin/main.exe -nl 2 --nElements=1000
Output:

ATP Stack walkback for Rank 1 starting:
  [empty]@0xffffffffffffffff
  qthread_wrapper$$CFE_id_58403e1c_qlib@qthread.c:2299
  chapel_wrapper$$CFE_id_70d80183_ed8f0e2b@tasks-qthreads.c:814
  wrapcoforall_fn_chpl4$$CFE_id_569285a2_9b0c5384@benchmark.chpl:122
  coforall_fn_chpl4$$CFE_id_569285a2_9b0c5384@benchmark.chpl:126
  chpl_gen_comm_get$$CFE_id_569285a2_9b0c5384@chpl-comm-compiler-macros.h:54
  chpl_cache_comm_get@chpl-cache.c:2836
  cache_get$$CFE_id_2835e7b6_d541bb58@chpl-cache.c:2297
  find_in_tree$$CFE_id_2835e7b6_d541bb58@chpl-cache.c:1434
  bottom_list_search$$CFE_id_2835e7b6_d541bb58@chpl-cache.c:1011
  list_search$$CFE_id_2835e7b6_d541bb58@chpl-cache.c:988
ATP Stack walkback for Rank 1 done
Process died with signal 11: 'Segmentation fault'
View application merged backtrace tree with: stat-view atpMergedBT.dot
You may need to: module load stat
  • Returning arrays results in an internal/syntax error

TIO

Recommended reading:
https://github.com/chapel-lang/chapel/blob/master/doc/rst/developer/bestPractices/TestSystem.rst

The whole document is important as you'll be developing correctness tests that can be used in Chapel's testing infrastructure. Sub-header "Futures" some way down in the document is particularly useful for reporting bugs: you create an issue and ideally create a PR adding a "future" to the test system to track down the issue in nightly testing)

@e-kayrakli
Copy link
Collaborator Author

None of these issues at this point are pressing for us or in general for Chapel, I think. I can take care of them when I am looking for something lightweight. Feel free to do it yourself, if you wish. Could be a good practice in contributing to Chapel.

@LouisJenkinsCS
Copy link
Owner

LouisJenkinsCS commented Jun 19, 2017

@e-kayrakli

Uh... Engin... you might want to have a look at the first one again. I got the same issue with initFn but this time, in one of my files. I pinpointed the issue to enum being improperly eliminated. The MWE is as such:

EmptyEnum.chpl

enum CausesA { CRASH }

DriverProgram.chpl

proc main() {}

I apologize if it isn't in the format specified in the document you linked (felt I had to immediately report it). This bug seems rather significant. I'm also assuming in BigInteger, it is caused by this being eliminated.

Edit: According to this, it doesn't make any sense. You can add all you want to the module (in fact it triggered from a module with like ~100 lines of code), and it even triggers in BigInteger (presumably), and so it doesn't make sense for this to be true: mod->block->body.length == 1. It'd be interesting to look more into it, so if you still think its easy enough for me to do, I can see about doing so. It'd be interesting and helpful anyway.

@e-kayrakli
Copy link
Collaborator Author

This bug seems rather significant.

Probably more significant than I thought, but still one thing that is not clear to me is that why you would compile these two files together? i.e. useing the module with single enum from the main module seems to cause compiler to add the init function, getting rid of the bug.

I think you are still right that this behavior is awkward, but still seems to be a corner case that has no implication on real world use. Am I missing something?

so it doesn't make sense for this to be true: mod->block->body.length == 1

Not so sure about that. Compiler may be creating a file scope module that wraps the BigInteger module. In which case, the block of the file scope module would have a single AST node (module BigInteger)

It'd be interesting to look more into it, so if you still think its easy enough for me to do, I can see about doing so. It'd be interesting and helpful anyway.

I am not sure if I am understanding right. But I wouldn't want you to try to fix this. What you can do is to create an issue/future for the team to solve this (or any similar future problem). Compiler work is never as easy as it may seem. Further, for this case, the comment before the assertion shows that this assertion was inserted there to close another edge case bug which triggers a segfault.

Two questions:

  • Does this block you in any way in your work? Remember that you are trying to compile a file without using it -- that leads me to believe that this can be worked around with simple Makefile modification.
  • Do you want to create an issue/future for this, do you want me to do it?

@LouisJenkinsCS
Copy link
Owner

Does this block you in any way in your work? Remember that you are trying to compile a file without using it -- that leads me to believe that this can be worked around with simple Makefile modification.

This more or less comes up as an obstacle/road-block but its easy enough to work around it, just requires a bit more effort. I can create a makefile which only builds specific files, but I liked the feature of compiling every single file in my repository and just have the compiler discard the ones I'm currently not using. I was debating on whether or not having a makefile per sub-directory would be best and just call it recursively or have a single global makefile which builds only specific files based on arguments passed, but when I move/rename/add files I have to make the proper adjustments. Which do you think I should use?

Do you want to create an issue/future for this, do you want me to do it?

I think if you submitted it (but linked me to it) so I can see how to do it myself next time, it'd be best.

@e-kayrakli
Copy link
Collaborator Author

This more or less comes up as an obstacle/road-block but its easy enough to work around it, just requires a bit more effort. I can create a makefile which only builds specific files, but I liked the feature of compiling every single file in my repository and just have the compiler discard the ones I'm currently not using.

It looks like this is something you probably need to accept :( Even if we open the issue, there is no guarantee it will be solved right away.

I think if you submitted it (but linked me to it) so I can see how to do it myself next time, it'd be best.

Will try to do it this week.

@LouisJenkinsCS
Copy link
Owner

LouisJenkinsCS commented Jun 22, 2017

Hey @e-kayrakli , I found another bug that relates to type inference from the compiler... Although to be sure, I'd need to figure out how to query more information about the type. Currently, it seems much more like undefined behavior and seems difficult to produce a MWE for. It has to do with generic inheritance and polymorphism/dynamic dispatch. I'm not sure precisely why.

I'm wondering if by a glance you can pinpoint the issue...

So, here is the declaration of the field. If I change it to the following:

var localQueues : [cyclicDomain] Queue(eltType);

It will produce undefined behavior. In one case, it ended up looping infinitely here, which if this occurs then uncommenting this line, it works fine. This was however when I had use my global makefile script.

Later, when I decided to make the necessary changes so that I can run only the required files, seen here, the behavior changed... this time, the returned value is garbage. It produces random values, which definitely is undefined behavior. As before, uncommenting this line fixes the issue.

I kind of need to make this change so I can inject different kind of queues into benchmarks and tests, the base class seen here.

Finally, to put the final nail in the coffin... here, if we change it to...

var boardQueue : Queue(26 * int) = new DistributedFIFOQueue(26 * int);

It says:

queues/Queue.chpl:5: error: halt reached - Enqueue Not Overloaded

To Build Run: make NQueens

Note: Bug occurs with and without --fast flag

@e-kayrakli
Copy link
Collaborator Author

For the race condition:

Are you sure you don't have a race in trunk as of now. I checked the line initialize a CCQueue. When I run it on my machine within a bash loop with 10 iterations or so it always deadlocks..

Looking into dynamic dispatch issue...

@e-kayrakli
Copy link
Collaborator Author

Dynamic dispatch:

Your formals are not identical. See 15.2.5 in specs:

If a method in a derived class is declared with a signature identical to that of a method in a base class, then it is said to override the base class’s method. Such a method is a candidate for dynamic dispatch in the event that a variable that has the base class type references an object that has the derived class type.
The identical signature requires that the names, types, and order of the formal arguments be identical. The return type of the overriding method must be the same as the return type of the base class’s method, or must be a subclass of the base class method’s return type.

@LouisJenkinsCS
Copy link
Owner

LouisJenkinsCS commented Jun 22, 2017

Are you sure you don't have a race in trunk as of now. I checked the line initialize a CCQueue. When I run it on my machine within a bash loop with 10 iterations or so it always deadlocks..

You mean using the main program for CCQueue? I've reran the program multiple times. Furthermore, it is as identical to the publication's CC-Synch as you can get, and I had no issues with the billions of operations being run with it up until now.

If a method in a derived class is declared with a signature identical to that of a method in a base class, then it is said to override the base class’s method.

CCQueue.enqueue matches Queue.enqueue. CCQueue.dequeue matches Queue.dequeue.

Furthermore, it does not make sense for as to why inside of DistributedFIFOQueue.dequeue it causes undefined behavior. You can rerun NQueens with --logBoard 1 flag to see the boards being returned. My mistake, realized I didn't wrap it in logBoard conditional block, so you'd see it anyway.

@LouisJenkinsCS
Copy link
Owner

LouisJenkinsCS commented Jun 22, 2017

For emphasis, if you see deadlock, try it both when you change the type of the queue to Queue and then to CCQueue, furthermore when its Queue and writeln(retval) is uncommented. Again as well, if it failed dynamic dispatch, it would halt, but instead it deadlocks.

@LouisJenkinsCS
Copy link
Owner

Also, I realized now that cclock is still using first class functions. I never changed it because I need to hard-code it to a specific use-case (like CCQueue) and it would take a while so I just left it as is since it seemed to work well enough. Still possible its causing potential issues if it recursively checks each field of a class and fails due to it or something.

@e-kayrakli
Copy link
Collaborator Author

You mean using the main program for CCQueue?

No, I mean NQueens.

make NQueens && for t in `seq 1 10`; do ./a.out; done

Deadlocks on trunk

There are two separate issues: mismatching function signatures causing dynamic dispatch to not work, and a possible deadlock in NQueens. Uncommenting writeln also uncomments multiple synchronizations which make your code go through.

CCQueue.enqueue matches Queue.enqueue. CCQueue.dequeue matches Queue.dequeue.

Aren't you trying to do dynamic dispatch between Queue and DistributedFIFOQueue?

@LouisJenkinsCS
Copy link
Owner

NQueens itself does not contain a deadlock, as it only spins like that inside of DistributedFIFOQueue, meaning the deadlock would be inside DistributedFIFOQueue. I do see now that there is a mismatch in method names for the method parameter names, and I fixed that, which does fix one potential issue.

Uncommenting writeln also uncomments multiple synchronizations which make your code go through.

I'm certain it isn't a race condition at all. If you uncomment the line, the line after it states that it is spinning, it works. This added synchronization from writeln would never be reached since it falls through correctly (meaning the line to write that it is spinning is never reached), so it wouldn't factor in at runtime (unless the compiler does something that does effect it at runtime). As well, this only happens if you have the type of localQueues as [cyclicDomain] Queue(eltType) instead of [cyclicDomain] CCQueue(eltType).

@LouisJenkinsCS
Copy link
Owner

Hey Engin, in terms of removing first class functions in it's entirety (and to get it out of the way, with hopeful benefit of eliminating that undefined behavior bug), I had an idea...

Classes as First Class Function Objects

This is separate from the idea of functors in functional programming (that requires working first class functions :P), but more like C++'s functional objects. The idea is to have something similar to Java's functional interface, where you define an implementation of an interface (although I'm curious: are anonymous objects supported? Are they stable to use if so?).

(I'm just teasing btw, Chapel is an amazing language, can't wait to help it grow more!)

class BaseFunctor {
  // Needed because it is created before it is passed the actual, so can't have
  // compiler infer type.
  type dataType;

  // Perform some operation in mutual exclusion (guaranteed by CCLock).
  // Is there a better way to manage types? The return type can vary
  // from operation to operation; I.E: Enqueue returns void normally,
  // a Dequeue return void normally. However, CCNode requires a dummy node, and so
  // we don't know the actual type at initialization time...
  proc op(data : dataType);
}

As noted above, this is the 'Chicken-Or-The-Egg' problem. The return type is determined by the implementer, but we require a dummy node with space allocated of the same type. Any generic function requires us to know the type, we can't just query it (which the prime issue, what if we don't know? Why can't we just ask it knowing it has to know it???).

For CCLock, we want to take a request functor that handles returning whatever it wants, including primitives and tuples, and store it in a node for return. I know why we can't just do what we want because the size of the node depends on the type of the node, and so since we need to know the actual type of the node we need to know the type of the next node, and so on and so forth... falls apart fast

Usage

var lock$ : cclock(dataType);
// Custom operator! Pretty cool looking! This would make cclock a higher higher order function!
lock$ >>= new BaseFunctor(dataType) { proc op(data) { ... } };

Where lock$ would call BaseFunctor(dataType).op(data) etc.

Alternatives

Inline Logic

An example of the equivalent code that is inlined...

// Code used to do action as combiner thread goes here...
...
// Code that is specific to our logic...
doSomethingWith(data);
// Rest of code to complete action, etc.
...

Imagine that copy-pasted all over the place... Bleh!

Push idea back until after evaluations

This is something of which, at this point, I guess I can do if the above isn't possible... I'd need to rerun and re-tune benchmarks again to collect data, but its saves a ton of time.

@LouisJenkinsCS
Copy link
Owner

LouisJenkinsCS commented Jun 22, 2017

Hm, actually perhaps I should just go ahead and use sync vars for now. CCLock would be an optimization in and of itself, so I can insert/inject that in after I know I have everything else.

Edit:

Okay, so replacing it with SyncQueue does not improve it at all... so I know it isn't from first class functions, etc. It also goes away if I specify it as child type, SyncQueue, but is present (like before) if it is defined with parent type Queue

Edit2:

I see it now... it is halting! Wow... unexpected... This is actually... really bad... I can't believe there is no way to determine exactly where all tasks locally are. Its not a sync var deadlock as I used -b flag, and nothing. I'm analyzing all while loops, I'll try to see if I can pinpoint it tomorrow.

Edit3:

Installing Windows Driver Kit overnight so I can analyze the program itself. Might also have to know how to generate C the code as well (can I have a reminder?). Going to have to do a lot of debugging this weekend if it can't be solved by tomorrow.

Edit4:

I have an exam soon but could not resist to test again... confirmed it is not NQueens. I literally have a busy loop and one of the issues (where I declared as parent type in DistributedFIFOQueue) shows up, and a new one (where I get internal segfault during compilation when I assign DistributedFIFOQueue to a variable with parent type Queue). More testing results as I go on...

Edit5:

While the original issues are present (regarding dynamic dispatch), the deadlock issue definitely seems to be specific to NQueens.

@LouisJenkinsCS
Copy link
Owner

LouisJenkinsCS commented Jun 24, 2017

Seems the issue has been fixed by changing the domain to be 0-based, Thanks Engin!

Hm, nevermind, race condition is still there...

Edit:

So, now I get a compiler internal segfault trying to compile it... might have to spend rest of day trying to figure out a decent MWE.

Output:

LouisJenkinsCS@MSI:/mnt/c/Github/Distributed_Queue$ make Benchmark
chpl --devel --fast queues/Queue.chpl queues/local/SyncQueue.chpl queues/local/CCQueue.chpl \
        benchmark/cclock.chpl queues/distributed/DistributedFIFOQueue.chpl queues/distributed/DistributedQueue.chpl \
        benchmark/benchmark.chpl misc/LocalAtomicObject.chpl \
        -o bin/main.exe --main-module benchmark
internal error: seg fault [misc.cpp:558]
make: *** [Benchmark] Error 1

It happens on both 1.16 pre-release (master) and 1.15 (release).

This is running the original benchmark. The only change I made was added Queue as parent type. Okay this didn't cause it... gotta investigate to find out what.

Edit 2:

Issue seemed to resolve itself when I removed one of the queues that was derived from the same parent.. at least I know where to start.

@LouisJenkinsCS
Copy link
Owner

I'm thinking... I know you mentioned a lot I should try to keep everything self-contained (especially for CCLock, of which I'll agree 100%), and I'm thinking that... maybe I should keep all of the queues self contained? I know you mentioned this before, but I also mean for shared data structures. With the exception of LocalAtomicObject, I was thinking of having an implementation of SyncQueue and CCQueue for each Distributed*Queue, which feels redundant and counter-intuitive, but if it will reduce the chances of bugs like these occurring and with a potential performance boost, I'll do it.

@e-kayrakli
Copy link
Collaborator Author

Does this latest bug still persist? Let me know if you want me to take a closer look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants