Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check_254 "Core" failing during "M2 --check 2" on i386 #1834

Open
d-torrance opened this issue Jan 18, 2021 · 19 comments
Open

check_254 "Core" failing during "M2 --check 2" on i386 #1834

d-torrance opened this issue Jan 18, 2021 · 19 comments
Labels
build issue platform specific issues involving compiling M2, generating examples, or running tests

Comments

@d-torrance
Copy link
Member

This has happened twice so far (here and here) on continuous integration tests of the 1.17.1 package in Debian unstable for i386:

 -- capturing check(252, "Core")                                             -- 0.21414 seconds elapsed
 -- capturing check(253, "Core")                                             -- 3.12982 seconds elapsed
 -- capturing check(254, "Core")                                            

 *** out of memory trying to allocate 10000 bytes, exiting ***

 cd /tmp/M2-2214-0/1-rundir/; GC_MAXIMUM_HEAP_SIZE=400M "/usr/bin/M2-binary" -q --int --no-randomize --no-readline --silent --stop --print-width 129 <"/tmp/M2-2214-0/0.m2" >>"/tmp/M2-2214-0/0.tmp"
/tmp/M2-2214-0/0.tmp:0:1: (output file) error: Macaulay2 exited with status code 1
/tmp/M2-2214-0/0.m2:0:1: (input file)
M2: *** Error 1

It doesn't appear to be a problem when just running the test by itself, so I think it's probably just a result of having just run 254 other tests right before it! From an i386 chroot:

i1 : check_254 "Core"
 -- capturing check(254, "Core")                                             -- 0.000356972 seconds elapsed
@d-torrance
Copy link
Member Author

I tried adding -* no-capture-flag *- to that test, and it just ran out of memory in the next test!

 -- capturing check(254, "Core")                                             -- 0.137893 seconds elapsed
 -- capturing check(255, "Core")                                            

 *** out of memory trying to allocate 10000 bytes, exiting ***

 cd /tmp/M2-2006253-0/1-rundir/; GC_MAXIMUM_HEAP_SIZE=400M "/usr/bin/M2-binary" -q --int --no-randomize --no-readline --silent --stop --print-width 129 --srcdir "macaulay2-1.17.1+ds/M2/" <"/tmp/M2-2006253-0/0.m2" >>"/tmp/M2-2006253-0/0.tmp"
/tmp/M2-2006253-0/0.tmp:0:1: (output file) error: Macaulay2 exited with status code 1
/tmp/M2-2006253-0/0.m2:0:1: (input file)

Everything works if I add Core to the list of packages not to capture.

d-torrance added a commit to d-torrance/M2 that referenced this issue Jan 18, 2021
Otherwise, we run out of memory on i386.

Closes: Macaulay2#1834
@mahrud
Copy link
Member

mahrud commented Jan 18, 2021

Everything works if I add Core to the list of packages not to capture.

No, don't do this. It defies the point of having capture, because now a new M2 instance is loaded each time. After capture there's a call to collectGarbage, so this either has to do with those tests or there's a memory leak.

@mahrud
Copy link
Member

mahrud commented Jan 18, 2021

Also, what is the test? I can't seem to have any:

i3 : check "Core"
 -- warning: Core has no tests

i4 : check_254 "Core"
stdio:4:6:(3): error: key not found in hash table

@mahrud
Copy link
Member

mahrud commented Jan 18, 2021

Never mind, the tests directory isn't installed with the brew package, that's why. I'll fix it with the next release.

@mahrud
Copy link
Member

mahrud commented Jan 18, 2021

Does the autotools build install the tests directory? If so, where does it do that?

@d-torrance
Copy link
Member Author

Everything works if I add Core to the list of packages not to capture.

No, don't do this. It defies the point of having capture, because now a new M2 instance is loaded each time. After capture there's a call to collectGarbage, so this either has to do with those tests or there's a memory leak.

Fair enough. I've closed the PR. I may patch the Debian package if this ends up preventing migration to testing, but I don't think it will.

Also, what is the test?

i1 : check_254 "Core"
 -- capturing check(254, "Core")                                             -- 0.00545433 seconds elapsed

i2 : first Core#"test inputs"#254

o2 = /usr/share/doc/Macaulay2/Core/tests/rings.m2

However, I think the problem may be a bit sooner. This just seemed to be the straw that broke the camel's back.

@d-torrance
Copy link
Member Author

Does the autotools build install the tests directory? If so, where does it do that?

Yes:

@pre_docdir@:; @INSTALL@ -d $@
@pre_docdir@/Core: @pre_docdir@; @INSTALL@ -d $@
@pre_docdir@/Core/tests: @pre_docdir@/Core; @INSTALL@ -d $@
@pre_docdir@/Core/tests/%: % @pre_docdir@/Core/tests
@INSTALL_DATA@ $< $@
all: $(patsubst $(SRCDIR)/%.m2, @pre_docdir@/Core/tests/%.m2, $(wildcard $(SRCDIR)/*.m2))

@d-torrance
Copy link
Member Author

This was my attempt to make it work for the CMake build:

# install normal tests for "check Core"
file(MAKE_DIRECTORY ${M2_DIST_PREFIX}/${M2_INSTALL_DOCDIR}/Core/tests)
file(GLOB CORE_TESTS "${CMAKE_CURRENT_SOURCE_DIR}/normal/*.m2")
foreach(CORE_TEST ${CORE_TESTS})
file(COPY ${CORE_TEST} DESTINATION ${M2_DIST_PREFIX}/${M2_INSTALL_DOCDIR}/Core/tests/)
endforeach()

It seemed to work in my test builds, but I wouldn't be surprised if there's a better way!

@d-torrance
Copy link
Member Author

This also ended up happening for the armhf package. One test later, at check_255 "Core". It's also 32-bit.

@DanGrayson
Copy link
Member

I tried adding -* no-capture-flag *- to that test, and it just ran out of memory in the next test!

That's most easily explained by a memory leak in some of the tests. Eventually the memory gets full.

@d-torrance
Copy link
Member Author

I realized that 254 isn't actually the correct number for this test. Debian compresses large files that get installed to /usr/share/doc/, and so something like 30 of the Core tests are getting gzipped and therefore not included in check "Core" in the Debian package. I'll fix this on the next upload.

@mahrud
Copy link
Member

mahrud commented Jan 19, 2021

Well, repeatedly capturing 254 also has a memory leak. I ran it 1500 times, then killed it:
image

At the end I ran clearAll and collectGarbage, neither of them changed anything. I don't think this should happen. That test is barely doing anything.

@d-torrance
Copy link
Member Author

The same thing is happening for Cremona in an i386 chroot:

Cremona
*******
 -- capturing check(0, "Cremona")                                            -- 0.820287 seconds elapsed
 -- capturing check(1, "Cremona")                                            -- 17.4592 seconds elapsed
 -- capturing check(2, "Cremona")                                            -- 10.6633 seconds elapsed
 -- capturing check(3, "Cremona")                                            -- 8.34091 seconds elapsed
 -- capturing check(4, "Cremona")                                            -- 6.65764 seconds elapsed
 -- capturing check(5, "Cremona")                                            -- 1.31674 seconds elapsed
 -- capturing check(6, "Cremona")                                           

 *** out of memory trying to allocate 131108 bytes, exiting ***

 cd /tmp/M2-3992497-0/59-rundir/; GC_MAXIMUM_HEAP_SIZE=400M "/usr/bin/M2-binary" -q --int --no-randomize --no-readline --silent --stop --print-width 129 <"/tmp/M2-3992497-0/58.m2" >>"/tmp/M2-3992497-0/58.tmp"
/tmp/M2-3992497-0/58.tmp:0:1: (output file) error: Macaulay2 exited with status code 1
/tmp/M2-3992497-0/58.m2:0:1: (input file)
M2: *** Error 1

make check is fine, as is running check "Cremona" by itself inside Macaulay2. I'm only seeing this with M2 --check 3.

@mahrud
Copy link
Member

mahrud commented Jan 20, 2021

Ahh, you're not setting UserMode => false when calling capture. The default value of [check, UserMode] is null, and unless you set it to something, it'll read it from noinitfile, which is set based on whether -q is given. You set usermode to be false as input to runString, but somehow that's not making it all the way to check. I tested this by adding if usermode then error 0; in check:

[mahrud@noether build]$ ./M2 --check 2
Macaulay2, version 1.17.1.1
with packages: ConwayPolynomials, Elimination, IntegralClosure, InverseSystems,
               LLLBases, MinimalPrimes, PrimaryDecomposition, ReesAlgebra,
               Saturation, TangentCone
stdio:1:1:(3): error: 0

The way the --check option is implemented is too complicated, calling capture inside check inside runString, so it's hard to tell what's going wrong, but this change seems to fix it, and it might be good to do anyway since it also fixes #1787:

diff --git a/M2/Macaulay2/m2/testing.m2 b/M2/Macaulay2/m2/testing.m2
index c3967fe2f..2631b8bc0 100644
--- a/M2/Macaulay2/m2/testing.m2
+++ b/M2/Macaulay2/m2/testing.m2
@@ -60,7 +60,7 @@ loadTestDir := pkg -> (
     ) else pkg#"test directory loaded" = false;
 )
 
-check = method(Options => {UserMode => null, Verbose => false})
+check = method(Options => {UserMode => false, Verbose => false})
 check String  := opts -> pkg -> check(-1, pkg, opts)
 check Package := opts -> pkg -> check(-1, pkg, opts)
 
@@ -68,7 +68,6 @@ check(ZZ, String)  := opts -> (n, pkg) -> check(n, needsPackage (pkg, LoadDocume
 check(ZZ, Package) := opts -> (n, pkg) -> (
     if not pkg.Options.OptionalComponentsPresent then (
        printerr("warning: skipping tests; ", toString pkg, " requires optional components"); return);
-    usermode := if opts.UserMode === null then not noinitfile else opts.UserMode;
     --
     use pkg;
     if pkg#?"documentation not loaded" then pkg = loadPackage(pkg#"pkgname", LoadDocumentation => true, Reload => true);
@@ -81,7 +80,7 @@ check(ZZ, Package) := opts -> (n, pkg) -> (
     scan(tests, k -> (
            (filename, lineno, teststring) := pkg#"test inputs"#k;
            desc := "check(" | toString k | ", " | format pkg#"pkgname" | ")";
-           ret := elapsedTime captureTestResult(desc, teststring, pkg, usermode);
+           ret := elapsedTime captureTestResult(desc, teststring, pkg, opts.UserMode);
            if not ret then errorList = append(errorList,
                 (k, temporaryFilenameCounter - 2))));
     outfile := k -> temporaryDirectory() | toString k | ".tmp";

@mahrud
Copy link
Member

mahrud commented Jan 20, 2021

Hmm, I see that you pass -q on the command line, so maybe this isn't relevant. (why do you use M2 for printing to stderr there?)

@d-torrance
Copy link
Member Author

(why do you use M2 for printing to stderr there?)

Lol, laziness mostly. I wanted nice boxed headers to delineate the three stages of testing, and using netList seemed easier than doing it by hand.

@mahrud
Copy link
Member

mahrud commented Jan 23, 2021

I think we should move those alarm tests to a separate file and add no-capture-flag to it for now. Or maybe no-check-flag.

@d-torrance
Copy link
Member Author

I think we should move those alarm tests to a separate file and add no-capture-flag to it for now. Or maybe no-check-flag.

We're already avoiding capture for alarm:

and not match("(alarm|exec|exit|quit|restart|run)\\b", inputs) -- these commands interrupt the interpreter

d-torrance added a commit to d-torrance/M2 that referenced this issue Jan 23, 2021
@mahrud mahrud added the build issue platform specific issues involving compiling M2, generating examples, or running tests label Nov 19, 2021
@jkyang92
Copy link
Contributor

So I ran one of the tests from Core 100 times (probably Core/tests/resultant.m2:9), and I got the following graph showing the number of RawRingCell from engine.d that are allocated as a function of the total number of allocations in the interpreter. I used some code from https://github.com/jkyang92/M2/tree/memlog to generate the data for this graph and then used gnuplot to plot it. Also this is with a temporary hack in place to fix one of the possible culprits of #2330

core_test_ring_leak

This could in theory be a false positive if RawRingCell somehow ends up with a different finalizer on it, but there's a bunch of other types leaking, and so it's likely a real leak. In particular, we are somehow also leaking FunctionClosure and Symbol objects at a similar rate.

Unfortunately, my code isn't setup to determine why something is leaking, only that it is. With some patience I could give a complete list of all the types that are leaking.

Note: I was originally trying to run the test from this bug report but I didn't realize that the order of tests had changed. So I think this is what should be check_254 Core now but even this seems not entirely consistent.

Note also: The branch I linked above has unrelated tbb changes right now, because I need to be able to build it on my desktop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build issue platform specific issues involving compiling M2, generating examples, or running tests
Projects
None yet
Development

No branches or pull requests

4 participants