fix: Guarantee one boosterPtr is allocated and freed per LightGBMBooster instance #792

JoanFM · 2020-01-31T16:18:36Z

I am not sure this is a fix, and not sure about the potential performance implications.

I think these changes could be helpful in solving memory issues that seem related to concurrency in spark such as microsoft/LightGBM#2392.

Please feel free to comment, is just a proposal but I am not 100% sure that it fixes these issues.

imatiach-msft · 2020-01-31T16:35:25Z

/azp run

azure-pipelines · 2020-01-31T16:35:36Z

Azure Pipelines successfully started running 1 pipeline(s).

codecov · 2020-01-31T16:45:38Z

Codecov Report

Merging #792 into master will decrease coverage by 13.34%.
The diff coverage is 0%.

@@             Coverage Diff             @@
##           master     #792       +/-   ##
===========================================
- Coverage   31.47%   18.13%   -13.35%     
===========================================
  Files         240      184       -56     
  Lines        9699     8483     -1216     
  Branches      528      528               
===========================================
- Hits         3053     1538     -1515     
- Misses       6646     6945      +299

Impacted Files	Coverage Δ
...crosoft/ml/spark/lightgbm/LightGBMClassifier.scala	`0% <ø> (ø)`	⬆️
.../microsoft/ml/spark/lightgbm/LightGBMBooster.scala	`0% <0%> (ø)`	⬆️
...crosoft/ml/spark/core/schema/SchemaConstants.scala	`100% <0%> (ø)`	⬆️
...ain/scala/org/apache/spark/ml/param/UDFParam.scala	`100% <0%> (ø)`	⬆️
...microsoft/ml/spark/core/schema/SparkBindings.scala	`0% <0%> (-100%)`	⬇️
...cala/org/apache/spark/ml/param/DataTypeParam.scala	`0% <0%> (-100%)`	⬇️
...n/scala/org/apache/spark/ml/param/UDPyFParam.scala	`100% <0%> (ø)`	⬆️
...ft/ml/spark/core/serialize/ConstructorWriter.scala	`100% <0%> (ø)`	⬆️
...cala/com/microsoft/ml/spark/io/binary/Binary.scala	`0% <0%> (-100%)`	⬇️
...spark/sql/types/injections/MetadataUtilities.scala	`100% <0%> (ø)`	⬆️
... and 113 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9c61053...9742891. Read the comment docs.

imatiach-msft · 2020-01-31T17:16:18Z

interesting, all of the regular tests are passing but the pyspark notebooks are all failing with errors like:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 956.0 failed 4 times, most recent failure: Lost task 0.3 in stage 956.0 (TID 12257, 10.139.64.7, executor 1): java.lang.NoClassDefFoundError: Could not initialize class com.microsoft.ml.lightgbm.lightgbmlibConstants

Py4JJavaError Traceback (most recent call last)
in ()
12 lg_predictions = lg_model.transform(lr_test_data)
13
---> 14 display(lg_predictions.limit(10).toPandas())

/databricks/spark/python/pyspark/sql/dataframe.py in toPandas(self)
2189
2190 # Below is toPandas without Arrow optimization.
-> 2191 pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)
2192
2193 dtype = {}

/databricks/spark/python/pyspark/sql/dataframe.py in collect(self)
546 # Default path used in OSS Spark / for non-DF-ACL clusters:
547 with SCCallSiteSync(self._sc) as css:
--> 548 sock_info = self._jdf.collectToPython()
549 return list(_load_from_socket(sock_info, BatchedSerializer(PickleSerializer())))

imatiach-msft · 2020-01-31T17:16:35Z

seems to be some pickle issue

imatiach-msft · 2020-01-31T17:18:50Z

oooooh I see the issue. We are doing the init only on getting booster:

  lazy val boosterPtr: SWIGTYPE_p_void = {
    LightGBMUtils.initializeNativeLibrary()
    getBoosterPtrFromModelString(model)
  }

however it seems the init also needs to be done before we call anything that uses lightgbmlibConstants

JoanFM · 2020-01-31T17:19:56Z

oooooh I see the issue. We are doing the init only on getting booster:
  lazy val boosterPtr: SWIGTYPE_p_void = {
    LightGBMUtils.initializeNativeLibrary()
    getBoosterPtrFromModelString(model)
  }
however it seems the init also needs to be done before we call anything that uses lightgbmlibConstants

Where was it done before then?

imatiach-msft · 2020-01-31T17:24:57Z

we were just doing it at top of methods before using lightgbmlibConstants:

def score(features: Vector, raw: Boolean, classification: Boolean): Array[Double] = {
    // Reload booster on each node
    if (boosterPtr == null) {
      LightGBMUtils.initializeNativeLibrary()
      boosterPtr = getBoosterPtrFromModelString(model)
    }
    val kind =
      if (raw) lightgbmlibConstants.C_API_PREDICT_RAW_SCORE
      else lightgbmlibConstants.C_API_PREDICT_NORMAL

Note where we use lightgbmlibConstants.C_API_PREDICT_RAW_SCORE, lightgbmlibConstants.C_API_PREDICT_NORMAL

imatiach-msft · 2020-01-31T17:26:53Z

now we are potentially accessing lightgbmlibConstants after the booster was loaded

JoanFM · 2020-01-31T17:32:49Z

I see, I can try to find a way around this. But first I would like to be sure that you guys think this PR is a good contribution and if it could be the root of some memory issues me and some others have been experiencing.

…nly loaded once

JoanFM · 2020-01-31T22:02:32Z

Did some changes to try to guarantee that native library is loaded any time is needed

imatiach-msft · 2020-01-31T22:28:55Z

/azp run

azure-pipelines · 2020-01-31T22:29:06Z

Azure Pipelines successfully started running 1 pipeline(s).

src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMBooster.scala

…r changes. By boosterHandler not being serializable, it forces Serializable objects to declare it as transient

…rder of finalization so it is dangerous. I wonder how is it guaranteed that the memory will be cleared though

JoanFM · 2020-02-01T11:39:49Z

See benchmarking failure in tests for LightGBMRegressor and Classifier, It seems that the model trained differs, but I do not see how it can affect since LigthGBMBooster is created with a given trained model.

JoanFM · 2020-02-01T15:12:14Z

In the last commit I removed this finalize method because according to scalastyle and Java recommendation they should be avoided, but how is it that in current master there is a finalize for LightGBMBooster?.

override def finalize(): Unit = { if (scoredDataLengthLongPtr != null) lightgbmlib.delete_int64_tp(scoredDataLengthLongPtr) if (scoredDataOutPtr != null) lightgbmlib.delete_doubleArray(scoredDataOutPtr) if (leafIndexDataLengthLongPtr != null) lightgbmlib.delete_int64_tp(leafIndexDataLengthLongPtr) if (leafIndexDataOutPtr != null) lightgbmlib.delete_doubleArray(leafIndexDataOutPtr) if (boosterPtr != null) { LightGBMUtils.validate(lightgbmlib.LGBM_BoosterFree(boosterPtr), "Finalize Booster") } }

Some issues come to my mind, please correct me if some of these reflections are wrong or inaccurate, I am not very experienced with the JVM:

In relation to BoosterHandler, I think it should be us responsible from deleting it since LightGBM explicitly releases the unique_ptr when returning it in the c_api in LGBM_BoosterCreate. Also I do not think the JVM can have information about the undelrying type since it just sees a pointer to void.
When it comes to the dataPtr I also think we may need to make sure they are deleted. But I am not sure, maybe JVM has the capability to see all the instances. Also if mmlspark code is responsible fot this memory freeing, why are some other allocations not being taken care of? For instance in: saveBoosterToString.
Are JNI swigCPtr deleted by JVM? Isn't there a swigCMemOwn parameter to be set that can parametrize who is responsible of deletion of these swigCPtrs?
I think that if we apply this finalize protecting for a double free should work I think, but how could I pass the scalastyle issue?
Is there any other way to make sure JVM will clean it all?

Thank you very much for your help

chris-smith-zocdoc · 2020-02-01T17:02:14Z

In the last commit I removed this finalize method because according to scalastyle and Java recommendation they should be avoided

As a general practice yes, but they're perfect for working with native memory. I'd agree with this answer from stackoverflow

"You can use the finalizer to cleanup as a last resort, but, i would not use solely to manage the actual lifetime for the reasons you actually pointed out in your own question. As such, let the finalizer be the final attempt, but not the first."

Oracle also has a doc on the subject

but how is it that in current master there is a finalize for LightGBMBooster?.
I'm not sure about how scalastyle checks are setup in this repo, but you should be able to disable the check by adding a comment to disable the check on the finalize method http://www.scalastyle.org/configuration.html

In relation to BoosterHandler, I think it should be us responsible from deleting it since LightGBM explicitly releases the unique_ptr when returning it in the c_api in LGBM_BoosterCreate.

Yes, it is our job to call the methods to free any memory we have allocated that LightGBM has given us pointers for.

Also I do not think the JVM can have information about the undelrying type since it just sees a pointer to void.

The JVM is completely unaware of the unmanaged memory that LightGBM allocates. The JVM only tracks memory that it allocates itself (java objects)

why are some other allocations not being taken care of?

I don't know this for sure, but it might have to do with the design of the spark apis. If we need to return objects to spark for it to serialize between tasks, we wouldn't be able to free it beforehand. This is also why finalizers are useful.

Are JNI swigCPtr deleted by JVM? Isn't there a swigCMemOwn parameter to be set that can parametrize who is responsible of deletion of these swigCPtrs?

I'm not very familiar with swig, Ilya might have a better answer for you

I think that if we apply this finalize protecting for a double free should work I think, but how could I pass the scalastyle issue?

I think general best practice is to implement a dispose() and finalize() method, where finalize just calls dispose. This allows both the end user, and the GC to cleanup native memory. After the memory is freed, the pointers should be set to null so we cannot attempt to free the memory again.

For scalastyle, check the link for examples on how to disable

JoanFM · 2020-02-02T11:38:09Z

Thanks you very much @chris-smith-zocdoc and @imatiach-msft

I added back the finalize method calling a private freeNativeMemory. I did it private for now, because at this point I do not know if there is need for external calls to this method.

At the same time, I did not cover in this scope, I consider a little bit incoherent to ensure it is not null at delete time, but not when we try to use these pointers, although maybe I am unaware of some checks done under the hood.

I still get the same errors during test that find difficult to understand.

chris-smith-zocdoc · 2020-02-02T14:04:24Z

Which error are you seeing? Can you post the stacktrace?

JoanFM · 2020-02-02T14:33:55Z

I mean, test errors, but I do not see why

JoanFM · 2020-02-02T15:36:04Z

Which error are you seeing? Can you post the stacktrace?

I was seeing errors in the tests, but now I found the origin, wrongly ported a lightCBMLibConstant

imatiach-msft · 2020-02-03T15:47:37Z

/azp run

azure-pipelines · 2020-02-03T15:47:48Z

Azure Pipelines successfully started running 1 pipeline(s).

imatiach-msft · 2020-02-03T16:25:11Z

ping @mhamilton723 the cognitive tests still seem to be failing, even on other PRs

JoanFM · 2020-02-08T08:23:48Z

Hello @imatiach-msft, any news about the investigation of these issues? Thanks

mhamilton723

Really nice work @JoanFM :)

…ter instance (microsoft#792) * Move transient var to transient lazy val * Try guaranteeing NativeLibrary is loaded everytime it is needed but only loaded once * Add freeing of resources, handle all memory in BoosterHandler * Remove serialization from BoosterHandler, some documentation and minor changes. By boosterHandler not being serializable, it forces Serializable objects to declare it as transient * Style forbigs implementing finalize method, Java does not guarantee order of finalization so it is dangerous. I wonder how is it guaranteed that the memory will be cleared though * Add finalize method to call free for Native C++ allocated memory * predict normal correct constant

Move transient var to transient lazy val

a6d69eb

JoanFM requested a review from imatiach-msft as a code owner January 31, 2020 16:18

Try guaranteeing NativeLibrary is loaded everytime it is needed but o…

37da33b

…nly loaded once

JoanFM mentioned this pull request Jan 31, 2020

[mmlspark] Memory Corruption microsoft/LightGBM#2392

Closed

imatiach-msft previously approved these changes Jan 31, 2020

View reviewed changes

chris-smith-zocdoc reviewed Jan 31, 2020

View reviewed changes

src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMBooster.scala Outdated Show resolved Hide resolved

Forgot some lightGBMlibConstants

13059a7

JoanFM dismissed imatiach-msft’s stale review via 13059a7 January 31, 2020 23:38

JoanFM added 3 commits February 1, 2020 00:56

Add freeing of resources, handle all memory in BoosterHandler

9b15b73

Remove serialization from BoosterHandler, some documentation and mino…

b9b2b6e

…r changes. By boosterHandler not being serializable, it forces Serializable objects to declare it as transient

Style forbigs implementing finalize method, Java does not guarantee o…

64f6ada

…rder of finalization so it is dangerous. I wonder how is it guaranteed that the memory will be cleared though

Add finalize method to call free for Native C++ allocated memory

41fcd47

JoanFM changed the title ~~fix: Move transient var to transient lazy val~~ fix: Guarantee one boosterPtr is allocated and freed per LightGBMBooster instance Feb 2, 2020

predict normal correct constant

9742891

imatiach-msft approved these changes Feb 3, 2020

View reviewed changes

mhamilton723 approved these changes Feb 10, 2020

View reviewed changes

mhamilton723 merged commit 7b8efa5 into microsoft:master Feb 10, 2020

JoanFM deleted the concurrency branch February 10, 2020 18:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Guarantee one boosterPtr is allocated and freed per LightGBMBooster instance #792

fix: Guarantee one boosterPtr is allocated and freed per LightGBMBooster instance #792

JoanFM commented Jan 31, 2020

imatiach-msft commented Jan 31, 2020

azure-pipelines bot commented Jan 31, 2020

codecov bot commented Jan 31, 2020 •

edited

imatiach-msft commented Jan 31, 2020

imatiach-msft commented Jan 31, 2020

imatiach-msft commented Jan 31, 2020

JoanFM commented Jan 31, 2020

imatiach-msft commented Jan 31, 2020

imatiach-msft commented Jan 31, 2020

JoanFM commented Jan 31, 2020

JoanFM commented Jan 31, 2020

imatiach-msft commented Jan 31, 2020

azure-pipelines bot commented Jan 31, 2020

JoanFM commented Feb 1, 2020

JoanFM commented Feb 1, 2020 •

edited

chris-smith-zocdoc commented Feb 1, 2020

JoanFM commented Feb 2, 2020

chris-smith-zocdoc commented Feb 2, 2020

JoanFM commented Feb 2, 2020

JoanFM commented Feb 2, 2020

imatiach-msft commented Feb 3, 2020

azure-pipelines bot commented Feb 3, 2020

imatiach-msft commented Feb 3, 2020

JoanFM commented Feb 8, 2020

mhamilton723 left a comment

fix: Guarantee one boosterPtr is allocated and freed per LightGBMBooster instance #792

fix: Guarantee one boosterPtr is allocated and freed per LightGBMBooster instance #792

Conversation

JoanFM commented Jan 31, 2020

imatiach-msft commented Jan 31, 2020

azure-pipelines bot commented Jan 31, 2020

codecov bot commented Jan 31, 2020 • edited

Codecov Report

imatiach-msft commented Jan 31, 2020

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 956.0 failed 4 times, most recent failure: Lost task 0.3 in stage 956.0 (TID 12257, 10.139.64.7, executor 1): java.lang.NoClassDefFoundError: Could not initialize class com.microsoft.ml.lightgbm.lightgbmlibConstants

imatiach-msft commented Jan 31, 2020

imatiach-msft commented Jan 31, 2020

JoanFM commented Jan 31, 2020

imatiach-msft commented Jan 31, 2020

imatiach-msft commented Jan 31, 2020

JoanFM commented Jan 31, 2020

JoanFM commented Jan 31, 2020

imatiach-msft commented Jan 31, 2020

azure-pipelines bot commented Jan 31, 2020

JoanFM commented Feb 1, 2020

JoanFM commented Feb 1, 2020 • edited

chris-smith-zocdoc commented Feb 1, 2020

JoanFM commented Feb 2, 2020

chris-smith-zocdoc commented Feb 2, 2020

JoanFM commented Feb 2, 2020

JoanFM commented Feb 2, 2020

imatiach-msft commented Feb 3, 2020

azure-pipelines bot commented Feb 3, 2020

imatiach-msft commented Feb 3, 2020

JoanFM commented Feb 8, 2020

mhamilton723 left a comment

Choose a reason for hiding this comment

codecov bot commented Jan 31, 2020 •

edited

JoanFM commented Feb 1, 2020 •

edited