Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Guarantee one boosterPtr is allocated and freed per LightGBMBooster instance #792

Merged
merged 8 commits into from Feb 10, 2020

Conversation

JoanFM
Copy link
Contributor

@JoanFM JoanFM commented Jan 31, 2020

I am not sure this is a fix, and not sure about the potential performance implications.

I think these changes could be helpful in solving memory issues that seem related to concurrency in spark such as microsoft/LightGBM#2392.

Please feel free to comment, is just a proposal but I am not 100% sure that it fixes these issues.

@imatiach-msft
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@codecov
Copy link

codecov bot commented Jan 31, 2020

Codecov Report

Merging #792 into master will decrease coverage by 13.34%.
The diff coverage is 0%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master     #792       +/-   ##
===========================================
- Coverage   31.47%   18.13%   -13.35%     
===========================================
  Files         240      184       -56     
  Lines        9699     8483     -1216     
  Branches      528      528               
===========================================
- Hits         3053     1538     -1515     
- Misses       6646     6945      +299
Impacted Files Coverage Δ
...crosoft/ml/spark/lightgbm/LightGBMClassifier.scala 0% <ø> (ø) ⬆️
.../microsoft/ml/spark/lightgbm/LightGBMBooster.scala 0% <0%> (ø) ⬆️
...crosoft/ml/spark/core/schema/SchemaConstants.scala 100% <0%> (ø) ⬆️
...ain/scala/org/apache/spark/ml/param/UDFParam.scala 100% <0%> (ø) ⬆️
...microsoft/ml/spark/core/schema/SparkBindings.scala 0% <0%> (-100%) ⬇️
...cala/org/apache/spark/ml/param/DataTypeParam.scala 0% <0%> (-100%) ⬇️
...n/scala/org/apache/spark/ml/param/UDPyFParam.scala 100% <0%> (ø) ⬆️
...ft/ml/spark/core/serialize/ConstructorWriter.scala 100% <0%> (ø) ⬆️
...cala/com/microsoft/ml/spark/io/binary/Binary.scala 0% <0%> (-100%) ⬇️
...spark/sql/types/injections/MetadataUtilities.scala 100% <0%> (ø) ⬆️
... and 113 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9c61053...9742891. Read the comment docs.

@imatiach-msft
Copy link
Contributor

interesting, all of the regular tests are passing but the pyspark notebooks are all failing with errors like:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 956.0 failed 4 times, most recent failure: Lost task 0.3 in stage 956.0 (TID 12257, 10.139.64.7, executor 1): java.lang.NoClassDefFoundError: Could not initialize class com.microsoft.ml.lightgbm.lightgbmlibConstants

Py4JJavaError Traceback (most recent call last)
in ()
12 lg_predictions = lg_model.transform(lr_test_data)
13
---> 14 display(lg_predictions.limit(10).toPandas())

/databricks/spark/python/pyspark/sql/dataframe.py in toPandas(self)
2189
2190 # Below is toPandas without Arrow optimization.
-> 2191 pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)
2192
2193 dtype = {}

/databricks/spark/python/pyspark/sql/dataframe.py in collect(self)
546 # Default path used in OSS Spark / for non-DF-ACL clusters:
547 with SCCallSiteSync(self._sc) as css:
--> 548 sock_info = self._jdf.collectToPython()
549 return list(_load_from_socket(sock_info, BatchedSerializer(PickleSerializer())))

@imatiach-msft
Copy link
Contributor

seems to be some pickle issue

@imatiach-msft
Copy link
Contributor

oooooh I see the issue. We are doing the init only on getting booster:

  lazy val boosterPtr: SWIGTYPE_p_void = {
    LightGBMUtils.initializeNativeLibrary()
    getBoosterPtrFromModelString(model)
  }

however it seems the init also needs to be done before we call anything that uses lightgbmlibConstants

@JoanFM
Copy link
Contributor Author

JoanFM commented Jan 31, 2020

oooooh I see the issue. We are doing the init only on getting booster:

  lazy val boosterPtr: SWIGTYPE_p_void = {
    LightGBMUtils.initializeNativeLibrary()
    getBoosterPtrFromModelString(model)
  }

however it seems the init also needs to be done before we call anything that uses lightgbmlibConstants

Where was it done before then?

@imatiach-msft
Copy link
Contributor

we were just doing it at top of methods before using lightgbmlibConstants:

def score(features: Vector, raw: Boolean, classification: Boolean): Array[Double] = {
    // Reload booster on each node
    if (boosterPtr == null) {
      LightGBMUtils.initializeNativeLibrary()
      boosterPtr = getBoosterPtrFromModelString(model)
    }
    val kind =
      if (raw) lightgbmlibConstants.C_API_PREDICT_RAW_SCORE
      else lightgbmlibConstants.C_API_PREDICT_NORMAL

Note where we use lightgbmlibConstants.C_API_PREDICT_RAW_SCORE, lightgbmlibConstants.C_API_PREDICT_NORMAL

@imatiach-msft
Copy link
Contributor

now we are potentially accessing lightgbmlibConstants after the booster was loaded

@JoanFM
Copy link
Contributor Author

JoanFM commented Jan 31, 2020

I see, I can try to find a way around this. But first I would like to be sure that you guys think this PR is a good contribution and if it could be the root of some memory issues me and some others have been experiencing.

@JoanFM
Copy link
Contributor Author

JoanFM commented Jan 31, 2020

Did some changes to try to guarantee that native library is loaded any time is needed

@imatiach-msft
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

imatiach-msft
imatiach-msft previously approved these changes Jan 31, 2020
…r changes. By boosterHandler not being serializable, it forces Serializable objects to declare it as transient
…rder of finalization so it is dangerous. I wonder how is it guaranteed that the memory will be cleared though
@JoanFM
Copy link
Contributor Author

JoanFM commented Feb 1, 2020

See benchmarking failure in tests for LightGBMRegressor and Classifier, It seems that the model trained differs, but I do not see how it can affect since LigthGBMBooster is created with a given trained model.

@JoanFM
Copy link
Contributor Author

JoanFM commented Feb 1, 2020

In the last commit I removed this finalize method because according to scalastyle and Java recommendation they should be avoided, but how is it that in current master there is a finalize for LightGBMBooster?.

override def finalize(): Unit = { if (scoredDataLengthLongPtr != null) lightgbmlib.delete_int64_tp(scoredDataLengthLongPtr) if (scoredDataOutPtr != null) lightgbmlib.delete_doubleArray(scoredDataOutPtr) if (leafIndexDataLengthLongPtr != null) lightgbmlib.delete_int64_tp(leafIndexDataLengthLongPtr) if (leafIndexDataOutPtr != null) lightgbmlib.delete_doubleArray(leafIndexDataOutPtr) if (boosterPtr != null) { LightGBMUtils.validate(lightgbmlib.LGBM_BoosterFree(boosterPtr), "Finalize Booster") } }

Some issues come to my mind, please correct me if some of these reflections are wrong or inaccurate, I am not very experienced with the JVM:

  • In relation to BoosterHandler, I think it should be us responsible from deleting it since LightGBM explicitly releases the unique_ptr when returning it in the c_api in LGBM_BoosterCreate. Also I do not think the JVM can have information about the undelrying type since it just sees a pointer to void.

  • When it comes to the dataPtr I also think we may need to make sure they are deleted. But I am not sure, maybe JVM has the capability to see all the instances. Also if mmlspark code is responsible fot this memory freeing, why are some other allocations not being taken care of? For instance in: saveBoosterToString.

  • Are JNI swigCPtr deleted by JVM? Isn't there a swigCMemOwn parameter to be set that can parametrize who is responsible of deletion of these swigCPtrs?

  • I think that if we apply this finalize protecting for a double free should work I think, but how could I pass the scalastyle issue?

  • Is there any other way to make sure JVM will clean it all?

Thank you very much for your help

@chris-smith-zocdoc
Copy link
Contributor

In the last commit I removed this finalize method because according to scalastyle and Java recommendation they should be avoided

As a general practice yes, but they're perfect for working with native memory. I'd agree with this answer from stackoverflow

"You can use the finalizer to cleanup as a last resort, but, i would not use solely to manage the actual lifetime for the reasons you actually pointed out in your own question. As such, let the finalizer be the final attempt, but not the first."

Oracle also has a doc on the subject

but how is it that in current master there is a finalize for LightGBMBooster?.
I'm not sure about how scalastyle checks are setup in this repo, but you should be able to disable the check by adding a comment to disable the check on the finalize method http://www.scalastyle.org/configuration.html

In relation to BoosterHandler, I think it should be us responsible from deleting it since LightGBM explicitly releases the unique_ptr when returning it in the c_api in LGBM_BoosterCreate.

Yes, it is our job to call the methods to free any memory we have allocated that LightGBM has given us pointers for.

Also I do not think the JVM can have information about the undelrying type since it just sees a pointer to void.

The JVM is completely unaware of the unmanaged memory that LightGBM allocates. The JVM only tracks memory that it allocates itself (java objects)

why are some other allocations not being taken care of?

I don't know this for sure, but it might have to do with the design of the spark apis. If we need to return objects to spark for it to serialize between tasks, we wouldn't be able to free it beforehand. This is also why finalizers are useful.

Are JNI swigCPtr deleted by JVM? Isn't there a swigCMemOwn parameter to be set that can parametrize who is responsible of deletion of these swigCPtrs?

I'm not very familiar with swig, Ilya might have a better answer for you

I think that if we apply this finalize protecting for a double free should work I think, but how could I pass the scalastyle issue?

I think general best practice is to implement a dispose() and finalize() method, where finalize just calls dispose. This allows both the end user, and the GC to cleanup native memory. After the memory is freed, the pointers should be set to null so we cannot attempt to free the memory again.

For scalastyle, check the link for examples on how to disable

@JoanFM JoanFM changed the title fix: Move transient var to transient lazy val fix: Guarantee one boosterPtr is allocated and freed per LightGBMBooster instance Feb 2, 2020
@JoanFM
Copy link
Contributor Author

JoanFM commented Feb 2, 2020

Thanks you very much @chris-smith-zocdoc and @imatiach-msft

I added back the finalize method calling a private freeNativeMemory. I did it private for now, because at this point I do not know if there is need for external calls to this method.

At the same time, I did not cover in this scope, I consider a little bit incoherent to ensure it is not null at delete time, but not when we try to use these pointers, although maybe I am unaware of some checks done under the hood.

I still get the same errors during test that find difficult to understand.

@chris-smith-zocdoc
Copy link
Contributor

Which error are you seeing? Can you post the stacktrace?

@JoanFM
Copy link
Contributor Author

JoanFM commented Feb 2, 2020

I mean, test errors, but I do not see why

@JoanFM
Copy link
Contributor Author

JoanFM commented Feb 2, 2020

Which error are you seeing? Can you post the stacktrace?

I was seeing errors in the tests, but now I found the origin, wrongly ported a lightCBMLibConstant

@imatiach-msft
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@imatiach-msft
Copy link
Contributor

ping @mhamilton723 the cognitive tests still seem to be failing, even on other PRs

@JoanFM
Copy link
Contributor Author

JoanFM commented Feb 8, 2020

Hello @imatiach-msft, any news about the investigation of these issues? Thanks

Copy link
Collaborator

@mhamilton723 mhamilton723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice work @JoanFM :)

@mhamilton723 mhamilton723 merged commit 7b8efa5 into microsoft:master Feb 10, 2020
@JoanFM JoanFM deleted the concurrency branch February 10, 2020 18:55
ocworld pushed a commit to AhnLab-OSS/mmlspark that referenced this pull request Mar 24, 2020
…ter instance (microsoft#792)

* Move transient var to transient lazy val

* Try guaranteeing NativeLibrary is loaded everytime it is needed but only loaded once

* Add freeing of resources, handle all memory in BoosterHandler

* Remove serialization from BoosterHandler, some documentation and minor changes. By boosterHandler not being serializable, it forces Serializable objects to declare it as transient

* Style forbigs implementing finalize method, Java does not guarantee order of finalization so it is dangerous. I wonder how is it guaranteed that the memory will be cleared though

* Add finalize method to call free for Native C++ allocated memory

* predict normal correct constant
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants