Skip to content

Conversation

@mptrepanier
Copy link
Contributor

@mptrepanier mptrepanier commented Feb 17, 2020

Description

Prefixes the Spark NLP settings with jsl..

Motivation and Context

The current settings implementation of sparknlp.settings. ... leads to an exception being thrown by Spark's validateSettings when the Spark NLP configuration is provided as a -D option in either spark.driver.extraJavaOptions or spark.executor.extraJavaOptions, as shown here:

https://github.com/apache/spark/blob/d8613571bc1847775dd5c1945757279234cb388c/core/src/main/scala/org/apache/spark/SparkConf.scala#L529

    getOption(executorOptsKey).foreach { javaOpts =>
      if (javaOpts.contains("-Dspark")) {
        val msg = s"$executorOptsKey is not allowed to set Spark options (was '$javaOpts'). " +
          "Set them directly on a SparkConf or in a properties file when using ./bin/spark-submit."
        throw new Exception(msg)
      }
      if (javaOpts.contains("-Xmx")) {
        val msg = s"$executorOptsKey is not allowed to specify max heap memory settings " +
          s"(was '$javaOpts'). Use spark.executor.memory instead."
        throw new Exception(msg)
      }
    }

Spark simply looks for any strings which contain -Dspark (which -Dsparknlp does) and errors out.

How Has This Been Tested?

Against both a GCP Dataproc Cluster and a Zepplin notebook running on top of Dataproc. Example settings config shown below:

-Djsl.sparknlp.settings.license="LICENSE_KEY" -Djsl.sparknlp.settings.pretrained.credentials.access_key_id="ACCESS_KEY" -Djsl.sparknlp.settings.pretrained.credentials.secret_access_key="SECRET_ACCESS_KEY" -Djsl.sparknlp.settings.pretrained.s3_bucket=auxdata.johnsnowlabs.com -Djsl.sparknlp.settings.pretrained.s3_socket_timeout=3600000

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • Code improvements with no or little impact
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING page.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

Additional Info:

IMPORTANT The sparknlp.settings.license references are not public. I believe these reside in the spark-nlp-license project and will need to be updated there.

@maziyarpanahi
Copy link
Contributor

@rohit-nlp I know you already fixed this in the Enterprise, do we really have to change everything in the public as well? That validation code in Apache Spark has been there for the past 4 years, does this mean that none of our configs has ever worked? Or this is just to set them with -D or spark config?

@rohit-nlp
Copy link

@maziyarpanahi our configs always worked because we used -DconfigFile= and the settings was loaded using typeconfig and not spark. but when you pass some parameter of the configuration not through properties or application conf file but directly as java property such as -Dsparknlp.something then the spark code reading these params will throw an exception at this line.

https://github.com/apache/spark/blob/d8613571bc1847775dd5c1945757279234cb388c/core/src/main/scala/org/apache/spark/SparkConf.scala#L529

now as the code looks for "-Dspark" keyword -Dsparknlp will also get caugth in this.

@maziyarpanahi
Copy link
Contributor

@rohit-nlp So there is no need to rename anything here as users are supposed to pass those through DconfigFile or application.conf?

@mptrepanier
Copy link
Contributor Author

mptrepanier commented Feb 17, 2020

It comes down to whether or not you want to support the -D config options path. In the latest release, that was highlighted as one of the methods for setting SparkNLP config, but it currently errors out.

The main advantage of -D config is that it provides users an option to programatically set the SparkNLP config without needing to host an external configuration file. If there are any distributed storage systems which SparkNLP cannot read config files from, this provides users of those a workaround. As well, the --files method provided by Spark does not distribute the application.conf file to the driver node, so the method:

spark-submit --files "$LOCAL_PATH/application.conf --conf spark.driver.extraJavaOptions="-Dconfig.file=file:application.conf" --conf spark.executor.extraJavaOptions="-Dconfig.file=file:application.conf" 

will fail in cluster mode as JSL's access key validation occurs in the driver.

(For local deployments --master local[*], --files misleadingly will work as the driver and the executors are one.)

@rohit-nlp
Copy link

I agree with @mptrepanier though there is a workaround with application conf we should support -D option directly for all properties too. Regarding --files not working in cluster mode is really surprising for me.. Though I did not get time to set up a yarn cluster and check the issue. @maziyarpanahi we have already changed the property in Enterprise from sparknlp.settings.* to jsl.settings.* we can do the same change in public to be consistent.

@maziyarpanahi
Copy link
Contributor

That makes sense, @rohit-nlp do you approved this PR then?

@mptrepanier
Copy link
Contributor Author

As well, the --files method provided by Spark does not distribute the application.conf file to the driver node, so the method:

spark-submit --files "$LOCAL_PATH/application.conf --conf spark.driver.extraJavaOptions="-Dconfig.file=file:application.conf" --conf spark.executor.extraJavaOptions="-Dconfig.file=file:application.conf" 

will fail in cluster mode as JSL's access key validation occurs in the driver.

(For local deployments --master local[*], --files misleadingly will work as the driver and the executors are one.)

@rohit-nlp unable to replicate this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants