Skip to content

Conversation

apiltamang
Copy link
Contributor

@apiltamang apiltamang commented Jun 12, 2018

Description

We recently tried to use johnsnowlabs-1.5.4 to do a small proof-of-concept with a pretrained model for NER Tagging. We were also trying to load the data from an Amazon S3 bucket (url format: s3://bucket-name/bucket-file)

Expected Behavior

The data should load from the S3 servers, and model evaluation should happen as expected.

Current Behavior

The code errors out with the following stacktrace:

java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager.<init>(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecutor;)V
  at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:287)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
  at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:44)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
  at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
  at com.lucidworks.spark.job.sql.SparkSQLLoader$.loadInputDataFrame(SparkSQLLoader.scala:401)
  at com.lucidworks.spark.job.sql.SparkSQLLoader$.runLoadSave(SparkSQLLoader.scala:153)
  at com.lucidworks.spark.job.sql.SparkSQLLoader$.runLoaderJob(SparkSQLLoader.scala:135)
  ... 58 elided

Possible Solution

Our runtime spark environment is 2.3.0, which has a dependency on hadoop-2.7, which has a dependency on aws-java-sdk:1.7.4. However, johnsnow references the aws-sdk-java-s3:1.11.313, thus resulting in the above NoSuchMethodError. There seem at least two other open issues, resulting precisely because of the same reason.

I've rewrote the parts of Johnsnow that allows for it to stream data from Amazon S3 using the aws-java-sdk-1.7.4 version.

To reproduce

#1 export AWS credentials to your shell.

export AWS_ACCESS_KEY_ID=[your access key]
export AWS_SECRET_ACCESS_KEY=[your secret key]

#2 Start spark with the following command (note: loading hadoop-aws 2.7.5, because spark:2.3.0 comes preinstalled for hadoop:2.7.x)

bin/spark-shell start --packages JohnSnowLabs:spark-nlp:1.5.4,org.apache.hadoop:hadoop-aws:2.7.5

#3 Run the following command to try download the parquet file.

var data = spark.read.format("parquet").load("s3a://[bucket-name]/[object-name]")

#4 Now try to build off of this PR, and repeat. The data loading should succeed.

Your Environment

  • Version used: 1.5.4
  • Browser Name and version: Chrome 67.0
  • Operating System and version (desktop or mobile): OS X High Sierra

@saif-ellafi
Copy link
Contributor

This is great news. Glad you found the reason. Let's see if someone in such environment can confirm the fix just to be sure.

@thelabdude
Copy link

To clarify the issue, if a Spark Job wants to use the hadoop-aws dependency to load datasets from S3, i.e. S3AFileSystemImpl, then they can't do that and use spark-nlp in the same job. You don't really need to do all the reproduction steps above ^ ...

@abatilo
Copy link

abatilo commented Jul 2, 2018

I just wanted to bump this issue. It's pretty problematic that I can't load data in from S3. It would be great if this could be merged soon.

@saif-ellafi
Copy link
Contributor

merged master into https://github.com/JohnSnowLabs/spark-nlp/tree/refactor_use_aws-java-sdk-1.7.4
Testing for merge

@saif-ellafi saif-ellafi added this to the 1.5.5 milestone Jul 4, 2018
@saif-ellafi saif-ellafi merged commit 3bd99b0 into JohnSnowLabs:master Jul 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants