Refactor library to use AWS-sdk-java 1.7.4 #222

apiltamang · 2018-06-12T21:42:48Z

Description

We recently tried to use johnsnowlabs-1.5.4 to do a small proof-of-concept with a pretrained model for NER Tagging. We were also trying to load the data from an Amazon S3 bucket (url format: s3://bucket-name/bucket-file)

Expected Behavior

The data should load from the S3 servers, and model evaluation should happen as expected.

Current Behavior

The code errors out with the following stacktrace:

java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager.<init>(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecutor;)V
  at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:287)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
  at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:44)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
  at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
  at com.lucidworks.spark.job.sql.SparkSQLLoader$.loadInputDataFrame(SparkSQLLoader.scala:401)
  at com.lucidworks.spark.job.sql.SparkSQLLoader$.runLoadSave(SparkSQLLoader.scala:153)
  at com.lucidworks.spark.job.sql.SparkSQLLoader$.runLoaderJob(SparkSQLLoader.scala:135)
  ... 58 elided

Possible Solution

Our runtime spark environment is 2.3.0, which has a dependency on hadoop-2.7, which has a dependency on aws-java-sdk:1.7.4. However, johnsnow references the aws-sdk-java-s3:1.11.313, thus resulting in the above NoSuchMethodError. There seem at least two other open issues, resulting precisely because of the same reason.

I've rewrote the parts of Johnsnow that allows for it to stream data from Amazon S3 using the aws-java-sdk-1.7.4 version.

To reproduce

#1 export AWS credentials to your shell.

export AWS_ACCESS_KEY_ID=[your access key]
export AWS_SECRET_ACCESS_KEY=[your secret key]

#2 Start spark with the following command (note: loading hadoop-aws 2.7.5, because spark:2.3.0 comes preinstalled for hadoop:2.7.x)

bin/spark-shell start --packages JohnSnowLabs:spark-nlp:1.5.4,org.apache.hadoop:hadoop-aws:2.7.5

#3 Run the following command to try download the parquet file.

var data = spark.read.format("parquet").load("s3a://[bucket-name]/[object-name]")

#4 Now try to build off of this PR, and repeat. The data loading should succeed.

Your Environment

Version used: 1.5.4
Browser Name and version: Chrome 67.0
Operating System and version (desktop or mobile): OS X High Sierra

saif-ellafi · 2018-06-13T15:02:01Z

This is great news. Glad you found the reason. Let's see if someone in such environment can confirm the fix just to be sure.

thelabdude · 2018-06-14T15:34:17Z

To clarify the issue, if a Spark Job wants to use the hadoop-aws dependency to load datasets from S3, i.e. S3AFileSystemImpl, then they can't do that and use spark-nlp in the same job. You don't really need to do all the reproduction steps above ^ ...

abatilo · 2018-07-02T18:32:14Z

I just wanted to bump this issue. It's pretty problematic that I can't load data in from S3. It would be great if this could be merged soon.

saif-ellafi · 2018-07-04T13:33:55Z

merged master into https://github.com/JohnSnowLabs/spark-nlp/tree/refactor_use_aws-java-sdk-1.7.4
Testing for merge

Refactor library to use AWS-sdk-java 1.7.4

a10df1a

saif-ellafi mentioned this pull request Jun 13, 2018

Pretrained Model not working due to missing Amazon dependencies #210

Closed

saif-ellafi added this to the 1.5.5 milestone Jul 4, 2018

saif-ellafi merged commit 3bd99b0 into JohnSnowLabs:master Jul 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor library to use AWS-sdk-java 1.7.4 #222

Refactor library to use AWS-sdk-java 1.7.4 #222

Uh oh!

apiltamang commented Jun 12, 2018 •

edited

Loading

Uh oh!

saif-ellafi commented Jun 13, 2018

Uh oh!

thelabdude commented Jun 14, 2018

Uh oh!

abatilo commented Jul 2, 2018

Uh oh!

saif-ellafi commented Jul 4, 2018

Uh oh!

Uh oh!

Refactor library to use AWS-sdk-java 1.7.4 #222

Refactor library to use AWS-sdk-java 1.7.4 #222

Uh oh!

Conversation

apiltamang commented Jun 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Expected Behavior

Current Behavior

Possible Solution

To reproduce

Your Environment

Uh oh!

saif-ellafi commented Jun 13, 2018

Uh oh!

thelabdude commented Jun 14, 2018

Uh oh!

abatilo commented Jul 2, 2018

Uh oh!

saif-ellafi commented Jul 4, 2018

Uh oh!

Uh oh!

apiltamang commented Jun 12, 2018 •

edited

Loading