Refactor library to use AWS-sdk-java 1.7.4 #222
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
We recently tried to use johnsnowlabs-1.5.4 to do a small proof-of-concept with a pretrained model for NER Tagging. We were also trying to load the data from an Amazon S3 bucket (url format: s3://bucket-name/bucket-file)
Expected Behavior
The data should load from the S3 servers, and model evaluation should happen as expected.
Current Behavior
The code errors out with the following stacktrace:
Possible Solution
Our runtime spark environment is 2.3.0, which has a dependency on hadoop-2.7, which has a dependency on aws-java-sdk:1.7.4. However, johnsnow references the aws-sdk-java-s3:1.11.313, thus resulting in the above NoSuchMethodError. There seem at least two other open issues, resulting precisely because of the same reason.
I've rewrote the parts of Johnsnow that allows for it to stream data from Amazon S3 using the aws-java-sdk-1.7.4 version.
To reproduce
#1 export AWS credentials to your shell.
#2 Start spark with the following command (note: loading hadoop-aws 2.7.5, because spark:2.3.0 comes preinstalled for hadoop:2.7.x)
#3 Run the following command to try download the parquet file.
#4 Now try to build off of this PR, and repeat. The data loading should succeed.
Your Environment