Having problems with running this code under PySpark 2.3.1 #4

alexlusher · 2018-08-24T22:42:28Z

Hi Alex,

Thanks a lot for creating this example. I cloned your repository and tried to run it on my laptop (MacOS X Sierra. pyspark==2.3.1 py4j==0.10.7) and encountered the following problem:

Traceback (most recent call last):
File "/Users/al/Desktop/dev/Spark/etl/etl_job.py", line 237, in
main()
File "/Users/al/Desktop/dev/Spark/etl/etl_job.py", line 56, in main
files=['etl_config.json'])
File "/Users/al/Desktop/dev/Spark/etl/etl_job.py", line 209, in start_spark
spark_sess = spark_builder.getOrCreate()
File "/usr/local/Cellar/apache-spark/2.3.1/libexec/python/lib/pyspark.zip/pyspark/sql/session.py", line 173, in getOrCreate
File "/usr/local/Cellar/apache-spark/2.3.1/libexec/python/lib/pyspark.zip/pyspark/context.py", line 343, in getOrCreate
File "/usr/local/Cellar/apache-spark/2.3.1/libexec/python/lib/pyspark.zip/pyspark/context.py", line 115, in init
File "/usr/local/Cellar/apache-spark/2.3.1/libexec/python/lib/pyspark.zip/pyspark/context.py", line 292, in _ensure_initialized
File "/usr/local/Cellar/apache-spark/2.3.1/libexec/python/lib/pyspark.zip/pyspark/java_gateway.py", line 120, in launch_gateway
TypeError: init() got an unexpected keyword argument 'auth_token'
2018-08-24 15:24:22 INFO ShutdownHookManager:54 - Shutdown hook called

What needs to be modified in your source code in order to run it?

Thanks!

pwrose · 2018-08-26T05:55:25Z

The code works with pyspark 2.3.1 if you don't specify the --py-files dependencies.zip option. I'm not sure why it works with pyspark 2.2.1. It may have something to do with the directory structure or files in the dependencies directory.

pwrose · 2018-08-26T19:35:28Z

Ok, I found the issue. There is a py4j version in the dependencies.zip file. If you remove the py4j directory out of dependencies.zip, then it works with pyspark 2.3.1.

Also, note that the content in dependencies directory doesn't match dependencies.zip.

alexlusher · 2018-08-27T17:30:32Z

Good morning, Peter.

Many thanks for your trying to help me. I followed on your suggestion and completely removed py4j directory from the archive. Now, I am facing the error <'module' object has no attribute 'Logger'> (see below). Any ideas on what might cause this?

(py2715env) al$ spark-submit --master local[*] --files etl_config.json etl_job.py
2018-08-27 10:21:27 WARN Utils:66 - Your hostname, al resolves to a loopback address: 127.0.0.1; using 172.21.80.16 instead (on interface en0)
2018-08-27 10:21:27 WARN Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
2018-08-27 10:21:28 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Traceback (most recent call last):
File "/Users/al/Desktop/dev/Spark/etl/etl_job.py", line 41, in
from pyspark import SparkFiles
File "/usr/local/Cellar/apache-spark/2.3.1/libexec/python/lib/pyspark.zip/pyspark/init.py", line 46, in
File "/usr/local/Cellar/apache-spark/2.3.1/libexec/python/lib/pyspark.zip/pyspark/context.py", line 31, in
File "/usr/local/Cellar/apache-spark/2.3.1/libexec/python/lib/pyspark.zip/pyspark/accumulators.py", line 97, in
File "/usr/local/Cellar/apache-spark/2.3.1/libexec/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 221, in
File "/usr/local/Cellar/apache-spark/2.3.1/libexec/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 885, in CloudPickler
AttributeError: 'module' object has no attribute 'Logger'
2018-08-27 10:21:28 INFO ShutdownHookManager:54 - Shutdown hook called
2018-08-27 10:21:28 INFO ShutdownHookManager:54 - Deleting directory /private/var/folders/2k/1p2m03494rz7b16rp96zwf8m_86r_h/T/spark-f3a68ca0-d982-4390-b24a-30cbaa674ba4

pwrose · 2018-08-27T18:13:48Z

Add "--py-files dependencies.zip" to your spark-submit command. I think it's getting the logging dependency from there.

…

On Mon, Aug 27, 2018 at 10:31 AM Alexander Lusher ***@***.***> wrote: Good morning, Peter. Many thanks for your trying to help me. I followed on your suggestion and completely removed py4j directory from the archive. Now, I am facing the following slew of errors: (py2715env) al$ spark-submit --master local[*] --files etl_config.json etl_job.py 2018-08-27 10:21:27 WARN Utils:66 - Your hostname, al resolves to a loopback address: 127.0.0.1; using 172.21.80.16 instead (on interface en0) 2018-08-27 10:21:27 WARN Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address 2018-08-27 10:21:28 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Traceback (most recent call last): File "/Users/al/Desktop/dev/Spark/etl/etl_job.py", line 41, in from pyspark import SparkFiles File "/usr/local/Cellar/apache-spark/2.3.1/libexec/python/lib/pyspark.zip/pyspark/init.py", line 46, in File "/usr/local/Cellar/apache-spark/2.3.1/libexec/python/lib/pyspark.zip/pyspark/context.py", line 31, in File "/usr/local/Cellar/apache-spark/2.3.1/libexec/python/lib/pyspark.zip/pyspark/accumulators.py", line 97, in File "/usr/local/Cellar/apache-spark/2.3.1/libexec/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 221, in File "/usr/local/Cellar/apache-spark/2.3.1/libexec/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 885, in CloudPickler AttributeError: 'module' object has no attribute 'Logger' 2018-08-27 10:21:28 INFO ShutdownHookManager:54 - Shutdown hook called 2018-08-27 10:21:28 INFO ShutdownHookManager:54 - Deleting directory /private/var/folders/2k/1p2m03494rz7b16rp96zwf8m_86r_h/T/spark-f3a68ca0-d982-4390-b24a-30cbaa674ba4 — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#4 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADuwEOwIelA2qimrEcN5uqNfYke3zcFwks5uVCy5gaJpZM4WMFy1> .

alexlusher · 2018-08-27T19:43:44Z

Thanks, Peter.

I did what you suggested, but it is still there (see below). What else should I check?

(py2715env) al$ spark-submit --master local[*] --py-files dependencies.zip --files etl_config.json etl_job.py
2018-08-27 12:41:04 WARN Utils:66 - Your hostname, al resolves to a loopback address: 127.0.0.1; using 172.21.80.16 instead (on interface en0)
2018-08-27 12:41:04 WARN Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
2018-08-27 12:41:04 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Traceback (most recent call last):
File "/Users/al/Desktop/dev/Spark/etl/etl_job.py", line 41, in
from pyspark import SparkFiles
File "/usr/local/Cellar/apache-spark/2.3.1/libexec/python/lib/pyspark.zip/pyspark/init.py", line 46, in

File "/usr/local/Cellar/apache-spark/2.3.1/libexec/python/lib/pyspark.zip/pyspark/context.py", line 31, in
File "/usr/local/Cellar/apache-spark/2.3.1/libexec/python/lib/pyspark.zip/pyspark/accumulators.py", line 97, in
File "/usr/local/Cellar/apache-spark/2.3.1/libexec/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 221, in
File "/usr/local/Cellar/apache-spark/2.3.1/libexec/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 885, in CloudPickler
AttributeError: 'module' object has no attribute 'Logger'
2018-08-27 12:41:05 INFO ShutdownHookManager:54 - Shutdown hook called
2018-08-27 12:41:05 INFO ShutdownHookManager:54 - Deleting directory /private/var/folders/2k/1p2m03494rz7b16rp96zwf8m_86r_h/T/spark-85c1f328-9ddf-4130-bce2-479378ab0312
(py2715env) al$

AlexIoannides · 2018-08-28T06:10:48Z

Hello.

I have replicated the (original) error on my side. I'll try and fix it today or later this week.

AlexIoannides · 2018-08-28T06:37:22Z

Okay, delete dependencies.zip and then re-build it on your system using,

 ./build_dependencies.sh dependencies venv

Assuming you've named the folders in the same way that I have. It should then work okay. I will probably remove dependencies.zip from the repo as this really ought to be built locally and not source controlled.

AlexIoannides added the bug label Aug 28, 2018

AlexIoannides closed this as completed Aug 28, 2018

AlexIoannides mentioned this issue Aug 28, 2018

Pyspark2.3 compatibility #5

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Having problems with running this code under PySpark 2.3.1 #4

Having problems with running this code under PySpark 2.3.1 #4

alexlusher commented Aug 24, 2018

pwrose commented Aug 26, 2018

pwrose commented Aug 26, 2018 •

edited

alexlusher commented Aug 27, 2018 •

edited

pwrose commented Aug 27, 2018 via email

alexlusher commented Aug 27, 2018

AlexIoannides commented Aug 28, 2018

AlexIoannides commented Aug 28, 2018

Having problems with running this code under PySpark 2.3.1 #4

Having problems with running this code under PySpark 2.3.1 #4

Comments

alexlusher commented Aug 24, 2018

pwrose commented Aug 26, 2018

pwrose commented Aug 26, 2018 • edited

alexlusher commented Aug 27, 2018 • edited

pwrose commented Aug 27, 2018 via email

alexlusher commented Aug 27, 2018

AlexIoannides commented Aug 28, 2018

AlexIoannides commented Aug 28, 2018

pwrose commented Aug 26, 2018 •

edited

alexlusher commented Aug 27, 2018 •

edited