Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pip install Download pyspark as default & fails to work #3

Open
yelled1 opened this issue Mar 26, 2021 · 4 comments
Open

pip install Download pyspark as default & fails to work #3

yelled1 opened this issue Mar 26, 2021 · 4 comments

Comments

@yelled1
Copy link

yelled1 commented Mar 26, 2021

When I pip install ceja, I automatically get
pyspark-3.1.1.tar.gz (212.3MB)
which is a problem because it's the wrong version (using 3.0.0 on both EMR & WSL).
Even when I eliminate it, I still get errors on EMR.
Can this behavior be stopped?

[hadoop@ip-172-31-89-109 ~]$ sudo /usr/bin/pip3 install ceja
WARNING: Running pip install with root privileges is generally not a good idea. Try `pip3 install --user` instead.
Collecting ceja
  Downloading https://files.pythonhosted.org/packages/c6/80/f372c62a83175f4c54229474f543aeca3344f4c64aab4bcfe7cf05f50cbf/ceja-0.2.0-py3-none-any.whl
Collecting pyspark>2.0.0 (from ceja)
  Downloading https://files.pythonhosted.org/packages/45/b0/9d6860891ab14a39d4bddf80ba26ce51c2f9dc4805e5c6978ac0472c120a/pyspark-3.1.1.tar.gz (212.3MB)
    100% |████████████████████████████████| 212.3MB 6.3kB/s
Collecting jellyfish<0.9.0,>=0.8.2 (from ceja)
  Downloading https://files.pythonhosted.org/packages/04/3f/d03cb056f407ef181a45569255348457b1a0915fc4eb23daeceb930a68a4/jellyfish-0.8.2.tar.gz (134kB)
    100% |████████████████████████████████| 143kB 9.1MB/s
Collecting py4j==0.10.9 (from pyspark>2.0.0->ceja)
  Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
    100% |████████████████████████████████| 204kB 6.5MB/s
Installing collected packages: py4j, pyspark, jellyfish, ceja
  Running setup.py install for pyspark ... done
  Running setup.py install for jellyfish ... done
Successfully installed ceja-0.2.0 jellyfish-0.8.2 py4j-0.10.9 pyspark-3.1.1


[hadoop@ip-172-31-89-109 ~]$ sudo /usr/bin/pip3 uninstall pyspark
Proceed (y/n)? y
..(snip)..
  Successfully uninstalled pyspark-3.1.1

When I do above & attempt to use:

>>> df_m.columns
['guid_consumer_hashed_df10', 'guid_customer_hashed_df10', 'guidr_m', 'jws_fnm_m', 'jws_lnm_m', 'gender_m', 'state_m', 'zip3_m', 'soundex_fnm_m', 'lev_gender_m', 'lev_state_m', 'l
ev_zip3_m', 'lev_soundex_fnm_m']

jws_???_m are created with:

...     .withColumn(
...         "jws_fnm_m",
...         ceja.jaro_winkler_similarity(f.col("firstname_df10"), f.col("firstname_df4")),
...     )

I can see columns but show fails

>>> df_m.show()
21/03/26 06:01:50 WARN TaskSetManager: Lost task 0.0 in stage 14.0 (TID 40007, ip-172-31-80-99.ec2.internal, executor 1): org.apache.spark.api.python.PythonException: Traceback (m
ost recent call last):
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616735627370_0001/container_1616735627370_0001_01_000002/pyspark.zip/pyspark/worker.py", line 589, in main
    func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616735627370_0001/container_1616735627370_0001_01_000002/pyspark.zip/pyspark/worker.py", line 447, in read_udfs
    udfs.append(read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=i))
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616735627370_0001/container_1616735627370_0001_01_000002/pyspark.zip/pyspark/worker.py", line 254, in read_single_udf
    f, return_type = read_command(pickleSer, infile)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616735627370_0001/container_1616735627370_0001_01_000002/pyspark.zip/pyspark/worker.py", line 74, in read_command
    command = serializer._read_with_length(file)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616735627370_0001/container_1616735627370_0001_01_000002/pyspark.zip/pyspark/serializers.py", line 172, in _read_with_leng
th
    return self.loads(obj)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616735627370_0001/container_1616735627370_0001_01_000002/pyspark.zip/pyspark/serializers.py", line 458, in loads
    return pickle.loads(obj, encoding=encoding)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616735627370_0001/container_1616735627370_0001_01_000002/pyspark.zip/pyspark/cloudpickle.py", line 1110, in subimport
    __import__(name)
ModuleNotFoundError: No module named 'jellyfish'
```

attempting install fails
```
$ sudo /usr/bin/pip3 install jellifish
WARNING: Running pip install with root privileges is generally not a good idea. Try `pip3 install --user` instead.
Collecting jellifish
  Could not find a version that satisfies the requirement jellifish (from versions: )
No matching distribution found for jellifish
```
@yelled1 yelled1 changed the title Download pyspark as default pip install Download pyspark as default & fails to work Mar 26, 2021
@MrPowers
Copy link
Owner

@yelled1 - I removed the hard dependency on PySpark, hopefully this will solve the issue.

The hard PySpark dependency caused an issue on another project as well.

I just published ceja v0.3.0. It should be in PyPi.

Can you try again and let me know if the new version solves your issue?

@yelled1
Copy link
Author

yelled1 commented Mar 28, 2021

@MrPowers, pyspark did not download, which is great & thanks a bunch, but got jellyfish error below.
Still, I do have it installed:

[hadoop@ip-172-31-83-44 ~]$ sudo /usr/bin/pip3 install jellyfish
WARNING: Running pip install with root privileges is generally not a good idea. Try `pip3 install --user` instead.
Requirement already satisfied: jellyfish in /usr/local/lib/python3.7/site-packages

Also note that this is NOT an issue with WSL2, but it is in EMR (wsl2 was reinstall & EMR was fresh.
Using findspark.py on both before initiating spark on vim.
import jellyfish works.
Just confirmed that the same error happens under spark-submit.

>>> df_m.show()
21/03/28 15:11:48 WARN TaskSetManager: Lost task 0.0 in stage 7.0 (TID 30005, ip-172-31-86-169.ec2.internal, executor 3): org.apache.spark.api.python.PythonException: Traceback (m
ost recent call last):
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/worker.py", line 589, in main
    func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/worker.py", line 447, in read_udfs
    udfs.append(read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=i))
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/worker.py", line 254, in read_single_udf
    f, return_type = read_command(pickleSer, infile)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/worker.py", line 74, in read_command
    command = serializer._read_with_length(file)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/serializers.py", line 172, in _read_with_leng
th
    return self.loads(obj)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/serializers.py", line 458, in loads
    return pickle.loads(obj, encoding=encoding)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/cloudpickle.py", line 1110, in subimport
    __import__(name)
**ModuleNotFoundError: No module named 'jellyfish'**

        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:503)
        at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:81)
        at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:64)
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:456)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:127)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

21/03/28 15:11:48 ERROR TaskSetManager: Task 0 in stage 7.0 failed 4 times; aborting job
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 441, in show
    print(self._jdf.showString(n, 20, vertical))
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 134, in deco
    raise_from(converted)
  File "<string>", line 3, in raise_from
pyspark.sql.utils.PythonException:
  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/worker.py", line 589, in main
    func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/worker.py", line 447, in read_udfs
    udfs.append(read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=i))
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/worker.py", line 254, in read_single_udf
    f, return_type = read_command(pickleSer, infile)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/worker.py", line 74, in read_command
    command = serializer._read_with_length(file)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/serializers.py", line 172, in _read_with_leng
th
    return self.loads(obj)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/serializers.py", line 458, in loads
    return pickle.loads(obj, encoding=encoding)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/cloudpickle.py", line 1110, in subimport
    __import__(name)
ModuleNotFoundError: No module named 'jellyfish'

@yelled1
Copy link
Author

yelled1 commented Mar 29, 2021

Below will work, but this means only works for spark-submit, while VSCode or vi repl will NOT.

mkdir $HOME/lib
pip3 install ceja -t $HOME/lib/
cd $HOME/lib/
zip -r  ~/include_py_modules.zip .
cd $HOME/

/usr/bin/nohup spark-submit --packages io.delta:delta-core_2.12:0.7.0 --py-files $HOME/include_py_modules.zip --driver-memory 8g --executor-memory 8g my_python_script.py > ~/output.log 2>&1 &

@MrPowers
Copy link
Owner

Yea, perhaps vendoring Jellyfish is the best path forward to avoid the transitive dependency issue. Python packaging is difficult and even harder when Spark is added to the mix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants