Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MMLSpark on Cloudera #311

Open
moyanojv opened this issue Jun 1, 2018 · 16 comments
Open

MMLSpark on Cloudera #311

moyanojv opened this issue Jun 1, 2018 · 16 comments
Labels

Comments

@moyanojv
Copy link

moyanojv commented Jun 1, 2018

We are trying to use mmlspark in a Cloudera environment using Hue pyspark notebooks through livy.
All our efforts have failed and we wonder if this option is possible. The only way we've got it working is to use pyspark without yarn.

Tested but not working:
We have modify Spark 2 Client Advanced Configuration Snippet (Safety Valve) in Cloudera manager to add --packages Azure:mmlspark:0.12 (spark.jars.packages=Azure:mmlspark:0.12). With this property our livy session donwloads the packages and dependencies but we don't see anything regarding mmlspark in the session property spark.submit.pyFiles.

Here you have the spark propertes of the environment of a livy session created using the commented aproach:
livy-session-7 - Environment.pdf

And here you have a screenshot of a working environment of a pyspark2 session using a diferent aproach (pyspark2 --master local --deploy-mode client --packages Azure:mmlspark:0.12):
pyspark - Environment.pdf

So, here is my question: It is possible to use mmlspark on a Cloudera environment using Hue pyspark notebooks through livy?

Thanks in advance.

@mhamilton723
Copy link
Collaborator

Hey @moyanojv , Thanks for reaching out! MMLSpark should be entirely compatible with YARN as we do not rely on a particular scheduler. Are you able to install other Spark Packages on your system? Do you get a particular error message?

@moyanojv
Copy link
Author

Thanks @mhamilton723 for your help.

Right now I'm a little lost. As far as I can see, this package contains python code so I'm not sure how to install it. Do I have to install it as a python package in my environment?

@mhamilton723
Copy link
Collaborator

@moyanojv to add a python+scala library to spark you use "spark packages". when you create or spin up your spark session, you use the --packages to attach our maven repo. If you are using pyspark, attaching this maven repo will automatically load python bindings into your interpreter. Here is the section in the readme that describes the process:

https://github.com/Azure/mmlspark#spark-package

https://github.com/Azure/mmlspark#python

Hope this helps!

@moyanojv
Copy link
Author

moyanojv commented Jun 14, 2018

@mhamilton723 I used this commands on my Cloudera cluster:

pyspark2 --master yarn --deploy-mode client --packages Azure:mmlspark:0.12

And the shell comes up:

Python 3.6.1 (default, Sep 22 2017, 14:27:40) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Azure#mmlspark added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
	confs: [default]
	found Azure#mmlspark;0.12 in spark-packages
	found io.spray#spray-json_2.11;1.3.2 in central
	found com.microsoft.cntk#cntk;2.4 in central
	found org.openpnp#opencv;3.2.0-1 in central
	found com.microsoft.ml.lightgbm#lightgbmlib;2.0.120 in central
:: resolution report :: resolve 1178ms :: artifacts dl 45ms
	:: modules in use:
	Azure#mmlspark;0.12 from spark-packages in [default]
	com.microsoft.cntk#cntk;2.4 from central in [default]
	com.microsoft.ml.lightgbm#lightgbmlib;2.0.120 from central in [default]
	io.spray#spray-json_2.11;1.3.2 from central in [default]
	org.openpnp#opencv;3.2.0-1 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   5   |   0   |   0   |   0   ||   5   |   0   |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
	confs: [default]
	0 artifacts copied, 5 already retrieved (0kB/26ms)
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/06/14 10:42:35 WARN yarn.Client: Same path resource file:/root/.ivy2/jars/Azure_mmlspark-0.12.jar added multiple times to distributed cache.
18/06/14 10:42:35 WARN yarn.Client: Same path resource file:/root/.ivy2/jars/io.spray_spray-json_2.11-1.3.2.jar added multiple times to distributed cache.
18/06/14 10:42:35 WARN yarn.Client: Same path resource file:/root/.ivy2/jars/com.microsoft.cntk_cntk-2.4.jar added multiple times to distributed cache.
18/06/14 10:42:35 WARN yarn.Client: Same path resource file:/root/.ivy2/jars/org.openpnp_opencv-3.2.0-1.jar added multiple times to distributed cache.
18/06/14 10:42:35 WARN yarn.Client: Same path resource file:/root/.ivy2/jars/com.microsoft.ml.lightgbm_lightgbmlib-2.0.120.jar added multiple times to distributed cache.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.0.cloudera1
      /_/

Using Python version 3.6.1 (default, Sep 22 2017 14:27:40)
SparkSession available as 'spark'.
>>> 

As you can see the package is downloaded, and it seems that also is correctly installed. But when i follow the tutorial:

>>> import pyspark
>>> spark = pyspark.sql.SparkSession.builder.appName("MyApp").config("spark.jars.packages", "Azure:mmlspark:0.12").getOrCreate()
>>> import mmlspark
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'mmlspark'
>>>

I'm doing anything wrong?

Thanks for your help.

@mhamilton723
Copy link
Collaborator

Hmm the first line looks right, but when you use pyspark as a command you dont need to recreate the spark object as it already exists. try just

import mmlspark

and see if that works

@mhamilton723 mhamilton723 changed the title mmlspark with Spark on Yarn. (Cloudera) MMLSpark on Cloudera Jun 18, 2018
@moyanojv
Copy link
Author

@mhamilton723 here you have the result:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.0.cloudera1
      /_/

Using Python version 3.6.1 (default, Sep 22 2017 14:27:40)
SparkSession available as 'spark'.
>>> import mmlspark
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'mmlspark'
>>> 

I have attached the spark environment information.

PySparkShell - Environment.pdf

Thanks for your help.

@mhamilton723
Copy link
Collaborator

mhamilton723 commented Jun 20, 2018

Thanks for the quick reply! Is it possible to try this out with spark 2.2? That's what our package was built against.

@moyanojv
Copy link
Author

I'm sorry but right now this is not possible.
Any way we will try to modify our spark version as soon as possible to test your suggeriment. If we change the version for sure we will add our results here.

@mhamilton723 many thanks for your help.

Regards

@mhamilton723
Copy link
Collaborator

@moyanojv perhaps also try installing the pip package directly as it seems your spark submit is not installing the python bits as anticipated

https://mmlspark.azureedge.net/pip/mmlspark-0.12-py2.py3-none-any.whl

@hanzigs
Copy link

hanzigs commented Jul 1, 2019

I am new to mmlspark, Can I have help on this please?
from mmlspark import TrainClassifier
Traceback (most recent call last):
File "", line 1, in
from mmlspark import TrainClassifier
File "mmlspark.py", line 30, in
from mmlspark.TrainClassifier import TrainClassifier
ModuleNotFoundError: No module named 'mmlspark.TrainClassifier'; 'mmlspark' is not a package

@imatiach-msft
Copy link
Contributor

@apremgeorge it looks like you are running into a similar issue, can you try to install the latest pip package for the v0.17 version here:

https://mmlspark.azureedge.net/pip/mmlspark-0.17-py2.py3-none-any.whl

@imatiach-msft
Copy link
Contributor

@apremgeorge also, how did you install the package in cloudera? Did you specify the spark package maven coordinates somewhere? Also, do you know if the scala bindings are working and you are only having trouble with the pyspark python bindings?

@hanzigs
Copy link

hanzigs commented Jul 1, 2019

@imatiach-msft Thank you very much for the reply,
Not with cloudera install
I installed from pyspark (pip install)
Then trying to install in HDInsight Spark Cluster, using configurations settings for mmlspark and dependencies, running in python IDE, Thanks

@Jinqiao
Copy link

Jinqiao commented Oct 16, 2019

@imatiach-msft Hi~ I run into the same problem with 0.18.1. Where can I get a 0.18.1 wheel file? Thank you!

@njgerner
Copy link

Is there a reference anywhere to what wheel files are available at https://mmlspark.azureedge.net/pip/** ?

@rusonding
Copy link

rusonding commented Oct 16, 2020

(mmlspark) [root@hadoop51]# spark2-submit --master yarn --conf spark.pyspark.python=/usr/lib/anaconda2/envs/mmlspark/bin/python --num-executors 10 --executor-memory 15G test_mmlspark.py
Traceback (most recent call last):
File "/root/test/test_mmlspark.py", line 13, in
from mmlspark.lightgbm import LightGBMClassifier
File "/usr/lib/anaconda2/envs/mmlspark/lib/python3.6/site-packages/mmlspark/lightgbm/LightGBMClassifier.py", line 11, in
from mmlspark.lightgbm._LightGBMClassifier import _LightGBMClassifier
ModuleNotFoundError: No module named 'mmlspark.lightgbm._LightGBMClassifier'

certifi 2016.2.28

future 0.18.2
mmlspark 0.0.11111111
numpy 1.19.2
pip 20.2.3
py4j 0.10.7
PyHive 0.6.1
pyspark 2.4.5
python-dateutil 2.8.1
setuptools 36.4.0
six 1.15.0
wheel 0.29.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants