MMLSpark on Cloudera #311

moyanojv · 2018-06-01T11:16:02Z

We are trying to use mmlspark in a Cloudera environment using Hue pyspark notebooks through livy.
All our efforts have failed and we wonder if this option is possible. The only way we've got it working is to use pyspark without yarn.

Tested but not working:
We have modify Spark 2 Client Advanced Configuration Snippet (Safety Valve) in Cloudera manager to add --packages Azure:mmlspark:0.12 (spark.jars.packages=Azure:mmlspark:0.12). With this property our livy session donwloads the packages and dependencies but we don't see anything regarding mmlspark in the session property spark.submit.pyFiles.

Here you have the spark propertes of the environment of a livy session created using the commented aproach:
livy-session-7 - Environment.pdf

And here you have a screenshot of a working environment of a pyspark2 session using a diferent aproach (pyspark2 --master local --deploy-mode client --packages Azure:mmlspark:0.12):
pyspark - Environment.pdf

So, here is my question: It is possible to use mmlspark on a Cloudera environment using Hue pyspark notebooks through livy?

Thanks in advance.

mhamilton723 · 2018-06-06T16:53:12Z

Hey @moyanojv , Thanks for reaching out! MMLSpark should be entirely compatible with YARN as we do not rely on a particular scheduler. Are you able to install other Spark Packages on your system? Do you get a particular error message?

moyanojv · 2018-06-12T09:08:05Z

Thanks @mhamilton723 for your help.

Right now I'm a little lost. As far as I can see, this package contains python code so I'm not sure how to install it. Do I have to install it as a python package in my environment?

mhamilton723 · 2018-06-13T17:39:41Z

@moyanojv to add a python+scala library to spark you use "spark packages". when you create or spin up your spark session, you use the --packages to attach our maven repo. If you are using pyspark, attaching this maven repo will automatically load python bindings into your interpreter. Here is the section in the readme that describes the process:

https://github.com/Azure/mmlspark#spark-package

https://github.com/Azure/mmlspark#python

Hope this helps!

moyanojv · 2018-06-14T08:49:36Z

@mhamilton723 I used this commands on my Cloudera cluster:

pyspark2 --master yarn --deploy-mode client --packages Azure:mmlspark:0.12

And the shell comes up:

Python 3.6.1 (default, Sep 22 2017, 14:27:40) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Azure#mmlspark added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
	confs: [default]
	found Azure#mmlspark;0.12 in spark-packages
	found io.spray#spray-json_2.11;1.3.2 in central
	found com.microsoft.cntk#cntk;2.4 in central
	found org.openpnp#opencv;3.2.0-1 in central
	found com.microsoft.ml.lightgbm#lightgbmlib;2.0.120 in central
:: resolution report :: resolve 1178ms :: artifacts dl 45ms
	:: modules in use:
	Azure#mmlspark;0.12 from spark-packages in [default]
	com.microsoft.cntk#cntk;2.4 from central in [default]
	com.microsoft.ml.lightgbm#lightgbmlib;2.0.120 from central in [default]
	io.spray#spray-json_2.11;1.3.2 from central in [default]
	org.openpnp#opencv;3.2.0-1 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   5   |   0   |   0   |   0   ||   5   |   0   |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
	confs: [default]
	0 artifacts copied, 5 already retrieved (0kB/26ms)
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/06/14 10:42:35 WARN yarn.Client: Same path resource file:/root/.ivy2/jars/Azure_mmlspark-0.12.jar added multiple times to distributed cache.
18/06/14 10:42:35 WARN yarn.Client: Same path resource file:/root/.ivy2/jars/io.spray_spray-json_2.11-1.3.2.jar added multiple times to distributed cache.
18/06/14 10:42:35 WARN yarn.Client: Same path resource file:/root/.ivy2/jars/com.microsoft.cntk_cntk-2.4.jar added multiple times to distributed cache.
18/06/14 10:42:35 WARN yarn.Client: Same path resource file:/root/.ivy2/jars/org.openpnp_opencv-3.2.0-1.jar added multiple times to distributed cache.
18/06/14 10:42:35 WARN yarn.Client: Same path resource file:/root/.ivy2/jars/com.microsoft.ml.lightgbm_lightgbmlib-2.0.120.jar added multiple times to distributed cache.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.0.cloudera1
      /_/

Using Python version 3.6.1 (default, Sep 22 2017 14:27:40)
SparkSession available as 'spark'.
>>>

As you can see the package is downloaded, and it seems that also is correctly installed. But when i follow the tutorial:

>>> import pyspark
>>> spark = pyspark.sql.SparkSession.builder.appName("MyApp").config("spark.jars.packages", "Azure:mmlspark:0.12").getOrCreate()
>>> import mmlspark
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'mmlspark'
>>>

I'm doing anything wrong?

Thanks for your help.

mhamilton723 · 2018-06-18T22:14:20Z

Hmm the first line looks right, but when you use pyspark as a command you dont need to recreate the spark object as it already exists. try just

import mmlspark

and see if that works

moyanojv · 2018-06-19T06:57:03Z

@mhamilton723 here you have the result:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.0.cloudera1
      /_/

Using Python version 3.6.1 (default, Sep 22 2017 14:27:40)
SparkSession available as 'spark'.
>>> import mmlspark
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'mmlspark'
>>>

I have attached the spark environment information.

PySparkShell - Environment.pdf

Thanks for your help.

mhamilton723 · 2018-06-20T03:02:07Z

Thanks for the quick reply! Is it possible to try this out with spark 2.2? That's what our package was built against.

moyanojv · 2018-06-20T07:09:04Z

I'm sorry but right now this is not possible.
Any way we will try to modify our spark version as soon as possible to test your suggeriment. If we change the version for sure we will add our results here.

@mhamilton723 many thanks for your help.

Regards

mhamilton723 · 2018-06-22T14:23:19Z

@moyanojv perhaps also try installing the pip package directly as it seems your spark submit is not installing the python bits as anticipated

https://mmlspark.azureedge.net/pip/mmlspark-0.12-py2.py3-none-any.whl

hanzigs · 2019-07-01T03:24:32Z

I am new to mmlspark, Can I have help on this please?
from mmlspark import TrainClassifier
Traceback (most recent call last):
File "", line 1, in
from mmlspark import TrainClassifier
File "mmlspark.py", line 30, in
from mmlspark.TrainClassifier import TrainClassifier
ModuleNotFoundError: No module named 'mmlspark.TrainClassifier'; 'mmlspark' is not a package

imatiach-msft · 2019-07-01T04:12:08Z

@apremgeorge it looks like you are running into a similar issue, can you try to install the latest pip package for the v0.17 version here:

https://mmlspark.azureedge.net/pip/mmlspark-0.17-py2.py3-none-any.whl

imatiach-msft · 2019-07-01T04:13:46Z

@apremgeorge also, how did you install the package in cloudera? Did you specify the spark package maven coordinates somewhere? Also, do you know if the scala bindings are working and you are only having trouble with the pyspark python bindings?

hanzigs · 2019-07-01T05:58:50Z

@imatiach-msft Thank you very much for the reply,
Not with cloudera install
I installed from pyspark (pip install)
Then trying to install in HDInsight Spark Cluster, using configurations settings for mmlspark and dependencies, running in python IDE, Thanks

Jinqiao · 2019-10-16T14:38:15Z

@imatiach-msft Hi~ I run into the same problem with 0.18.1. Where can I get a 0.18.1 wheel file? Thank you!

njgerner · 2020-04-16T19:05:45Z

Is there a reference anywhere to what wheel files are available at https://mmlspark.azureedge.net/pip/** ?

rusonding · 2020-10-16T09:00:03Z

(mmlspark) [root@hadoop51]# spark2-submit --master yarn --conf spark.pyspark.python=/usr/lib/anaconda2/envs/mmlspark/bin/python --num-executors 10 --executor-memory 15G test_mmlspark.py
Traceback (most recent call last):
File "/root/test/test_mmlspark.py", line 13, in
from mmlspark.lightgbm import LightGBMClassifier
File "/usr/lib/anaconda2/envs/mmlspark/lib/python3.6/site-packages/mmlspark/lightgbm/LightGBMClassifier.py", line 11, in
from mmlspark.lightgbm._LightGBMClassifier import _LightGBMClassifier
ModuleNotFoundError: No module named 'mmlspark.lightgbm._LightGBMClassifier'

certifi 2016.2.28

future 0.18.2
mmlspark 0.0.11111111
numpy 1.19.2
pip 20.2.3
py4j 0.10.7
PyHive 0.6.1
pyspark 2.4.5
python-dateutil 2.8.1
setuptools 36.4.0
six 1.15.0
wheel 0.29.0

mhamilton723 changed the title ~~mmlspark with Spark on Yarn. (Cloudera)~~ MMLSpark on Cloudera Jun 18, 2018

mhamilton723 added the question label Jun 18, 2018

imatiach-msft mentioned this issue Jul 27, 2020

mmlspark.lightgbm._LightGBMClassifier does not exist #718

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MMLSpark on Cloudera #311

MMLSpark on Cloudera #311

moyanojv commented Jun 1, 2018

mhamilton723 commented Jun 6, 2018

moyanojv commented Jun 12, 2018

mhamilton723 commented Jun 13, 2018

moyanojv commented Jun 14, 2018 •

edited

Loading

mhamilton723 commented Jun 18, 2018

moyanojv commented Jun 19, 2018

mhamilton723 commented Jun 20, 2018 •

edited

Loading

moyanojv commented Jun 20, 2018

mhamilton723 commented Jun 22, 2018

hanzigs commented Jul 1, 2019

imatiach-msft commented Jul 1, 2019

imatiach-msft commented Jul 1, 2019

hanzigs commented Jul 1, 2019

Jinqiao commented Oct 16, 2019

njgerner commented Apr 16, 2020

rusonding commented Oct 16, 2020 •

edited

Loading

MMLSpark on Cloudera #311

MMLSpark on Cloudera #311

Comments

moyanojv commented Jun 1, 2018

mhamilton723 commented Jun 6, 2018

moyanojv commented Jun 12, 2018

mhamilton723 commented Jun 13, 2018

moyanojv commented Jun 14, 2018 • edited Loading

mhamilton723 commented Jun 18, 2018

moyanojv commented Jun 19, 2018

mhamilton723 commented Jun 20, 2018 • edited Loading

moyanojv commented Jun 20, 2018

mhamilton723 commented Jun 22, 2018

hanzigs commented Jul 1, 2019

imatiach-msft commented Jul 1, 2019

imatiach-msft commented Jul 1, 2019

hanzigs commented Jul 1, 2019

Jinqiao commented Oct 16, 2019

njgerner commented Apr 16, 2020

rusonding commented Oct 16, 2020 • edited Loading

moyanojv commented Jun 14, 2018 •

edited

Loading

mhamilton723 commented Jun 20, 2018 •

edited

Loading

rusonding commented Oct 16, 2020 •

edited

Loading