# Pixiedust on Openshift/Daikon

Start a pod via oc new-app pixiedust-notebook[-2.0.2].yaml
Then browse to the route created as a result

# Initial Pixiedust Import

The first time pixiedust is imported within a Jupyter pyspark instance after installation (or pod instantiation) a kernel restart is requested. If requested, select Kernel--->Restart from the Jupyter menu bar above, making certain to run this cell again upon restart. If no restart is requested, run this cell and those after it without a restart.

In [1]:
import pixiedust

ERROR:root:An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line string', (651, 72))



TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module'

# Data Import

Pixiedust has a main import method, `pixiedust.sampleData()` in its API which returns a Spark DataFrame. In addition, it has a number of sample data sets to provide a quick start to data science using Pixiedust. Its visualization tools do not require using its data import tools, allowing legacy DataFrame code to remain intact.

```python
# Standard approaches
sqlContext.createDataFrame(<Data values>)
spark.read()
# Additional Pixiedust approaches
pixiedust.sampleData(<URL>)
pixiedust.sampleData({1-7})
```

# Spark Package Install/Management
Pixiedust advertises a means for installing Spark packages with lower overhead. Some things to note: the example code in some of the Pixiedust sample notebooks is flawed, as it omits a key step. The flow is as follows:
1 Use pixiedust.installPackage({spark-package.org string, Maven repo info, URL to Jar}) to install the package/jar
2 Restart the notebook kernel
3 import pixiedust
4 Use installPackage() to load the installed package
Note: This has mixed results. For instance, the GraphFrames example Pixiedust uses does not function even after loading properly. The following example, however, does function (as of 12 June 2017 at least)

In [2]:
pixiedust.installPackage("TargetHolding:pyspark-cassandra:0.3.5")

Package already installed: TargetHolding:pyspark-cassandra:0.3.5


<pixiedust.packageManager.package.Package at 0x1126c9128>

**NOTE:** For the initial call of the above, be sure to select Kernel--->Restart **AND** run the initial `import pixiedust` cell and the above cell again before proceeding to the cells below.

In [3]:
pixiedust.printAllPackages()

julioasotodv:spark-df-profiling:1.1.2 => /Users/jschless/pixiedust/data/libs/spark-df-profiling-1.1.2.jar
TargetHolding:pyspark-cassandra:0.3.5 => /Users/jschless/pixiedust/data/libs/pyspark-cassandra-0.3.5.jar


In [4]:
import pixiedust
import pyspark_cassandra

In [5]:
pyspark_cassandra.conf

<module 'pyspark_cassandra.conf' from '/private/var/folders/kk/lc4tj92149zfrd6ky3hrllfr0000gn/T/spark-9ca78303-0303-4d51-8ab6-e7f8d972c316/userFiles-75073fb0-5c74-4482-bb5e-2f155819b956/pyspark-cassandra-0.3.5.jar/pyspark_cassandra/conf.py'>

# Spark Job Monitor

Pixiedust advertises a built-in spark job monitor for displaying job progress _in situ_ rather than tailing logfiles or otherwise. This is a neat feature, however, it's the only one which absolutely **does not** work beyond Spark 2.0. While the monitor can be enabled irrespective of version, beyond 2.0 an ugly error message will appear after a successfully enabled message(**sigh**), but otherwise the rest of the notebook should run fine. 

In [6]:
pixiedust.enableJobMonitor()

Succesfully enabled Spark Job Progress Monitor


Exception in thread Thread-5:
Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/jschless/.virtualenvs/testp3/lib/python3.6/site-packages/pixiedust/utils/sparkJobProgressMonitor.py", line 47, in startSparkJobProgressMonitor
    progressMonitor = SparkJobProgressMonitor()
  File "/Users/jschless/.virtualenvs/testp3/lib/python3.6/site-packages/pixiedust/utils/sparkJobProgressMonitor.py", line 174, in __init__
    self.addSparkListener()
  File "/Users/jschless/.virtualenvs/testp3/lib/python3.6/site-packages/pixiedust/utils/sparkJobProgressMonitor.py", line 203, in addSparkListener
    _env.getTemplate("sparkJobProgressMonitor/addSparkListener.scala").render()
  File 

In [None]:
df3 = pixiedust.sampleData("https://github.com/ibm-cds-labs/open-data/raw/master/cars/cars.csv")

# Visualizations

This is where Pixiedust truly shines. Much like Tableau, Pixiedust takes a bunch of the heavy lifting out of exploratory data analyses by providing a means for trying out different visualizations on a DataFrame. Indeed, one need merely call `display(<DataFrame>)` and be on their way. No visualization lib imports, no advanced settings, just call and run with it.

## Visualizing a Two-column DataFrame

In [None]:
# Tutorial paste: create a Spark dataframe, passing in some data, and assign it to a variable 
df = sqlContext.createDataFrame(
[("Black", 87),
 ("Red", 13)],
["Colors","%"])

In [None]:
display(df)

## Visualizing an N-column DataFrame

In [None]:
# Slightly modified Tutorial paste
df2 = sqlContext.createDataFrame(
[(2010, 'Air Hockey', 10),
 (2010, 'Curling', 20),
 (2010, 'Kendo', 1),
 (2010, 'Iaido', 2),
 (2010, 'Ninjitsu', 1),
 (2010, 'Ping Pong', 50),
 (2011, 'Air Hockey', 15),
 (2011, 'Curling', 30),
 (2011, 'Kendo', 5),
 (2011, 'Iaido', 10),
 (2011, 'Ninjitsu', 2),
 (2011, 'Ping Pong', 45),
 (2012, 'Air Hockey', 19),
 (2012, 'Curling', 34),
 (2012, 'Kendo', 6),
 (2012, 'Iaido', 11),
 (2012, 'Ninjitsu', 3),
 (2012, 'Ping Pong', 40)],
["year","sport","unique_fans"])

display(df2)

## Using sample Pixiedust data set to explore car data

In [None]:
# Another tutorial paste, interesting cars data set
df3 = pixiedust.sampleData("https://github.com/ibm-cds-labs/open-data/raw/master/cars/cars.csv")
display(df3)

## Using sample Pixiedust data set to explore Boston crime data

In [3]:
# Not a tutorial paste, fun dataset
bostonCrime = pixiedust.sampleData(7)
display(bostonCrime)

# Scala Bridge

Pixiedust provides a means for sharing Python variables/data with Scala and Scala with Python. Scala code is entered after issueing the Pixiedust magic %%scala. 

## Python ---> Scala
In the below example borrowed from a tutorial, Python variables are created then Scala is entered and the Python variables are printed. For Strings, they must be defined using double quotes. 

In [None]:
dog="Weechee"
person="Jason"

In [None]:
%%scala
println(s"$person has a dog named $dog")

## Scala ---> Python

In this example, a DataFrame is created within the Scala magic and the resulting DataFrame is shown.

In [None]:
%%scala
// Slightly modified Tutorial paste
//Reuse the sqlContext object available in the python scope
val c = sqlContext.asInstanceOf[org.apache.spark.sql.SQLContext]
import c.implicits._

val __dfFromScala = Seq(
    (2010, "Air Hockey", 10),
    (2010, "Curling", 20),
    (2010, "Kendo", 1),
    (2010, "Iaido", 2),
    (2010, "Ninjitsu", 1),
    (2010, "Ping Pong", 50),
    (2011, "Air Hockey", 15),
    (2011, "Curling", 30),
    (2011, "Kendo", 5),
    (2011, "Iaido", 10),
    (2011, "Ninjitsu", 2),
    (2011, "Ping Pong", 45),
    (2012, "Air Hockey", 19),
    (2012, "Curling", 34),
    (2012, "Kendo", 6),
    (2012, "Iaido", 11),
    (2012, "Ninjitsu", 3),
    (2012, "Ping Pong", 40)).toDF("year","sport","unique_fans")
     
__dfFromScala.show

Finally, the DataFrame from Scala is converted to a Python DataFrame. Note: The Pixiedust tutorials do not include the conversion, however, it appears necessary.

In [None]:
from pyspark.mllib.common import _py2java, _java2py

pythonDF = _java2py(sc, __dfFromScala)

display(pythonDF)