## Load Sparkmagic

In [1]:
%load_ext sparkmagic.magics

## Manage Spark

    1) Add an endpoint - add livy server and ignore username and password
    2) Create a session. The session parameters will be taken from ~/.sparkmagic/config.json

Example session properties:

```json
{"executorCores": 4, 
 "proxyUser": "bernhard", 
 "conf": {
   "spark.master": "yarn-cluster", 
   "spark.jars.packages": "com.databricks:spark-csv_2.10:1.5.0"
 }, 
 "driverMemory": "2G"
}
```
This enusres that jobs are executed on the Hadoop cluster as user "bernhard"with 4 executors

In [2]:
%manage_spark

Added endpoint http://banach.local:8998
Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
14,,spark,idle,,,✔


SparkContext available as 'sc'.
HiveContext available as 'sqlContext'.


## Using Spark with Scala

In [3]:
sc.version

res2: String = 1.6.2

In [4]:
val df = sqlContext.read.
                    format("com.databricks.spark.csv").
                    option("header", "true").
                    option("inferSchema", "true").
                    load("/tmp/iris.csv").
                    cache
df.registerTempTable("iris")

In [5]:
df.show()

+-----------+----------+-----------+----------+-----------+
|sepalLength|spealWidth|petalLength|petalWidth|    species|
+-----------+----------+-----------+----------+-----------+
|        5.1|       3.5|        1.4|       0.2|Iris-setosa|
|        4.9|       3.0|        1.4|       0.2|Iris-setosa|
|        4.7|       3.2|        1.3|       0.2|Iris-setosa|
|        4.6|       3.1|        1.5|       0.2|Iris-setosa|
|        5.0|       3.6|        1.4|       0.2|Iris-setosa|
|        5.4|       3.9|        1.7|       0.4|Iris-setosa|
|        4.6|       3.4|        1.4|       0.3|Iris-setosa|
|        5.0|       3.4|        1.5|       0.2|Iris-setosa|
|        4.4|       2.9|        1.4|       0.2|Iris-setosa|
|        4.9|       3.1|        1.5|       0.1|Iris-setosa|
|        5.4|       3.7|        1.5|       0.2|Iris-setosa|
|        4.8|       3.4|        1.6|       0.2|Iris-setosa|
|        4.8|       3.0|        1.4|       0.1|Iris-setosa|
|        4.3|       3.0|        1.1|    

## Use the session to issue SQL commands

In [6]:
%%spark -s jupyter-1 -c sql
select min(sepalLength) as min, max(sepalLength) as max 
from iris 
group by species

## Make the result usable in Python a s a Pandas Dataframe

Flag `-o` will store result from Spark query into a Pandas DataFrame called irisDf

In [8]:
%%spark -s jupyter-1 -c sql -o irisDf --maxrows 150
select * from iris

## Working in local python environment
Use ` %%local ` to use python ...

In [9]:
%%local
type(irisDf)

pandas.core.frame.DataFrame

... and visualize the data using bokeh

In [10]:
%%local

from bokeh.io import push_notebook, show, output_notebook
from bokeh.layouts import row
from bokeh.plotting import figure
output_notebook()

In [11]:
%%local
colormap = {'Iris-setosa': 'red', 'Iris-versicolor': 'green', 'Iris-virginica': 'blue'}
colors = [colormap[x] for x in irisDf['species']]

p = figure(title = "Iris Morphology")
p.xaxis.axis_label = 'Petal Length'
p.yaxis.axis_label = 'Petal Width'

p.circle(irisDf["petalLength"], irisDf["petalWidth"],
         color=colors, fill_alpha=0.2, size=10)

show(p)