# Spark Notebook
## Interactive Data Analysis with backed by Apache Spark

The Spark Notebook is the open source notebook focusing on improving productivity of data analitycs and data science in enterprise environments.

For that, it is only based on JVM components and has no other dependencies.

It uses Scala as its main language to benefit from the JVM ecosystem blended with a functional approach and full access to all Spark features. The Spark Notebook enriches the data analytics experience with reactive charts and plotting elements that provide visual insights in the data.

---
## Multiple Independent Spark Contexts

One of the top most useful feature brought by the spark notebook is its separation of the running notebooks.

Indeed, each started notebook will spawn a new JVM with its own `SparkSession` instance. This allows a maximal flexibility for:
* dependencies without clashes
* access different clusters
* tune differently each notebook
* external scheduling (on the roadmap)

You can recognize easily the spawned processes using `ps` (*unix* only) and search for the main class `ChildProcessMain` and verify that the process contains the name of the started notebooks.

In [ ]:
import sys.process._
import scala.language.postfixOps
"ps aux" #| "grep ChildProcessMain" lineStream_!

import sys.process._
import scala.language.postfixOps
res5: Stream[String] = Stream(maasg     5360  0.5  8.5 9478024 1379380 pts/1 Sl+  Sep25   4:57 /usr/lib/jvm/java-8-oracle/jre/bin/java -Xmx3674210304 -XX:MaxPermSize=1073741824 -server notebook.kernel.pfork.ChildProcessMain notebook.kernel.remote.RemoteActorProcess 35989 info 94cbf08e-1547-418e-b5f1-d0fcd36cd3eb ngrams/language-detection-letter-freq.snb NONE kernel, ?)


So this notebook declares the variables `sparkSession` and `sparkContext` (with its alias `sc`).

In [ ]:
val context = (sparkSession, sparkContext, sc)

context: (org.apache.spark.sql.SparkSession, org.apache.spark.SparkContext, org.apache.spark.SparkContext) = (org.apache.spark.sql.SparkSession@1988ee55,org.apache.spark.SparkContext@662a71ba,org.apache.spark.SparkContext@662a71ba)


Internally, are declared as `var`s. We need a little trick when we want to import implicits (typical when using `DataFrame`s and `Dataset`s

In [ ]:
val spark = sparkSession
import spark.implicits._

spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@1988ee55
import spark.implicits._


### Spark Configuration

The metadata can define a JSON object (String to String!) to declare extra configuration for spark.

In [ ]:
sparkSession.conf.get("spark.default.parallelism")

res12: String = 4


### Managing of Library Dependencies
#### Enrich your notebook with any library available out there

This notebook has injected a few dependencies from the [datastax cassandra connector](https://github.com/datastax/spark-cassandra-connector/).

Hence, this code compiles.

In [ ]:
import com.datastax.spark.connector._                                    

import com.datastax.spark.connector._


Also it includes the kafka external modules for the current scala version (using `%%`) and the current spark version (using `_`)

In [ ]:
import org.apache.spark.streaming.kafka

import org.apache.spark.streaming.kafka


We can also see that we can remove dependencies by prepending `-` to the definition. So we avoid downloading any extra libraries from the scala language.

### Change the metadata

There are a few metadata available and you can configure them from the editor in the menu: _Edit > Edit Notebook Metadata_.

---
## Plotting

There exist many predefined `Chart` that you can use directly on any kind of **Scala** container that can be iterated.

If the last statement of a cell isn't an assignment or a definition, then the spark notebook will try to plot it the best way it can automatically.

In [ ]:
case class Example(id:Int, category:String, value:Long, advanced:Boolean)
import scala.util.Random._

val categories = List("drama", "musical", "thriller", "comedy", "horror", "action")
def category:String = shuffle(categories).head
val examples = List.fill(100)(Example(nextInt(2000), category, nextInt(2000).toLong, nextBoolean))

defined class Example
import scala.util.Random._
categories: List[String] = List(drama, musical, thriller, comedy, horror, action)
category: String
examples: List[Example] = List(Example(440,thriller,313,false), Example(1209,action,1288,false), Example(1066,horror,632,false), Example(1344,comedy,840,false), Example(1442,horror,413,false), Example(104,musical,648,false), Example(1453,musical,449,true), Example(509,thriller,1103,true), Example(1399,action,333,true), Example(363,drama,839,true), Example(188,drama,1770,true), Example(1566,thriller,135,true), Example(184,thriller,895,true), Example(1459,horror,1208,true), Example(714,thriller,1574,false), Example(1672,action,882,true), Example(108,comedy,1126,false), Example(1447,action,575,false), Example(1675,drama,1545,false), Example(62,...

The above cell doesn't plot anything since it terminates with a assignement.

In [ ]:
examples

res16: List[Example] = List(Example(440,thriller,313,false), Example(1209,action,1288,false), Example(1066,horror,632,false), Example(1344,comedy,840,false), Example(1442,horror,413,false), Example(104,musical,648,false), Example(1453,musical,449,true), Example(509,thriller,1103,true), Example(1399,action,333,true), Example(363,drama,839,true), Example(188,drama,1770,true), Example(1566,thriller,135,true), Example(184,thriller,895,true), Example(1459,horror,1208,true), Example(714,thriller,1574,false), Example(1672,action,882,true), Example(108,comedy,1126,false), Example(1447,action,575,false), Example(1675,drama,1545,false), Example(62,comedy,1510,false), Example(229,action,1495,true), Example(1465,comedy,1808,true), Example(1428,comedy,680,true), Example(532,musical,1654,false), Exam...

Now we have a `TableChart` and a `PivotChart` tabs for the data, which we can use to have a better feeling of the data.

We can of course create them ourselves:

In [ ]:
TableChart(examples)

res18: notebook.front.widgets.charts.TableChart[List[Example]] = <TableChart widget>


### Grouping plots

Among available charts, you have for instance the pretty common ones like:
* `LineChart`
* `ScatterChart`
* `BarChart`

Which accept at least two other parameters: 
* `fields`: the two field names to use to plot 
* `groupField`: the field used to group the data

In [ ]:
val hitsPerCategory = spark.createDataset(examples).groupBy($"category").agg(sum("value") as "hits")
BarChart(hitsPerCategory, fields=Some(("category", "hits")))

hitsPerCategory: org.apache.spark.sql.DataFrame = [category: string, hits: bigint]
res34: notebook.front.widgets.charts.BarChart[org.apache.spark.sql.DataFrame] = <BarChart widget>


In [ ]:

ScatterChart(examples, fields=Some(("id", "value")), groupField=Some("advanced"))

res22: notebook.front.widgets.charts.ScatterChart[List[Example]] = <ScatterChart widget>


In [ ]:
BarChart(examples, fields=Some(("id", "value")), groupField=Some("category"))

res40: notebook.front.widgets.charts.BarChart[List[Example]] = <BarChart widget>


---
## Graphs

Graph is generally a common way to represent data where connections matter. Hence the Spark Notebook defines an API easing the definition of `Node` and `Edge`.

* `Graph[T]`: abstract class defining a graph component with an id of type `T`, a value of type `Any` and a color
* `Node[T]`: defines a node as a circle which can be specified a radius and its position ($x$, $y$) (initial or static if it's fixed)
* `Edge[T]`: defines an edge using the ids of both ends

In [ ]:
import scala.util.Random._
case class GraphExample(id:Int, cluster:Char, value:Long)
val clusters = ('A' to 'D').toList
val cluCol = clusters.zip(List("#000", "#478", "#127", "#984", "#F5A")).toMap
val gexamples = List.tabulate(10, 4)((i,j) => GraphExample(i*4+j, clusters(j), nextLong)).flatten

val nodes = gexamples.map(e => notebook.front.widgets.magic.Node(e.id, e, cluCol(e.cluster), 5))

val clustered = gexamples.groupBy(_.cluster).toList
val connectedClusters = clustered.flatMap { case (c, cl) => 
                                       for {
                                         a <- cl
                                         b <- cl if a != b
                                       } yield notebook.front.widgets.magic.Edge[Int](400+nextInt(400)+a.id+b.id, (a.id, b.id), "intra", "red")
                                     }

import scala.util.Random._
defined class GraphExample
clusters: List[Char] = List(A, B, C, D)
cluCol: scala.collection.immutable.Map[Char,String] = Map(A -> #000, B -> #478, C -> #127, D -> #984)
gexamples: List[GraphExample] = List(GraphExample(0,A,3912371022920081814), GraphExample(1,B,6412170878345702211), GraphExample(2,C,9204147210451903646), GraphExample(3,D,-5469523725007552212), GraphExample(4,A,1400166045456678603), GraphExample(5,B,117433689862059653), GraphExample(6,C,-2565296159209085778), GraphExample(7,D,-566587921581094860), GraphExample(8,A,6628864983314821656), GraphExample(9,B,-2115657522211612322), GraphExample(10,C,-2017054813330680589), GraphExample(11,D,5255636672555945057), GraphExample(12,A,855834809839472974), GraphExample(13,B,6543978274141533960), GraphExample...

In [ ]:
val singleConnections = {
  val s = gexamples.take(4)
  
  
  for (a <- s; b <- s if a != b) 
    yield Edge(800+nextInt(400)+a.id+b.id, (a.id, b.id), "inter", "green")
}

val all = nodes ::: connectedClusters ::: singleConnections

singleConnections: List[notebook.front.widgets.magic.Edge[Int]] = List(Edge(938,(0,1),inter,green), Edge(819,(0,2),inter,green), Edge(1105,(0,3),inter,green), Edge(837,(1,0),inter,green), Edge(909,(1,2),inter,green), Edge(933,(1,3),inter,green), Edge(1056,(2,0),inter,green), Edge(822,(2,1),inter,green), Edge(810,(2,3),inter,green), Edge(1018,(3,0),inter,green), Edge(880,(3,1),inter,green), Edge(1187,(3,2),inter,green))
all: List[Product with Serializable with notebook.front.widgets.magic.Graph[Int]] = List(Node(0,GraphExample(0,A,3912371022920081814),#000,5,None,false), Node(1,GraphExample(1,B,6412170878345702211),#478,5,None,false), Node(2,GraphExample(2,C,9204147210451903646),#127,5,None,false), Node(3,GraphExample(3,D,-5469523725007552212),#984,5,None,false), Node(4,GraphExample(4,A,...

In [ ]:
GraphChart(all, maxPoints = 1000, sizes=(600, 600))

res46: notebook.front.widgets.charts.GraphChart[List[Product with Serializable with notebook.front.widgets.magic.Graph[Int]]] = <GraphChart widget>


---
## Geo charts

There are two types of geo charts:
* `GeoPointsChart` for simple points lat long points
* `GeoChart` for _GeoJSON_ or _opengis_ data


### GeoPointsChart

Let's load some airports data with latitude and longitude coordinates

In [ ]:
val root = "/home/maasg/projects/data-fellas/scala-for-data-science"
val airportsDF = sparkSession.read.json(s"$root/notebooks/airports.json")
airportsDF.cache
airportsDF

root: String = /home/maasg/projects/data-fellas/scala-for-data-science
airportsDF: org.apache.spark.sql.DataFrame = [airport: string, city: string ... 5 more fields]
res62: org.apache.spark.sql.DataFrame = [airport: string, city: string ... 5 more fields]


In [ ]:
val statsDF = airportsDF.groupBy("state").count.orderBy($"count".desc).limit(5)

statsDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [state: string, count: bigint]


Convert to Dataset (for the fun).

In [ ]:
case class StateStat(state:String, count:Long)

defined class StateStat


In [ ]:
statsDF.as[StateStat]

res66: org.apache.spark.sql.Dataset[StateStat] = [state: string, count: bigint]


Plot the dataframe with dedicated colors for each state

In [ ]:
import org.apache.spark.sql.functions._

import org.apache.spark.sql.functions._


In [ ]:
def forStates[A](xs:List[A]) = when($"state" === "AK", xs(0))
                               .when($"state" === "TX", xs(1))
                               .when($"state" === "CA", xs(2))
                               .when($"state" === "OK", xs(3))
                               .when($"state" === "OH", xs(4))
                               .otherwise(xs(5))
val airportsDFWithStyles = airportsDF.withColumn("r", forStates(List(10,9,8,7,6,1)))
                                     .withColumn("c", forStates(List("red","orange","blue","green","yellow","white")))
GeoPointsChart(airportsDFWithStyles, latLonFields=Some(("lat", "long")), rField = Some("r"), colorField = Some("c"))

forStates: [A](xs: List[A])org.apache.spark.sql.Column
airportsDFWithStyles: org.apache.spark.sql.DataFrame = [airport: string, city: string ... 7 more fields]
res69: notebook.front.widgets.charts.GeoPointsChart[org.apache.spark.sql.DataFrame] = <GeoPointsChart widget>


---
## GeoChart

Fetch some data on the web about parks and gardens

In [ ]:
:sh wget http://data.cyc.opendata.arcgis.com/datasets/57fa576e5e8149b0b744f768e01e5ce1_0.geojson -O Parks_and_Gardens.geojson

--2016-10-03 20:23:24--  http://data.cyc.opendata.arcgis.com/datasets/57fa576e5e8149b0b744f768e01e5ce1_0.geojson
Resolving data.cyc.opendata.arcgis.com (data.cyc.opendata.arcgis.com)... 52.71.54.58, 52.45.55.59
Connecting to data.cyc.opendata.arcgis.com (data.cyc.opendata.arcgis.com)|52.71.54.58|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/json]
Saving to: ‘Parks_and_Gardens.geojson’

     0K .......... .......... .......... .......... .......     149K=0,3s

2016-10-03 20:23:25 (149 KB/s) - ‘Parks_and_Gardens.geojson’ saved [48157]


import sys.process._




Parse it as GeoJSON using provided `widgets.parseGeoJSON`

In [ ]:
val geoJSONRepr = widgets.parseGeoJSON(scala.io.Source.fromFile("Parks_and_Gardens.geojson").getLines.mkString(""))

geoJSONRepr: org.wololo.geojson.GeoJSON = {"type":"FeatureCollection","features":[{"type":"Feature","geometry":{"type":"Polygon","coordinates":[[[-1.0586561844469051,53.95603309272234],[-1.05854399466353,53.95603230358679],[-1.0583760626436196,53.956040512630246],[-1.0581615796381862,53.9560514114069],[-1.0578934080246933,53.956069347443375],[-1.0575324639452257,53.95610587354096],[-1.0572491469247918,53.95613797567649],[-1.0569805467357019,53.956174865720776],[-1.0567978185678244,53.95620398210952],[-1.0566095948767782,53.956237188881765],[-1.0563513871859314,53.95628682886303],[-1.055997890932263,53.956366180073815],[-1.0557105892387082,53.95643544719732],[-1.0554643721163497,53.95651042797886],[-1.054881757917829,53.95669741874451],[-1.0548229580333113,53.95661375072628],[-1.05479500...

Fetch some more vectorial information of the same area.

In [ ]:
:sh wget http://data.cyc.opendata.arcgis.com/datasets/9b212b7af275438ca9088ff868bda139_9.geojson -O airqual.geojson

--2016-10-03 20:23:27--  http://data.cyc.opendata.arcgis.com/datasets/9b212b7af275438ca9088ff868bda139_9.geojson
Resolving data.cyc.opendata.arcgis.com (data.cyc.opendata.arcgis.com)... 52.71.54.58, 52.45.55.59
Connecting to data.cyc.opendata.arcgis.com (data.cyc.opendata.arcgis.com)|52.71.54.58|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/json]
Saving to: ‘airqual.geojson’

     0K .......... .......... .......... .......... ..........  118K
    50K .......... .......... .......... .......... ..........  128K
   100K .......... .......... .......... .......... .......... 1,10M
   150K .......... .......... .......... .......... ..........  301K
   200K .......... .......... .......... .......... ..........  248K
   250K .......... .......... .......                          1,18M=1,3s

2016-10-03 20:23:28 (222 KB/s) - ‘airqual.geojson’ saved [284153]


import sys.process._




And parse it...

In [ ]:
val ng = widgets.parseGeoJSON(scala.io.Source.fromFile("airqual.geojson").getLines.mkString(""))

ng: org.wololo.geojson.GeoJSON = {"type":"FeatureCollection","features":[{"type":"Feature","geometry":{"type":"Polygon","coordinates":[[[-1.0993634390145945,53.94953467420566],[-1.0993632127017887,53.94953761160725],[-1.0993626606334983,53.949540533984006],[-1.0993617861083074,53.94954342967616],[-1.0993605924828298,53.949546284327965],[-1.09935908608327,53.94954908720113],[-1.0993572702275516,53.94955182573725],[-1.0993551558316914,53.94955448833359],[-1.0993529937136752,53.94955681981139],[-1.0980501642550011,53.95086841417698],[-1.098049919747874,53.95086865861476],[-1.0980472332209674,53.95087113533479],[-1.0980456879200573,53.95087241532106],[-1.0976919863244108,53.95115453694566],[-1.0976915894502317,53.95115485124434],[-1.0966111506989786,53.95200106454321],[-1.0966106547656387,5...

Create a `GeoChart` instance on `GeoJSON` representation of the first dataset.

In [ ]:
val gc = GeoChart(Seq(geoJSONRepr), sizes=(800, 800))
gc

gc: notebook.front.widgets.charts.GeoChart[Seq[org.wololo.geojson.GeoJSON]] = <GeoChart widget>
res77: notebook.front.widgets.charts.GeoChart[Seq[org.wololo.geojson.GeoJSON]] = <GeoChart widget>


We can now add the linear features into the same chart using the helpful function `addAndApply` which adds information to the existing chart.

In [ ]:
gc.addAndApply(Seq(ng))

---
## Fancy charts

### Radar

Let's grab some data from http://www.basketball-reference.com/teams/SAS/2016.html (31st May 2016).

In [ ]:
case class TeamMember(Player:String, Age:Int, FG_pc:Double, _3P_pc:Double, _2P_pc:Double, eFG_pc:Double, FT_pc:Double)
val team = 
      s"""
        1	Kawhi Leonard	24	72	72	2380	551	1090	.506	129	291	.443	422	799	.528	.565	292	334	.874	95	398	493	186	128	71	105	133	1523
        2	LaMarcus Aldridge	30	74	74	2261	536	1045	.513	0	16	.000	536	1029	.521	.513	259	302	.858	176	456	632	110	38	81	99	151	1331
        3	Danny Green	28	79	79	2062	211	561	.376	116	349	.332	95	212	.448	.480	34	46	.739	48	255	303	141	79	64	75	141	572
        4	Tony Parker	33	72	72	1980	350	710	.493	27	65	.415	323	645	.501	.512	130	171	.760	17	159	176	379	54	11	131	114	857
        5	Patrick Mills	27	81	3	1662	260	612	.425	123	320	.384	137	292	.469	.525	47	58	.810	27	131	158	226	59	6	76	102	690
        6	Tim Duncan	39	61	60	1536	215	441	.488	0	2	.000	215	439	.490	.488	92	131	.702	115	332	447	163	47	78	90	125	522
        7	David West	35	78	19	1404	244	448	.545	3	7	.429	241	441	.546	.548	63	80	.788	72	237	309	143	44	55	68	142	554
        8	Boris Diaw	33	76	4	1386	202	383	.527	25	69	.362	177	314	.564	.560	56	76	.737	58	175	233	176	26	21	97	102	485
        9	Kyle Anderson	22	78	11	1245	138	295	.468	12	37	.324	126	258	.488	.488	62	83	.747	25	219	244	123	60	29	59	97	350
        10	Manu Ginobili	38	58	0	1134	197	435	.453	70	179	.391	127	256	.496	.533	91	112	.813	26	120	146	177	66	11	99	99	555
        11	Jonathon Simmons	26	55	2	813	122	242	.504	18	47	.383	104	195	.533	.541	69	92	.750	16	80	96	58	24	5	53	103	331
        12	Boban Marjanovic	27	54	4	508	105	174	.603	0	0	.0	105	174	.603	.603	87	114	.763	73	121	194	21	12	23	29	54	297
        13	Rasual Butler	36	46	0	432	49	104	.471	15	49	.306	34	55	.618	.543	11	16	.688	3	53	56	24	13	23	8	11	124
        14	Kevin Martin	32	16	1	261	30	85	.353	11	33	.333	19	52	.365	.418	28	30	.933	4	25	29	12	9	2	13	15	99
        15	Ray McCallum	24	31	3	256	27	67	.403	5	16	.313	22	51	.431	.440	9	10	.900	6	25	31	33	5	4	11	14	68
        16	Matt Bonner	35	30	2	206	29	57	.509	15	34	.441	14	23	.609	.640	3	4	.750	3	24	27	9	6	1	3	16	76
        17	Andre Miller	39	13	4	181	23	48	.479	1	4	.250	22	44	.500	.490	9	13	.692	6	21	27	29	7	0	12	14	56
     """.trim.split("\n").map(s => s.trim.split("\t").drop(1).toList).map(x => (x.head, x(1).trim.toInt) → x.drop(2).filter(_.startsWith("."))
                                                                     .map(_.trim.toDouble * 100)).map { case ((p, a), stats)  =>
        TeamMember(p, a, stats(0), stats(1), stats(2), stats(3), stats(4))
      }

defined class TeamMember
team: Array[TeamMember] = Array(TeamMember(Kawhi Leonard,24,50.6,44.3,52.800000000000004,56.49999999999999,87.4), TeamMember(LaMarcus Aldridge,30,51.300000000000004,0.0,52.1,51.300000000000004,85.8), TeamMember(Danny Green,28,37.6,33.2,44.800000000000004,48.0,73.9), TeamMember(Tony Parker,33,49.3,41.5,50.1,51.2,76.0), TeamMember(Patrick Mills,27,42.5,38.4,46.9,52.5,81.0), TeamMember(Tim Duncan,39,48.8,0.0,49.0,48.8,70.19999999999999), TeamMember(David West,35,54.50000000000001,42.9,54.6,54.800000000000004,78.8), TeamMember(Boris Diaw,33,52.7,36.199999999999996,56.39999999999999,56.00000000000001,73.7), TeamMember(Kyle Anderson,22,46.800000000000004,32.4,48.8,48.8,74.7), TeamMember(Manu Ginobili,38,45.300000000000004,39.1,49.6,53.300000000000004,81.3), TeamMember...

In [ ]:
RadarChart(shuffle(team.toList).take(5), labelField=Some("Player"), sizes=(800, 600))

res82: notebook.front.widgets.charts.RadarChart[List[TeamMember]] = <RadarChart widget>


### Pivot

In [ ]:
PivotChart(team)

res84: notebook.front.widgets.charts.PivotChart[Array[TeamMember]] = <PivotChart widget>


### Parallel coordinates

In [ ]:
ParallelCoordChart(team, sizes=(800, 500))

res86: notebook.front.widgets.charts.ParallelCoordChart[Array[TeamMember]] = <ParallelCoordChart widget>


### Timeseries

In [ ]:
:sh wget http://www.ncdc.noaa.gov/cag/time-series/global/globe/land_ocean/p12/12/1880-2015.csv -O /tmp/1880-2015.csv

--2016-10-03 20:23:34--  http://www.ncdc.noaa.gov/cag/time-series/global/globe/land_ocean/p12/12/1880-2015.csv
Resolving www.ncdc.noaa.gov (www.ncdc.noaa.gov)... 205.167.25.171, 205.167.25.172, 2610:20:8040:2::171, ...
Connecting to www.ncdc.noaa.gov (www.ncdc.noaa.gov)|205.167.25.171|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘/tmp/1880-2015.csv’

     0K .......... .......... ...                               124K=0,2s

2016-10-03 20:23:35 (124 KB/s) - ‘/tmp/1880-2015.csv’ saved [23819]


import sys.process._




In [ ]:
import java.util.Calendar
import java.util.Calendar._
val cal = Calendar.getInstance
cal.set(DAY_OF_MONTH, 0)
cal.set(HOUR, 0)
cal.set(MINUTE, 0)
cal.set(SECOND, 0)
cal.set(MILLISECOND, 0)
val ts = scala.io.Source.fromFile(new File("/tmp/1880-2015.csv")).getLines.drop(4)
                .map(_.split(",").toList.map(_.trim))
                .map{case List(y,c) => (y.take(4).toInt, y.drop(4).take(2).dropWhile(_ == '0').toInt-1, c.toDouble)}
                .map{ case (y, m, c) => 
                  cal.set(YEAR, y)
                  cal.set(MONTH, m)
                  cal.getTime → c
                }.toList

import java.util.Calendar
import java.util.Calendar._
cal: java.util.Calendar = java.util.GregorianCalendar[time=1449054000000,areFieldsSet=true,areAllFieldsSet=false,lenient=true,zone=sun.util.calendar.ZoneInfo[id="Europe/Brussels",offset=3600000,dstSavings=3600000,useDaylight=true,transitions=184,lastRule=java.util.SimpleTimeZone[id=Europe/Brussels,offset=3600000,dstSavings=3600000,useDaylight=true,startYear=0,startMode=2,startMonth=2,startDay=-1,startDayOfWeek=1,startTime=3600000,startTimeMode=2,endMode=2,endMonth=9,endDay=-1,endDayOfWeek=1,endTime=3600000,endTimeMode=2]],firstDayOfWeek=1,minimalDaysInFirstWeek=1,ERA=1,YEAR=2015,MONTH=11,WEEK_OF_YEAR=49,WEEK_OF_MONTH=1,DAY_OF_MONTH=2,DAY_OF_YEAR=336,DAY_OF_WEEK=4,DAY_OF_WEEK_IN_MONTH=1,AM_PM=1,HOUR=0,HOUR_OF_DAY=12,MINUTE=0,SECOND=0,...

In [ ]:
val tc = TimeseriesChart(ts)

tc: notebook.front.widgets.charts.TimeseriesChart[List[(java.util.Date, Double)]] = <TimeseriesChart widget>


In [ ]:
tc

res92: notebook.front.widgets.charts.TimeseriesChart[List[(java.util.Date, Double)]] = <TimeseriesChart widget>


In [ ]:

val pairs = Seq((49,39,"foo"),(49,22,"foo"),(33,0,"bum"))

pairs: Seq[(Int, Int, String)] = List((49,39,foo), (49,22,foo), (33,0,bum))


---
## Everything is Dynamic and Reactive

Since data can come live in a system or you want to log vizualy some events or perhaps you need to have two visual components to interact... what you don't want to do is to write the html, js, server code and who knows what else you'll need to master...

For that, the spark notebook comes with dynamicity of charts and most (if not all) components can be listened and react to events.

### Dynamic Line Chart

In [ ]:
val tsH :: tsS = ts.sliding(100, 100).toList
val dynTC = TimeseriesChart(tsH, maxPoints = ts.size)
dynTC

tsH: List[(java.util.Date, Double)] = List((Wed Dec 31 12:00:00 CET 1879,-0.003), (Tue Mar 02 12:00:00 CET 1880,-0.1286), (Tue Mar 02 12:00:00 CET 1880,-0.1398), (Fri Apr 02 12:00:00 CET 1880,-0.0552), (Sun May 02 12:00:00 CET 1880,-0.0775), (Wed Jun 02 12:00:00 CET 1880,-0.1718), (Fri Jul 02 12:00:00 CET 1880,-0.1545), (Mon Aug 02 12:00:00 CET 1880,-0.0772), (Thu Sep 02 12:00:00 CET 1880,-0.0851), (Sat Oct 02 12:00:00 CET 1880,-0.1827), (Tue Nov 02 12:00:00 CET 1880,-0.2709), (Thu Dec 02 12:00:00 CET 1880,-0.0823), (Sun Jan 02 12:00:00 CET 1881,-0.0239), (Wed Feb 02 12:00:00 CET 1881,-0.0327), (Wed Mar 02 12:00:00 CET 1881,0.03), (Sat Apr 02 12:00:00 CET 1881,0.0696), (Mon May 02 12:00:00 CET 1881,0.0225), (Thu Jun 02 12:00:00 CET 1881,-0.0968), (Sat Jul 02 12:00:00 CET 1881,-0.0401), ...

In [ ]:
var cont = true
new Thread() {
  override def run = 
    tsS.foreach { l =>
    if (cont) {
      Thread.sleep(1000)
      dynTC.addAndApply(l)
    }
  }
}.start

cont: Boolean = true


In [ ]:
cont = false

cont: Boolean = false


### Components

In [ ]:
val rteam = shuffle(team.toList).take(5)
val dd = new DropDown("All" :: rteam.map(_.Player))
val rc = RadarChart(rteam, labelField=Some("Player"), sizes=(800, 600))
val bout = out

dd.selected --> Connection.fromObserver { p =>
  bout(p + " is selected")
  rc.applyOn(rc.originalData.filter(_.Player == p || p == "All"))
}

dd ++ bout ++ rc

rteam: List[TeamMember] = List(TeamMember(Boris Diaw,33,52.7,36.199999999999996,56.39999999999999,56.00000000000001,73.7), TeamMember(Tim Duncan,39,48.8,0.0,49.0,48.8,70.19999999999999), TeamMember(Patrick Mills,27,42.5,38.4,46.9,52.5,81.0), TeamMember(Rasual Butler,36,47.099999999999994,30.599999999999998,61.8,54.300000000000004,68.8), TeamMember(Jonathon Simmons,26,50.4,38.3,53.300000000000004,54.1,75.0))
dd: notebook.front.widgets.DropDown[String] = <DropDown widget>
rc: notebook.front.widgets.charts.RadarChart[List[TeamMember]] = <RadarChart widget>
bout: notebook.front.widgets.OutDiv = <OutDiv widget>
res104: notebook.front.Widget = <widget>


---
## Synchronization

Oh... notebooks are synchronized!

Open another browser window and relaunch the timeseries example.

---
## Create new chart type... live

If you want/need to exerce your js fu, you can always use the `Chart` (for instance) API to create new dynamic widgets types.

In the following, we'll create a widget that can plot duration bars based for given operations (only a name):
* a `js` string which is the javascript to execute for the new chart. It:
  * has to be a function with 3 params
    * `dataO` a knockout observable wich can be listened for new incoming data, see the `subscribe` call
    * `container` is the div element where you can add new elements
    * `options` is an extra object passed to the widget which defines additional configuration options (like width or a specific color or whatever)
  * has a `this` object containing:
    * `dataInit` this is the JSON representation of the Scala data as an array of objects having the same schema as the Scala type
    * `genId` a unique id that you can use for a high level element for instance

In [ ]:
val js = """
function progressgraph (dataO, container, options) {
  var css = 'div.prog {position: relative; overflow: hidden; } span.pp {display: inline-block; position: absolute; height: 16px;} span.prog {display: inline-block; position: absolute; height: 16px; }' +
            '.progs {border: solid 1px #ccc; background: #eee; } .progs .pv {background: #3182bd; }',
      head = document.head || document.getElementsByTagName('head')[0],
      style = document.createElement('style');

  style.type = 'text/css';
  if (style.styleSheet){
    style.styleSheet.cssText = css;
  } else {
    style.appendChild(document.createTextNode(css));
  }

  head.appendChild(style);


  var width = options.width||600
  var height = options.height||400
  
  function create(name, duration) {
    var div =  d3.select(container).append("div").attr("class", "prog");

    div.append("span").attr("class", "pp prog")
        .style("width", "74px")
        .style("text-align", "right")
        .style("z-index", "2000")
        .text(name);

    div.append("span")
        .attr("class", "progs")
        .style("width", "240px")
        .style("left", "80px")
      .append("span")
        .attr("class", "pp pv")
      .transition()
        .duration(duration)
        .ease("linear")
        .style("width", "350px");

    div.transition()
        .style("height", "20px")
      .transition()
        .delay(duration)
        .style("height", "0px")
        .remove();

  }

  function onData(data) {
    _.each(data, function(d) {
      create(d[options.name], 5000 + d[options.duration])
    });
  }

  onData(this.dataInit);
  dataO.subscribe(onData);
}
""".trim

js: String = 
function progressgraph (dataO, container, options) {
  var css = 'div.prog {position: relative; overflow: hidden; } span.pp {display: inline-block; position: absolute; height: 16px;} span.prog {display: inline-block; position: absolute; height: 16px; }' +
            '.progs {border: solid 1px #ccc; background: #eee; } .progs .pv {background: #3182bd; }',
      head = document.head || document.getElementsByTagName('head')[0],
      style = document.createElement('style');

  style.type = 'text/css';
  if (style.styleSheet){
    style.styleSheet.cssText = css;
  } else {
    style.appendChild(document.createTextNode(css));
  }

  head.appendChild(style);


  var width = options.width||600
  var height = options.height||400
  
  function create(name, duration) {
    var div ...

Now we can create the widget extending `notebook.front.widgets.charts.Chart[C]`, where `C` is any Scala type, it'll be converted to JS using the implicit instance of `ToPoints`.

It has to declare the original dataset which needs to be a wrapper (`List`, `Array`, ...) of the `C` instances we want to plot. But it can also define other things like below:
* `sizes` are the $w \times h$ dimension of the chart
* `maxPoints` the number of points to plot, the way to select them is defined in the implicitly available instance of `Sampler`.
* `scripts` a list of references to existing javascript scripts
* `snippets` a list of string that represent snippets to execute in JS, they take the form of a JSON object with
  * `f` the function to call when the snippet will be executed
  * `o` a JSON object that will be provided to the above function at execution time. Here we define which field has to be used for the name and duration.

In [ ]:
import notebook.front.widgets._
import notebook.front.widgets.magic._
import notebook.front.widgets.magic.Implicits._
import notebook.front.widgets.magic.SamplerImplicits._
case class ProgChart[C:ToPoints:Sampler](
  originalData:C,
  override val sizes:(Int, Int)=(600, 400),
  maxPoints:Int = 1000,
  name:String,
  duration:String
) extends notebook.front.widgets.charts.Chart[C](originalData, maxPoints) {
  def mToSeq(t:MagicRenderPoint):Seq[(String, Any)] = t.data.toSeq


  override val snippets = List(s"""|{
                                   |  "f": $js, 
                                   |  "o": {
                                   |    "name": "$name",
                                   |    "duration": "$duration"
                                   |  }
                                   |}
                                  """.stripMargin)
  
  override val scripts = Nil
}

import notebook.front.widgets._
import notebook.front.widgets.magic._
import notebook.front.widgets.magic.Implicits._
import notebook.front.widgets.magic.SamplerImplicits._
defined class ProgChart


We can define the type of data we'll use for this example

In [ ]:
case class ProgData(n:String, v:Int)

defined class ProgData


Here we generate a bunch of data bucketized by 10, and we create an instance of the new widget giving it the first bucket of data and specifying the right field names for `name` and `duration`.

In [ ]:
import scala.util.Random
val pdata = for {
  c1 <- 'a' to 'e'
  c2 <- 'a' to 'e'
} yield ProgData(""+c1+c2, (Random.nextDouble * 10000).toInt)
val pdataH :: pdataS = pdata.toList.sliding(10, 10).toList

val pc = ProgChart(pdataH, name = "n", duration = "v")
pc

import scala.util.Random
pdata: scala.collection.immutable.IndexedSeq[ProgData] = Vector(ProgData(aa,1556), ProgData(ab,504), ProgData(ac,3887), ProgData(ad,4116), ProgData(ae,8454), ProgData(ba,9980), ProgData(bb,1191), ProgData(bc,1007), ProgData(bd,9926), ProgData(be,3612), ProgData(ca,8337), ProgData(cb,8970), ProgData(cc,1357), ProgData(cd,6919), ProgData(ce,1689), ProgData(da,3236), ProgData(db,4788), ProgData(dc,5859), ProgData(dd,8475), ProgData(de,489), ProgData(ea,6193), ProgData(eb,4423), ProgData(ec,7244), ProgData(ed,8974), ProgData(ee,7681))
pdataH: List[ProgData] = List(ProgData(aa,1556), ProgData(ab,504), ProgData(ac,3887), ProgData(ad,4116), ProgData(ae,8454), ProgData(ba,9980), ProgData(bb,1191), ProgData(bc,1007), ProgData(bd,9926), ProgData(be,3612))
pdataS: List[Lis...

We update the chart by passing the value using the `addAndApply` approach.

In [ ]:
var pcont = true
new Thread() {
  override def run = 
    pdataS.foreach { l =>
    if (pcont) {
      Thread.sleep(9000)
      pc.addAndApply(l, true)
    }
  }
}.start

pcont: Boolean = true


---
## Contexts with interpolation

In [ ]:
:sh ls ${sys.env("NOTEBOOKS_DIR")}

In [ ]:
val ok = "$\\LaTeX$ interpolated in Scala is $\\Re$"

In [ ]:
:markdown 
Yup, **$ok** in Spark Notebook

In [ ]:
:javascript
alert("I am ${("whoami".!!).trim}")

---
## Metadata

A notebook has a context enricheded via its metadata, here are a few important ones.

---
## Logs

Checking logs is always painful when using a notebook since this is simply a web client on the remote REPL in the server. 

Hence the logs are quite far, or even worse inaccessible!

So, the spark notebook will forwards **all logs using slf4j** to the browser console → go check it, use the `F12` key and open the _console_ tab!