# Spark to DocumentDB Connector

Connecting Apache Spark to Azure DocumentDB accelerates your ability to solve your fast moving Data Sciences problems where your data can be quickly persisted and retrieved using Azure DocumentDB.  With the Spark to DocumentDB conector, you can more easily solve scenarios including (but not limited to) blazing fast IoT scenarios, update-able columns when performing analytics, push-down predicate filtering, and performing advanced analytics to data sciences against your fast changing data against a geo-replicated managed document store with guaranteed SLAs for consistency, availability, low latency, and throughput.   

The Spark to DocumentDB connector utilizes the [Azure DocumentDB Java SDK](https://github.com/Azure/azure-documentdb-java) will utilize the following flow:

<img style="align: left;" src="https://raw.githubusercontent.com/dennyglee/notebooks/master/images/Azure-DocumentDB-Spark_Connector_600x266.png">



The data flow is as follows:

1. Connection is made from Spark master node to DocumentDB gateway node to obtain the partition map. Note, user only specifies Spark and DocumentDB connections, the fact that it connects to the respective master and gateway nodes is transparent to the user.
2. This information is provided back to the Spark master node. At this point, we should be able to parse the query to determine which partitions (and their locations) within DocumentDB we need to access.
3. This information is transmitted to the Spark worker nodes ...
4. Thus allowing the Spark worker nodes to connect directly to the DocumentDB partitions directly to extract the data that is needed and bring the data back to the Spark partitions within the Spark worker nodes.


In [1]:
%%configure
{ "jars": ["wasb:///example/jars/azure-documentdb-1.9.6.jar","wasb:///example/jars/azure-documentdb-spark-0.0.1.jar"],
  "conf": {
    "spark.jars.excludes": "org.scala-lang:scala-reflect"
   }
}

In [2]:
// Import Time libraries
import org.joda.time._
import org.joda.time.format._

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
2,application_1489125951706_0001,spark,idle,Link,Link,✔


SparkSession available as 'spark'.
import org.joda.time.format._

In [3]:
// Import Spark to DocumentDB Connector
import com.microsoft.azure.documentdb.spark.schema._
import com.microsoft.azure.documentdb.spark._
import com.microsoft.azure.documentdb.spark.config.Config

// Connect to DocumentDB Database
val readConfig2 = Config(Map("Endpoint" -> "https://doctorwho.documents.azure.com:443/",
"Masterkey" -> "le1n99i1w5l7uvokJs3RT5ZAH8dc3ql7lx2CG0h0kK4lVWPkQnwpRLyAN0nwS1z4Cyd1lJgvGUfMWR3v8vkXKA==",
"Database" -> "DepartureDelays",
"preferredRegions" -> "Central US;East US 2;",
"Collection" -> "flights_pcoll", 
"SamplingRatio" -> "1.0"))

readConfig2: com.microsoft.azure.documentdb.spark.config.Config = com.microsoft.azure.documentdb.spark.config.ConfigBuilder$$anon$1@4848afe

In [4]:
// Create collection connection 
val coll = spark.sqlContext.read.DocumentDB(readConfig2)
coll.createOrReplaceTempView("c")

## Query 1: Flights departing from Seattle (Top 100)

In [7]:
// Run, get row count, and time query
var query = "SELECT c.date, c.delay, c.distance, c.origin, c.destination FROM c WHERE c.origin = 'SEA' LIMIT 100"
val start = new DateTime()
val df = spark.sql(query)
df.count()
val end = new DateTime()
val duration = new Duration(start, end)

// Create DataFrame
df.createOrReplaceTempView("df")

// Print out duration of query
PeriodFormat.getDefault().print(duration.toPeriod())

res30: String = 1 second and 195 milliseconds

In [8]:
%%sql
select * from df limit 10

## Query 2: Flights departing from Seattle

In [9]:
// Run, get row count, and time query
var query = "SELECT c.date, c.delay, c.distance, c.origin, c.destination FROM c WHERE c.origin = 'SEA'"
val start = new DateTime()
val df = spark.sql(query)
df.count()
val end = new DateTime()
val duration = new Duration(start, end)

// Create DataFrame
df.createOrReplaceTempView("df")

// Print out duration of query
PeriodFormat.getDefault().print(duration.toPeriod())

res39: String = 1 second and 465 milliseconds

### Determine the number of flights departing from Seattle (in this dataset)

In [10]:
%%sql
select count(1) from df

### Total delay grouped by destination
Not just `counts` but with Spark SQL and DocumentDB, can easily do `GROUP BY`

In [11]:
%%sql
select destination, sum(delay) as TotalDelay from df group by destination order by sum(delay) desc limit 10

### Get distinct ordered destination airports departing from Seattle

In [15]:
%%sql
select distinct destination from df order by destination limit 5

### Top 5 delayed destination cities departing from Seattle (by Total Delay)

In [17]:
%%sql
select destination, sum(delay) 
from df 
where delay < 0 
group by destination 
order by sum(delay) limit 5

### Calculate median delays by destination cities departing from Seattle

In [22]:
%%sql
select destination, percentile_approx(delay, 0.5) as median_delay 
from df 
where delay < 0 
group by destination 
order by percentile_approx(delay, 0.5)

### Query 3: Access all data (1.4M rows)

In [23]:
// Run, get row count, and time query
var query = "SELECT c.date, c.delay, c.distance, c.origin, c.destination FROM c"
val start = new DateTime()
val df = spark.sql(query)
df.count()
val end = new DateTime()
val duration = new Duration(start, end)

// Create DataFrame
df.createOrReplaceTempView("df")

// Print out duration of query
PeriodFormat.getDefault().print(duration.toPeriod())

res59: String = 10 seconds and 424 milliseconds