<span style="color:blue">Thanks for using Drogon for your interactive Spark application. We update Drogon/SparkMagic as often as possible to make it easier, faster and more reliable for you. Have a question or feedback? Ping us on [uChat](https://uchat.uberinternal.com/uber/channels/spark).</span>

What's New
- Now you can use `%%configure` and `%%spark` magics to configure and start a Spark session (deprecating hard-to-use `%load_ext sparkmagic.magics` and `manage_spark` magics). Check out [this example](https://workbench.uberinternal.com/explore/knowledge/localfile/cwang/sparkmagic_python2_example.ipynb) for more details.
- Improved `%%configure` magic. You now can use it to make all Spark and Drogon configurations from within notebook itself. Check out our [latest documentation & examples](https://docs.google.com/document/d/1mkYtDHquh4FjqTeA0Fxii8lyV-P6qzmoABhmmRwm_00/edit#heading=h.xn14pmoorsn0) for more details.
- Bug fixes and performance updates.


In [None]:
# Spark Configuration

%%configure -f
{
  "pyFiles": [], 
  "kind": "pyspark", 
  "proxyUser": "dhruven.vora", 
  "sparkEnv": "SPARK_24",
  "queue": "maps-popularity-routing",
  "numExecutors": 1000,
  "driverMemory": "12g",
  "executorMemory": "12g",
  "driverCores": 4,
  "executorCores": 1, 
  "jars": [], 
  "conf": {
    "spark.executor.memoryOverhead": "3g",
    "spark.driver.memoryOverhead": "3g",
    "spark.driver.maxResultSize": "10g",
    "hive.exec.dynamic.partition": "true",
    "hive.exec.dynamic.partition.mode": "nonstrict",
    "spark.locality.wait": "6s",
    "spark.maxRemoteBlockSizeFetchToMem": "200m",
    "spark.driver.extraJavaOptions": "-XX:+UseCompressedOops -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps",
    "spark.executor.extraJavaOptions": "-XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=6 -XX:+UseCompressedOops -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC",
    "spark.sql.autoBroadcastJoinThreshold":-1
    }, 
  "drogonHeaders": {
    "X-DROGON-CLUSTER": "PHX2/Secure"  
  }    
}

In [None]:
%%spark

In [None]:
## Load segment speeds for bicycles on tomtom data
## start date: 2021-10-03
## end date: 2021-10-31
segments = spark.sql("""
 with all_data as (

 SELECT 
        msg.classification, 
        msg.durationms*1.0/1000 as durations,
        msg.graphsegment.segmentuuid,
        msg.graphsegment.startjunctionuuid,
        msg.graphsegment.endjunctionuuid,
        msg.segmentfeature.roadclass,
        msg.latitude,
        msg.longitude,
        msg.jobuuid,
        msg.supplyuuid,
        msg.tasktype,
        (case when msg.speedkmph < 1 then 1 when msg.speedkmph > 150 then 150 else msg.speedkmph end)*1.0/3.6 as speedms,
        msg.lengthmeters,
        msg.cityid

    FROM 
        rawdata_user.kafka_hp_maps_historical_streaks_tomtom_nodedup
        where 1=1
        and datestr between '2022-04-04' and '2022-05-02'
        and msg.classification = 'valid'
        and msg.vehicletype = 'BICYCLE'
        and msg.lengthmeters > 0
        and msg.speedkmph > 0
)

, segments as (
SELECT 
startjunctionuuid
,segmentuuid
,endjunctionuuid
,avg(latitude) as latitude
,avg(longitude) as longitude
,LEAST(72,GREATEST(1,count(*)/sum(1/speedms))) as harmonic_mean_ms
from all_data
group by segmentuuid,startjunctionuuid,endjunctionuuid
having count(*) >= 5
)
select * from segments
""")
segments.cache()
segments.createOrReplaceTempView("segments")
spark.sql("""
insert overwrite table tmp.bike_model_tomtom_global select * from segments
"""
)

In [None]:
# show segments snippet and count
segments.show()
segments.count()

In [None]:
# write the historical model to HDFS location
segments.write.format("parquet").mode("overwrite").save("hdfs:///user/dhruven.vora/tomtom_bicycle_20220504_parquet")

In [None]:
## Load segment speeds for bicycles on OSM data
## start date: 2021-10-03
## end date: 2021-10-31
segments = spark.sql("""
 with all_data as (

 SELECT 
        msg.classification, 
        msg.durationms*1.0/1000 as durations,
        msg.graphsegment.segmentuuid,
        msg.graphsegment.startjunctionuuid,
        msg.graphsegment.endjunctionuuid,
        msg.segmentfeature.roadclass,
        msg.latitude,
        msg.longitude,
        msg.jobuuid,
        msg.supplyuuid,
        msg.tasktype,
        (case when msg.speedkmph < 1 then 1 when msg.speedkmph > 150 then 150 else msg.speedkmph end)*1.0/3.6 as speedms,
        msg.lengthmeters,
        msg.cityid

    FROM 
        rawdata_user.kafka_hp_maps_historical_streaks_osm_nodedup
        where 1=1
        and datestr between '2022-04-04' and '2022-05-02'
        and msg.classification = 'valid'
        and msg.vehicletype = 'BICYCLE'
        and msg.lengthmeters > 0
        and msg.speedkmph > 0
)

, segments as (
SELECT 
startjunctionuuid
,segmentuuid
,endjunctionuuid
,avg(latitude) as latitude
,avg(longitude) as longitude
,LEAST(72,GREATEST(1,count(*)/sum(1/speedms))) as harmonic_mean_ms
from all_data
group by segmentuuid,startjunctionuuid,endjunctionuuid
having count(*) >= 5
)
select * from segments
""")
segments.cache()
segments.createOrReplaceTempView("segments")
spark.sql("""
insert overwrite table tmp.bike_model_osm_global select * from segments
"""
)

In [None]:
# show segments snippet and count
segments.show()
segments.count()

In [None]:
# write the historical model to HDFS location
segments.write.format("parquet").mode("overwrite").save("hdfs:///user/dhruven.vora/osm_bicycle_20220504_parquet")