<span style="color:blue">Thanks for using Drogon for your interactive Spark application. We update Drogon/SparkMagic as often as possible to make it easier, faster and more reliable for you. Have a question or feedback? Ping us on [uChat](https://uchat.uberinternal.com/uber/channels/spark).</span>

What's New
- Now you can use `%%configure` and `%%spark` magics to configure and start a Spark session (deprecating hard-to-use `%load_ext sparkmagic.magics` and `manage_spark` magics). Check out [this example](https://workbench.uberinternal.com/explore/knowledge/localfile/cwang/sparkmagic_python2_example.ipynb) for more details.
- Improved `%%configure` magic. You now can use it to make all Spark and Drogon configurations from within notebook itself. Check out our [latest documentation & examples](https://docs.google.com/document/d/1mkYtDHquh4FjqTeA0Fxii8lyV-P6qzmoABhmmRwm_00/edit#heading=h.xn14pmoorsn0) for more details.
- Bug fixes and performance updates.


In [107]:
%%configure -f
{
  "pyFiles": [],
  "kind": "spark",
  "proxyUser": "radhesh",
  "sparkEnv": "SPARK_24",
  "driverMemory": "8g",
  "queue": "uber_eats_ml",
  "conf": { 
      "spark.dynamicAllocation.enabled": "true",
      "spark.dynamicAllocation.initialExecutors":100,
      "spark.dynamicAllocation.minExecutors":100,
      "spark.dynamicAllocation.maxExecutors" : 200,
      "spark.executor.memory": "12g",
      "spark.executor.memoryOverhead": "4g",
      "spark.driver.memory": "12g",
      "spark.driver.memoryOverhead" : "4g",
      "spark.hadoop.hadoop.security.authentication": "simple",
      "spark.shuffle.service.enabled" : true,
      "spark.sql.shuffle.partitions" : 500
},
  "executorCores": 2,
  "driverCores": 2,
  "executorMemory": "12g",
    "jars": ["/user/radhesh/spark-corenlp-0.4.0-spark2.4-scala2.11.jar",
            "/user/radhesh/stanford-corenlp-3.9.1-models.jar"],
   "drogonHeaders": {
    "X-DROGON-CLUSTER": "phx2/secure"
  }
}

Starting Spark application (can take 60s or more)...
Starting heartbeat thread...done.
Waiting for Drogon session to be ready......................................................
Drogon session is ready.


Drogon Session ID,Spark Application ID,Kind,State,Spark UI,Driver log
569853823,application_1674066407532_1916582,spark,idle,Link,Link


SparkSession available as 'spark'.


Cell execution took 109 seconds.


In [108]:
spark

res1: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@22f49911

In [109]:
var data = spark.sql("SELECT * FROM uber_eats.semantic_tags_grocery_data");

data: org.apache.spark.sql.DataFrame = [item_name: string, semantic_tags: string]

In [None]:
data.show()

In [111]:
data.count()

res3: Long = 2377850

In [112]:
data = data.withColumn("tag",explode(split(col("semantic_tags"),":")))

data: org.apache.spark.sql.DataFrame = [item_name: string, semantic_tags: string ... 1 more field]

In [113]:
data.show()

+--------------------+--------------------+--------------------+
|           item_name|       semantic_tags|                 tag|
+--------------------+--------------------+--------------------+
|Nimbus Sauvignon ...|     wine:white wine|                wine|
|Nimbus Sauvignon ...|     wine:white wine|          white wine|
|Dempster's Hambur...|dairy deli egg br...|dairy deli egg bread|
|Dempster's Hambur...|dairy deli egg br...|              bakery|
|Dempster's Hambur...|dairy deli egg br...|           bread bun|
|Dempster's Hambur...|dairy deli egg br...|               bread|
|Dempster's Hambur...|dairy deli egg br...|      milk bread egg|
|Dempster's Hambur...|dairy deli egg br...|             grocery|
|   Munchos Chips 60g|snack:chip:salty ...|               snack|
|   Munchos Chips 60g|snack:chip:salty ...|                chip|
|   Munchos Chips 60g|snack:chip:salty ...|         salty snack|
|   Munchos Chips 60g|snack:chip:salty ...|           candy bag|
|   Munchos Chips 60g|sna

In [114]:
data.count()

res5: Long = 23150151

In [115]:
import org.apache.spark.sql.functions._
import com.databricks.spark.corenlp.functions._

import com.databricks.spark.corenlp.functions._

In [116]:
var lemmas = data.withColumn("tag_lemmas",lemma('tag))

lemmas: org.apache.spark.sql.DataFrame = [item_name: string, semantic_tags: string ... 2 more fields]

In [117]:
lemmas.count()

res6: Long = 23150151

In [118]:
lemmas.show()

+--------------------+--------------------+--------------------+--------------------+
|           item_name|       semantic_tags|                 tag|          tag_lemmas|
+--------------------+--------------------+--------------------+--------------------+
|Dollarama Xmas-EN...|paper tableware:c...|     paper tableware|  [paper, tableware]|
|Dollarama Xmas-EN...|paper tableware:c...|           christmas|         [christmas]|
|         Turkey Subs|dog:poultry:meat ...|                 dog|               [dog]|
|         Turkey Subs|dog:poultry:meat ...|             poultry|           [poultry]|
|         Turkey Subs|dog:poultry:meat ...|meat seafood plan...|[meat, seafood, p...|
|         Turkey Subs|dog:poultry:meat ...|    homemade pot pie|[homemade, pot, pie]|
|         Turkey Subs|dog:poultry:meat ...|             healthy|           [healthy]|
|         Turkey Subs|dog:poultry:meat ...|salami pepperoni ...|[salami, pepperon...|
|         Turkey Subs|dog:poultry:meat ...|           

In [119]:
lemmas = lemmas.select(col("tag"),col("item_name"),col("tag_lemmas"))

lemmas: org.apache.spark.sql.DataFrame = [tag: string, item_name: string ... 1 more field]

In [120]:
lemmas.show()

+--------------------+--------------------+--------------------+
|                 tag|           item_name|          tag_lemmas|
+--------------------+--------------------+--------------------+
|              coffee|Frappuccino Mocha...|            [coffee]|
|            beverage|Frappuccino Mocha...|          [beverage]|
|  medicine treatment|Walgreens Waterpr...|[medicine, treatm...|
|       support brace|Walgreens Waterpr...|    [support, brace]|
|         honey syrup|Clover natural honey|      [honey, syrup]|
|               casey|Casey's Sour Wate...|             [casey]|
|             maynard|Casey's Sour Wate...|           [maynard]|
|               candy|Casey's Sour Wate...|             [candy]|
|     chocolate sweet|Casey's Sour Wate...|  [chocolate, sweet]|
|maynard jolly ran...|Casey's Sour Wate...|[maynard, jolly, ...|
|                baby|Enfamil - Poly - ...|              [baby]|
|vitamin mineral s...|Enfamil - Poly - ...|[vitamin, mineral...|
|          healthcare|Enf

In [121]:
lemmas.count()

res9: Long = 23150151

In [122]:
lemmas = lemmas.withColumn("size", size($"tag_lemmas"))

lemmas: org.apache.spark.sql.DataFrame = [tag: string, item_name: string ... 2 more fields]

In [123]:
var long_tags = lemmas.filter(col("size") > 5).sort(col("size").desc)

long_tags: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [tag: string, item_name: string ... 2 more fields]

In [124]:
long_tags.show()

+--------------------+--------------------+--------------------+----+
|                 tag|           item_name|          tag_lemmas|size|
+--------------------+--------------------+--------------------+----+
|cold sinus heartb...|Monistat 3 Combo ...|[cold, sinus, hea...|  14|
|cold sinus heartb...|Boiron's Oscilloc...|[cold, sinus, hea...|  14|
|cold sinus heartb...|   Motrin IB (6 Tab)|[cold, sinus, hea...|  14|
|cold sinus heartb...|Boiron's Oscilloc...|[cold, sinus, hea...|  14|
|cold sinus heartb...|Boiron's Oscilloc...|[cold, sinus, hea...|  14|
|cold sinus heartb...|        Monistat - 1|[cold, sinus, hea...|  14|
|cold sinus heartb...|   Motrin IB 100 ct |[cold, sinus, hea...|  14|
|cold sinus heartb...|MONISTAT 7 DAY CO...|[cold, sinus, hea...|  14|
|cold sinus heartb...|         Rosti bites|[cold, sinus, hea...|  14|
|cold sinus heartb...|          Motrin 6ct|[cold, sinus, hea...|  14|
|cold sinus heartb...|Motrin - I b u p ...|[cold, sinus, hea...|  14|
|cold sinus heartb..

In [125]:
long_tags.count()

res11: Long = 116738

In [128]:
long_tags = long_tags.select(col("item_name"), col("tag")).distinct()

long_tags: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [item_name: string, tag: string]

In [129]:
long_tags

res14: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [item_name: string, tag: string]

In [130]:
long_tags.count()

org.apache.spark.SparkException: Job 14 cancelled
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2017)
  at org.apache.spark.scheduler.DAGScheduler.handleJobCancellation(DAGScheduler.scala:1952)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2222)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2205)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2194)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:752)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2250)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2271)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2290)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2315)
  at org.apache.

In [None]:
long_tags.write.mode("overwrite").insertInto("tmp.eats_semantic_tags_long")