<span style="color:blue">Thanks for using Drogon for your interactive Spark application. We update Drogon/SparkMagic as often as possible to make it easier, faster and more reliable for you. Have a question or feedback? Ping us on [uChat](https://uchat.uberinternal.com/uber/channels/spark).</span>

What's New
- Now you can use `%%configure` and `%%spark` magics to configure and start a Spark session (deprecating hard-to-use `%load_ext sparkmagic.magics` and `manage_spark` magics). Check out [this example](https://workbench.uberinternal.com/explore/knowledge/localfile/cwang/sparkmagic_python2_example.ipynb) for more details.
- Improved `%%configure` magic. You now can use it to make all Spark and Drogon configurations from within notebook itself. Check out our [latest documentation & examples](https://docs.google.com/document/d/1mkYtDHquh4FjqTeA0Fxii8lyV-P6qzmoABhmmRwm_00/edit#heading=h.xn14pmoorsn0) for more details.
- Bug fixes and performance updates.


# Frequent Pattern Mining - Semantic Tags


### Spark configuration



In [1]:
%%configure 
{
  "pyFiles": [],
  "kind": "spark",
  "proxyUser": "radhesh",
  "sparkEnv": "SPARK_24",
  "driverMemory": "8g",
  "queue": "uber_eats",
  "conf": { 
      "spark.dynamicAllocation.enabled": "true",
      "spark.dynamicAllocation.initialExecutors":100,
      "spark.dynamicAllocation.minExecutors":100,
      "spark.dynamicAllocation.maxExecutors" : 200,
      "spark.executor.memory": "12g",
      "spark.executor.memoryOverhead": "4g",
      "spark.driver.memory": "12g",
      "spark.driver.memoryOverhead" : "4g",
      "spark.hadoop.hadoop.security.authentication": "simple",
      "spark.shuffle.service.enabled" : true,
      "spark.sql.shuffle.partitions" : 500
},
  "executorCores": 2,
  "driverCores": 2,
  "executorMemory": "12g",
    "jars": ["/user/radhesh/spark-corenlp-0.4.0-spark2.4-scala2.11.jar",
            "/user/radhesh/stanford-corenlp-3.9.1-models.jar"],
   "drogonHeaders": {
    "X-DROGON-CLUSTER": "phx2/secure"
  }
}

In [2]:
spark

Starting Spark application (can take 60s or more)...
Starting heartbeat thread...done.
Waiting for Drogon session to be ready...................................................
Drogon session is ready.


Drogon Session ID,Spark Application ID,Kind,State,Spark UI,Driver log
543342203,application_1670874807583_3070696,spark,idle,,


SparkSession available as 'spark'.


Cell execution took 103 seconds.
res1: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@20afbf90

In [24]:
var input = spark.sql("SELECT DISTINCT semantic_tags FROM uber_eats.semantic_tags_grocery_data")

input: org.apache.spark.sql.DataFrame = [semantic_tags: string]

In [25]:
input.show()

+--------------------+
|       semantic_tags|
+--------------------+
|       spirit:liquor|
|liquor:spirit:alc...|
|default:ale stout...|
|chocolate gum can...|
|cooking sauce kit...|
|granola:energy gr...|
|coffee:tag:unreal...|
|vegetable:packet:...|
|jam jelly:condime...|
|tequila:alcohol:r...|
|alcoholic drink:s...|
|miniature shot:mi...|
|alcohol:spirit:bo...|
|fast noodle:froze...|
|drink pop juice:s...|
|essential:condime...|
|lucky convenience...|
|snack:chip:chip s...|
|white wine:bevera...|
|bourbon:booze:whi...|
+--------------------+
only showing top 20 rows

In [26]:
input.count()

res17: Long = 640677

### Split Semantic tags into seperate rows, and split all tokens of each token as a new new

In [27]:
input = input.withColumn("word", explode(split(col("semantic_tags"),":")))

input: org.apache.spark.sql.DataFrame = [semantic_tags: string, word: string]

In [28]:
input.show()

+--------------------+--------------------+
|       semantic_tags|                word|
+--------------------+--------------------+
|        meat seafood|        meat seafood|
|white wine:gallo ...|          white wine|
|white wine:gallo ...|          gallo wine|
|white wine:gallo ...|      wine wite rise|
|white wine:gallo ...|    chill white wine|
|white wine:gallo ...|   alcholic beverage|
|white wine:gallo ...|   convenience store|
|white wine:gallo ...|     krystal express|
|white wine:gallo ...|spirit champagne ...|
|white wine:gallo ...|     white rise wine|
|white wine:gallo ...|    californian wine|
|white wine:gallo ...|wine champagne pr...|
|white wine:gallo ...|   price weight live|
|white wine:gallo ...|    wine californium|
|white wine:gallo ...|             alcohol|
|white wine:gallo ...|                wine|
|white wine:gallo ...|   white wine bottle|
|white wine:gallo ...|     wine white wine|
|white wine:gallo ...|        alcohol wine|
|white wine:gallo ...|chill whit

In [29]:
input.count()

res19: Long = 3764850

In [30]:
input = input.withColumn("token", explode(split(col("word")," ")))

input: org.apache.spark.sql.DataFrame = [semantic_tags: string, word: string ... 1 more field]

In [31]:
input.show()

+--------------------+---------------+---------+
|       semantic_tags|           word|    token|
+--------------------+---------------+---------+
|beverage:booze:sp...|       beverage| beverage|
|beverage:booze:sp...|          booze|    booze|
|beverage:booze:sp...|         spirit|   spirit|
|beverage:booze:sp...|         liquor|   liquor|
|beverage:booze:sp...|alcoholic drink|alcoholic|
|beverage:booze:sp...|alcoholic drink|    drink|
|beverage:booze:sp...|   cream liquor|    cream|
|beverage:booze:sp...|   cream liquor|   liquor|
|beverage:booze:sp...|liqueur cordial|  liqueur|
|beverage:booze:sp...|liqueur cordial|  cordial|
|beverage:booze:sp...|        alcohol|  alcohol|
|beverage:booze:sp...|          drink|    drink|
|        product:bean|        product|  product|
|        product:bean|           bean|     bean|
|tequila:blanco si...|        tequila|  tequila|
|tequila:blanco si...|  blanco silver|   blanco|
|tequila:blanco si...|  blanco silver|   silver|
|tequila:blanco si..

In [38]:
input.count()

res25: Long = 6240898

In [39]:
input = input.drop("semantic_tags")

input: org.apache.spark.sql.DataFrame = [token: string]

In [40]:
input = input.drop("word")

input: org.apache.spark.sql.DataFrame = [token: string]

In [41]:
input

res26: org.apache.spark.sql.DataFrame = [token: string]

In [42]:
input.show()

+---------+
|    token|
+---------+
|  alcohol|
|    booze|
|   brandy|
|   cognac|
| beverage|
|   liquor|
|alcoholic|
|    drink|
|   spirit|
|    drink|
|   cognac|
|   frozen|
|   frozen|
|vegetable|
|     wine|
|   spirit|
|     beer|
|      wet|
|      cat|
|      pet|
+---------+
only showing top 20 rows

In [43]:
input.count()

res28: Long = 6240898

### Mine Frequent Tokens


In [45]:
input = input.groupBy("token").count().sort(col("count").desc)

input: org.apache.spark.sql.DataFrame = [token: string, count: bigint]

In [48]:
input.show(1000,false)

+-----------------+------+
|token            |count |
+-----------------+------+
|drink            |397235|
|wine             |235200|
|beverage         |195946|
|alcohol          |193792|
|spirit           |188976|
|liquor           |178225|
|snack            |173984|
|alcoholic        |154472|
|booze            |142165|
|beer             |138674|
|candy            |71022 |
|chocolate        |68065 |
|cookie           |61248 |
|frozen           |60577 |
|sweet            |56168 |
|red              |48215 |
|grocery          |47993 |
|whiskey          |46401 |
|meat             |45031 |
|chip             |42517 |
|cream            |42477 |
|health           |41429 |
|dairy            |40331 |
|product          |38583 |
|fruit            |38305 |
|ice              |37815 |
|cracker          |35333 |
|juice            |34975 |
|crisp            |34331 |
|personal         |34222 |
|cider            |33403 |
|fresh            |32178 |
|beauty           |32146 |
|white            |31673 |
|

In [49]:
input.count()

res32: Long = 10230

### Mining frequent Item Names

In [50]:
var AllItems = spark.sql("SELECT item_name FROM eds.menu_main_items")

AllItems: org.apache.spark.sql.DataFrame = [item_name: string]

In [51]:
AllItems.show()

+------------------------+
|               item_name|
+------------------------+
|    Sally Hansen Inst...|
|    Wexford Wooden Ru...|
|    Cheez-It Duos Che...|
|      rice crispy treats|
|               Item_3109|
|    Aquaphor Advanced...|
|    AXE Body Spray De...|
|    SMARTSWEETS SOUR ...|
|    Noz pecã carameli...|
|    Strawberry Banana...|
|(量)御衣坊純水濕紙巾80抽|
|                  Kulcha|
|    Shalom · Hair Bru...|
|    Breakfast Crunchw...|
|    La Esencia Sangri...|
|    Olay Regenerist M...|
|    Red Bull Red Edit...|
|    Tide Ultra Oxi wi...|
|               Item_3271|
|            Small Fanta®|
+------------------------+
only showing top 20 rows

In [52]:
AllItems.count()

res34: Long = 590717486

In [53]:
AllItems = AllItems.withColumn("word", explode(split(col("item_name")," ")))

AllItems: org.apache.spark.sql.DataFrame = [item_name: string, word: string]

In [54]:
AllItems.show()

+--------------------+---------+
|           item_name|     word|
+--------------------+---------+
|           Item_4904|Item_4904|
|Old Spice High En...|      Old|
|Old Spice High En...|    Spice|
|Old Spice High En...|     High|
|Old Spice High En...|Endurance|
|Old Spice High En...|Deodorant|
|Old Spice High En...|    Fresh|
|Old Spice High En...|       (3|
|Old Spice High En...|      oz)|
|Pan de queso crem...|      Pan|
|Pan de queso crem...|       de|
|Pan de queso crem...|    queso|
|Pan de queso crem...|    crema|
|Pan de queso crem...|        y|
|Pan de queso crem...|    fresa|
|Bud Light, 6pk-12...|      Bud|
|Bud Light, 6pk-12...|   Light,|
|Bud Light, 6pk-12...| 6pk-12oz|
|Bud Light, 6pk-12...|   bottle|
|Bud Light, 6pk-12...|     beer|
+--------------------+---------+
only showing top 20 rows

In [55]:
AllItems = AllItems.drop("item_name")

AllItems: org.apache.spark.sql.DataFrame = [word: string]

In [56]:
AllItems.count()

res36: Long = 3443261288

In [57]:
AllItems.show()

+--------------+
|          word|
+--------------+
|           Chi|
|          Silk|
|Reconstructing|
|       Complex|
|            (2|
|           oz)|
|        Olives|
|        vertes|
|   dénoyautées|
|        Salmón|
|   Gunkan-Maki|
|         Betty|
|       Crocker|
|          Bulb|
|        Baster|
|          (##)|
|       Organic|
|      Valencia|
|        Orange|
|        simple|
+--------------+
only showing top 20 rows

In [58]:
AllItems = AllItems.groupBy("word").count().sort(col("count").desc)

AllItems: org.apache.spark.sql.DataFrame = [word: string, count: bigint]

In [59]:
AllItems.count()

res38: Long = 7986459

In [61]:
AllItems.show(100,false)

+----------+--------+
|word      |count   |
+----------+--------+
|-         |82644075|
|·         |67058532|
|oz)       |58769306|
|oz        |40642638|
|&         |34621550|
|de        |28722141|
|ea        |23287944|
|fl        |22367774|
|x         |19752670|
|Chicken   |19253804|
|          |17065040|
|with      |16287117|
|Chocolate |14737330|
|ct)       |14532150|
|Burger    |13693104|
|1.0       |11691011|
|Cheese    |11335451|
|Best      |11252488|
|and       |10871636|
|OZ        |10305329|
|Original  |9908045 |
|(1        |9714373 |
|1         |8391352 |
|Cream     |8098390 |
|Bottle    |7967209 |
|2         |7456202 |
|Walgreens |7100809 |
|in        |7074603 |
|Milk      |6831353 |
|Free      |6607388 |
|White     |6563911 |
|Pizza     |6518177 |
|Pack      |6395949 |
|4         |6368865 |
|Beef      |5981937 |
|Sauce     |5927163 |
|Black     |5900104 |
|750ml     |5797619 |
|Extra     |5497344 |
|6         |5463557 |
|Size      |5431088 |
|Signature |5393588 |
|Water    