<span style="color:blue">Thanks for using Drogon for your interactive Spark application. We update Drogon/SparkMagic as often as possible to make it easier, faster and more reliable for you. Have a question or feedback? Ping us on [uChat](https://uchat.uberinternal.com/uber/channels/spark).</span>

What's New
- Now you can use `%%configure` and `%%spark` magics to configure and start a Spark session (deprecating hard-to-use `%load_ext sparkmagic.magics` and `manage_spark` magics). Check out [this example](https://workbench.uberinternal.com/explore/knowledge/localfile/cwang/sparkmagic_python2_example.ipynb) for more details.
- Improved `%%configure` magic. You now can use it to make all Spark and Drogon configurations from within notebook itself. Check out our [latest documentation & examples](https://docs.google.com/document/d/1mkYtDHquh4FjqTeA0Fxii8lyV-P6qzmoABhmmRwm_00/edit#heading=h.xn14pmoorsn0) for more details.
- Bug fixes and performance updates.


# Semantic Tags Grocery Tag Lemmatizer

### Spark Config

In [1]:
%%configure -f
{
  "pyFiles": [],
  "kind": "spark",
  "proxyUser": "radhesh",
  "sparkEnv": "SPARK_24",
  "driverMemory": "8g",
  "queue": "uber_eats_ml",
  "conf": { 
      "spark.dynamicAllocation.enabled": "true",
      "spark.dynamicAllocation.initialExecutors":100,
      "spark.dynamicAllocation.minExecutors":100,
      "spark.dynamicAllocation.maxExecutors" : 200,
      "spark.executor.memory": "12g",
      "spark.executor.memoryOverhead": "4g",
      "spark.driver.memory": "12g",
      "spark.driver.memoryOverhead" : "4g",
      "spark.hadoop.hadoop.security.authentication": "simple",
      "spark.shuffle.service.enabled" : true,
      "spark.sql.shuffle.partitions" : 500
},
  "executorCores": 2,
  "driverCores": 2,
  "executorMemory": "12g",
    "jars": ["/user/radhesh/spark-corenlp-0.4.0-spark2.4-scala2.11.jar",
            "/user/radhesh/stanford-corenlp-3.9.1-models.jar"],
   "drogonHeaders": {
    "X-DROGON-CLUSTER": "phx2/secure"
  }
}

In [2]:
spark

Starting Spark application (can take 60s or more)...
Starting heartbeat thread...done.
Waiting for Drogon session to be ready.....................................
Drogon session is ready.


Drogon Session ID,Spark Application ID,Kind,State,Spark UI,Driver log
543414791,application_1670874807583_3088162,spark,idle,,


SparkSession available as 'spark'.


Cell execution took 71 seconds.
res1: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@4ffcfee2

## Load Data 

In [4]:
var input = spark.sql("SELECT DISTINCT semantic_tags FROM uber_eats.semantic_tags_grocery_data").filter(row => !(row.mkString("").isEmpty && row.length>0))

input: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [semantic_tags: string]

In [5]:
input.show()

+--------------------+
|       semantic_tags|
+--------------------+
|               water|
|spice pickle:spic...|
|liquor:drink:alco...|
|dairy egg:concess...|
|    candy gum:market|
|popcorn pretzel n...|
|spirit:spirit gin...|
|alcohlic beverage...|
|snack:cake biscui...|
|liquor:booze:beve...|
|import drink:exot...|
|flour:frozen pizz...|
|essential:milk:mi...|
|spice season:froz...|
|bottle beer ale l...|
|toy:board game:adult|
|clean:laundry:cle...|
|health beauty:ski...|
|bone chew:pet:dog...|
|drink:beverage:li...|
+--------------------+
only showing top 20 rows

In [6]:
input = input.withColumn("tag_name", explode(split(col("semantic_tags"),":")))

input: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [semantic_tags: string, tag_name: string]

In [7]:
input.show()

+--------------------+--------------------+
|       semantic_tags|            tag_name|
+--------------------+--------------------+
|wine white wine:w...|     wine white wine|
|wine white wine:w...|          white wine|
|wine white wine:w...|        pinot grigio|
|wine white wine:w...|white wine pinot ...|
|wine white wine:w...|   wine pinot grigio|
|wine white wine:w...|                wine|
|wine white wine:w...|             alcohol|
|liquor market:sco...|       liquor market|
|liquor market:sco...|              scotch|
|liquor market:sco...|              liquor|
|liquor market:sco...|             whiskey|
|liquor market:sco...|               sprit|
|liquor market:sco...|      spirit alcohol|
|liquor market:sco...|        blend scotch|
|liquor market:sco...|bob discount liqu...|
|liquor market:sco...|      alcohol bottle|
|liquor market:sco...|       whiskey buddy|
|liquor market:sco...|   alcohol miniature|
|liquor market:sco...|             alcohol|
|liquor market:sco...|     whisk

In [9]:
input = input.select("tag_name")

input: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [tag_name: string]

In [10]:
input.show()

+--------------------+
|            tag_name|
+--------------------+
|      american drink|
|            beverage|
|american snack drink|
|          drink soda|
|              cereal|
|          cough cold|
|          frozenfood|
|             instant|
|              frozen|
|            red wine|
|                wine|
|          beer cider|
|wine sparkling wi...|
|   bottle beer cider|
|         beer larger|
|     alesstout lager|
|         bottle beer|
|  lager stout bottle|
|              seller|
|          beer cider|
+--------------------+
only showing top 20 rows

In [11]:
input.count()

res5: Long = 3764850

## Loading Stop Words

In [12]:
import org.apache.spark.sql.functions._
import com.databricks.spark.corenlp.functions._

import com.databricks.spark.corenlp.functions._

In [13]:
var stopWordsDf = spark.sql("SELECT DISTINCT stop_word from kirby_external_data.semantic_tag_stop_words_master")

stopWordsDf: org.apache.spark.sql.DataFrame = [stop_word: string]

In [14]:
var stopWords = stopWordsDf.collect.map(row=>row.getString(0)) 

stopWords: Array[String] = Array(doesn, thanx, wholl, everyday, highest, presents, parts, slightly, hundred, indicated, items, 7, en, whats, bo, nl, we'd, ye, dishes, doubtful, forward, ad, hows, parted, quickly, states, thought, sub, l, twice, got, needing, extra, eg, herself, mug, ``, find, appreciate, ll, anywhere, box, regarding, al, except, always, becoming, thatve, there're, thereto, combination, entrante, gu, name, see, downs, tn, whys, april, beforehand, causes, not, twenty, i'd, certain, cv, show, 50g, couldn't, mt, sorry, there'd, jp, mv, sec, Wednesday, ain, oz., sensible, 350, ao, must, needn't, much, opposite, ref, di, hasn, make, namely, new, th, ups, cant, hadn, neither, u, what, widely, further, work, hour, omitted, desayuno, wasn, neednt, whichever, eight, fx, resulted,...

In [15]:
stopWords

res6: Array[String] = Array(doesn, thanx, wholl, everyday, highest, presents, parts, slightly, hundred, indicated, items, 7, en, whats, bo, nl, we'd, ye, dishes, doubtful, forward, ad, hows, parted, quickly, states, thought, sub, l, twice, got, needing, extra, eg, herself, mug, ``, find, appreciate, ll, anywhere, box, regarding, al, except, always, becoming, thatve, there're, thereto, combination, entrante, gu, name, see, downs, tn, whys, april, beforehand, causes, not, twenty, i'd, certain, cv, show, 50g, couldn't, mt, sorry, there'd, jp, mv, sec, Wednesday, ain, oz., sensible, 350, ao, must, needn't, much, opposite, ref, di, hasn, make, namely, new, th, ups, cant, hadn, neither, u, what, widely, further, work, hour, omitted, desayuno, wasn, neednt, whichever, eight, fx, resulted, they...

In [16]:
val finalInput = input.select(col("tag_name")).where(length(regexp_replace($"tag_name", " ","")) > 0)

finalInput: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [tag_name: string]

In [17]:
finalInput.count()

res7: Long = 3764850

In [18]:
finalInput.show()

+--------------------+
|            tag_name|
+--------------------+
|          coffee tea|
|  bonisoir laprairie|
|               drink|
|            starbuck|
|              coffee|
|coffee coffee filter|
|    coffee tea drink|
|         coffee milk|
|            beverage|
|           jus drink|
|                 tea|
|        energy drink|
|depanneur quartie...|
|                soda|
|          soda drink|
|        frozen snack|
|             sto dep|
|         freshly fry|
|              frozen|
|             grocery|
+--------------------+
only showing top 20 rows

In [19]:
val lemmas = finalInput.withColumn("tag_lemma",lemma('tag_name))

lemmas: org.apache.spark.sql.DataFrame = [tag_name: string, tag_lemma: array<string>]

In [20]:
lemmas.count()

res9: Long = 3764850

In [21]:
lemmas.show()

+--------------------+--------------------+
|            tag_name|           tag_lemma|
+--------------------+--------------------+
|             grocery|           [grocery]|
|         honey syrup|      [honey, syrup]|
|                 tag|               [tag]|
|condiment spice bake|[condiment, spice...|
|          consumable|        [consumable]|
|         noodle soup|      [noodle, soup]|
|             grocery|           [grocery]|
|        asian import|     [asian, import]|
|              noodle|            [noodle]|
|               snack|             [snack]|
|             instant|           [instant]|
|     krystal express|  [krystal, express]|
|             product|           [product]|
|                soup|              [soup]|
|                 pet|               [pet]|
|              yogurt|            [yogurt]|
|           dairy egg|        [dairy, egg]|
|           ice cream|        [ice, cream]|
|          toiletries|        [toiletries]|
|            haircare|          

In [22]:
var patternString =
        "^\\d+[A-Za-z]{1,2}$|^\\d+pcs$|^\\d+pc$|^\\d+g$|^\\d+gm$|^\\d+ml$|^\\d+kg$|^\\d+oz$|^\\d+oz.$|^\\d+mg$|^\\d+lb$|^d+”$|^\\d+’$|^\\d+cm$|^\\d+gms$|^\\d+pk$|^\\d+mm$|^\\d+lt$|";
    
   

patternString: String = ^\d+[A-Za-z]{1,2}$|^\d+pcs$|^\d+pc$|^\d+g$|^\d+gm$|^\d+ml$|^\d+kg$|^\d+oz$|^\d+oz.$|^\d+mg$|^\d+lb$|^d+”$|^\d+’$|^\d+cm$|^\d+gms$|^\d+pk$|^\d+mm$|^\d+lt$|

In [23]:
patternString +="^\\d+g.$|^\\d+gm.$|^\\d+ml.$|^\\d+kg.$|^\\d+mg.$|^\\d+lb.$|^d+”.$|^\\d+’.$|^\\d+cm.$|^\\d+gms.$|^\\d+pk.$|^\\d+mm.$|^\\d+lt.$|^\\d+$|^\\d*\\.?\\d$|";

In [24]:
 patternString += "^\\d+cl$|^\\d+am|^\\d+pm|^[0-2][0-3]:[0-5][0-9]$";

In [25]:
patternString

res13: String = ^\d+[A-Za-z]{1,2}$|^\d+pcs$|^\d+pc$|^\d+g$|^\d+gm$|^\d+ml$|^\d+kg$|^\d+oz$|^\d+oz.$|^\d+mg$|^\d+lb$|^d+”$|^\d+’$|^\d+cm$|^\d+gms$|^\d+pk$|^\d+mm$|^\d+lt$|^\d+g.$|^\d+gm.$|^\d+ml.$|^\d+kg.$|^\d+mg.$|^\d+lb.$|^d+”.$|^\d+’.$|^\d+cm.$|^\d+gms.$|^\d+pk.$|^\d+mm.$|^\d+lt.$|^\d+$|^\d*\.?\d$|^\d+cl$|^\d+am|^\d+pm|^[0-2][0-3]:[0-5][0-9]$

In [26]:
import org.apache.spark.ml.feature.StopWordsRemover

import org.apache.spark.ml.feature.StopWordsRemover

In [27]:
val remover = new StopWordsRemover().setStopWords(stopWords).setInputCol("tag_lemma").setOutputCol("test")

remover: org.apache.spark.ml.feature.StopWordsRemover = stopWords_bbdd82e3ceda

In [28]:
var df = remover.transform(lemmas)

df: org.apache.spark.sql.DataFrame = [tag_name: string, tag_lemma: array<string> ... 1 more field]

In [29]:
df.count()

res14: Long = 3764850

In [30]:
df.show()

+--------------------+--------------------+--------------------+
|            tag_name|           tag_lemma|                test|
+--------------------+--------------------+--------------------+
|        meat seafood|     [meat, seafood]|     [meat, seafood]|
|          white wine|       [white, wine]|       [white, wine]|
|          gallo wine|       [gallo, wine]|       [gallo, wine]|
|      wine wite rise|  [wine, wite, rise]|  [wine, wite, rise]|
|    chill white wine|[chill, white, wine]|[chill, white, wine]|
|   alcholic beverage|[alcholic, beverage]|[alcholic, beverage]|
|   convenience store|[convenience, store]|[convenience, store]|
|     krystal express|  [krystal, express]|  [krystal, express]|
|spirit champagne ...|[spirit, champagn...|[spirit, champagn...|
|     white rise wine| [white, rise, wine]| [white, rise, wine]|
|    californian wine| [californian, wine]| [californian, wine]|
|wine champagne pr...|[wine, champagne,...|[wine, champagne,...|
|   price weight live|[pr

In [31]:
import org.apache.spark.sql.functions.udf
import java.util.regex.Pattern
val pattern = Pattern.compile(patternString)
val removeRegex = udf {
  (array: Seq[String]) =>
    
    val cleanArray = array.filter((text) => ( text.length >= 3 && !pattern.matcher(text).find()) )
    cleanArray
};

removeRegex: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(StringType,true),Some(List(ArrayType(StringType,true))))

In [32]:
var regexdf = df.withColumn("removeregex",removeRegex(df.col("test")))

regexdf: org.apache.spark.sql.DataFrame = [tag_name: string, tag_lemma: array<string> ... 2 more fields]

In [33]:
regexdf.show()

+--------------------+--------------------+--------------------+--------------------+
|            tag_name|           tag_lemma|                test|         removeregex|
+--------------------+--------------------+--------------------+--------------------+
|                milk|              [milk]|              [milk]|              [milk]|
|           dairy egg|        [dairy, egg]|        [dairy, egg]|        [dairy, egg]|
|   topping condiment|    [top, condiment]|         [condiment]|         [condiment]|
|             grocery|           [grocery]|           [grocery]|           [grocery]|
|           condiment|         [condiment]|         [condiment]|         [condiment]|
|             barcode|           [barcode]|           [barcode]|           [barcode]|
|               pasta|             [pasta]|             [pasta]|             [pasta]|
|           dry pasta|        [dry, pasta]|        [dry, pasta]|        [dry, pasta]|
|           soda shop|        [soda, shop]|        [so

In [34]:
regexdf.show(10,false)

+--------------------+------------------------+------------------------+------------------------+
|tag_name            |tag_lemma               |test                    |removeregex             |
+--------------------+------------------------+------------------------+------------------------+
|alcoholic drink     |[alcoholic, drink]      |[alcoholic, drink]      |[alcoholic, drink]      |
|booze               |[booze]                 |[booze]                 |[booze]                 |
|alcohol             |[alcohol]               |[alcohol]               |[alcohol]               |
|liquor              |[liquor]                |[liquor]                |[liquor]                |
|cocktail            |[cocktail]              |[cocktail]              |[cocktail]              |
|spirit              |[spirit]                |[spirit]                |[spirit]                |
|cocktail soda       |[cocktail, soda]        |[cocktail, soda]        |[cocktail, soda]        |
|monaco             

In [35]:
regexdf.count()

res18: Long = 3764850

In [36]:
val outputDf = regexdf.withColumn("final_tag_name",concat_ws(" ", $"removeregex"))

outputDf: org.apache.spark.sql.DataFrame = [tag_name: string, tag_lemma: array<string> ... 3 more fields]

In [37]:
val finalDf = outputDf.select(col("tag_name"),col("final_tag_name")).where(length(regexp_replace($"final_tag_name", " ","")) > 0)

finalDf: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [tag_name: string, final_tag_name: string]

In [38]:
finalDf.show(100)

+--------------------+--------------------+
|            tag_name|      final_tag_name|
+--------------------+--------------------+
|   drink drink mixer|   drink drink mixer|
|               juice|               juice|
|                 wet|                 wet|
|                 pet|                 pet|
|chocolatesweetsca...|chocolatesweetsca...|
|    import chocolate|    import chocolate|
|           chocolate|           chocolate|
|chocolate snack c...|chocolate snack c...|
|            choclate|            choclate|
|               sweet|               sweet|
|                cake|                cake|
|  ice cream popsicle|  ice cream popsicle|
|              frozen|              frozen|
|     brew tea coffee|     brew tea coffee|
|            brew tea|            brew tea|
|            beverage|            beverage|
|     alcoholic drink|     alcoholic drink|
|              liquor|              liquor|
|             alcohol|             alcohol|
|              spirit|          

In [39]:
finalDf.count()

res20: Long = 3763079

In [40]:
finalDf.select(col("final_tag_name")).show(50,false)

+------------------------+
|final_tag_name          |
+------------------------+
|bacon sausage hot dog   |
|roller grill            |
|meat                    |
|sausage                 |
|deli                    |
|sausage bacon           |
|concession              |
|beef                    |
|meat seafood            |
|meat poultry seafood    |
|grocery                 |
|meat seafood plant base |
|deli meat cold cut      |
|vegetable               |
|scotch                  |
|prepared                |
|dessert                 |
|international           |
|indian                  |
|beverage                |
|liquor                  |
|super                   |
|liqueur                 |
|alcoholic drink         |
|booze                   |
|spirit                  |
|drink                   |
|alcohol                 |
|beer domestic import    |
|liquor                  |
|beverage                |
|spirit                  |
|booze                   |
|rise dry hard cider beer|
|

In [41]:
var frequentItem = finalDf.select(col("final_tag_name"))

frequentItem: org.apache.spark.sql.DataFrame = [final_tag_name: string]

In [42]:
frequentItem = frequentItem.withColumn("word", explode(split(col("final_tag_name")," ")))

frequentItem: org.apache.spark.sql.DataFrame = [final_tag_name: string, word: string]

In [43]:
frequentItem = frequentItem.drop("final_tag_name")

frequentItem: org.apache.spark.sql.DataFrame = [word: string]

In [44]:
frequentItem.groupBy("word").count().sort(col("count").desc).show(50,false)

+---------+------+
|word     |count |
+---------+------+
|drink    |397235|
|wine     |235236|
|beverage |195946|
|alcohol  |193792|
|spirit   |188976|
|liquor   |178225|
|snack    |173984|
|alcoholic|154472|
|booze    |142165|
|beer     |138674|
|candy    |71022 |
|chocolate|68065 |
|cookie   |61248 |
|frozen   |59246 |
|sweet    |56168 |
|red      |48215 |
|grocery  |47993 |
|whiskey  |46401 |
|meat     |45031 |
|chip     |42517 |
|cream    |42477 |
|health   |41429 |
|dairy    |40331 |
|product  |38583 |
|fruit    |38305 |
|ice      |37815 |
|cracker  |35341 |
|juice    |34975 |
|crisp    |34331 |
|personal |34222 |
|cider    |33403 |
|fresh    |32178 |
|beauty   |32146 |
|white    |31673 |
|sauce    |29751 |
|vegetable|28937 |
|mixer    |28414 |
|tea      |27721 |
|household|27667 |
|ready    |26900 |
|american |26600 |
|spice    |26148 |
|vodka    |26112 |
|bakery   |25450 |
|gum      |24857 |
|coffee   |24535 |
|sparkling|24224 |
|water    |23922 |
|nut      |23608 |
|bread    |2