<span style="color:blue">Thanks for using Drogon for your interactive Spark application. We update Drogon/SparkMagic as often as possible to make it easier, faster and more reliable for you. Have a question or feedback? Ping us on [uChat](https://uchat.uberinternal.com/uber/channels/spark).</span>

What's New
- Now you can use `%%configure` and `%%spark` magics to configure and start a Spark session (deprecating hard-to-use `%load_ext sparkmagic.magics` and `manage_spark` magics). Check out [this example](https://workbench.uberinternal.com/explore/knowledge/localfile/cwang/sparkmagic_python2_example.ipynb) for more details.
- Improved `%%configure` magic. You now can use it to make all Spark and Drogon configurations from within notebook itself. Check out our [latest documentation & examples](https://docs.google.com/document/d/1mkYtDHquh4FjqTeA0Fxii8lyV-P6qzmoABhmmRwm_00/edit#heading=h.xn14pmoorsn0) for more details.
- Bug fixes and performance updates.


# Semantic Tags Grocery Item Lemmatizer


### Spark Config

In [1]:
%%configure 
{
  "pyFiles": [],
  "kind": "spark",
  "proxyUser": "radhesh",
  "sparkEnv": "SPARK_24",
  "driverMemory": "8g",
  "queue": "uber_eats_ml",
  "conf": { 
      "spark.dynamicAllocation.enabled": "true",
      "spark.dynamicAllocation.initialExecutors":100,
      "spark.dynamicAllocation.minExecutors":100,
      "spark.dynamicAllocation.maxExecutors" : 200,
      "spark.executor.memory": "12g",
      "spark.executor.memoryOverhead": "4g",
      "spark.driver.memory": "12g",
      "spark.driver.memoryOverhead" : "4g",
      "spark.hadoop.hadoop.security.authentication": "simple",
      "spark.shuffle.service.enabled" : true,
      "spark.sql.shuffle.partitions" : 500
},
  "executorCores": 2,
  "driverCores": 2,
  "executorMemory": "12g",
    "jars": ["/user/radhesh/spark-corenlp-0.4.0-spark2.4-scala2.11.jar",
            "/user/radhesh/stanford-corenlp-3.9.1-models.jar"],
   "drogonHeaders": {
    "X-DROGON-CLUSTER": "phx2/secure"
  }
}

In [2]:
spark

Starting Spark application (can take 60s or more)...
Starting heartbeat thread...done.
Waiting for Drogon session to be ready.......................................................................
Drogon session is ready.


Drogon Session ID,Spark Application ID,Kind,State,Spark UI,Driver log
543380559,application_1670874807583_3080506,spark,idle,Link,Link


SparkSession available as 'spark'.


Cell execution took 148 seconds.
res1: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@222ceb40

In [5]:
var input = spark.sql("SELECT * FROM uber_eats.semantic_tag_distinct_grocery_us_uk_canada_item_data WHERE length(item_name) >= 3").filter(row => !(row.mkString("").isEmpty && row.length>0))

input: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [item_name: string, item_uuid: string]

In [9]:
input.show(50,false)

+----------------------------------------------------------+------------------------------------+
|item_name                                                 |item_uuid                           |
+----------------------------------------------------------+------------------------------------+
|david original salted   roasted sunflower seeds       oz  |c19ed7cc-2bfc-4ecb-84b6-dd0f1ef0deb4|
|paul masson v s    ml brandy        abv                   |42f96070-4a84-478b-a36b-151860012ff0|
|listerine fresh burst frontend sleeve       ct            |4f814246-fede-475b-abef-a0cc2cb7b758|
|western son    ml                                         |ac097dff-8479-4938-b578-69998dde08f3|
|beringer chardonnay    l     abv                          |647d7b96-12dc-47ad-8721-04769a2044c8|
|export a ashtray one                                      |2c15fa57-a8ad-4b5c-a2c0-66fcd18a81de|
|oud hollandsche koffie wafel coffee wafer     oz          |f84a3f76-2476-56a8-a214-adecfd5b6547|
|lagunitas ipa      

In [10]:
input.count()

res6: Long = 3275251

### Adding Imports

In [11]:
import org.apache.spark.sql.functions._
import com.databricks.spark.corenlp.functions._

import com.databricks.spark.corenlp.functions._

### Load Stop Words

In [12]:
var stopWordsDf = spark.sql("SELECT DISTINCT stop_word from kirby_external_data.semantic_tag_stop_words_master")

stopWordsDf: org.apache.spark.sql.DataFrame = [stop_word: string]

In [13]:
stopWordsDf.count()

res7: Long = 1556

In [17]:
stopWordsDf.show(50,false)

+-----------+
|stop_word  |
+-----------+
|ll         |
|anywhere   |
|himseÓ     |
|latenight  |
|for        |
|gone       |
|build      |
|gu         |
|name       |
|see        |
|elsewhere  |
|him        |
|backward   |
|il         |
|mustnt     |
|men        |
|regardless |
|somehow    |
|ten        |
|toward     |
|theyd      |
|highest    |
|presents   |
|parts      |
|slightly   |
|itseÓ      |
|parting    |
|sv         |
|soft       |
|act        |
|uses       |
|lunch      |
|al         |
|except     |
|always     |
|becoming   |
|thatve     |
|there're   |
|thereto    |
|therefore  |
|with       |
|world      |
|younger    |
|group      |
|speciality |
|appropriate|
|call       |
|mightn     |
|ni         |
|pk         |
+-----------+
only showing top 50 rows

In [18]:
var stopWords = stopWordsDf.collect.map(row=>row.getString(0)) 

stopWords: Array[String] = Array(doesn, thanx, wholl, everyday, highest, presents, parts, slightly, hundred, indicated, items, 7, en, whats, bo, nl, we'd, ye, dishes, doubtful, forward, ad, hows, parted, quickly, states, thought, sub, l, twice, got, needing, extra, eg, herself, mug, ``, find, appreciate, ll, anywhere, box, regarding, al, except, always, becoming, thatve, there're, thereto, combination, entrante, gu, name, see, downs, tn, whys, april, beforehand, causes, not, twenty, i'd, certain, cv, show, 50g, couldn't, mt, sorry, there'd, jp, mv, sec, Wednesday, ain, oz., sensible, 350, ao, must, needn't, much, opposite, ref, di, hasn, make, namely, new, th, ups, cant, hadn, neither, u, what, widely, further, work, hour, omitted, desayuno, wasn, neednt, whichever, eight, fx, resulted,...

In [19]:
stopWords.length

res11: Int = 1556

In [20]:
val finalInput = input.select(col("item_name"),col("item_uuid")).where(length(regexp_replace($"item_name", " ","")) > 0)

finalInput: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [item_name: string, item_uuid: string]

In [22]:
finalInput.show(20,false)

+-------------------------------------------------------------------------+------------------------------------+
|item_name                                                                |item_uuid                           |
+-------------------------------------------------------------------------+------------------------------------+
|barr d b  l bottle                                                       |5b80d4a6-eac2-467c-8e6d-f2ec8146869f|
|beefeater gin   tonic pink strawberry can    ml                          |123e8d82-10da-44c8-a3ec-ad0c90e2968a|
|mark west pinot noir  mlx   ml bottle    ml beer     abv                 |b59ac2c0-443e-46bf-b0e8-59873e4ac68e|
|o'keeffe's seal n heal lip repair twin pack                              |fdfb585e-649d-583a-8030-5402d9438f34|
|splat pure sapphire color   bleach kit                                   |fe9908a2-1ced-54e3-8f12-d8887a9c318e|
|skittles crazy sours     g                                               |9a8d7354-e4f1-401d-8c

In [23]:
finalInput.count()

res14: Long = 3274944

In [24]:
val lemmas = finalInput.withColumn("item_lemma",lemma('item_name))

lemmas: org.apache.spark.sql.DataFrame = [item_name: string, item_uuid: string ... 1 more field]

In [25]:
lemmas.count()

res15: Long = 3274944

In [26]:
lemmas.show()

+--------------------+--------------------+--------------------+
|           item_name|           item_uuid|          item_lemma|
+--------------------+--------------------+--------------------+
|nestle milky bar ...|84a4ff81-a98e-59e...|[nestle, milky, b...|
|new amsterdam glu...|3e77b369-98f3-4e1...|[new, amsterdam, ...|
|e e ground almond...|c1c26196-1f08-41c...|[e, e, ground, al...|
|uncle rays ripple...|27690021-40fe-49d...|[uncle, ray, ripp...|
|snapple spiked wa...|a4ee393a-384d-4bc...|[snapple, spike, ...|
|missha   all arou...|2be2ff9e-8aeb-4de...|[missha, all, aro...|
|saint laurent luc...|865d1df3-b56b-5ec...|[saint, laurent, ...|
|marquis de bel ai...|e1984657-88f2-52d...|[marquis, de, bel...|
|smoking loon pino...|d7888af5-15de-402...|[smoking, loon, p...|
|pesquera reserva ...|74e128e0-0528-562...|[pesquera, reserv...|
|cherryade   litre...|1075b4a0-5f5e-4b6...|[cherryade, litre...|
|e x tra s mint gu...|dabdf025-a2e8-4a6...|[e, x, tra, be, m...|
|golden state cide...|959

## Load pattern for regex 

In [27]:
var patternString =
        "^\\d+[A-Za-z]{1,2}$|^\\d+pcs$|^\\d+pc$|^\\d+g$|^\\d+gm$|^\\d+ml$|^\\d+kg$|^\\d+oz$|^\\d+oz.$|^\\d+mg$|^\\d+lb$|^d+”$|^\\d+’$|^\\d+cm$|^\\d+gms$|^\\d+pk$|^\\d+mm$|^\\d+lt$|";
    
   

patternString: String = ^\d+[A-Za-z]{1,2}$|^\d+pcs$|^\d+pc$|^\d+g$|^\d+gm$|^\d+ml$|^\d+kg$|^\d+oz$|^\d+oz.$|^\d+mg$|^\d+lb$|^d+”$|^\d+’$|^\d+cm$|^\d+gms$|^\d+pk$|^\d+mm$|^\d+lt$|

In [28]:
patternString +="^\\d+g.$|^\\d+gm.$|^\\d+ml.$|^\\d+kg.$|^\\d+mg.$|^\\d+lb.$|^d+”.$|^\\d+’.$|^\\d+cm.$|^\\d+gms.$|^\\d+pk.$|^\\d+mm.$|^\\d+lt.$|^\\d+$|^\\d*\\.?\\d$|";

In [29]:
 patternString += "^\\d+cl$|^\\d+am|^\\d+pm|^[0-2][0-3]:[0-5][0-9]$";

In [30]:
patternString

res19: String = ^\d+[A-Za-z]{1,2}$|^\d+pcs$|^\d+pc$|^\d+g$|^\d+gm$|^\d+ml$|^\d+kg$|^\d+oz$|^\d+oz.$|^\d+mg$|^\d+lb$|^d+”$|^\d+’$|^\d+cm$|^\d+gms$|^\d+pk$|^\d+mm$|^\d+lt$|^\d+g.$|^\d+gm.$|^\d+ml.$|^\d+kg.$|^\d+mg.$|^\d+lb.$|^d+”.$|^\d+’.$|^\d+cm.$|^\d+gms.$|^\d+pk.$|^\d+mm.$|^\d+lt.$|^\d+$|^\d*\.?\d$|^\d+cl$|^\d+am|^\d+pm|^[0-2][0-3]:[0-5][0-9]$

In [31]:
import org.apache.spark.ml.feature.StopWordsRemover

import org.apache.spark.ml.feature.StopWordsRemover

### Remove StopWords

In [32]:
val remover = new StopWordsRemover().setStopWords(stopWords).setInputCol("item_lemma").setOutputCol("test")

remover: org.apache.spark.ml.feature.StopWordsRemover = stopWords_f348afd27aa3

In [33]:
var df = remover.transform(lemmas)

df: org.apache.spark.sql.DataFrame = [item_name: string, item_uuid: string ... 2 more fields]

In [34]:
df.count()

res20: Long = 3274944

In [35]:
df.show()

+--------------------+--------------------+--------------------+--------------------+
|           item_name|           item_uuid|          item_lemma|                test|
+--------------------+--------------------+--------------------+--------------------+
|del monte whole k...|ffe2fd56-18fc-5a4...|[del, monte, whol...|[del, monte, kern...|
| american salad   oz|aea4c508-af74-552...|[american, salad,...|   [american, salad]|
|      frita sandwich|d4924c59-4870-42f...|   [frita, sandwich]|   [frita, sandwich]|
|scooby doo  honey...|c47a3569-9aa1-556...|[scooby, doo, hon...|[scooby, doo, hon...|
|        white candle|fe8be5ea-8130-568...|     [white, candle]|     [white, candle]|
|perla miodowa  x ...|e148d2df-ce1d-49b...|[perla, miodowa, ...|    [perla, miodowa]|
|karma water  rasp...|349ea704-011d-4b0...|[karma, water, ra...|[karma, water, ra...|
|mr goodbar chocol...|f913223a-cb95-507...|[mr, goodbar, cho...|[goodbar, chocola...|
|caribbean style t...|e7c3a794-9d17-577...|[caribbean,

### Remove Lemmas with bad Regex Patterns

In [36]:
import org.apache.spark.sql.functions.udf
import java.util.regex.Pattern
val pattern = Pattern.compile(patternString)
val removeRegex = udf {
  (array: Seq[String]) =>
    
    val cleanArray = array.filter((text) => ( text.length >= 3 && !pattern.matcher(text).find()) )
    cleanArray
};

removeRegex: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(StringType,true),Some(List(ArrayType(StringType,true))))

In [37]:
var regexdf = df.withColumn("removeregex",removeRegex(df.col("test")))

regexdf: org.apache.spark.sql.DataFrame = [item_name: string, item_uuid: string ... 3 more fields]

In [38]:
regexdf.show(10,false)

+-------------------------------------------------------------------------+------------------------------------+-------------------------------------------------------------------------------+-------------------------------------------------------------------------------+-------------------------------------------------------------------------------+
|item_name                                                                |item_uuid                           |item_lemma                                                                     |test                                                                           |removeregex                                                                    |
+-------------------------------------------------------------------------+------------------------------------+-------------------------------------------------------------------------------+-------------------------------------------------------------------------------+----------------------

In [39]:
regexdf.count()

res23: Long = 3274944

In [40]:
val outputDf = regexdf.withColumn("final_item_name",concat_ws(" ", $"removeregex"))

outputDf: org.apache.spark.sql.DataFrame = [item_name: string, item_uuid: string ... 4 more fields]

In [41]:
outputDf.count()

res24: Long = 3274944

In [42]:
outputDf.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|           item_name|           item_uuid|          item_lemma|                test|         removeregex|     final_item_name|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|nestle milky bar ...|84a4ff81-a98e-59e...|[nestle, milky, b...|[nestle, milky, c...|[nestle, milky, c...|  nestle milky count|
|new amsterdam glu...|3e77b369-98f3-4e1...|[new, amsterdam, ...|[amsterdam, glute...|[amsterdam, glute...|amsterdam gluten ...|
|e e ground almond...|c1c26196-1f08-41c...|[e, e, ground, al...|    [ground, almond]|    [ground, almond]|       ground almond|
|uncle rays ripple...|27690021-40fe-49d...|[uncle, ray, ripp...|[uncle, ray, ripple]|[uncle, ray, ripple]|    uncle ray ripple|
|snapple spiked wa...|a4ee393a-384d-4bc...|[snapple, spike, ...|[snapple, spike, ...|[snapple, spike, ..

In [49]:
val finalDf = outputDf.select(col("item_uuid"),col("item_name"),col("final_item_name")).where(length(regexp_replace($"final_item_name", " ","")) > 0)

finalDf: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [item_uuid: string, item_name: string ... 1 more field]

In [51]:
finalDf.show()

+--------------------+--------------------+--------------------+
|           item_uuid|           item_name|     final_item_name|
+--------------------+--------------------+--------------------+
|fefdfc33-c660-553...|hallmark jingle c...|hallmark jingle c...|
|fff2c653-2ccc-59f...|luc belaire luxe ...|luc belaire luxe ...|
|6d99aabe-977a-498...|    fruit tella    g|         fruit tella|
|ba1c142d-a6ef-486...|       sanmiguel can|           sanmiguel|
|d54ac12e-bc57-4f2...|the kurayoshi mal...|kurayoshi malt wh...|
|e042d686-dd06-43d...|ka still black gr...|  black grape carton|
|fdbe99de-2637-528...|broccoli   fontin...|broccoli fontina ...|
|f3aeb4f9-2a3f-401...|r whites lemonade...|     whites lemonade|
|8658668e-a8d2-47e...|starburst origina...|starburst origina...|
|611c6c16-ac70-4b5...|itoen unsweetened...|itoen unsweetened...|
|a32fca77-8f74-4e6...|wellness complete...|wellness complete...|
|fb90acaf-57de-5a3...|cuvee numero     ...| cuvee numero bottle|
|1579107b-61c2-4b4...|lou

In [53]:
finalDf.count()

res32: Long = 3269353