<span style="color:blue">Thanks for using Drogon for your interactive Spark application. We update Drogon/SparkMagic as often as possible to make it easier, faster and more reliable for you. Have a question or feedback? Ping us on [uChat](https://uchat.uberinternal.com/uber/channels/spark).</span>

What's New
- Now you can use `%%configure` and `%%spark` magics to configure and start a Spark session (deprecating hard-to-use `%load_ext sparkmagic.magics` and `manage_spark` magics). Check out [this example](https://workbench.uberinternal.com/explore/knowledge/localfile/cwang/sparkmagic_python2_example.ipynb) for more details.
- Improved `%%configure` magic. You now can use it to make all Spark and Drogon configurations from within notebook itself. Check out our [latest documentation & examples](https://docs.google.com/document/d/1mkYtDHquh4FjqTeA0Fxii8lyV-P6qzmoABhmmRwm_00/edit#heading=h.xn14pmoorsn0) for more details.
- Bug fixes and performance updates.


## Query Tag Hive to Docstore Conversion

In [4]:
%%configure -f
{
  "name": "qu_data_docstore_semantic_tags",
  "proxyUser": "radhesh", 
  "sparkEnv": "SPARK_24", 
  "driverMemory": "12G",
  "executorMemory": "12G",
  "queue": "/uber-eats/eater/ml_training_experiments",
  "numExecutors": 50, 
  "conf": {
    "spark.default.parallelism": 200,
    "spark.yarn.executor.memoryOverhead": 1024
  }, 
  "executorCores": 8, 
  "driverCores": 8, 
  "jars": [], 
  "drogonHeaders": {
    "X-DROGON-CLUSTER": "phx4/Peloton02Secure"
  }
}

Starting Spark application (can take 60s or more)...
Starting heartbeat thread...done.
Waiting for Drogon session to be ready......................
Drogon session is ready.


Drogon Session ID,Spark Application ID,Kind,State,Spark UI,Driver log
578493103,86d937a3-9acd-4930-b07d-7b783cedc4c3,pyspark3,idle,Link,Link


SparkSession available as 'spark'.


Cell execution took 36 seconds.


In [5]:
%%spark

In [6]:
import pyspark.sql.functions as F
from pyspark.sql.functions import col
from pyspark.sql.types import StringType, ArrayType
from pyspark.sql.window import Window
import re
from collections import defaultdict
from datetime import datetime

In [7]:
def spark_sql(sql):
    return spark.sql(sql)

In [8]:
import pyspark.sql.functions as F
from pyspark.sql.functions import col
from pyspark.sql.types import StringType, ArrayType
from pyspark.sql.window import Window
import re
from collections import defaultdict
from datetime import datetime

In [23]:
query = """
SELECT * from tmp.semantic_tags_query_tag_array
"""
# Table Creation Query -> https://querybuilder-ea.uberinternal.com/r/q4gloGNUN/run/ERomdzfNv/edit

In [17]:
df = spark_sql(query)

In [18]:
df.show() #test sample

+-------------------+--------------------+
|              query|       semantic_tags|
+-------------------+--------------------+
|         wash water|beverage-0.206896...|
|       gin and soda|beverage-0.203125...|
|    pringles puzzle|salty snack-0.106...|
|           gallon o|drink-0.177489177...|
|     orange chicjen|fruit vegetable-0...|
|ice cream cheescake|consumable-0.2154...|
| liqs wine cocktail|drink-0.122807017...|
|          baged ice|ice cream-0.20375...|
|               tenn|whiskey-0.1738299...|
|  simply watermelon|beverage-0.207407...|
| martini rossi asti|wine-0.1578947368...|
|     organic walnut|fruit vegetable-0...|
|          rapid tes|pregnancy fertili...|
|              shank|meat seafood plan...|
|              gopuf|alcohol-0.1294623...|
|           wal dram|medicine treatmen...|
|       choolate bar|candy chocolate-0...|
|          casa rosa|tequila-0.1774193...|
|    whistle pig rye|spirit-0.1625:liq...|
|          sprite le|beverage-0.178217...|
+----------

In [19]:
print((df.count(), len(df.columns))) #check number of rows and columns

(623566, 2)

In [58]:
def jsonify(text):
    array = [tag_score.split('-') for tag_score in text.split(':')]
    data = {"predicted_semantic_tags" : [{"semantic_tag": array[i][0], "score": array[i][1]} for i in range(len(array))]}
    return str(data)
            

In [59]:
#testing jsonify
jsonify("tequila-0.3225806451612903:liquor-0.25806451612903225:reposado-0.22580645161290322:spirit tequila-0.1935483870967742")

"{'predicted_semantic_tags': [{'semantic_tag': 'tequila', 'score': '0.3225806451612903'}, {'semantic_tag': 'liquor', 'score': '0.25806451612903225'}, {'semantic_tag': 'reposado', 'score': '0.22580645161290322'}, {'semantic_tag': 'spirit tequila', 'score': '0.1935483870967742'}]}"

In [60]:
jsonify_UDF = F.udf(jsonify,  F.StringType()) #create UDF to JSONify

### Create Table

In [53]:
TABLE_NAME = 'uber_eats.query_semantic_tags_grocery'


In [54]:
TTL_FACTOR_TABLES_DAYS = 90
datestr = datetime.today().strftime('%Y-%m-%d') 
print(datestr)

2023-02-09

In [55]:
spark.sql(""" CREATE TABLE IF NOT EXISTS {table_name} (
                     query string,
                     predicted_semantic_tags string
                     )
                     PARTITIONED BY(datestr string)
                     tblproperties ('dc_replication'='true', 'ttl.partion'='{ttl}d')
                     ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
                     STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
                     OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
                     """.format(table_name=TABLE_NAME, ttl=TTL_FACTOR_TABLES_DAYS))

spark.sql("SET hive.exec.dynamic.partition.mode = nonstrict")

DataFrame[key: string, value: string]

In [64]:
df = df.withColumn("predicted_semantic_tags",jsonify_UDF(F.col("semantic_tags"))).select("query", "predicted_semantic_tags")


In [65]:
df.show(truncate = False)

+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|query              |predicted_semantic_tags                                                                                                                                                                                                                                

In [70]:
ingest_data = True 


In [71]:
if ingest_data:
    df\
    .withColumn("datestr", F.lit(datestr))\
    .write\
    .insertInto(TABLE_NAME, overwrite = True)