# Data pre-processing for Azure Data Explorer

<img src="https://github.com/Azure/azure-kusto-spark/raw/master/kusto_spark.png" style="border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa"/>

We often see customer scenarios where historical data has to be migrated to Azure Data Explorer (ADX). Although ADX has very powerful data-transformation capabilities via [update policies](https://docs.microsoft.com/azure/data-explorer/kusto/management/updatepolicy), sometimes more or less complex data engineering tasks must be done upfront. This happens if the original data structure is too complex or just single data elements being too big, hitting data explorer limits of dynamic columns of 1 MB or maximum ingest file-size of 1 GB for uncompressed data (see also [Comparing ingestion methods and tools](https://docs.microsoft.com/azure/data-explorer/ingest-data-overview#comparing-ingestion-methods-and-tools)) .

Let' s think about an Industrial Internet-of-Things (IIoT) use-case where you get data from several production lines. In the production line several devices read humidity, pressure, etc. The following example shows a scenario where a one-to-many relationship is implemented within an array. With this you might get very large columns (with millions of device readings per production line) that might exceed the limit of 1 MB in Azure Data Explorer for dynamic columns.
In this case you need to do some pre-processing.


Data has already been uploaded to Azure storage. You will start reading the json-data into a data frame:

In [0]:
inputpath = "wasbs://synapsework@kustosamplefiles.blob.core.windows.net/*.json"

# optional, for the output to Azure Storage:
#outputpath = "<your-storage-path>"

df = spark.read.format("json").load(inputpath)

The notebook has a parameter IngestDate, this will be used setting the extentsCreationtime. You can call this notebook from Azure Data Factory for all days you want to load to Azure Data Explorer.
Alternatively you can make use of a partitioning policy.

In [0]:
dbutils.widgets.text("wIngestDate", "2021-08-06T00:00:00.000Z", "Ingestion Date")
IngestDate = dbutils.widgets.get("wIngestDate")

In [0]:
display (df)

We see that the dataframe has some complex datatypes. The only thing that we want to change here is getting rid of the array, so having the resulting dataset a row for every entry in the measurement array. 

*How can we achieve this?*

pyspark-sql has some very powerful functions for transformations of complex datatypes. We will make use of the [explode-function](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.explode.html). In this case explode ("measurement") will give us a resulting dataframe with single rows per array-element. Finally we only have to drop the original measurement-column (it is the original structure):

In [0]:
from pyspark.sql.functions import *

df_explode = df.select("*", explode("measurement").alias("device")).drop("measurement")

With this we alreadyhave done the necessary data transformation with one line of code. Let' s do some final "prettyfying". 
As we are already preprocessing the data and want to get rid of the complex data types we select the struct elements to get a simplified table:

In [0]:
df_all_in_column = df_explode.select ("header.*", "device.header.*", "device.*", "ProdLineData.*").drop("header")

In [0]:
display (df_all_in_column)

We are setting the extentsCreationTime to the notebook-parameter *IngestDate*. For other ingestion properties see [here](https://github.com/Azure/azure-kusto-spark/blob/master/samples/src/main/python/pyKusto.py).

In [0]:
extentsCreationTime = sc._jvm.org.joda.time.DateTime.parse(IngestDate)
sp = sc._jvm.com.microsoft.kusto.spark.datasink.SparkIngestionProperties(
        False, None, None, None, None, extentsCreationTime, None, None)

Finally, we write the resulting dataframe back to to Azure Data Explorer. Prerequisite doing this is: 
* the target table created in the target database (.create table measurement (ProductionLineId : string, deviceId:string, enqueuedTime:datetime, humidity:real, humidity_unit:string, temperature:real, temperature_unit:string,  pressure:real, pressure_unit:string, reading : dynamic))
* having created a service principal for the ADX access
* the service principal (AAD-application) accessing ADX has sufficient permissions (add the ingestor and viewer role)
* Install the latest Kusto library  from maven see also the [Azure Data Explorer Connector for Apache Spark documentation](https://github.com/Azure/azure-kusto-spark#usage)

In [0]:
df_all_in_column.write. \
  format("com.microsoft.kusto.spark.datasource"). \
  option("kustoCluster", "https://<yourcluster>"). \
  option("kustoDatabase", "your-database"). \
  option("kustoTable", "<your-table>"). \
  option("sparkIngestionPropertiesJson", sp.toString()). \
  option("kustoAadAppId", "<app-id>"). \
  option("kustoAadAppSecret",dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"). \
  option("kustoAadAuthorityID", "<tenant-id>"). \
  mode("Append"). \
  save()

You might also consider writing the data to Azure Storage (this might be also make sense for mor complex tranformation pipelines as an intermediate staging step):

In [0]:
# df_all_in_column.write.mode('overwrite').json(outputpath) 