# Spark Diagnostic: Recurrent Application Analytics 
In this notebook, you can pick an arbitary spark application (selected application) to
+ Detect recurrent applications to the selected application
+ Select a comparison application, compare with the selected one on `application`, `job` and `stage` modes
## Prerequisite
+ In order to use this notebook in the workspaces having [Managed Virtual Network](https://docs.microsoft.com/en-us/azure/synapse-analytics/security/synapse-workspace-managed-vnet) enabled or having outbound traffics to Azure services blocked, 
you need to run additional codes (shown as step 1) in other environments than Synapse, such as VMs, Azure CloudShell, etc.
+ To get the spark application list and application detail, you need to be assigned one of the `Synapse Administrator, Synapse Contributor, Synapse Compute User or ApacheSparkSuperUser` roles to the Synapse workspace where your application runs. ([How-to](https://docs.microsoft.com/en-us/azure/synapse-analytics/security/how-to-set-up-access-control)) You can check the roles on `Manage -> Access Control`.
+ In order to download spark event logs, you need to be assiged the `Blob Storage Contributer` role to the Blob storage account you want to the logs been downloaded. ([How-to](https://docs.microsoft.com/en-us/azure/synapse-analytics/security/how-to-set-up-access-control))
+ You can learn more on [Azure Role Based Access Control (RBAC)](https://docs.microsoft.com/en-us/azure/role-based-access-control/) and [Synapse Access Control](https://docs.microsoft.com/en-us/azure/azure-sql/database/logins-create-manage?toc=%2Fazure%2Fsynapse-analytics%2Ftoc.json&bc=%2Fazure%2Fsynapse-analytics%2Fbreadcrumb%2Ftoc.json0).
## Limits
+ This notebook is built on top of [Azure Synapse Analytics REST API](https://docs.microsoft.com/en-us/rest/api/synapse/). 
Please check the [firewall rule](https://docs.microsoft.com/en-us/azure/synapse-analytics/security/synapse-workspace-ip-firewall) and [managed virtual network settings](https://docs.microsoft.com/en-us/azure/synapse-analytics/security/synapse-workspace-managed-vnet) of your workspace.


# Step 0: Select the application to diagnostic
- `developmentEndpoint`: The development endpoint of the workspace.
- `webHost`: The web host of the workspace.
- `tenantId`: The tenant id of the workspace.
- `selectedApplicationName`: The name of your selected application.
- `selectedApplicationId`: The spark application id of the selected application.
- `recurrentAppScanStartTime`: It can be as early as 90 days from now on, otherwise it will be forcly set to `now - 90 days`.
- `recurrentAppScanEndTime`: Shoud be later than `recurrentAppScanStartTime`.
- `sparkEventOutputBaseFolder`: This notebook will download [spark event log](https://spark.apache.org/docs/latest/monitoring.html) into this ADLS Gen2 path. Please make sure you have been granted the `Blob Storage Contributer` role.
You can also use full [Azure Blob File System URI](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction-abfs-uri#uri-syntax) when you are not using current workspace's Primary ADLS Gen2 file system.


You can find the value of `developmentEndpoint`, `webHost`, `tenantId`, `gen2StorageAccount` and `gen2Container` on [Azure Portal](https://ms.portal.azure.com/#home).

Here is an example.
```
val developmentEndpoint = "https://chayangwestus2.dev.azuresynapse.net"
val webHost = "https://web.azuresynapse.net"
val tenantId = "72f988bf-86f1-41af-91ab-2d7cd011db47"
val selectedApplicationName = "SparkJobDefinition_432290f8-ac3c-4eec-a984-24dc2db92b20"
val selectedApplicationId = "application_1598611528702_0071"
val recurrentAppScanStartTime = "2020-08-30T10:21:36Z"
val recurrentAppScanEndTime = "2020-08-30T18:21:36Z"
val sparkEventOutputBaseFolder = "/diagnostic/spark-events"
```

In [4]:
%%spark

val developmentEndpoint = "$$YourWorkspaceDevEndpoint$$"
val webHost = "$$YourWorkspaceWebHost$$"
val tenantId = "$$YourWorkspaceTenantId$$"
val selectedApplicationName = "$$YourSelectedApplicationName$$"
val selectedApplicationId = "$$YourSelectedApplicationId$$"
val recurrentAppScanStartTime = "$$RecurrentApplicationScanStartTime$$"
val recurrentAppScanEndTime = "$$RecurrentApplicationScanEndTime$$"
val sparkEventOutputBaseFolder = "$$SparkEventOutputBaseDirectory$$"

# Step 1: Detect recurrent applications
Applications sharing the same nomalized name are treated as recurrent ones.
- This step calls [Azure Synapse Analytics REST API](https://docs.microsoft.com/en-us/rest/api/synapse/) to download recurrent applications' spark events to `sparkEventOutputBaseFolder`. It may take several minnutes to complete, depending on the number of recurrent applications detected and the size of the spark event.
- If you encounted network connection problems, please check the [firewall rule](https://docs.microsoft.com/en-us/azure/synapse-analytics/security/synapse-workspace-ip-firewall) and [managed virtual network settings](https://docs.microsoft.com/en-us/azure/synapse-analytics/security/synapse-workspace-managed-vnet) of your workspace. 
If your workspace blocks outbound traffics to Azure Synapse Analytics API server.

In [10]:
%%spark
import org.apache.spark.diagnostic.recurrent._
import org.apache.spark.diagnostic.util._
import shapeless._
import syntax.std.traversable._
import scala.collection.mutable.ListBuffer

val applications = RecurrentJobDetector.detect(developmentEndpoint, selectedApplicationName, recurrentAppScanStartTime, recurrentAppScanEndTime)
val validApps = applications.sparkJobs.filter(_.latestAttemptId.get > 0)
val validApplications = Applications(validApps.size, validApps)
validApplications.sparkJobs.foreach(app => {
    app.eventLogFile = Some(LogDownloader.fetchLog(developmentEndpoint, app.sparkPoolName, app.livyId.toInt, app.sparkApplicationId, app.latestAttemptId.get, sparkEventOutputBaseFolder))
})
displayHTML(ApplicationProvider.outputHtml(validApplications))

# Step 2: Build metrics for recurrent applications
By replaying spark events, this notebook builds metrics for each application in the recurrent set. This step may take serveral minutes to complete. It depends on the size of spark events.

- `TrendHelper.getApplicationsMetricsTrend`: Build application level metrics for each application.
- `TrendHelper.getJobMetricsTrend(jobId: Int, validApplications: Applications)`: Build job level metrics for each application and each job.
- `TrendHelper.getStageMetricsTrend(stageId: Int, validApplications: Applications)`: Build stage level metrics for each application and each stage. 

In [11]:
%%spark
import org.apache.spark.diagnostic.recurrent.helper._
import org.apache.spark.diagnostic.recurrent.ApplicationProvider.getApplicationMetrics

val metricsInfo = getApplicationMetrics(validApplications.sparkJobs(0).eventLogFile.get)
val metricsTitle = new ListBuffer[String]()
metricsTitle += "submitTime"
val result = metricsInfo.applicationMetrics.keySet.foreach(key => {
   metricsTitle += key.name
})

## Step 2.1: Display application mode trend
Since there are many metric types in the line chart. You can view different dimensions by selecting metircs which belong to different groups. 
- Records group: inputRecords, outputRecords, shuffleReadRecords, shuffleWriteRecords.
- Throughput group: shuffleReadSizeBytes, shuffleWriteSizeBytes, memSpillsBytes, diskSpillsBytes.
- Time group:
 totalRuntimeMs, submissionToFirstLaunchDelayMs, firstLaunchToCompletedMs, netIoTimeMs, executorRunTimeLessShuffleMs, executorRunTimeMs, executorCpuTimeMs, shuffleReadFetchWaitTimeMs, shuffleWriteTimeMs.

In [4]:
%%spark
val applicationMetricsMap = TrendHelper.getApplicationsMetricsTrend(validApplications)
val df = applicationMetricsMap.toSeq.map(data => {
    (data._1, data._2(0), data._2(1), data._2(2), data._2(3), data._2(4), data._2(5), data._2(6), data._2(7), data._2(8), data._2(9), data._2(10), data._2(11), data._2(12), data._2(13), data._2(14))
}).toDF(metricsTitle.toList: _*)

display(df)

## Step 2.2: Display job mode trend
The following cell displays a table consisting of `jobId` and `jobName` for each recurrent applicaiton.
By clicking the link on a `jobName`, corresponding page on Spark history will be opened.

In [5]:
%%spark
// job - application matrix
displayHTML(TrendHelper.displayApplicationJobsDetailHTMLTable(webHost, developmentEndpoint, tenantId, validApplications))

The following cell shows job model trend by inputting a `jobId` from the above cell's output.

In [6]:
%%spark
// select jobId from the above cell's output
val jobId = "$$TheJobIdOfTheJobMerticTrend$$"
val jobMetricsMap = TrendHelper.getJobMetricsTrend(jobId, validApplications)
val df = jobMetricsMap.toSeq.map(data => {
    (data._1, data._2(0), data._2(1), data._2(2), data._2(3), data._2(4), data._2(5), data._2(6), data._2(7), data._2(8), data._2(9), data._2(10), data._2(11), data._2(12), data._2(13), data._2(14))
}).toDF(metricsTitle.toList: _*)

display(df)

## Step 2.3: Display stage mode trend
The following cell displays a table consisting of `stageId` and `stageName` for each recurrent applicaiton.
By clicking the link on a `stageName`, corresponding page on Spark history will be opened.

In [7]:
%%spark
displayHTML(TrendHelper.displayApplicationStagesDetailHTMLTable(webHost, developmentEndpoint, tenantId, validApplications))

The following cell shows stage model trend by inputting a `stageId` from the above cell's output.

In [8]:
%%spark
// select stageId from the above cell's output
val stageId = "$$TheStageIdOfTheStageMerticTrend$$"
val stageMetricsMap = TrendHelper.getStageMetricsTrend(stageId, validApplications)
val df = stageMetricsMap.toSeq.map(data => {
    (data._1, data._2(0), data._2(1), data._2(2), data._2(3), data._2(4), data._2(5), data._2(6), data._2(7), data._2(8), data._2(9), data._2(10), data._2(11), data._2(12), data._2(13), data._2(14))
}).toDF(metricsTitle.toList: _*)

display(df)

# Step 3: Select an application to compare
Select a comparison application, and compare it with the selected application.
+ `comparisonApplicationId`: The spark application id of the comparison application. Please select from Step 1's output table.
+ `thresholdLowToMedium`, `thresholdMediumToHigh`: The following code will compare each metric between the selected and the comparison application and calculate the relative deviation, which is a positive decimal between 0 and 1. 
Based on the deviation, each metric is classified into 3 `level`s (`DiagnosticDiffLevel.LOW`, `DiagnosticDiffLevel.MEDIUM` and `DiagnosticDiffLevel.HIGH`). 
These 2 parameters determine the threshold bars between levels.
+ `selectedLevels`: Only `level`s in this list will be presented in the output. Available elements are `DiagnosticDiffLevel.LOW`, `DiagnosticDiffLevel.MEDIUM` and `DiagnosticDiffLevel.HIGH`.

In [14]:
%%spark
import org.apache.spark.diagnostic.appdiff._
import collection.JavaConverters._
// select the comparison application id from the output of step 1.
val comparisonApplicationId = "$$YourComparisonApplicationId$$"
val thresholdLowToMedium = "$$ThresholdLowToMedium$$"
val thresholdMediumToHigh = "$$ThresholdMediumToHigh$$"
val selectedLevels = List(DiagnosticDiffLevel.LOW, DiagnosticDiffLevel.MEDIUM, DiagnosticDiffLevel.HIGH).asJava

In [22]:
%%spark
import com.fasterxml.jackson.core.JsonParser
import com.fasterxml.jackson.databind.ObjectMapper
import org.json4s._
import org.json4s.jackson.Json4sScalaModule
implicit val formats = DefaultFormats

val mapper = new ObjectMapper()
mapper.configure(JsonParser.Feature.ALLOW_NON_NUMERIC_NUMBERS, true)
mapper.registerModule(new Json4sScalaModule)

val selectedSparkEvent = applications.sparkJobs.filter(_.sparkApplicationId == selectedApplicationId).headOption.get.eventLogFile.get
val comparisonSparkEvent = applications.sparkJobs.filter(_.sparkApplicationId == comparisonApplicationId).headOption.get.eventLogFile.get

# Step 4: Compare applications
We support comparison between the selected and comparison application on application, job and stage modes. 
On application mode, comparison on arbitrary 2 applications are accepted. 

But on job and stage modes, comparison between application `A` and application `B` is accepted if and only if:
- `A` and `B` have the same number of jobs. Jobs that have the same id in `A` as in `B` share the same job name. The 2 corresponding jobs are called `job pair`.
- The execution graph (DAG) of each `job pair` is isomorphic.

## Step 4.1: Compare applications on Application mode
On application mode, comparison between any 2 applications is feasible.

In [23]:
%%spark
val comparisonApp = DiffInNotebook.getAckAndFlattenedResult(DiagnosticDiffMode.APPLICATION, selectedSparkEvent, comparisonSparkEvent, thresholdLowToMedium, thresholdMediumToHigh, selectedLevels)
val jsonResApp = mapper.readValue(comparisonApp, classOf[JValue])

displayHTML((jsonResApp \ "html").extract[String])

## Step 4.2: Compare applications on Job mode
On job mode, only 2 applications match all conditions are comparable. If comparison is not accpeted on job mode, you may get the following errors:
```
The comparison failed.
applications cannot match in job comparison.
```


In [24]:
%%spark

val comparisonJob = DiffInNotebook.getAckAndFlattenedResult(DiagnosticDiffMode.JOB, selectedSparkEvent, comparisonSparkEvent, thresholdLowToMedium, thresholdMediumToHigh, selectedLevels)
val jsonResJob = mapper.readValue(comparisonJob, classOf[JValue])

print((jsonResJob \ "message").extract[String])
if ((jsonResJob \ "code").extract[Int] != -1) {
    displayHTML((jsonResJob\ "html").extract[String])
}

## Step 4.3: Compare applications on Stage mode
On stage mode metrics, only 2 comparable applications match all conditions are comparable. If comparison is not accpeted on stage mode, you may get the following errors:
```
The comparison failed.
applications cannot match in stage mode comparison.
```

In [25]:
%%spark

val comparisonStage = DiffInNotebook.getAckAndFlattenedResult(DiagnosticDiffMode.STAGE, selectedSparkEvent, comparisonSparkEvent, thresholdLowToMedium, thresholdMediumToHigh, selectedLevels)
val jsonResStage = mapper.readValue(comparisonStage, classOf[JValue])

print((jsonResStage \ "message").extract[String])
if ((jsonResStage \ "code").extract[Int] != -1) {
    displayHTML((jsonResStage\ "html").extract[String])
}