# Anomaly Detection with Azure Synapse Link for Cosmos DB and MMLSpark
### Business Scenario
The hypothetical scenario is Power Plant, where IoT devices are monitoring [steam turbines](https://en.wikipedia.org/wiki/Steam_turbine). The IoTSignals collection has Revolutions per minute (RPM) and Megawatts (MW) data for each turbine. 

There could be outliers in the data in random frequency. In those situations, RPM values will go up and MW output will go down, for circuit protection. The idea is to see the data varying at the same time, but with different signals. Suggested analytics scenarios are [Predictive Maintenance](https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/predictive-maintenance-playbook) and [Anomaly Detection](https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/apps-anomaly-detection-api).

&nbsp;
In this notebook, we'll 

1. Load the data in [Cosmos DB Analytical store](https://docs.microsoft.com/en-us/azure/cosmos-db/analytical-store-introduction) collection into a Dataframe
2. Perform data exploration using pyplot
2. Perform anomaly detection using [Azure Cognitive Services on Spark](https://mmlspark.blob.core.windows.net/website/index.html)
3. Visualize the anomalies using plotly


>**Did you know?**  [MMLSpark](https://github.com/Azure/mmlspark) (Azure Cognitive Services on Spark) is an ecosystem of tools making use of the  distributed computing capability of Apache Spark and includes seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK), LightGBM and OpenCV.


### 1. Load the data in Cosmos DB Analytical store collection into a Dataframe using Synapse Link
>**Did you know?**  "cosmos.olap" is the Synapse Link Spark format that enables connection to the Cosmos DB Analytical store.

>**Did you know?**  To select a preferred list of regions in a multi-region Cosmos DB account, add .option("spark.cosmos.preferredRegions", "<Region1>,<Region2>"). 


In [None]:
df_IoTSignals = spark.read\
                    .format("cosmos.olap")\
                    .option("spark.synapse.linkedService", "CosmosDBIoTDemo")\
                    .option("spark.cosmos.container", "IoTSignals")\
                    .load()

### 2. Data exploration using pyplot


In [None]:
import pandas as pd
import matplotlib.pyplot as plt

df_IoTSignals_pd = df_IoTSignals.toPandas().dropna()
df_IoTSignals_pd['measureValue'] = df_IoTSignals_pd['measureValue'].astype(int)

df_MW = df_IoTSignals_pd[(df_IoTSignals_pd['deviceId'] == 'dev-1') & (df_IoTSignals_pd['unitSymbol'] == 'MW')]
df_MW.plot(x='dateTime', y='measureValue', color='green', figsize=(20,5), label = 'Output MW')
plt.title('MW TimeSeries')
plt.show()

df_RPM = df_IoTSignals_pd[(df_IoTSignals_pd['deviceId'] == 'dev-1') & (df_IoTSignals_pd['unitSymbol'] == 'RPM')]
df_RPM.plot(x='dateTime', y='measureValue', color='black', figsize=(20,5), label = 'Output RPM')
plt.title('RPM TimeSeries')
plt.show()

<img src="https://revin.blob.core.windows.net/synapselinknotebooks/MWRPMSeries.PNG" width="1400" style="float: center;"/>

### 3. Perform anomaly detection using Microsoft Machine Learning for Spark (MMLSpark)
* You need to create a Cognitive Services API account with access to the Anomaly Detector API. You can get your subscription key from the Azure portal after creating your account.
* Search for 'paste-your-key-here' and replace the text with your subscription key
* [Click here](https://mmlspark.blob.core.windows.net/docs/1.0.0-rc1/pyspark/mmlspark.cognitive.html#module-mmlspark.cognitive.SimpleDetectAnomalies) to learn more about the parameters used in SimpleDetectAnomalies modeule
* [Click here](https://docs.microsoft.com/en-us/python/api/azure-cognitiveservices-anomalydetector/azure.cognitiveservices.anomalydetector.models.entiredetectresponse?view=azure-python) to learn more about the response class from anomalydetector 


In [None]:
from pyspark.sql.functions import col
from pyspark.sql.types import *
from mmlspark.cognitive import SimpleDetectAnomalies
from mmlspark.core.spark import FluentAPI

anomaly_detector = (SimpleDetectAnomalies()
                            .setSubscriptionKey("paste-your-key-here")
                            .setUrl("https://westus2.api.cognitive.microsoft.com/anomalydetector/v1.0/timeseries/entire/detect")
                            .setOutputCol("anomalies")
                            .setGroupbyCol("grouping")
                            .setSensitivity(95)
                            .setGranularity("secondly"))

df_anomaly = (df_IoTSignals
                    .where(col("unitSymbol") == 'RPM')
                    .withColumnRenamed("dateTime", "timestamp")
                    .withColumn("value", col("measureValue").cast("double"))
                    .withColumn("grouping", col("deviceId"))
                    .mlTransform(anomaly_detector))

df_anomaly.createOrReplaceTempView('df_anomaly')


### 4a. Format the dataframe for visualization


In [None]:
df_anomaly_single_device = spark.sql("select timestamp \
                                            , measureValue \
                                            , anomalies.expectedValue \
                                            , anomalies.expectedValue + anomalies.upperMargin as expectedUpperValue \
                                            , anomalies.expectedValue - anomalies.lowerMargin as expectedLowerValue \
                                            , case when anomalies.isAnomaly=true then 1 else 0 end as isAnomaly \
                                        from df_anomaly \
                                        where deviceid = 'dev-1' \
                                        order by timestamp \
                                        limit 400")

display(df_anomaly_single_device)    

### 4b. Visualize the anomalies using plotly
* Plot Expected value, Upper Value, Lower Value and Actual Value along with Anomaly flag


In [None]:
import chart_studio.plotly as py
import plotly.graph_objs as go
from plotly.offline import plot
import matplotlib.pyplot as plt
from pyspark.sql.functions import col
from matplotlib.pyplot import figure
 
adf = df_anomaly_single_device.toPandas()
adf_subset = df_anomaly_single_device.where(col("isAnomaly") == 1).toPandas() 

plt.figure(figsize=(23,8))
plt.plot(adf['timestamp'],adf['expectedUpperValue'], color='darkred', linestyle='solid', linewidth=0.25)
plt.plot(adf['timestamp'],adf['expectedValue'], color='darkgreen', linestyle='solid', linewidth=2)
plt.plot(adf['timestamp'],adf['measureValue'], 'b', color='royalblue', linestyle='dotted', linewidth=2)
plt.plot(adf['timestamp'],adf['expectedLowerValue'],  color='black', linestyle='solid', linewidth=0.25)
plt.plot(adf_subset['timestamp'],adf_subset['measureValue'], 'ro')
plt.legend(['RPM-UpperMargin', 'RPM-ExpectedValue', 'RPM-ActualValue', 'RPM-LowerMargin', 'RPM-Anomaly'])
plt.title('RPM Anomalies with Expected, Actual, Upper and Lower Values')
plt.show()

<img src="https://revin.blob.core.windows.net/synapselinknotebooks/RPMAnomaly.PNG" width="1800" style="float: center;"/>
