# Streaming ingestion into Azure Cosmos DB collection using Structured Streaming

In this notebook, we'll 

1. Simulate streaming data generation using Rate streaming source
2. Format the stream dataframe as per the IoTSignals schema
3. Write the streaming dataframe to the Azure Cosmos DB collection

>**Did you know?** Azure Cosmos DB is a great fit for IoT predictive maintenance and anomaly detection use cases. [Click here](https://docs.microsoft.com/en-us/azure/cosmos-db/synapse-link-use-cases#iot-predictive-maintenance) to learn more about an IoT architecture leveraging HTAP capabilities of Azure Synapse Link for Azure Cosmos DB.

>**Did you know?**  [Azure Synapse Link for Azure Cosmos DB](https://docs.microsoft.com/en-us/azure/cosmos-db/synapse-link) is a hybrid transactional and analytical processing (HTAP) capability that enables you to run near real-time analytics over operational data in Azure Cosmos DB.
&nbsp;

>**Did you know?**  [Azure Cosmos DB analytical store](https://docs.microsoft.com/en-us/azure/cosmos-db/analytical-store-introduction?branch=release-build-cosmosdb) is a fully isolated column store for enabling large scale analytics against operational data in your Azure Cosmos DB, without any impact to your transactional workloads.
&nbsp;

### 1. Simulate streaming data generation using Rate streaming source
* The Rate streaming source is used to simplify the solution here and can be replaced with any supported streaming sources such as [Azure Event Hubs](https://azure.microsoft.com/en-us/services/event-hubs/) and [Apache Kafka](https://docs.microsoft.com/en-us/azure/hdinsight/kafka/apache-kafka-introduction).

* [Click here](https://github.com/Azure-Samples/streaming-at-scale) to learn more about the possible ways to implement an end-to-end streaming solution using a choice of different Azure technologies.

>**Did you know?**  The Rate streaming source generates data at the specified number of rows per second and each output row contains a timestamp and value.

In [3]:
dfStream = (spark
                .readStream
                .format("rate")
                .option("rowsPerSecond", 10)
                .load()
            )

### 2. Format the stream dataframe as per the IoTSignals schema


In [None]:
import pyspark.sql.functions as F
from pyspark.sql.types import StringType
import uuid

numberOfDevices = 10
generate_uuid = F.udf(lambda : str(uuid.uuid4()), StringType())
              
dfIoTSignals = (dfStream
                    .withColumn("id", generate_uuid())
                    .withColumn("dateTime", df["timestamp"].cast(StringType()))
                    .withColumn("deviceId", F.concat(F.lit("device-id-"), F.expr("mod(value, %d)" % numberOfDevices)))
                    .withColumn("measureType", F.expr("CASE WHEN rand() < 0.5 THEN 'Rotation Speed' ELSE 'Output' END"))
                    .withColumn("unitSymbol", F.expr("CASE WHEN rand() < 0.5 THEN 'RPM' ELSE 'MW' END"))
                    .withColumn("unit", F.expr("CASE WHEN rand() < 0.5 THEN 'Revolutions per Minute' ELSE 'MegaWatts' END"))
                    .withColumn("measureValue", F.expr("CASE WHEN rand() > 0.9 THEN value * 2 WHEN rand() < 0.1 THEN value div 2 ELSE value END"))
                    .drop("timestamp")
                )

### 3. Stream writes to the Azure Cosmos DB Collection
>**Did you know?** The "cosmos.oltp" is the Spark format that enables connection to the Cosmos DB Transactional store.

>**Did you know?** The ingestion to the Cosmos DB collection is always performed through the Transactional store irrespective of whether the Analytical Store is enabled or not.

In [None]:
streamQuery = dfIoTSignals\
                    .writeStream\
                    .format("cosmos.oltp")\
                    .outputMode("append")\
                    .option("spark.cosmos.connection.mode", "gateway")\      
                    .option("spark.synapse.linkedService", "CosmosDBIoTDemo")\
                    .option("spark.cosmos.container", "IoTSignals")\
                    .option("checkpointLocation", "/writeCheckpointDir")\
                    .start()

streamQuery.awaitTermination()