Skip to content
Branch: master
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
arm set max size to 100GB May 22, 2019
01-create-event-hub.sh make sure auto-inflate is disabled May 24, 2019
02-create-azure-sql.sh improved check for server and db existance May 23, 2019
03-create-stream-analytics.sh added columnstore support May 20, 2019
04-run-clients.sh
README.md
create-solution.sh updated RU Jun 7, 2019
streaming.bacpac added columnstore support May 20, 2019

README.md

Streaming at Scale with Azure Event Hubs, Stream Analytics and Azure SQL

This sample uses Stream Analytics to process streaming data from EventHub and uses Azure SQL as a sink to store processed data. This is especially useful when you need to create a Near-Real Time Operational Analytics, where streaming data has to be ingested at scale and, at the same time, also queried to execute analytical queries. The ability to ingest data into a columnstore is vital to have expected query performances:

Get started with Columnstore for real time operational analytics

The provided scripts will create an end-to-end solution complete with load test client.

Purpose of this solution

Provide a complete, end-to-end, balanced baseline and starting point for creating a stream processing solution. Everything is done in the simplest and easiest way possible so that you can use it to build up your custom solution.

For example if you know you have to process 5000 messages per second, you can start using the provided configuration option that will make sure you can reach that performances. Once you are confident you can start to add your own business logic that will likely increase the resource usage, depending on how much complex the business logic is, but at least you have a consistent and solid starting point that helps to significantly reduce development and testing times.

Running the Scripts

Please note that the scripts have been tested on Ubuntu 18 LTS, so make sure to use that environment to run the scripts. You can run it using Docker, WSL or a VM:

The following tools/languages are also needed:

  • Azure CLI
    • Install: sudo apt install azure-cli
  • jq
    • Install: sudo apt install jq

Azure Cloud Shell is also supported and already has all the required dependencies:

Azure Cloud Shell

Just do a git clone of the repo and you'll be good to go.

Setup Solution

Make sure you are logged into your Azure account (you are automatically logged it if you're using Azure Cloud Shell):

az login

and also make sure you have the subscription you want to use selected

az account list

if you want to select a specific subscription use the following command

az account set --subscription <subscription_name>

once you have selected the subscription you want to use just execute the following command

./create-solution.sh -d <solution_name>

then solution_name value will be used to create a resource group that will contain all resources created by the script. It will also be used as a prefix for all resource create so, in order to help to avoid name duplicates that will break the script, you may want to generated a name using a unique prefix. Please also use only lowercase letters and numbers only, since the solution_name is also used to create a storage account, which has several constraints on characters usage:

Storage Naming Conventions and Limits

to have an overview of all the supported arguments just run

./create-solution.sh

Note

To make sure that name collisions will be unlikely, you should use a random string to give name to your solution. The following script will generated a 7 random lowercase letter name for you:

../_common/generate-solution-name.sh

Created resources

The script will create the following resources:

  • Azure Container Instances to host Locust Load Test Clients: by default four Locust nodes (master + 3 slaves) will be created, generating a load of 1000 events/second
  • Event Hubs Namespace, Hub and Consumer Group: to ingest data incoming from test clients
  • Stream Analytics to process analytics on streaming data
  • Azure SQL Server and Database: to store and serve processed data

Solution customization

If you want to change some setting of the solution, like number of load test clients, Azure SQL SKU and so on, you can do it right in the create-solution.sh script, by changing any of these values:

export EVENTHUB_PARTITIONS=2
export EVENTHUB_CAPACITY=2
export PROC_JOB_NAME=streamingjob
export PROC_STREAMING_UNITS=3 # must be 1, 3, 6 or a multiple or 6
export SQL_SKU=S3
export SQL_TABLE_KIND="rowstore" # or "columnstore"
export TEST_CLIENTS=3

The above settings has been chosen to sustain a 1000 msg/sec stream.

Scaling the solution

In the create-solution.sh script values to test

  • 1000 msgs/sec
  • 5500 msgs/sec
  • 10000 msgs/sec

are already set, just uncomment what you what to test, and then run the script.

Streamed Data

Streamed data simulates an IoT device sending the following JSON data:

{
    "eventId": "b81d241f-5187-40b0-ab2a-940faf9757c0",
    "complexData": {
        "moreData8": 51.3068118685458,
        "moreData9": 44.44672606436184,
        "moreData0": 57.739726013343247,
        "moreData1": 52.230732688620829,
        "moreData2": 57.497518587807189,
        "moreData3": 81.32211656749469,
        "moreData4": 54.412361539409427,
        "moreData5": 75.36416309399911,
        "moreData6": 71.53407865773488,
        "moreData7": 45.34076957651598,
        [...]
    },
    "value": 49.02278128887753,
    "deviceId": "contoso://device-id-1554",
    "type": "CO2",
    "createdAt": "2019-05-16T17:16:40.000003Z"
}

Monitor performances

Please use Metrics pane in Stream Analytics for "Input/Output Events", "Watermark Delay" metrics. This last is especially important to understand if the processing engine is keeping up with the incoming data or is falling behind:

New metric in Azure Stream Analytics tracks latency of your streaming pipeline

You can also use Event Hub "Metrics" pane:

Understand Stream Analytics job monitoring and how to monitor queries

Stream Analytics

At present time Azure Stream Analytics cannot send record data types to Azure SQL, as documented here: Type mapping when writing to structured data stores

Azure SQL

The solution allows you to test both row-store and column-store options. The deployed database has two tables

  • rawdata
  • rawdata_cs

The rawdata_cs table is then one using a clustered column-store index. Both tables also have a non-clustered primary key on the eventId column. Use the -k option and set it to rowstore or columnstore to run the solution against the table you are interested in testing.

Be aware that database log backup happens every 10 minutes circa, as described here: Automated backups. This means that additional IO overhead needs to be taken into account, which is proportional to the amount of ingested rows. That's why to move from 5000 msgs/sec to 10000 msgs/sec a bump from P4 to P6 is needed. The Premium level provides much more I/Os which are needed to allow backup to happen without impacting performances.

Additional References

Clean up

To remove all the created resource, you can just delete the related resource group

az group delete -n <resource-group-name>
You can’t perform that action at this time.