A metal-to-alerts example of how to build an IoT enabled monitoring solution using only AWS PaaS offerings.
We have been doing this workflow for 5+ years now, much longer if you count fully custom tools. So have countless others. Yet, if you asked a new engineer to build this, you can wave goodbye to a few weeks of their effort as they navigate obtuse documentation and outdated StackOverflow answers. This is our attempt to document what's possible with PaaS tools in 2020.
To achieve this, we will:
- Write a Python script that monitors system metrics (CPU, Memory, Temperature, Fan)
- You could replace this with actual hardware, perhaps run this script on an Raspberry Pi or even send FreeRTOS metrics from ESP32 but we will save that for another day.
- Create multiple
things
on AWS IoT Core. - Send these metrics as shadow updates to AWS IoT every 10s (configurable)
- Configure AWS IoT to route shadow updates to a database
- Set up a visualisation tool and create dashboards using these updates
- Add a few simple alerts on our visualiation tool to send notifications if system metrics cross a threshold
master
= final code + AWS IoT configuration + Grafana dashboard JSON1_python_script
= Python script without AWS IoT integration (print to console)2_aws_iot
= Python script with AWS IoT integration (shadow updates)
We are going to adapt this excellent blog post to create our system monitor script.
python3 -m venv venv
source venv/bin/activate
echo psutil==5.8.0 > requirements.txt
pip install -r requirements.txt
touch sysmon.py
- Edit
sysmon.py
in your preferred text editor and add in the following functions from the blog post:get_cpu_usage_pct
get_cpu_frequency
get_cpu_temp
get_ram_usage
get_ram_total
- Next, create a
main
function that calls each of these functions and populates a dictionary:payload
and prints it.- We will also add a
timestamp
to the payload for use in visualisations later.
- We will also add a
- Add a
while(1)
that calls thismain
function every 10 seconds. - Add an argument parser so we can pass the
interval
and adevice_id
as command line arguments. - Run the script with
python sysmon.py 10 my_iot_device_1
You should see output similar to:
Now, we will add in the ability to send our metrics to AWS IoT. But first, we need to register our devices or things
as AWS calls them.
We will register the devices individually via the AWS Console. However, if you have a large number of devices to register, you may want to script it or use Bulk Registration via aws-cli
or the AWS IoT Core Console.
We are using
us-east-1
aka N. Virginia for integration later with Amazon Timestream which is not yet available in all regions.
- Click on
Create a single thing
- Give your thing a name, e.g.
my_iot_device_1
- You can skip
Thing Type
andGroup
for this demo. - Create the thing
- Give your thing a name, e.g.
- Use the
One-click certificate creation (recommended)
option to generate the certificates.- Download the generated certificates and the root CA certificate.
- Activate the certificates.
- Attach a policy and register the
Thing
.- Because we are cavalier and this is a demo, we are using the following
PubSubToAny
policy. - DO NOT use this in production!
- Because we are cavalier and this is a demo, we are using the following
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "iot:*",
"Resource": "*"
}
]
}
Now repeat this a couple more times so we have a few things. I am setting up 3 devices with the imaginative names: my_iot_device_1
, my_iot_device_2
, my_iot_device_3
.
Finally, we will rename our certificates to match our thing names so that it's easier to script together. For instance, I am using the rename utility to bulk rename my certificates:
Because AWS IoT supports MQTT, we could use any MQTT client that supports X.509 certificates. However, to keep things simple, we will use the official Python SDK from AWS IoT. Specifically, we will adapt the basicShadowUpdater.py
sample.
- Please inspect
aws_shadow_upater.py
for the changes we are making. Primarily, we are wrapping the functionality into 2 functions:init_device_shadow_handler
that takes AWS IoT specific config parameters and returns adeviceShadowHandler
specific to our configuration and thing.update_device_shadow
that takes our system metrics payload and wraps it into ajson
structure that AWS IoT expects fordevice shadows
.
- We will also take this opportunity to modularise our code a bit by moving the
main
function fromsysmon.py
into its own separate file. - Within
main.py
we are reading our AWS configuration from a combination of environment variables and the local certificates.- We only need the following:
export AWS_IOT_HOST=YOUR_AWS_IOT_ENDPOINT.amazonaws.com
andexport CERTS_DIR=certs
assuming you are keeping your certificates incerts/
. - You will probably want to create a script or
.env
file to set these environment variables - For good measure, we are also verifying that the certificates actually exist.
- We only need the following:
- With this done, we stitch our two modules
sysmon.py
andaws_shadow_updater.py
together and start publishing updates. If all goes well, you should see the following in your terminal and your AWS Console (go to Thing -> Shadows -> Classic Shadow)
We are done with almost all of the coding needed to get this working.
This is an easy one, open up multiple terminals/tabs and start a separate process for updating the shadow for each device
. Something like this:
In order to visualise, and perhaps analyse, these metrics, we need to persist them in some form of database. Thankfully, AWS IoT has a Rules Engine designed for just this purpose. The Rules Engine is essentially a message router with the ability to filter messages using an SQL syntax and send them to various destinations.
Go to AWS IoT Core -> Act -> Rules
to get started.
There are 2 steps to enabling rules:
- Filter: Select the messages we want to act on.
- Act: Select the action(s) we want to run for each filtered message.
AWS IoT uses a reduced SQL syntax for filtering messages. Points to note:
- The shadow topic we are interested in is
$aws/things/thingName/shadow/update
where we need to replacethingName
with the wildcard+
. Follow this reference on topics and wildcards. - The content of each message contains the entire
state
withdesired
andreported
properties as well as other metadata. We will need to unpack thereported
property to get the data we need. Follow this reference for more details.
Our SQL filter will look essentially like this:
SELECT
state.reported.cpu_usage as cpu_usage,
state.reported.cpu_freq as cpu_freq,
state.reported.cpu_temp as cpu_temp,
state.reported.ram_usage as ram_usage,
state.reported.ram_total as ram_total
FROM '$aws/things/+/shadow/update'
Before we can save this rule, we will also need to add an action
. Actions define what to do with the filtered messages. This depends on our choice of database.
AWS IoT supports a large range of actions out of the box including CloudWatch, DynamoDB, ElasticSearch, Timestream DB and custom HTTP endpoints. See the full list here.
To confirm that our messages are coming through and we are able to store them, we will use the shiny, new time series database from AWS - Timestream. We will also enable the CloudWatch
action in case of errors.
As of this writing Timestream is only available in 4 regions.
It's essential to create the DB in the same region as your AWS IoT endpoint as the Rules Engine does not, yet, support multiple regions for the built-in actions. _You could use a Lambda function to do this for you but that's more management and cost.
We will create a Standard
(empty) DB with the name aws_iot_demo
:
We will also need a table
to store our data, so let's do that too:
Once this is done, we can return to the rule we started creating earlier and add a new Action.
Notes:
- The AWS IoT Rule Action for Timestream needs at least one
dimension
to be specified. Dimensions can be used for grouping and filtering incoming data. - I used the following
key
:value
pair using a substitution template -device_id
:${clientId()}
- We are sending the device timestamp as part of the shadow update. If we include it as part of the
SELECT
query in the rule, Timestream will assume thattimestamp
is a measurement metric too.- Instead, we will ignore the device
timestamp
and use${timestamp()}
as the time parameter within the Rule Action. This generates a server timestamp.
- Instead, we will ignore the device
- You will also need to create or select an appropriate IAM role that lets AWS IoT to write to Timestream.
- Timestream creates separate rows for each metric so each shadow update creates 5 rows.
This action is triggered if/when there is an error while processing our rule. Again, follow the guided wizard to create a new Log Group
and assign permissions.
At the end, your rule should look something like this:
Assuming we have started our simulators again, we should start to see data being stored in Timestream. Go over to AWS Console -> Timestream -> Tables -> aws_iot_demo
-> Query Table. Type in the following query:
-- Get the 20 most recently added data points in the past 15 minutes. You can change the time period if you're not continuously ingesting data
SELECT * FROM "aws_iot_demo"."aws_iot_demo" WHERE time between ago(15m) and now() ORDER BY time DESC LIMIT 20
You should see output similar to the one below:
You will notice the separate rows for each metric. We will need a different query in order to combine the metrics into a single view, for instance for use with visualisation or analytics tools.
SELECT device_id, BIN(time, 1m) AS time_bin,
AVG(CASE WHEN measure_name = 'cpu_usage' THEN measure_value::double ELSE NULL END) AS avg_cpu_usage,
AVG(CASE WHEN measure_name = 'cpu_freq' THEN measure_value::bigint ELSE NULL END) AS avg_cpu_freq,
AVG(CASE WHEN measure_name = 'cpu_temp' THEN measure_value::double ELSE NULL END) AS avg_cpu_temp,
AVG(CASE WHEN measure_name = 'ram_usage' THEN measure_value::bigint ELSE NULL END) AS avg_ram_usage,
AVG(CASE WHEN measure_name = 'ram_total' THEN measure_value::bigint ELSE NULL END) AS avg_ram_total
FROM "aws_iot_demo"."aws_iot_demo"
WHERE time between ago(15m) and now()
GROUP BY BIN(time, 1m), device_id
ORDER BY time_bin desc
Your output should look something like this -
If you do see similar output, you are in business and we can continue to visualisation. If you don't,
- Check the Cloudwatch Logs for errors
- Verify that your SQL syntax is correct - especially the topic
- Ensure your Rule action has the right table and an appropriate IAM Role
- Verify that your Device Shadow is getting updated by going over to AWS IoT -> Things -> my_iot_device_1 -> Shadow
- Looking for errors if any on the terminal where you are running the script.
We have covered a lot of ground. So, let's pause and reflect. Here's what we have done so far:
- Created a Python script to monitor common system metrics.
- Hooked up this script to AWS IoT using the SDK and
Thing
certificates. - Simulated running multiples of these devices with sending a
Shadow
update. - Created a rule to persist these device shadows to
Timestream
and errors toCloudWatch
. - Verified that we are actually getting our data.
Now, we only have the small matter of visualising our data and setting up alerts in case any of our metrics cross critical thresholds.
With a fresh cup of coffee, onwards...
Storage and visualisation are, in fact, two separate operations that need two different software tools. However, these are often so tightly coupled that choice of one often dictates choice of the other. Here's a handy table that illustrates this.
Storage | Visualisation | Comments |
---|---|---|
Timestream | AWS QuickSight | See demo below |
Timestream | Grafana | See demo below |
DynamoDB | AWS QuickSight | Needs CSV export to S3 first |
DynamoDB | Redash | Works but with limitations, demo in future post |
ElasticSearch | Kibana | Works well, demo in future post |
ElasticSearch | Grafana | Simpler to just use Kibana |
InfluxDB | InfluxDB UI | Works well, demo in future post |
InfluxDB | Grafana | Simpler to just use the built-in UI |
There are, of course, numerous other ways to do this. We will focus on the first two in that table.
QuickSight is a managed BI tool from AWS. The official documentation to integrate Timestream with QuickSight is a little dense. However, it's pretty straightforward if you are using us-east-1
as your region.
- Within QuickSight, click on the user icon at the top right and then on
Manage QuickSight
. - Here, go to
Security & Permissions
->QuickSight access to AWS services
and enableTimestream
(see image below). - Next, within QuickSight, click on
New Dataset
and selectTimestream
. Click onValidate Connection
to ensure you have given the permissions and confirm. - Upon confirmation, select
aws_iot_demo
from the discovered databases and selectaws_iot_demo
from the tables. - Click on
Visualise
So far so good. I had to struggle for a while to understand how to get QuickSight to unpack the metrics from Timestream. Turns out, this is surprisingly easy if you follow this tutorial video from AWS. Essentially,
- Create multiple visualisations
- For each visualisation, add a filter on
measure_name
. - Click on
time
to add it to the X-axis. Change period toAggregrate: Minute
. - Click on
measure_value::bigint
ormeasure_value::double
depending on the metric to add it to the Y-axis. Change toAggregate: Average
.- In our case, only
cpu_usage
is adouble
.
- In our case, only
- Click on
device_id
to add separate lines for each device. This is added toColor
in QuickSight.
That's it! My dashboard looks like this -
QuickSight is a full-fledged business intelligence (BI) tool with the ability to integrate with multiple data sources. QuickSight also has built-in anomaly detection. This makes it an incredibly powerful tool to use for IoT visualisations and analysis. We could even bring in non-IoT data such as that from an ERP. More on this in a later post!
However, QuickSight:
- Does not support alerts
- Does not support calculations
- Has limited visualisations
- Does not support dashboard embeds, e.g. in a webpage
- Charges per user
Mind you, the AWS IoT rules engine can be used quite easily for alerts so you don't really need alerts in a separate tool. Having said that...
Grafana has long been a favourite of anyone looking to create beautiful, lightweight dashboards. Grafana integrates with a zillion data sources via input plugins and has numerous visualisations via output plugins. Grafana only does visualisations and alerts but does it really well. There are open source and enterprise editions available.
AWS has an upcoming managed Grafana service. Until then, we will use the managed service from Grafana Cloud. You could also spin up Grafana locally or on a VM somewhere with the docker image.
There's a video tutorial available but you will need to adapt a fair bit to our example. Assuming you have either signed up for Grafana Cloud or installed it locally, you should now:
- Install the Amazon Timestream plugin
- Back in Grafana, add a new
Data Source
and search forTimestream
. - For authentication, we will use Access Key and Secret for a new IAM User.
- Back in AWS, create a new user with the
AmazonTimestreamReadOnlyAccess
policy attachedadmin
rights. For some reason, Grafana would not connect to Timestream even with theAmazonTimestreamFullAccess
policy attached.
- Back in AWS, create a new user with the
- Once the keys are in place, click on
Save & Test
- Select
aws_iot_demo
in the$_database
field to set up the default DB. Try as I might, I could not get the dropdown for$_table
to populate.
Now, click on + -> Dashboard
and + Add new panel
to get started.
Unlike QuickSight, Grafana allows you to build queries using SQL. So, for our first panel, let's create a CPU Usage chart with the following query:
SELECT device_id,
CREATE_TIME_SERIES(time, measure_value::double) as avg_cpu_usage
FROM $__database.$__table
WHERE $__timeFilter
AND measure_name = '$__measure'
GROUP BY device_id
We are essentially doing the same as QuickSight by defining a where
clause to filter by metric and creating a time series that is grouped by device_id
. The one, big, difference is the Grafana allows you to add multiple such queries to a single visualisation chart (panel in Grafana speak). Duplicating this panel and making the mods necessary, we end up with a dashboard very similar to that in QuickSight.
Notice that we get dashboard wide time controls for free!
Creating alerts with Grafana is surprisingly easy. Alerts use the same query as the panel and are created in the same UI.
By default, the alert is triggered on the average of the metric but you can change it a different calculation.
If you have multiple queries on a panel, you can even use a combination of queries!
Alerts can trigger notifications to various channels
with built-in options for all the major chat apps, email and webhooks. For instance, if you wanted to trigger a notification within a mobile app, you would set up an API somewhere that will be triggered by a webhook configuration within Grafana. Your API is then responsible for notifying the mobile app.
For more on Grafana alerts, check out the docs.
Once you get around the verbose documentation, and refine your search skills, it's quite straightforward to create an end-to-end flow for most IoT use cases using purely platform-as-a-service offerings.
We have been running a deployment on AWS for a customer with ~46000 devices for 2+ years now, handling 15-20M messages monthly. All this for a fraction of the cost and attention this would need if we ran the infrastructure ourselves.
That said, I do have a few reservations and will explore those in future posts.
Good luck and reach out to us on hello@iotready.co if you have any questions!
- Add LICENSE
- Add screenshots
- Add motivation