Skip to content

A prototype application using S3 and Athena to analyze air quality data across the US.

License

Notifications You must be signed in to change notification settings

AJ2O/awsdemo-airquality

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AWS Demo: Air Quality Analysis

This was a freestyle project to choose any combination of AWS services to solve a problem.

The Situation

You work for a US-based, non-profit organization that promotes better environmental conditions in communities most in need.

Currently, your organization is preparing a campaign around the awareness of Ozone (O3) as a pollutant.

You have been asked to identify the top regions in the US with the highest average Ozone levels for use in a regionally-targeted social media awareness campaign.

Requirements

1. Identify an open-source data repository for air quality readings.

2. Set up a way to query that data via AWS services and tools.

3. What city had the highest average ozone reading on October 9, 2018?

My Solution

Included in this repository is a Terraform script that will automatically set up the solution resources in AWS. If you don't already, I highly recommend using Terraform as an Infrastructure as Code tool to automate the setup of your cloud environments. It's compatible with a hundreds of cloud providers, including Microsoft Azure, Google Cloud Platform, VMware vSphere and Kubernetes. It can be downloaded from HashiCorp's official website.

Requirement 1.

After searching online, I found that the United States Environmental Protection Agency (EPA) has public air quality data available across the US at the state, county, and city levels. It can be accessed either live on the website, via API, or by downloading CSV files.

I selected to download the Daily Summary Data CSV for Ozone in 2018 for two reasons:

  • 1. The API has a daily limit on the amount of times it can be called, which could be problematic if we have to query it many times a day. If we have a copy of the data locally, we can run as many queries as we want against it.
  • 2. We at least need daily data for 2018 to answer requirement 3.

Requirement 2.

For AWS services, I chose Amazon S3 to store the CSV, and Amazon Athena to query it.

Athena has native support for running SQL queries against CSV files in S3, and since the CSV file was only 122 MB, it would be very cheap to query the data. In fact, we could run 1500+ Athena queries per month for less than a dollar.

The file is also small enough to stay within S3 Free Tier, so that's why I chose S3 to store the data.

Both services are also very easy to set up, manage, and run. All we need to do is upload the data to S3, and run simple SQL queries.

Alternatives

Alternatives that could be used include database services such as RDS and DynamoDB. The problem with those solutions however, is that we need to configure the databases before we can run queries. This includes creating the tables, schemas, database instance (in the case of RDS), and other service-specific settings. We would also need to then populate the databases with the air quality data.

We already have a full, structured data set in the CSV files provided by the EPA, so we shouldn't need to burden ourselves with unnecessary work. We can just download them, drop them into S3, and start running queries immediately with Athena.

Requirement 3.

I ran the query in the AWS CLI, and the answer was Kelso, California at an average ozone reading of 0.060353 on October 9, 2018. If you want to try out the query, remember to replace the workgroup with your own:

$ aws athena start-query-execution --query-string \
> "SELECT cityname, statename, arithmeticmean \
> FROM dailyairquality \
> WHERE datelocal = '2018-10-09' \
>         AND arithmeticmean = \
>     (SELECT MAX(arithmeticmean) \
>     FROM dailyairquality \
>     WHERE datelocal = '2018-10-09');" \
> --work-group awsdemo-airqualityanalysis
{
    "QueryExecutionId": "8dccc3a4-1b41-4aaf-a625-ecdc186639af"
}

$ aws athena get-query-results --query-execution-id "8dccc3a4-1b41-4aaf-a625-ecdc186639af" | jq .ResultSet.Rows
[
  {
    "Data": [
      {
        "VarCharValue": "cityname"
      },
      {
        "VarCharValue": "statename"
      },
      {
        "VarCharValue": "arithmeticmean"
      }
    ]
  },
  {
    "Data": [
      {
        "VarCharValue": "Kelso"
      },
      {
        "VarCharValue": "California"
      },
      {
        "VarCharValue": "0.060353"
      }
    ]
  }
]

Try out the Solution!

The Terraform script will set up the S3 buckets, Athena database, and Athena queries in AWS. Before you test out the solution, make sure of a couple of things:

  • Download the Daily Summary Data for Ozone for 2018
  • Upload the CSV file to the S3 bucket prefixed by awsdemo-aqa-data which would have been created by Terraform
  • If you are running the Athena queries from the AWS CLI, make sure to include the argument --work-group awsdemo-airqualityanalysis
    • This workgroup would have been made by Terraform too
  • There are two saved Athena queries created: createTable and testQuery
    • Run createTable first to initialize the table
    • Run testQuery after to check requirement 3

About

A prototype application using S3 and Athena to analyze air quality data across the US.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages