Guardian imports Amazon S3 access logs into a local Postgres database. Useful for reporting on usage and keeping an eye on transfer costs.
Additional information on how Guardian fits into the S3 analysis of Cloudapp can be found here
Sign into S3 and enable logging for the bucket whose access you want to track. Target Bucket is the bucket where Amazon will copy the access logs for the selected bucket. It can be a completely separate bucket from the one whose access is being logged.
Clone Guardian and deploy it. Heroku makes this simple. If deploying to Heroku, know that Guardian requires at least 2 processes: one to run clockwork and another to process jobs.
$ git clone https://github.com/cloudapp/guardian
$ cd guardian
$ heroku create
$ git push heroku master
The development database Heroku provides by default allows up to 10,000 rows. Upgrade to Basic for 10mm rows or Crane for 1TB of storage.
$ heroku addons:add heroku-postgresql:dev
Adding heroku-postgresql:dev to sushi... done, v69 (free)
Attached as HEROKU_POSTGRESQL_RED
$ heroku pg:promote HEROKU_POSTGRESQL_RED_URL
Promoting HEROKU_POSTGRESQL_RED_URL to DATABASE_URL... done
Guardian depends on 3 environment variables in order to read access logs:
AWS_BUCKET_NAME
, AWS_ACCESS_KEY_ID
, and AWS_SECRET_ACCESS_KEY
. The
bucket name is the bucket configured as the Target Bucket in the first
step. The Access Key ID and Secret Access Key can be found on the AWS
Access Credentials page.
$ heroku config:add AWS_BUCKET_NAME=my-bucket \
AWS_ACCESS_KEY_ID=ABC123 \
AWS_SECRET_ACCESS_KEY=DEF456
If you're using Heroku, kickstart Guardian using script/rebuild
passing it
the Heroku app's name. This will scale the clock
and worker
processes to
0, rebuild the database, and scale clock
to 1 and worker
to 15 in order to
churn through the backlog.
$ script/rebuild my-app
Heroku Dataclips make it easy to generate a bucket activity report. Here's a query that shows the top 50 most trafficked files from the past day.
WITH most_trafficked AS (
SELECT coalesce(sum(bytes_sent), 0) as transfer, key
FROM requests
WHERE
key is not null AND
time > current_timestamp - interval '1 day'
GROUP BY key
ORDER BY transfer DESC)
SELECT pg_size_pretty(transfer), key
FROM most_trafficked
LIMIT 50;
Note: Don't assume the access logs Amazon provides will be accurate up to the minute. The way Amazon delivers access logs, the previous 2 hours may not be fully represented.