Process S3 logs from CloudFront

Process CloudFront logs to convert them from custom CSV format to JSON-encoded, merging date and time fields into a single date-time object.

Necessary environment

Environment: Python 3.8
Event source: S3 bucket, event "ObjectCreated"
Environment variables:
- S3_DEST_BUCKET to define the destination S3 bucket
Permissions:
- Read permissions for the S3 bucket with logs to read the content of the source key.
- Write permissions for the S3_DEST_BUCKET to write the modified file.

Please see detailed IAM policies in the CloudFormation template.

Testing locally

Create a Poetry project.

poetry install

Run tests:

poetry run pytest

Installing to make it work

Create a new S3 bucket to store the results of the processing.
Deploy a new lambda function by running make deploy or make ENV=production deploy. The default environment is staging. The command creates a lambda function and all required resources. Please see the CloudFormation template for details.
Test if the function works as expected. For the valid input, it has to create a new object in the destination S3 bucket.

Deploing a new version

To deploy a new version of the function run make deploy again or make ENV=production deploy for production environment.

Making it usable for Athena

To make it usable for AWS Athena, once some JSON files are created, create a new AWS Glue Crawler to process the content. Then set up a so-called partition projection to make Athena recognize new partitions without manually adding them daily.

Go to "Edit table details" in AWS Glue catalog and add the following new properties:

projection.date.type: date
projection.date.format: yyyy-MM-dd
projection.date.range: 2020-07-01,NOW, but replace "2020-07-01" with the date of your first record.
projection.enabled: true

Invoking the function manually to process existing logfiles

There is a script that you can run locally to process existing log files. The script expects you to provide the bucket name with the CloudFront logs and the name of the lambda function to execute.

Once invoked, the script reads the contents of the S3 bucket and runs asynchornously the Lambda function for each of the batches of the keys. The default batch size is 20, which means that the function will process twenty keys at once.

The script has two modes: "print" or "index". The print mode is a "dry run" mode, it only reads the contents of the S3 bucket and dumps it to stdout. The "index" mode performs the actual indexing.

Example:

Print to stdout all S3 keys with CloudFront logs for 16 July 2020.

poetry run ./scripts/process_all.py \
    --bucket=cloudfront-logs \
    --action=print \
    --prefix=XXXX.2020-07-16 \
    --function-name=process_cloudfront_logs

Index the contents of 16 July 2020 asynchronously.

poetry run ./scripts/process_all.py \
    --bucket=cloudfront-logs \
    --action=index \
    --prefix=XXXX.2020-07-16 \
    --function-name=process_cloudfront_logs

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github/workflows		.github/workflows
scripts		scripts
tests		tests
.flake8		.flake8
.gitignore		.gitignore
.isort.cfg		.isort.cfg
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
Makefile		Makefile
README.md		README.md
cloudformation.yml		cloudformation.yml
mypy.ini		mypy.ini
poetry.lock		poetry.lock
process.py		process.py
pyproject.toml		pyproject.toml

Doist/process-cloudfront-logs

Folders and files

Latest commit

History

Repository files navigation

Process S3 logs from CloudFront

Necessary environment

Testing locally

Installing to make it work

Deploing a new version

Making it usable for Athena

Invoking the function manually to process existing logfiles

About

Resources

Stars

Watchers

Forks

Languages