-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A Django command that scans files at rest #2705
Conversation
Terraform plan for dev No changes. Your infrastructure matches the configuration.
📝 Plan generated in Pull Request Checks #2173 |
Terraform plan for meta No changes. Your infrastructure matches the configuration.
📝 Plan generated in Pull Request Checks #2173 |
Minimum allowed coverage is Generated by 🐒 cobertura-action against f20fb66 |
Some questions that came up during cowork:
|
Q: Is there any additional risk to doing this scan from inside the Django app? If a potentially risky file exists, aren't we guaranteeing that it touches the app by scanning every file? Q: How should we handle inevitable ClamAV timeouts? Q: Once we start getting a lot more audits, ~how much compute/memory/etc are we going to need? If it takes 1 second to scan a file (which is likely a vast underestimate), and there are 2.5 million seconds in a month, we would outgrow a "monthly scan" cadence once we hit 2.5 million media files in S3 (which includes ~500k historic PDF files).
gives me about ~5 months. A1: We can do that twice a year! Q: Maybe we need to consider purging workbook files either annually or when a submission is complete. Q: What would ingress/egress costs look like to scan every file? Q: Could scanning files on the way out, every time, address the threat vector in question? Q: Could we just scan any time ClamAV or its definitions are updated? Would this qualify as a "periodic" scan? |
else: | ||
for r in results: | ||
if is_stringlike(r): | ||
logger.error(f"SCAN FAIL: {r}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is right now, since this is just a django command, and has no associated workflow, (and even if it did, question still applies when run with cf tasks) how do we get the output back to us for consumption?
Could we write fails to a file and send it off somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that the logger
actions, as a Django command, would be picked up by logshipper
/NR.
Is that incorrect?
If so, then yes: I need to bundle up results and send them somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to https://docs.cloudfoundry.org/devguide/using-tasks.html#-task-logging-and-execution-history :
Any data or messages the task outputs to stdout or stderr is available in the firehose logs of the app. A syslog drain attached to the app receives the task log output. The task execution history is retained for one month.
So it looks like that logger.info
line would be picked up by logshipper
/NR.
Given a path, it will scan everything. Given a single object, it will scan one thing
I was passing params wrong to the SimpleUploadedFile creation. Fixed.
e969025
to
a50d76e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding some inline comments following conversation with @tadhg-ohiggins.
backend/dissemination/management/commands/scan_bucket_files_for_viruses.py
Show resolved
Hide resolved
for object_summary in objects["Contents"]: | ||
object_name = object_summary["Key"] | ||
result = scan_file_in_s3(bucket, object_name) | ||
results.append(result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is likely to get bogged down by the volume of files. We should log each time ClamAV finds a suspicious file instead of creating one large results object.
|
||
def scan_files_at_path_in_s3(bucket, path): | ||
s3 = get_s3_client() | ||
objects = s3.list_objects(Bucket=bucket, Prefix=path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list_objects
can only return up to 1000 objects in a single request, so we'll need to handle pagination or figure something else out.
…o instead log errors at the point of each scan.
Contents of this PR moved to #3285 due to Git commit signing issues; closing. |
#2693
This is a Django command that can do one of two things:
Common calling patterns will be:
fac scan_bucket_files_for_viruses --bucket gsa-fac-private-s3 --path singleauditreport
fac scan_bucket_files_for_viruses --bucket gsa-fac-private-s3 --path excel
This should be testable locally, and once merged, we can test it in
dev
. If it can be run manually indev
, we can then build an associated GH Action to trigger it.This does not write to any files, and therefore is not a risk to data. It does generate log messages.
sfv.mp4
PR checklist: submitters
main
into your branch shortly before creating the PR. (You should also be mergingmain
into your branch regularly during development.)git status | grep migrations
. If there are any results, you probably need to add them to the branch for the PR. Your PR should have only one new migration file for each of the component apps, except in rare circumstances; you may need to delete some and re-runpython manage.py makemigrations
to reduce the number to one. (Also, unless in exceptional circumstances, your PR should not delete any migration files.)PR checklist: reviewers
make docker-clean; make docker-first-run && docker compose up
; then rundocker compose exec web /bin/bash -c "python manage.py test"
The larger the PR, the stricter we should be about these points.