Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A Django command that scans files at rest #2705

Closed
wants to merge 7 commits into from

Conversation

jadudm
Copy link
Contributor

@jadudm jadudm commented Nov 2, 2023

#2693

This is a Django command that can do one of two things:

  1. Given a --bucket and --object, it will scan that object.
  2. Given a --bucket and --path, it will scan all the files at that path

Common calling patterns will be:

  • fac scan_bucket_files_for_viruses --bucket gsa-fac-private-s3 --path singleauditreport
  • fac scan_bucket_files_for_viruses --bucket gsa-fac-private-s3 --path excel

This should be testable locally, and once merged, we can test it in dev. If it can be run manually in dev, we can then build an associated GH Action to trigger it.

This does not write to any files, and therefore is not a risk to data. It does generate log messages.

sfv.mp4

PR checklist: submitters

  • Link to an issue if possible. If there’s no issue, describe what your branch does. Even if there is an issue, a brief description in the PR is still useful.
  • List any special steps reviewers have to follow to test the PR. For example, adding a local environment variable, creating a local test file, etc.
  • For extra credit, submit a screen recording like this one.
  • Make sure you’ve merged main into your branch shortly before creating the PR. (You should also be merging main into your branch regularly during development.)
  • Make sure you’ve accounted for any migrations. When you’re about to create the PR, bring up the application locally and then run git status | grep migrations. If there are any results, you probably need to add them to the branch for the PR. Your PR should have only one new migration file for each of the component apps, except in rare circumstances; you may need to delete some and re-run python manage.py makemigrations to reduce the number to one. (Also, unless in exceptional circumstances, your PR should not delete any migration files.)
  • Make sure that whatever feature you’re adding has tests that cover the feature. This includes test coverage to make sure that the previous workflow still works, if applicable.
  • Make sure the full-submission.cy.js Cypress test passes, if applicable.
  • Do manual testing locally. Our tests are not good enough yet to allow us to skip this step. If that’s not applicable for some reason, check this box.
  • Verify that no Git surgery was necessary, or, if it was necessary at any point, repeat the testing after it’s finished.
  • Once a PR is merged, keep an eye on it until it’s deployed to dev, and do enough testing on dev to verify that it deployed successfully, the feature works as expected, and the happy path for the broad feature area (such as submission) still works.

PR checklist: reviewers

  • Pull the branch to your local environment and run make docker-clean; make docker-first-run && docker compose up; then run docker compose exec web /bin/bash -c "python manage.py test"
  • Manually test out the changes locally, or check this box to verify that it wasn’t applicable in this case.
  • Check that the PR has appropriate tests. Look out for changes in HTML/JS/JSON Schema logic that may need to be captured in Python tests even though the logic isn’t in Python.
  • Verify that no Git surgery is necessary at any point (such as during a merge party), or, if it was, repeat the testing after it’s finished.

The larger the PR, the stricter we should be about these points.

@jadudm jadudm temporarily deployed to dev November 2, 2023 14:51 — with GitHub Actions Inactive
@jadudm jadudm temporarily deployed to meta November 2, 2023 14:51 — with GitHub Actions Inactive
Copy link
Contributor

github-actions bot commented Nov 2, 2023

Terraform plan for dev

No changes. Your infrastructure matches the configuration.
No changes. Your infrastructure matches the configuration.

Terraform has compared your real infrastructure against your configuration
and found no differences, so no changes are needed.

Warning: Argument is deprecated

  with module.dev.module.cg-logshipper.module.s3-logshipper-storage.cloudfoundry_service_instance.bucket,
  on /tmp/terraform-data-dir/modules/dev.cg-logshipper.s3-logshipper-storage/s3/main.tf line 14, in resource "cloudfoundry_service_instance" "bucket":
  14:   recursive_delete = var.recursive_delete

Since CF API v3, recursive delete is always done on the cloudcontroller side.
This will be removed in future releases

(and 3 more similar warnings elsewhere)

📝 Plan generated in Pull Request Checks #2173

Copy link
Contributor

github-actions bot commented Nov 2, 2023

Terraform plan for meta

No changes. Your infrastructure matches the configuration.
No changes. Your infrastructure matches the configuration.

Terraform has compared your real infrastructure against your configuration
and found no differences, so no changes are needed.

Warning: Argument is deprecated

  with module.s3-backups.cloudfoundry_service_instance.bucket,
  on /tmp/terraform-data-dir/modules/s3-backups/s3/main.tf line 14, in resource "cloudfoundry_service_instance" "bucket":
  14:   recursive_delete = var.recursive_delete

Since CF API v3, recursive delete is always done on the cloudcontroller side.
This will be removed in future releases

📝 Plan generated in Pull Request Checks #2173

Copy link
Contributor

github-actions bot commented Nov 2, 2023

File Coverage Missing
All files 88%
api/serializers.py 88% 177-178 183 188
api/test_views.py 95% 103
api/uei.py 88% 87 118-119 163 167-168
api/views.py 98% 198-199 337-338
audit/forms.py 60% 31-38 108-115
audit/intake_to_dissemination.py 87% 71-72 205-211 261 305-313
audit/test_commands.py 87%
audit/test_manage_submission_access_view.py 98% 15 19
audit/test_mixins.py 90% 112-113 117-119 184-185 189-191
audit/test_validators.py 95% 436 440 608-609 848 855 862 869
audit/test_views.py 95% 410-442 451-482 491-519
audit/test_workbooks_should_fail.py 88% 58 87-88 92
audit/test_workbooks_should_pass.py 87% 59 74-76
audit/utils.py 91% 33-35 38
audit/validators.py 94% 137 189 288-289 304-305 486-490 495-499 515-524
audit/cross_validation/additional_ueis.py 93% 33
audit/cross_validation/check_award_ref_declaration.py 90%
audit/cross_validation/check_award_reference_uniqueness.py 93%
audit/cross_validation/check_certifying_contacts.py 87%
audit/cross_validation/check_findings_count_consistency.py 91%
audit/cross_validation/check_ref_number_in_cap.py 90%
audit/cross_validation/check_ref_number_in_findings_text.py 90%
audit/cross_validation/errors.py 78% 30 69
audit/cross_validation/naming.py 93% 197
audit/cross_validation/submission_progress_check.py 95% 80
audit/cross_validation/tribal_data_sharing_consent.py 81% 33 36 40
audit/cross_validation/validate_general_information.py 93% 28-29
audit/fixtures/single_audit_checklist.py 55% 146-183 229-238
audit/intakelib/exceptions.py 71% 7-9 12
audit/intakelib/intermediate_representation.py 91% 27-28 73 91 129 200-203 212-213
audit/intakelib/mapping_audit_findings.py 97% 55
audit/intakelib/mapping_audit_findings_text.py 97% 52
audit/intakelib/mapping_federal_awards.py 93% 90
audit/intakelib/mapping_util.py 81% 21 25 29 99 104-105 114-120 130 145 150
audit/intakelib/checks/check_all_unique_award_numbers.py 79% 24
audit/intakelib/checks/check_cluster_names.py 75% 21-26
audit/intakelib/checks/check_cluster_total.py 95% 73
audit/intakelib/checks/check_has_all_the_named_ranges.py 84% 52
audit/intakelib/checks/check_is_a_workbook.py 69% 20
audit/intakelib/checks/check_loan_balance_entries.py 83% 29
audit/intakelib/checks/check_loan_balance_present.py 88% 30
audit/intakelib/checks/check_look_for_empty_rows.py 91% 18
audit/intakelib/checks/check_no_major_program_no_type.py 76% 18 27
audit/intakelib/checks/check_no_repeat_findings.py 76% 21 30
audit/intakelib/checks/check_other_cluster_names.py 81% 23 33
audit/intakelib/checks/check_passthrough_name_when_no_direct.py 88% 9 47
audit/intakelib/checks/check_sequential_award_numbers.py 76% 14 22
audit/intakelib/checks/check_start_and_end_rows_of_all_columns_are_same.py 89% 14
audit/intakelib/checks/check_state_cluster_names.py 81% 23 33
audit/intakelib/checks/check_version_number.py 73% 21 31-32
audit/intakelib/checks/runners.py 96% 147
audit/intakelib/common/util.py 90% 22 39
audit/intakelib/transforms/xform_rename_additional_notes_sheet.py 81% 14
audit/management/commands/load_fixtures.py 46% 39-45
audit/models/models.py 83% 57 59 64 66 216 228-231 249 426 444-445 453 475 573-574 578 586 595 601
audit/views/audit_info_form_view.py 27% 25-74 77-117 120-137
audit/views/manage_submission.py 86% 73-80
audit/views/manage_submission_access.py 98% 113-114
audit/views/pre_dissemination_download_view.py 76% 15-18 23-26 33-43
audit/views/submission_progress_view.py 96% 179-180
audit/views/tribal_data_consent.py 34% 23-41 44-79
audit/views/unlock_after_certification.py 57% 28-51 73-87
audit/views/upload_report_view.py 26% 32-35 44 91-117 120-170 178-209
audit/views/views.py 39% 69 76-95 118-119 193-194 239-240 251-252 254-258 305-318 321-335 340-353 370-376 381-401 404-432 437-466 469-513 518-538 541-569 574-603 606-650 655-667 670-680 685-697 724-725
census_historical_migration/change_record.py 95% 30 34 46
census_historical_migration/test_federal_awards_xforms.py 99% 194-195
census_historical_migration/sac_general_lib/audit_information.py 91% 24 78
census_historical_migration/sac_general_lib/cognizant_oversight.py 68% 11
census_historical_migration/sac_general_lib/general_information.py 92% 150-151 159-160 202
census_historical_migration/sac_general_lib/sac_creator.py 90% 34
census_historical_migration/sac_general_lib/utils.py 83% 33 60-69
census_historical_migration/transforms/xform_retrieve_uei.py 67% 10
census_historical_migration/workbooklib/additional_eins.py 84% 58-60 67-77
census_historical_migration/workbooklib/additional_ueis.py 77% 27-29 36-46
census_historical_migration/workbooklib/excel_creation_utils.py 61% 98 107-112 117-124 128-146 159-163 177-180
census_historical_migration/workbooklib/federal_awards.py 62% 139 180-188 198-223 326-408
census_historical_migration/workbooklib/notes_to_sefa.py 65% 34-38 116-122 130-136 144-146 153-185
config/test_settings.py 92% 33-34 49-50
config/urls.py 71% 83
dissemination/file_downloads.py 73% 35-53 83-85
dissemination/models.py 99% 469
dissemination/search.py 83% 58 88 94 123 125 128-136
dissemination/summary_reports.py 68% 268-270 274-278 355-399 424 460-462 477-484
dissemination/views.py 84% 166 170 200 248 250 252 293-297
dissemination/migrations/0002_general_fac_accepted_date.py 47% 10-12
djangooidc/backends.py 78% 32 57-63
djangooidc/exceptions.py 66% 19 21 23 28
djangooidc/oidc.py 16% 32-35 45-51 64-70 92-149 153-199 203-226 230-275 280-281 286
djangooidc/views.py 80% 22 43 114
djangooidc/tests/common.py 96%
report_submission/forms.py 92% 35
report_submission/views.py 76% 83 215-216 218 240-241 260-261 287-396 399-409
report_submission/templatetags/get_attr.py 76% 8 11-14 18
support/admin.py 88% 76 79 84 91-97 100-102
support/cog_over.py 91% 30-33 93 145
support/test_admin_api.py 80% 22 146-147 236-237 316-317
support/test_cog_over.py 98% 134-135 224
support/management/commands/seed_cog_baseline.py 98% 20-21
support/models/cog_over.py 89% 103-104
tools/update_program_data.py 89% 96
users/admin.py 99% 27
users/auth.py 96% 58-59
users/models.py 96% 18 74-75
users/fixtures/user_fixtures.py 91%

Minimum allowed coverage is 85%

Generated by 🐒 cobertura-action against f20fb66

@danswick
Copy link
Contributor

danswick commented Nov 2, 2023

Some questions that came up during cowork:

  • Is there any additional risk to doing this scan from inside the Django app? If a potentially risky file exists, aren't we guaranteeing that it touches the app by scanning every file?
  • How should we handle inevitable ClamAV timeouts?
  • Once we start getting a lot more audits, ~how much compute/memory/etc are we going to need? If it takes 1 second to scan a file (which is likely a vast underestimate), and there are 2.5 million seconds in a month, we would outgrow a "monthly scan" cadence once we hit 2.5 million media files in S3 (which includes ~500k historic PDF files).
    • Maybe we need to consider purging workbook files either annually or when a submission is complete.
  • What would ingress/egress costs look like to scan every file?
  • Could scanning files on the way out, every time, address the threat vector in question?
  • Could we just scan any time ClamAV or its definitions are updated? Would this qualify as a "periodic" scan?

@jadudm
Copy link
Contributor Author

jadudm commented Nov 3, 2023

Q: Is there any additional risk to doing this scan from inside the Django app? If a potentially risky file exists, aren't we guaranteeing that it touches the app by scanning every file?
A: I'm not sure what tooling would be "safer" to load a file and pass it to ClamAV via HTTPS POST. No matter how you slice it, the file has to pass through RAM on its way to the "remote" ClamAV instance.
A: No matter what, some code has to touch the production bucket to read the files and pass them to Clam.
A: This would run as an action, so it wouldn't exactly be running in the same space as the app. But, it has to touch our data.
A: I may be missing something here.

Q: How should we handle inevitable ClamAV timeouts?
A: It's a failed scan. Either we add a backoff and retry, or we call the file "no good" and quarantine it?
A: If the conn doesn't timeout naturally, we need to add a timeout.
A: This was a first step. We need something that can run and log, and that we can catch in NR. A documented mitigation plan, for the moment, would suffice.

Q: Once we start getting a lot more audits, ~how much compute/memory/etc are we going to need? If it takes 1 second to scan a file (which is likely a vast underestimate), and there are 2.5 million seconds in a month, we would outgrow a "monthly scan" cadence once we hit 2.5 million media files in S3 (which includes ~500k historic PDF files).
A: Hm.

(define pdfs 700000)
(define xlsx (* pdfs 4))
(define files (+ pdfs xlsx))
(define time-per-file 4)
(define total-time (* files time-per-file))
(define hours (/ (/ total-time 60) 60))
(define days (/ hours 24))
(define months (/ days 30))

gives me about ~5 months.

A1: We can do that twice a year!
A2: I have no idea how we address this.

Q: Maybe we need to consider purging workbook files either annually or when a submission is complete.
A: Or, monthly.
A: Or, zip them all every month, so there's only one file to scan per month.
A: Or, can we scan our backups? Those are compressed. Would that work? But, I have no idea how large a file you can pass to ClamAV... oh. Around 4GB. So... perhaps we create 2GB zips, and scan those?
A: Or, yep.

Q: What would ingress/egress costs look like to scan every file?
A: $0? Ingress/egress is to/from the internet? So, I think this would be S3 to our app, both of which are within the same GovCloud region. If this is true, then I think there is no ingress/egress between the app (running as an action) and the S3 storage.
A: We would want to check with cloud.gov.

Q: Could scanning files on the way out, every time, address the threat vector in question?
A: We could try and make the case. We'd have to convince GSA leadership/our CISO that this is the preferable solution to running virus scans all year long.

Q: Could we just scan any time ClamAV or its definitions are updated? Would this qualify as a "periodic" scan?
A: That would likely be more often. Meaning... we couldn't run the scans fast enough to keep up with this. It would count as periodic, I think, but we couldn't run all the files.

@jadudm jadudm marked this pull request as ready for review November 8, 2023 13:26
@jadudm jadudm requested a review from a team November 8, 2023 13:26
else:
for r in results:
if is_stringlike(r):
logger.error(f"SCAN FAIL: {r}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is right now, since this is just a django command, and has no associated workflow, (and even if it did, question still applies when run with cf tasks) how do we get the output back to us for consumption?

Could we write fails to a file and send it off somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that the logger actions, as a Django command, would be picked up by logshipper/NR.

Is that incorrect?

If so, then yes: I need to bundle up results and send them somewhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to https://docs.cloudfoundry.org/devguide/using-tasks.html#-task-logging-and-execution-history :

Any data or messages the task outputs to stdout or stderr is available in the firehose logs of the app. A syslog drain attached to the app receives the task log output. The task execution history is retained for one month.

So it looks like that logger.info line would be picked up by logshipper/NR.

Given a path, it will scan everything.

Given a single object, it will scan one thing
I was passing params wrong to the SimpleUploadedFile creation.

Fixed.
Copy link
Contributor

@danswick danswick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding some inline comments following conversation with @tadhg-ohiggins.

for object_summary in objects["Contents"]:
object_name = object_summary["Key"]
result = scan_file_in_s3(bucket, object_name)
results.append(result)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is likely to get bogged down by the volume of files. We should log each time ClamAV finds a suspicious file instead of creating one large results object.


def scan_files_at_path_in_s3(bucket, path):
s3 = get_s3_client()
objects = s3.list_objects(Bucket=bucket, Prefix=path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list_objects can only return up to 1000 objects in a single request, so we'll need to handle pagination or figure something else out.

Boto3 list_objects docs.

Boto3 built-in paginator docs

…o instead log errors at the point of each scan.
@tadhg-ohiggins
Copy link
Contributor

Contents of this PR moved to #3285 due to Git commit signing issues; closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants