A Django command that scans files at rest #2705

jadudm · 2023-11-02T14:51:10Z

This is a Django command that can do one of two things:

Given a --bucket and --object, it will scan that object.
Given a --bucket and --path, it will scan all the files at that path

Common calling patterns will be:

fac scan_bucket_files_for_viruses --bucket gsa-fac-private-s3 --path singleauditreport
fac scan_bucket_files_for_viruses --bucket gsa-fac-private-s3 --path excel

This should be testable locally, and once merged, we can test it in dev. If it can be run manually in dev, we can then build an associated GH Action to trigger it.

This does not write to any files, and therefore is not a risk to data. It does generate log messages.

sfv.mp4

PR checklist: submitters

PR checklist: reviewers

Pull the branch to your local environment and run make docker-clean; make docker-first-run && docker compose up; then run docker compose exec web /bin/bash -c "python manage.py test"
Manually test out the changes locally, or check this box to verify that it wasn’t applicable in this case.
Check that the PR has appropriate tests. Look out for changes in HTML/JS/JSON Schema logic that may need to be captured in Python tests even though the logic isn’t in Python.
Verify that no Git surgery is necessary at any point (such as during a merge party), or, if it was, repeat the testing after it’s finished.

The larger the PR, the stricter we should be about these points.

github-actions · 2023-11-02T14:52:11Z

Terraform plan for dev

No changes. Your infrastructure matches the configuration.

No changes. Your infrastructure matches the configuration.

Terraform has compared your real infrastructure against your configuration
and found no differences, so no changes are needed.

Warning: Argument is deprecated

  with module.dev.module.cg-logshipper.module.s3-logshipper-storage.cloudfoundry_service_instance.bucket,
  on /tmp/terraform-data-dir/modules/dev.cg-logshipper.s3-logshipper-storage/s3/main.tf line 14, in resource "cloudfoundry_service_instance" "bucket":
  14:   recursive_delete = var.recursive_delete

Since CF API v3, recursive delete is always done on the cloudcontroller side.
This will be removed in future releases

(and 3 more similar warnings elsewhere)

📝 Plan generated in Pull Request Checks #2173

github-actions · 2023-11-02T14:52:12Z

Terraform plan for meta

No changes. Your infrastructure matches the configuration.

No changes. Your infrastructure matches the configuration.

Terraform has compared your real infrastructure against your configuration
and found no differences, so no changes are needed.

Warning: Argument is deprecated

  with module.s3-backups.cloudfoundry_service_instance.bucket,
  on /tmp/terraform-data-dir/modules/s3-backups/s3/main.tf line 14, in resource "cloudfoundry_service_instance" "bucket":
  14:   recursive_delete = var.recursive_delete

Since CF API v3, recursive delete is always done on the cloudcontroller side.
This will be removed in future releases

📝 Plan generated in Pull Request Checks #2173

github-actions · 2023-11-02T15:06:39Z

File	Coverage		Missing
All files	`88%`	✅
api/serializers.py	`88%`	✅	`177-178` `183` `188`
api/test_views.py	`95%`	✅	`103`
api/uei.py	`88%`	✅	`87` `118-119` `163` `167-168`
api/views.py	`98%`	✅	`198-199` `337-338`
audit/forms.py	`60%`	❌	`31-38` `108-115`
audit/intake_to_dissemination.py	`87%`	✅	`71-72` `205-211` `261` `305-313`
audit/test_commands.py	`87%`	✅
audit/test_manage_submission_access_view.py	`98%`	✅	`15` `19`
audit/test_mixins.py	`90%`	✅	`112-113` `117-119` `184-185` `189-191`
audit/test_validators.py	`95%`	✅	`436` `440` `608-609` `848` `855` `862` `869`
audit/test_views.py	`95%`	✅	`410-442` `451-482` `491-519`
audit/test_workbooks_should_fail.py	`88%`	✅	`58` `87-88` `92`
audit/test_workbooks_should_pass.py	`87%`	✅	`59` `74-76`
audit/utils.py	`91%`	✅	`33-35` `38`
audit/validators.py	`94%`	✅	`137` `189` `288-289` `304-305` `486-490` `495-499` `515-524`
audit/cross_validation/additional_ueis.py	`93%`	✅	`33`
audit/cross_validation/check_award_ref_declaration.py	`90%`	✅
audit/cross_validation/check_award_reference_uniqueness.py	`93%`	✅
audit/cross_validation/check_certifying_contacts.py	`87%`	✅
audit/cross_validation/check_findings_count_consistency.py	`91%`	✅
audit/cross_validation/check_ref_number_in_cap.py	`90%`	✅
audit/cross_validation/check_ref_number_in_findings_text.py	`90%`	✅
audit/cross_validation/errors.py	`78%`	❌	`30` `69`
audit/cross_validation/naming.py	`93%`	✅	`197`
audit/cross_validation/submission_progress_check.py	`95%`	✅	`80`
audit/cross_validation/tribal_data_sharing_consent.py	`81%`	❌	`33` `36` `40`
audit/cross_validation/validate_general_information.py	`93%`	✅	`28-29`
audit/fixtures/single_audit_checklist.py	`55%`	❌	`146-183` `229-238`
audit/intakelib/exceptions.py	`71%`	❌	`7-9` `12`
audit/intakelib/intermediate_representation.py	`91%`	✅	`27-28` `73` `91` `129` `200-203` `212-213`
audit/intakelib/mapping_audit_findings.py	`97%`	✅	`55`
audit/intakelib/mapping_audit_findings_text.py	`97%`	✅	`52`
audit/intakelib/mapping_federal_awards.py	`93%`	✅	`90`
audit/intakelib/mapping_util.py	`81%`	❌	`21` `25` `29` `99` `104-105` `114-120` `130` `145` `150`
audit/intakelib/checks/check_all_unique_award_numbers.py	`79%`	❌	`24`
audit/intakelib/checks/check_cluster_names.py	`75%`	❌	`21-26`
audit/intakelib/checks/check_cluster_total.py	`95%`	✅	`73`
audit/intakelib/checks/check_has_all_the_named_ranges.py	`84%`	❌	`52`
audit/intakelib/checks/check_is_a_workbook.py	`69%`	❌	`20`
audit/intakelib/checks/check_loan_balance_entries.py	`83%`	❌	`29`
audit/intakelib/checks/check_loan_balance_present.py	`88%`	✅	`30`
audit/intakelib/checks/check_look_for_empty_rows.py	`91%`	✅	`18`
audit/intakelib/checks/check_no_major_program_no_type.py	`76%`	❌	`18` `27`
audit/intakelib/checks/check_no_repeat_findings.py	`76%`	❌	`21` `30`
audit/intakelib/checks/check_other_cluster_names.py	`81%`	❌	`23` `33`
audit/intakelib/checks/check_passthrough_name_when_no_direct.py	`88%`	✅	`9` `47`
audit/intakelib/checks/check_sequential_award_numbers.py	`76%`	❌	`14` `22`
audit/intakelib/checks/check_start_and_end_rows_of_all_columns_are_same.py	`89%`	✅	`14`
audit/intakelib/checks/check_state_cluster_names.py	`81%`	❌	`23` `33`
audit/intakelib/checks/check_version_number.py	`73%`	❌	`21` `31-32`
audit/intakelib/checks/runners.py	`96%`	✅	`147`
audit/intakelib/common/util.py	`90%`	✅	`22` `39`
audit/intakelib/transforms/xform_rename_additional_notes_sheet.py	`81%`	❌	`14`
audit/management/commands/load_fixtures.py	`46%`	❌	`39-45`
audit/models/models.py	`83%`	❌	`57` `59` `64` `66` `216` `228-231` `249` `426` `444-445` `453` `475` `573-574` `578` `586` `595` `601`
audit/views/audit_info_form_view.py	`27%`	❌	`25-74` `77-117` `120-137`
audit/views/manage_submission.py	`86%`	✅	`73-80`
audit/views/manage_submission_access.py	`98%`	✅	`113-114`
audit/views/pre_dissemination_download_view.py	`76%`	❌	`15-18` `23-26` `33-43`
audit/views/submission_progress_view.py	`96%`	✅	`179-180`
audit/views/tribal_data_consent.py	`34%`	❌	`23-41` `44-79`
audit/views/unlock_after_certification.py	`57%`	❌	`28-51` `73-87`
audit/views/upload_report_view.py	`26%`	❌	`32-35` `44` `91-117` `120-170` `178-209`
audit/views/views.py	`39%`	❌	`69` `76-95` `118-119` `193-194` `239-240` `251-252` `254-258` `305-318` `321-335` `340-353` `370-376` `381-401` `404-432` `437-466` `469-513` `518-538` `541-569` `574-603` `606-650` `655-667` `670-680` `685-697` `724-725`
census_historical_migration/change_record.py	`95%`	✅	`30` `34` `46`
census_historical_migration/test_federal_awards_xforms.py	`99%`	✅	`194-195`
census_historical_migration/sac_general_lib/audit_information.py	`91%`	✅	`24` `78`
census_historical_migration/sac_general_lib/cognizant_oversight.py	`68%`	❌	`11`
census_historical_migration/sac_general_lib/general_information.py	`92%`	✅	`150-151` `159-160` `202`
census_historical_migration/sac_general_lib/sac_creator.py	`90%`	✅	`34`
census_historical_migration/sac_general_lib/utils.py	`83%`	❌	`33` `60-69`
census_historical_migration/transforms/xform_retrieve_uei.py	`67%`	❌	`10`
census_historical_migration/workbooklib/additional_eins.py	`84%`	❌	`58-60` `67-77`
census_historical_migration/workbooklib/additional_ueis.py	`77%`	❌	`27-29` `36-46`
census_historical_migration/workbooklib/excel_creation_utils.py	`61%`	❌	`98` `107-112` `117-124` `128-146` `159-163` `177-180`
census_historical_migration/workbooklib/federal_awards.py	`62%`	❌	`139` `180-188` `198-223` `326-408`
census_historical_migration/workbooklib/notes_to_sefa.py	`65%`	❌	`34-38` `116-122` `130-136` `144-146` `153-185`
config/test_settings.py	`92%`	✅	`33-34` `49-50`
config/urls.py	`71%`	❌	`83`
dissemination/file_downloads.py	`73%`	❌	`35-53` `83-85`
dissemination/models.py	`99%`	✅	`469`
dissemination/search.py	`83%`	❌	`58` `88` `94` `123` `125` `128-136`
dissemination/summary_reports.py	`68%`	❌	`268-270` `274-278` `355-399` `424` `460-462` `477-484`
dissemination/views.py	`84%`	❌	`166` `170` `200` `248` `250` `252` `293-297`
dissemination/migrations/0002_general_fac_accepted_date.py	`47%`	❌	`10-12`
djangooidc/backends.py	`78%`	❌	`32` `57-63`
djangooidc/exceptions.py	`66%`	❌	`19` `21` `23` `28`
djangooidc/oidc.py	`16%`	❌	`32-35` `45-51` `64-70` `92-149` `153-199` `203-226` `230-275` `280-281` `286`
djangooidc/views.py	`80%`	❌	`22` `43` `114`
djangooidc/tests/common.py	`96%`	✅
report_submission/forms.py	`92%`	✅	`35`
report_submission/views.py	`76%`	❌	`83` `215-216` `218` `240-241` `260-261` `287-396` `399-409`
report_submission/templatetags/get_attr.py	`76%`	❌	`8` `11-14` `18`
support/admin.py	`88%`	✅	`76` `79` `84` `91-97` `100-102`
support/cog_over.py	`91%`	✅	`30-33` `93` `145`
support/test_admin_api.py	`80%`	❌	`22` `146-147` `236-237` `316-317`
support/test_cog_over.py	`98%`	✅	`134-135` `224`
support/management/commands/seed_cog_baseline.py	`98%`	✅	`20-21`
support/models/cog_over.py	`89%`	✅	`103-104`
tools/update_program_data.py	`89%`	✅	`96`
users/admin.py	`99%`	✅	`27`
users/auth.py	`96%`	✅	`58-59`
users/models.py	`96%`	✅	`18` `74-75`
users/fixtures/user_fixtures.py	`91%`	✅

Minimum allowed coverage is 85%

Generated by 🐒 cobertura-action against f20fb66

danswick · 2023-11-02T17:24:20Z

Some questions that came up during cowork:

Is there any additional risk to doing this scan from inside the Django app? If a potentially risky file exists, aren't we guaranteeing that it touches the app by scanning every file?
How should we handle inevitable ClamAV timeouts?
Once we start getting a lot more audits, ~how much compute/memory/etc are we going to need? If it takes 1 second to scan a file (which is likely a vast underestimate), and there are 2.5 million seconds in a month, we would outgrow a "monthly scan" cadence once we hit 2.5 million media files in S3 (which includes ~500k historic PDF files).
- Maybe we need to consider purging workbook files either annually or when a submission is complete.
What would ingress/egress costs look like to scan every file?
Could scanning files on the way out, every time, address the threat vector in question?
Could we just scan any time ClamAV or its definitions are updated? Would this qualify as a "periodic" scan?

jadudm · 2023-11-03T03:57:05Z

Q: Is there any additional risk to doing this scan from inside the Django app? If a potentially risky file exists, aren't we guaranteeing that it touches the app by scanning every file?
A: I'm not sure what tooling would be "safer" to load a file and pass it to ClamAV via HTTPS POST. No matter how you slice it, the file has to pass through RAM on its way to the "remote" ClamAV instance.
A: No matter what, some code has to touch the production bucket to read the files and pass them to Clam.
A: This would run as an action, so it wouldn't exactly be running in the same space as the app. But, it has to touch our data.
A: I may be missing something here.

Q: How should we handle inevitable ClamAV timeouts?
A: It's a failed scan. Either we add a backoff and retry, or we call the file "no good" and quarantine it?
A: If the conn doesn't timeout naturally, we need to add a timeout.
A: This was a first step. We need something that can run and log, and that we can catch in NR. A documented mitigation plan, for the moment, would suffice.

Q: Once we start getting a lot more audits, ~how much compute/memory/etc are we going to need? If it takes 1 second to scan a file (which is likely a vast underestimate), and there are 2.5 million seconds in a month, we would outgrow a "monthly scan" cadence once we hit 2.5 million media files in S3 (which includes ~500k historic PDF files).
A: Hm.

(define pdfs 700000)
(define xlsx (* pdfs 4))
(define files (+ pdfs xlsx))
(define time-per-file 4)
(define total-time (* files time-per-file))
(define hours (/ (/ total-time 60) 60))
(define days (/ hours 24))
(define months (/ days 30))

gives me about ~5 months.

A1: We can do that twice a year!
A2: I have no idea how we address this.

Q: Maybe we need to consider purging workbook files either annually or when a submission is complete.
A: Or, monthly.
A: Or, zip them all every month, so there's only one file to scan per month.
A: Or, can we scan our backups? Those are compressed. Would that work? But, I have no idea how large a file you can pass to ClamAV... oh. Around 4GB. So... perhaps we create 2GB zips, and scan those?
A: Or, yep.

Q: What would ingress/egress costs look like to scan every file?
A: $0? Ingress/egress is to/from the internet? So, I think this would be S3 to our app, both of which are within the same GovCloud region. If this is true, then I think there is no ingress/egress between the app (running as an action) and the S3 storage.
A: We would want to check with cloud.gov.

Q: Could scanning files on the way out, every time, address the threat vector in question?
A: We could try and make the case. We'd have to convince GSA leadership/our CISO that this is the preferable solution to running virus scans all year long.

Q: Could we just scan any time ClamAV or its definitions are updated? Would this qualify as a "periodic" scan?
A: That would likely be more often. Meaning... we couldn't run the scans fast enough to keep up with this. It would count as periodic, I think, but we couldn't run all the files.

asteel-gsa · 2023-11-08T13:57:43Z

backend/dissemination/management/commands/scan_bucket_files_for_viruses.py

+                else:
+                    for r in results:
+                        if is_stringlike(r):
+                            logger.error(f"SCAN FAIL: {r}")


As this is right now, since this is just a django command, and has no associated workflow, (and even if it did, question still applies when run with cf tasks) how do we get the output back to us for consumption?

Could we write fails to a file and send it off somewhere?

I thought that the logger actions, as a Django command, would be picked up by logshipper/NR.

Is that incorrect?

If so, then yes: I need to bundle up results and send them somewhere.

According to https://docs.cloudfoundry.org/devguide/using-tasks.html#-task-logging-and-execution-history :

Any data or messages the task outputs to stdout or stderr is available in the firehose logs of the app. A syslog drain attached to the app receives the task log output. The task execution history is retained for one month.

So it looks like that logger.info line would be picked up by logshipper/NR.

Given a path, it will scan everything. Given a single object, it will scan one thing

I was passing params wrong to the SimpleUploadedFile creation. Fixed.

danswick

adding some inline comments following conversation with @tadhg-ohiggins.

backend/dissemination/management/commands/scan_bucket_files_for_viruses.py

danswick · 2024-01-18T18:32:56Z

backend/dissemination/management/commands/scan_bucket_files_for_viruses.py

+            for object_summary in objects["Contents"]:
+                object_name = object_summary["Key"]
+                result = scan_file_in_s3(bucket, object_name)
+                results.append(result)


This is likely to get bogged down by the volume of files. We should log each time ClamAV finds a suspicious file instead of creating one large results object.

danswick · 2024-01-18T18:45:51Z

backend/dissemination/management/commands/scan_bucket_files_for_viruses.py

+
+def scan_files_at_path_in_s3(bucket, path):
+    s3 = get_s3_client()
+    objects = s3.list_objects(Bucket=bucket, Prefix=path)


list_objects can only return up to 1000 objects in a single request, so we'll need to handle pagination or figure something else out.

Boto3 list_objects docs.

Boto3 built-in paginator docs

…o instead log errors at the point of each scan.

tadhg-ohiggins · 2024-01-25T18:25:07Z

Contents of this PR moved to #3285 due to Git commit signing issues; closing.

jadudm temporarily deployed to dev November 2, 2023 14:51 — with GitHub Actions Inactive

jadudm temporarily deployed to meta November 2, 2023 14:51 — with GitHub Actions Inactive

jadudm marked this pull request as draft November 3, 2023 03:23

jadudm temporarily deployed to dev November 3, 2023 03:43 — with GitHub Actions Inactive

jadudm temporarily deployed to meta November 3, 2023 03:43 — with GitHub Actions Inactive

jadudm temporarily deployed to meta November 7, 2023 22:45 — with GitHub Actions Inactive

jadudm temporarily deployed to dev November 7, 2023 22:45 — with GitHub Actions Inactive

jadudm marked this pull request as ready for review November 8, 2023 13:26

jadudm requested a review from a team November 8, 2023 13:26

asteel-gsa reviewed Nov 8, 2023

View reviewed changes

jadudm added 4 commits November 15, 2023 15:51

A Django command that scans files at rest

d3454dd

Given a path, it will scan everything. Given a single object, it will scan one thing

Passed parameters wrong

520d75a

I was passing params wrong to the SimpleUploadedFile creation. Fixed.

Fix the empty case.

06751cf

Linted.

a50d76e

asteel-gsa force-pushed the jadudm/scan-files-at-rest-for-viruses branch from e969025 to a50d76e Compare November 15, 2023 20:51

asteel-gsa temporarily deployed to dev November 15, 2023 20:51 — with GitHub Actions Inactive

asteel-gsa temporarily deployed to meta November 15, 2023 20:52 — with GitHub Actions Inactive

asteel-gsa temporarily deployed to preview November 15, 2023 21:06 — with GitHub Actions Inactive

asteel-gsa temporarily deployed to preview November 15, 2023 21:08 — with GitHub Actions Inactive

danswick reviewed Jan 18, 2024

View reviewed changes

Merge branch 'main' into jadudm/scan-files-at-rest-for-viruses

aa4b62b

tadhg-ohiggins temporarily deployed to dev January 18, 2024 22:02 — with GitHub Actions Inactive

tadhg-ohiggins temporarily deployed to meta January 18, 2024 22:02 — with GitHub Actions Inactive

Change AV scan code path to not assemble a list of all results, but t…

42efa74

…o instead log errors at the point of each scan.

tadhg-ohiggins temporarily deployed to meta January 18, 2024 22:11 — with GitHub Actions Inactive

tadhg-ohiggins temporarily deployed to dev January 18, 2024 22:11 — with GitHub Actions Inactive

Switch Django command for AV scan of S3 objects to use pagination.

f20fb66

tadhg-ohiggins temporarily deployed to dev January 18, 2024 22:16 — with GitHub Actions Inactive

tadhg-ohiggins temporarily deployed to meta January 18, 2024 22:16 — with GitHub Actions Inactive

tadhg-ohiggins approved these changes Jan 25, 2024

View reviewed changes

tadhg-ohiggins mentioned this pull request Jan 25, 2024

A Django command that scans files at rest #3285

Merged

14 tasks

tadhg-ohiggins closed this Jan 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A Django command that scans files at rest #2705

A Django command that scans files at rest #2705

jadudm commented Nov 2, 2023 •

edited

Loading

github-actions bot commented Nov 2, 2023 •

edited

Loading

github-actions bot commented Nov 2, 2023 •

edited

Loading

github-actions bot commented Nov 2, 2023 •

edited

Loading

danswick commented Nov 2, 2023

jadudm commented Nov 3, 2023 •

edited

Loading

asteel-gsa Nov 8, 2023

jadudm Dec 21, 2023

tadhg-ohiggins Jan 18, 2024

danswick left a comment

danswick Jan 18, 2024

danswick Jan 18, 2024

tadhg-ohiggins commented Jan 25, 2024

A Django command that scans files at rest #2705

A Django command that scans files at rest #2705

Conversation

jadudm commented Nov 2, 2023 • edited Loading

PR checklist: submitters

PR checklist: reviewers

github-actions bot commented Nov 2, 2023 • edited Loading

github-actions bot commented Nov 2, 2023 • edited Loading

github-actions bot commented Nov 2, 2023 • edited Loading

danswick commented Nov 2, 2023

jadudm commented Nov 3, 2023 • edited Loading

asteel-gsa Nov 8, 2023

Choose a reason for hiding this comment

jadudm Dec 21, 2023

Choose a reason for hiding this comment

tadhg-ohiggins Jan 18, 2024

Choose a reason for hiding this comment

danswick left a comment

Choose a reason for hiding this comment

danswick Jan 18, 2024

Choose a reason for hiding this comment

danswick Jan 18, 2024

Choose a reason for hiding this comment

tadhg-ohiggins commented Jan 25, 2024

jadudm commented Nov 2, 2023 •

edited

Loading

github-actions bot commented Nov 2, 2023 •

edited

Loading

github-actions bot commented Nov 2, 2023 •

edited

Loading

github-actions bot commented Nov 2, 2023 •

edited

Loading

jadudm commented Nov 3, 2023 •

edited

Loading