Skip to content
This repository has been archived by the owner on Jun 26, 2020. It is now read-only.

Latest commit

 

History

History
323 lines (280 loc) · 24.5 KB

DEVSECOPS.md

File metadata and controls

323 lines (280 loc) · 24.5 KB

DevSecOps Architecture

The architecture of this project lends itself directly to fulfilling the principles found in https://tech.gsa.gov/guides/dev_sec_ops_guide/. The overall idea behind this project is to create a template that people can clone, add their application to, provision some GCP Projects, and then enable CircleCI to automatically provision dev/staging/production environments into those GCP Projects with good secrets management, backups, tests, and other controls that make their project automatically use best DevSecOps practices, making it easy for people to get an ATO.

We believe that developing a project with DevSecOps best practices will result in a much better experience for the users, developers, and owners of projects derived from this template.

Platform

The platform that people will be deploying to is Google Cloud Platform (GCP). This project does not directly address how GCP is provisioned and set up, but we have had discussions with the people in the GSA Enterprise Systems Support Division (ICE) who are developing the process. The rough plan is:

  • There is a GSA GCP Organization run by the GSA ICE people.
  • The GSA ICE group will receive requests for GCP Projects from the Project Owner.
  • The request will require some sort of approval.
  • Once approval is granted, GSA ICE will create GCP Project(s) for the Project Owner and give them access.

GCP Projects are fully compartmentalized subdivisions within a GCP Organization that are meant to be used for individual application environments like dev/test/prod. Access to one GCP Project does not give you access to any other Project. Billing is managed on a per-Project basis as well.

diagram of gcp org, project, apps, and services

Project

The GCP Project is where the Project Owner will build and deploy their infrastructure and application. The Project Owner will usually request three GCP Projects: dev, staging, production, though they may request more if they have special deployment pipeline needs. A terraform service account will be created, and each GCP Project will then have it's infrastructure configured and deployed with terraform using the CircleCI CI/CD system. Applications will be configured and deployed as well, using gcloud tools within CircleCI. CircleCI will watch a set of branches (usually dev/staging/master) and deploy the code in that branch into the appropriate GCP Project whenever there are changes.

diagram of a CircleCI workflow

Currently, our project uses Google App Engine to deploy apps which use Google Cloud SQL and KMS. It also stores the terraform state in an encrypted Cloud Storage bucket. All Google App Engine apps are deployed on instances that automatically are scaled up according to load, and are automatically updated weekly with patched images by GCP, and are only exposed to the world by GCP-supplied load balancers which ensure that services are accessed by https.

App

The applications in this project are extremely simple, meant to be examples for people to look at how they are deployed and perhaps how they have been configured to do basic/OIDC authentication. They are not in any way useful except as an example. When others use this template, they will delete the example apps and create their own.

These apps are deployed whenever changes to particular branches in github change by the CircleCI CI/CD system. Whenever such a workflow is triggered, CircleCI will configure and deploy the app using the gcloud tool, run tests to verify that the app is functional, run a OWASP ZAP scan against the app to ensure that there are not obvious security vulnerabilities, and then bring load up on the app.

One example app has been configured to run with an oauth2_proxy in front of it that is configured to authenticate users from gsa.gov using login.gov. The workflow here adds a deploy/test of the oauth2_proxy after the app is up, but is otherwise the same. This seems like it could be useful for people who do not want to implement OIDC in their app, but it does require you to be smarter about restricting access on the backend too. In this example, we check whether there is an authorized header present from the proxy, and if not we reject the connection.

General features that apps deployed into this environment should have are:

  • Apps should be relatively simple Twelve-Factor Apps. An app with a database backend and maybe redis or some other GCP-managed services, for example. More complex apps with microservices or complex backend services (Elasticsearch, for example) are probably more suited to GRACE for now.
  • Secrets should be generated by terraform, so that they are assured to be of a good strength/quality, as well as easily rotated.
  • Apps should be configured entirely through environment variables. This means that secrets and other config data should be ephemeral and not hit the disk.
  • Authentication should be in front of most applications. This is a requirement pre-ATO, and probably will be important to almost every other application that stores data.
  • Apps should be able to allow the OWASP ZAP scanner in so that it can do a full scan.
  • Apps should either use GCP logging-enabled libraries, or log to stdout/stderr.
  • Apps should implement health checks that are comprehensive.
  • Apps should have comprehensive tests written for them that can be executed during the CircleCI deployment pipeline.
  • All data that the apps store should be stored in a database or other storage service. Local files are ephemeral at best, and may not be allowed to happen for some deployment types.

Supporting Systems

There are a number of supporting systems which are used to manage the project which are not part of the Google Cloud Platform.

GitHub

GitHub is the primary interface for interacting with the project once it has been deployed. It is a service that hosts git repos that are used to manage the code for projects. We are using a gitops-style workflow which uses GitHub as the single source of truth of what is deployed to what environment. Developers do not manipulate anything in GCP directly, but instead develop code and do Pull Requests (which must be approved) to integrate code into specific protected branches, which trigger deploys into the appropriate environment.

CircleCI

CircleCI is a CI/CD service that is used as the automation engine for our project. When changes to particular branches in GitHub happen, CircleCI will kick off jobs that will build, deploy, test, scan, and promote the code to the appropriate environment (dev/staging/production, for example).

DevSecOps Requirements

Summarizing the DevSecOps guide somewhat, we came up with a list of requirements, and under each requirement, we listed how this project can fulfill the requirements on GCP.

Application developers have a pipeline that they can use to deploy software which is considerate of security and visible to operations.

  • CircleCI automates all deployment/testing/scanning/promotion tasks in a pipeline that automatically deploys app and infrastructure changes into the dev/staging/production environments.
  • All changes to the environment and apps are done through a defined gitops-style process with code and automation, so the management of the infrastructure should not require most people to have more than readonly access to anything once the infrastructure is bootstrapped.
  • Terraform generates all secrets used in the system, so there is generally no opportunity for people to store this data insecurely, and this also makes it easy to rotate secrets.
  • All applications are configured through environment variables, so secrets and other config are generally ephemeral. The only place where secrets hit the disk is in the encrypted Cloud Storage bucket where terraform stores it's state.
  • All apps/services deployed in Google App Engine are given an SSL cert in the appspot.com domain by default. There are provisions for getting a cert for a custom domain as well, which most people will want to do for production.
  • CircleCI can provide logs of what the app/infrastructure deployment automation is doing.
  • GCP Log Viewer provides logs from the other side of what CircleCI is doing with deploys and infrastructure changes, though perhaps without the context that CircleCI access might give you.

Application developers have a clear, self-service intake onto the platform.

  • This project does not directly address how GCP is provisioned and set up, but from discussions with the people within GSA ICE who are developing the process, the rough plan is to have a GSA GCP Organization account run by the ICE folks who will, upon receiving an approved request from a Project Owner, set up GCP Project(s) for the Project Owner with a minimal set of permissions and services provisioned in it.

The platform services are centralized in its infrastructure and pipeline implementation.

  • The creation of GCP Projects will be controlled by GSA ICE.
  • Users and service accounts in the GCP Project and their permissions are currently managed by GSA ICE.
  • Billing is configured by GSA ICE.
  • Users are required to use 2fa to get into the GCP Console and use GCP resources. The current second factor that must be used is a U2F/FIDO key.
  • The infrastructures created in the GCP Projects are fully automated by CircleCI and the code in this project.

Application developers are provided base OS images and images that provide component-level functionality that has also been hardened (e.g., standard images pre-packaged with hardened components i.e. databases or web/application servers).

  • App Engine instances and containers are provided and updated by google and are not configurable by the developers.
  • App Engine instances are automatically updated/patched/restarted with zero downtime on a weekly basis.
  • Services used by the application (such as Cloud SQL databases) are also updated and managed by Google to ensure their security.

Application owners have full access to their application event information with monitoring and alerting flexibility for their own use. An enterprise-wide application logging and monitoring system is available.

  • All logs generated by the apps/services will be pulled into the read-only Google Stackdriver Logs Viewer.
  • App events can be alerted on with Google Stackdriver.
  • Performance data for app instances can be viewed in the GCP Console.
  • CircleCI can provide logs of what the app/infrastructure deployment automation is doing.
  • GCP Log Viewer provides logs from the other side of what CircleCI is doing with deploys and infrastructure changes, though perhaps without the context that CircleCI access might give you.
  • Google Stackdriver provides customizable alerts so that IAM changes or changes to infrastructure by something other than the terraform system account can raise alarms.
  • GCP has a security console that does anomaly and vulnerability detection, scanning, alerting, and other interesting services which should provide good visibility into the security posture of the project.
  • We believe that GSA ICE will be implementing some level of audit logging and alerting for users and permissions changes across the organization, but do not know specifics.
  • This project attempts to turn on our own audit logs within the project so that we can listen and alert on unusual activity.
  • GCP allows billing alerts to be configured, but we do not know if GSA ICE will be taking advantage of that.
  • There is a billing dashboard that can be accessed by Project Owners as well as GSA ICE.
  • Read-only log viewing access can be granted to anybody in the project, so temporary access could be granted to a developer or security engineer and then taken away. We would presume that log access would probably be granted at the organization level already, but don't know for sure.

Images and components undergo automatic testing and are pre-approved by security and operations groups.

  • Google is continuously updating their App Engine instances and the containers they build the applications in to ensure the latest security patches are applied. Instances are updated and relaunched weekly with zero downtime.
  • Apps are not fully rolled out unless they pass a suite of tests that are defined by the developer.
  • Google runs security scans on deployed applictions, and this project also is set up to run authenticated OWASP ZAP scans.

The platform automatically tests new patches on applications which run on it, informing the appropriate parties if decision points are reached (e.g., if a CVE is raised on an existing piece of software, the platform can automatically update that software, test it, and inform the application developers of the change if the tests pass or indicate that the patch needs to be applied in a particular timeframe). No downtime for patching.

  • Google is continuously updating their App Engine instances and the containers they build the applications in to ensure the latest security patches are applied. Instances are updated and relaunched weekly with zero downtime.
  • Apps are not fully rolled out unless they pass a suite of tests that are defined by the developer.
  • Google runs security scans on deployed applictions, and this project also is set up to run authenticated OWASP ZAP scans.

Platform change is conducted through strictly defined processes with clear criteria defined that allow for rapid change; the platform automates changes and endeavors to impact the minimum number of application developers through that automation.

  • Changes to the apps and infrastructure are made through a defined gitops-style process, where particular branches are deployed automatically once an approved change lands.
  • Health checks are standard for every deployment. If healthchecks fail, traffic will be routed to other healthy instances, and eventually will relaunch the instance.
  • Deploys are zero-downtime for production.

Version control is a key method of managing application lifecycle, has well-defined standards for use such that any user of the platform has a baseline that is shared across all applications.

  • All code for the projects deployed using this template resides in GitHub.
  • Changes to the apps and infrastructure are made through a defined gitops-style process, where particular branches are deployed automatically once an approved change lands.
  • All changes to the code will be required to have approvals on them with GitHub Protected Branches, so no unilateral unauthorized changes can happen, and all changes that are approved can be reviewed later on in case there is a problem.
  • GitHub Verified checkins could be made required too for additional levels of assurance. We would expect most projects to allow anybody on the project to push into the dev branch, but require approvals on pull requests into the staging and master branches.

Development and operational environments are identical and immutable. Environments can be stood up and torn down via automation. All changes to the running system are logged and broadly conducted through scripting rather than actual access to the running system. All necessary tests, including security tests, are run as part of the deployment process. Development environments may be instantiated and torn down as needed.

  • Separate dev/staging/production GCP Projects are requested by default.
  • Separate GCP Projects can allow for fine-grained access controls so that dev/staging/prod can have appropriate sets of users and full to readonly access granted to them.
  • Separate GCP Projects provide logical/physical resource separation so that opportunities to lateral from a dev environment into prod are not available.
  • The dev/staging/production environments are generally identical, since they are deployed from the same code. The only real differences are the data in their databases, and perhaps some additional access granted for debugging in the dev environment.
  • App Engine instances and images are immutable, but once launched, can be logged into for debugging purposes by authorized users, when the running instance may be changed. This is not a normal workflow, and changes are ephemeral, since the instances are relaunched every time there is a deploy or the google-managed weekly update.

The only manual steps to deployment are those explicitly designed to meet application expectations (e.g., not every push to the master branch necessarily indicates a release, but a product that could be released, if there is a business reason to not automatically update).

  • The only manual steps in this project are done while bootstrapping the environment, change approvals, and infrastructure deployment approvals.
  • We depart from this directive a bit because we will automatically deploy whenever a branch changes. Our take on this is that if we don't want to deploy to production, those changes should queue up in the staging environment rather than being merged into production.

User management is self-service with appropriate security limitations. Secrets are created/shared between parts of the platform, without people needing to set/interact with them.

  • Users of the platform are currently managed by ICE. We hope to have Project Owner IAM roles given to the project owners, so that they can grant access to the different environments in a self-service way.
  • Users of the platform are required to use 2fa to get into the GCP Console and use GCP resources. The current second factor that must be used is a U2F/FIDO key.
  • Users of the applications will be managed by the application. We have an example of how to use login.gov to ensure that users are from gsa.gov, but application developers are free to use whatever system they need to fulfill their mission.
  • Terraform generates all secrets used in the system, so there is generally no opportunity for people to store this data insecurely, and this also makes it easy to rotate secrets.
  • All applications are configured through environment variables, so secrets and other config are generally ephemeral. The only place where secrets hit the disk is in the encrypted Cloud Storage bucket where terraform stores it's state.

The platform manages availability for the application owners through automation based on application need. The platform provides direct insight into application health and performance. Applications can be seamlessly moved between hosting regions/zones in reaction to DR or threat activity.

  • App Engine automatically scales up and down the number of instances according to load.
  • Health checks are standard for every deployment. If healthchecks fail, traffic will be routed to other healthy instances, and eventually will relaunch the instance, so all apps are automatically HA.
  • The production databases have daily backups automatically scheduled for them, and also are configured for HA and failover.
  • Because the infrastructure is all code, Disaster Recovery into another region should be simple, only requiring a change to the terraform code to change the region and a restore of a database backup from the old region.

The platform governs the overarching infrastructure supporting the applications, with defined and assess separation of network concerns. Application owners can make limited changes to their network environment sufficient to self-manage the deployment of their applications and creation of new components of their application, on their own, with appropriate compliance checks.

  • This is currently a bit of a weak spot. Google App Engine does not provide very good networking controls.
  • We are hoping that we will be able to leverage GSA IT Security's relationship with the GCP folks to know what they recommend and/or have coming down the pipe.
  • What we hope to do is be able to show how we have been able to apply fine-grained networking controls on inbound and outbound connections.
  • We expect that the networking controls will be applied through terraform, and thus easily auditable for compliance.

ATO processes are highly automated. Compliant code and process is reused by multiple teams. All ATOs take the same amount of time for the same system and frequent deployment is only interrupted when specific risk triggers are raised through automation. Controls can be continuously monitored and measured with automation.

  • We are using OpenControl and Compliance Masonry to automate the collection of most of the information required to get a GSA LATO.
  • The compliance documentation is meant to be created along with the code, which means that the compliance information ought to be easier to keep up to date because there are tools that you can use to collect and audit it.
  • It should be easy to tag an ATO'ed revision in github and then review diffs between the tag and what is in staging to see if there is anything that requires an SCR on an ongoing basis.
  • More documentation on this can be found in the Compliance documentation for the template.

Backup and data lifecycle management allows application developers to ensure that their data is maintained over time and, in the case of failure of any subsystems, that it can be recovered with potentially some gap in transactional data. Lifecycle management of the data includes capabilities to archive and manage data over a long lifetime.

  • The production databases have daily backups automatically scheduled for them, and also are configured for HA and failover.
  • All changes to the code are retained in github.

Onboarding is largely self-service (within appropriate legal limits), and application owners have full access to their expenditures at any time. Application owners can set triggers on expenditures to manage their costs appropriately.

  • User/Role management is done by the Project Owners, who will request role changes related to offboarding from the GSA ICE group.
  • Billing is configured by GSA ICE.
  • GCP allows billing alerts to be configured, but we do not know if GSA ICE will be taking advantage of this feature.
  • There is a billing dashboard that can be accessed by Project Owners as well as GSA ICE.

Known Issues

There are a few known issues with the project:

  • The GSA ICE team issues GCP Projects such that the Project Owner only is allowed specific roles like roles/appengine.appAdmin and so on. This means that they will be unable to do things like activate stackdriver alerting, create service accounts, access the Security Command Center, etc. This means that most projects will be unable to deploy without having to get them to run commands on the Project Owner's behalf.

    Seems like it would be better to set up good auditing/logging of access control events and let the Project Owner set up the environment and manage their users. Or at least to let them bootstrap the environment and get everything set up, and then lower their access once things are going.

    We hope that these requirements will be relaxed down the road as everybody gets more experience with the platform. In the meantime, we have written some scripts for the ICE people to run to enable access in the gcp-appengine-template/gcp_setup directory.

  • Networking is not very customizable in App Engine:

    • Limiting outbound access seems pretty close to impossible with App Engine. The firewall that App Engine does only operates on inbound traffic. I have hopes that we can set up some sort of anomaly detection thing that kills apps if they query outside of the project, but the best thing to do would be to ask the GCP people what they expect people to do.
    • Limiting access from the outside world to services protected by the oauth2_proxy seems to be hard. Inbound filtering seems to not be selective enough to apply to a service. So app implementers need to check for a properly signed GAP-Authentication header. This seems like an easy thing to forget to do, or do improperly, so it would be good to talk with GCP to try to understand if they have better networking controls coming.
  • Logs are retained with this schedule:

    • Admin Activity audit logs 400 days
    • Data Access audit logs 30 days
    • System Event audit logs 400 days
    • Access Transparency logs 400 days
    • Logs other than audit logs or Access Transparency logs 30 days
      We may need to set up a storage bucket for log archival and automate that process, if this retention is not enough.