# **Chapter 14: Securing Cloud Infrastructure**

## Introduction: Defense in Depth for the Cloud Native Era

While Chapter 13 established identity as the primary security perimeter, robust cloud security requires comprehensive protection of the underlying infrastructure—compute instances, container orchestration platforms, storage systems, and network configurations that host applications and data. Identity controls determine *who* can access resources, but infrastructure security ensures *what* they access is hardened against exploitation, monitored for anomalies, and resilient against attack.

The cloud's programmatic nature creates both opportunities and challenges for infrastructure security. On one hand, infrastructure as code enables security policies to be codified, tested, and enforced before deployment, eliminating the configuration drift that plagues traditional data centers. On the other hand, the speed of cloud provisioning and the complexity of distributed architectures create expansive attack surfaces where misconfigurations—exposed storage buckets, overly permissive security groups, or unpatched container images—can be exploited within minutes of deployment.

This chapter implements defense-in-depth strategies across the infrastructure stack. We will architect network security controls that move beyond simple perimeter defense to micro-segmentation, harden compute resources against emerging threats through automated vulnerability management and configuration baselines, protect sensitive data through comprehensive encryption and secrets management, and establish observable security postures through centralized logging, behavioral analytics, and automated remediation. These controls transform the theoretical security architecture of Chapter 12 into operational reality.

---

## 14.1 Network Security Architecture: Beyond the Perimeter

Traditional network security relied on a hard exterior shell with soft interior—once past the firewall, attackers had lateral freedom. Cloud-native network security implements zero-trust networking where every packet is inspected, every connection is authenticated, and segmentation occurs at the workload level.

### 14.1.1 Virtual Private Cloud (VPC) Design Patterns

**Segmented VPC Architecture:**
Isolating workloads by function and sensitivity level prevents lateral movement and limits blast radius.

**Terraform Implementation: Multi-Tier VPC with Network Segmentation**

```hcl
# Comprehensive VPC architecture with security zones
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 5.0"

  name = "production-vpc"
  cidr = "10.0.0.0/16"

  azs             = ["us-east-1a", "us-east-1b", "us-east-1c"]
  
  # Public tier: Load balancers and bastion hosts only
  public_subnets  = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  
  # Private tier: Application workloads
  private_subnets = ["10.0.4.0/24", "10.0.5.0/24", "10.0.6.0/24"]
  
  # Database tier: Highly restricted
  database_subnets = ["10.0.7.0/24", "10.0.8.0/24", "10.0.9.0/24"]

  # Network ACLs for defense in depth
  public_subnet_tags = {
    Tier = "public"
    Compliance = "standard"
  }
  
  private_subnet_tags = {
    Tier = "private"
    Compliance = "sensitive"
  }

  database_subnet_tags = {
    Tier = "restricted"
    Compliance = "critical"
  }

  # VPC Flow Logs for traffic analysis
  enable_flow_log                      = true
  create_flow_log_cloudwatch_iam_role  = true
  create_flow_log_cloudwatch_log_group = true
  flow_log_max_aggregation_interval    = 60
  
  # DNS security
  enable_dns_hostnames = true
  enable_dns_support   = true
  
  # VPC Endpoints for private AWS service access (no internet traversal)
  enable_ec2_endpoint              = true
  ec2_endpoint_private_dns_enabled = true
  ec2_endpoint_security_group_ids  = [aws_security_group.vpc_endpoints.id]
}

# Transit Gateway for multi-VPC connectivity with inspection
resource "aws_ec2_transit_gateway" "main" {
  description                     = "Centralized routing hub"
  auto_accept_shared_attachments  = "disable"
  default_route_table_association = "disable"
  default_route_table_propagation = "disable"
  
  # Enable appliance mode for stateful inspection
  vpn_ecmp_support = "disable"
  
  tags = {
    Name = "security-tgw"
  }
}

# Transit Gateway Route Table for traffic inspection
resource "aws_ec2_transit_gateway_route_table" "inspection" {
  transit_gateway_id = aws_ec2_transit_gateway.main.id
  
  tags = {
    Name = "inspection-rt"
  }
}

# Route all inter-VPC traffic through Network Firewall
resource "aws_ec2_transit_gateway_route" "to_firewall" {
  destination_cidr_block         = "10.0.0.0/8"
  transit_gateway_route_table_id = aws_ec2_transit_gateway_route_table.inspection.id
  transit_gateway_attachment_id  = aws_ec2_transit_gateway_vpc_attachment.firewall.id
}
```

**Security Controls Implemented:**
- **Subnet Isolation:** Database subnets have no internet gateway route, requiring all access through application tier
- **DNS Security:** Private DNS ensures internal service names resolve without public internet exposure
- **VPC Endpoints:** Private connectivity to AWS services prevents data from traversing public internet
- **Transit Gateway:** Centralized routing enables traffic inspection between VPCs without complex peering

### 14.1.2 Security Groups and Network ACLs: Stateful vs. Stateless

**Security Groups (Stateful):**
Operate at the instance level, automatically allowing return traffic. They serve as virtual firewalls controlling inbound and outbound traffic.

**Network ACLs (Stateless):**
Operate at the subnet level, providing an additional layer of defense. Require explicit rules for both request and response traffic.

**Defense in Depth Implementation:**

```hcl
# Application tier security group - restrictive by default
resource "aws_security_group" "application_tier" {
  name_prefix = "app-tier-"
  description = "Security group for application servers"
  vpc_id      = module.vpc.vpc_id

  # Only accept HTTPS from load balancer
  ingress {
    description     = "HTTPS from ALB only"
    from_port       = 443
    to_port         = 443
    protocol        = "tcp"
    security_groups = [aws_security_group.alb.id]
  }

  # Application health checks
  ingress {
    description = "Health checks from ALB"
    from_port   = 8080
    to_port     = 8080
    protocol    = "tcp"
    security_groups = [aws_security_group.alb.id]
  }

  # Outbound to database tier only
  egress {
    description     = "PostgreSQL to DB tier"
    from_port       = 5432
    to_port         = 5432
    protocol        = "tcp"
    cidr_blocks     = [module.vpc.database_subnets_cidr_blocks[0]]  # Specific subnet only
  }

  # Outbound to AWS APIs via VPC endpoint
  egress {
    description     = "HTTPS to VPC endpoints"
    from_port       = 443
    to_port         = 443
    protocol        = "tcp"
    prefix_list_ids = [aws_vpc_endpoint.s3.prefix_list_id]
  }

  tags = {
    Name = "application-tier-sg"
    Tier = "application"
  }

  lifecycle {
    create_before_destroy = true
  }
}

# Network ACL for database subnets - additional subnet-level protection
resource "aws_network_acl" "database" {
  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.database_subnets

  # Explicit deny of common attack ports at subnet level
  egress {
    protocol   = "tcp"
    rule_no    = 100
    action     = "deny"
    cidr_block = "0.0.0.0/0"
    from_port  = 22    # SSH should never leave database tier
    to_port    = 22
  }

  # Allow PostgreSQL from application tier only
  ingress {
    protocol   = "tcp"
    rule_no    = 100
    action     = "allow"
    cidr_block = module.vpc.private_subnets_cidr_blocks[0]  # App tier CIDR
    from_port  = 5432
    to_port    = 5432
  }

  # Allow return traffic (stateless requirement)
  egress {
    protocol   = "tcp"
    rule_no    = 200
    action     = "allow"
    cidr_block = module.vpc.private_subnets_cidr_blocks[0]
    from_port  = 1024
    to_port    = 65535  # Ephemeral ports
  }

  # Explicit deny all other ingress
  ingress {
    protocol   = "-1"
    rule_no    = 32766
    action     = "deny"
    cidr_block = "0.0.0.0/0"
    from_port  = 0
    to_port    = 0
  }

  tags = {
    Name = "database-nacl"
  }
}
```

**Key Security Principles:**
- **Security Group References:** Instead of CIDR blocks, reference other security group IDs—if the ALB changes IP, the rule remains valid
- **Minimal Outbound:** Application servers can only reach the database subnet on port 5432 and AWS services via VPC endpoints—no general internet access
- **NACL as Safety Net:** Even if a security group is misconfigured to allow 0.0.0.0/0, the NACL provides subnet-level blocking

### 14.1.3 Web Application Firewall (WAF) and DDoS Protection

**AWS WAFv2 with Managed Rules:**

```hcl
resource "aws_wafv2_web_acl" "main" {
  name        = "production-protection"
  description = "WAF rules for production application"
  scope       = "REGIONAL"

  default_action {
    block {}  # Default deny - must explicitly allow
  }

  # AWS Managed Rules - Core Rule Set (OWASP Top 10)
  rule {
    name     = "AWSManagedRulesCommonRuleSet"
    priority = 1

    override_action {
      none {}
    }

    statement {
      managed_rule_group_statement {
        name        = "AWSManagedRulesCommonRuleSet"
        vendor_name = "AWS"
        
        rule_action_override {
          action_to_use {
            count {}  # Monitor only for this rule initially
          }
          name = "SizeRestrictions_BODY"
        }
      }
    }

    visibility_config {
      cloudwatch_metrics_enabled = true
      metric_name                = "AWSManagedRulesCommonRuleSetMetric"
      sampled_requests_enabled   = true
    }
  }

  # Rate Limiting - Prevent brute force and scraping
  rule {
    name     = "RateLimitRule"
    priority = 2

    action {
      block {}
    }

    statement {
      rate_based_statement {
        limit              = 2000  # Requests per 5 minutes per IP
        aggregate_key_type = "IP"
        
        scope_down_statement {
          not_statement {
            statement {
              ip_set_reference_statement {
                arn = aws_wafv2_ip_set.whitelist.arn
              }
            }
          }
        }
      }
    }

    visibility_config {
      cloudwatch_metrics_enabled = true
      metric_name                = "RateLimitRuleMetric"
      sampled_requests_enabled   = true
    }
  }

  # Custom Rule - Block specific attack patterns
  rule {
    name     = "BlockBadBots"
    priority = 3

    action {
      block {}
    }

    statement {
      regex_pattern_set_reference_statement {
        arn = aws_wafv2_regex_pattern_set.bad_bots.arn
        field_to_match {
          single_header {
            name = "user-agent"
          }
        }
        text_transformation {
          priority = 0
          type     = "LOWERCASE"
        }
      }
    }

    visibility_config {
      cloudwatch_metrics_enabled = true
      metric_name                = "BlockBadBotsMetric"
      sampled_requests_enabled   = true
    }
  }

  # Geographic blocking
  rule {
    name     = "GeoBlockRule"
    priority = 4

    action {
      block {}
    }

    statement {
      geo_match_statement {
        country_codes = ["KP", "IR", "SY", "CU"]  # Sanctioned countries
      }
    }

    visibility_config {
      cloudwatch_metrics_enabled = true
      metric_name                = "GeoBlockRuleMetric"
      sampled_requests_enabled   = true
    }
  }

  visibility_config {
    cloudwatch_metrics_enabled = true
    metric_name                = "production-waf-metric"
    sampled_requests_enabled   = true
  }

  tags = {
    Environment = "production"
    Compliance  = "pci-dss"
  }
}

# Associate WAF with ALB
resource "aws_wafv2_web_acl_association" "main" {
  resource_arn = aws_lb.application.arn
  web_acl_arn  = aws_wafv2_web_acl.main.arn
}

# AWS Shield Advanced for DDoS protection
resource "aws_shield_protection" "alb" {
  name         = "alb-shield-protection"
  resource_arn = aws_lb.application.arn
  
  tags = {
    Purpose = "DDoS-Protection"
  }
}
```

**Security Capabilities:**
- **Virtual Patching:** Block SQL injection and XSS attacks at the edge before they reach application code
- **Rate Limiting:** Prevent credential stuffing attacks by limiting requests per IP
- **Bot Control:** Distinguish between legitimate crawlers and malicious scanners
- **Geofencing:** Block traffic from high-risk geographic regions

---

## 14.2 Compute Hardening and Vulnerability Management

Compute instances—whether EC2, Azure VMs, or GCE instances—require continuous hardening to mitigate vulnerabilities in operating systems, middleware, and application runtimes.

### 14.2.1 CIS Benchmark Compliance

The Center for Internet Security (CIS) provides prescriptive configuration guidelines (benchmarks) for secure operating system and container configurations.

**AWS Systems Manager (SSM) for Automated Hardening:**

```yaml
# SSM Association to apply CIS Level 1 benchmarks
Resources:
  CISComplianceAssociation:
    Type: AWS::SSM::Association
    Properties:
      Name: AWS-RunCISBenchmark
      Parameters:
        benchmark:
          - cis
        level:
          - level-1
      Targets:
        - Key: tag:ComplianceLevel
          Values:
            - required
      ScheduleExpression: rate(7 days)  # Weekly compliance checks
      ComplianceSeverity: HIGH

  # Remediation automation for non-compliant instances
  RemediationRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: ssm.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AmazonSSMAutomationRole

  # Automation document to remediate common findings
  HardeningAutomation:
    Type: AWS::SSM::Document
    Properties:
      DocumentType: Automation
      Content:
        schemaVersion: '0.3'
        description: Remediate CIS violations
        parameters:
          InstanceId:
            type: String
        mainSteps:
          - name: DisablePasswordAuthentication
            action: 'aws:runCommand'
            inputs:
              DocumentName: AWS-RunShellScript
              InstanceIds:
                - '{{ InstanceId }}'
              Parameters:
                commands:
                  - sudo sed -i 's/#PasswordAuthentication yes/PasswordAuthentication no/g' /etc/ssh/sshd_config
                  - sudo systemctl restart sshd
          
          - name: EnableAutomaticUpdates
            action: 'aws:runCommand'
            inputs:
              DocumentName: AWS-RunShellScript
              InstanceIds:
                - '{{ InstanceId }}'
              Parameters:
                commands:
                  - sudo yum install -y yum-cron
                  - sudo systemctl enable yum-cron
                  - sudo systemctl start yum-cron
```

**Terraform for AMI Hardening Pipeline:**

```hcl
# Packer configuration for golden AMI creation with CIS hardening
resource "aws_imagebuilder_component" "cis_hardening" {
  name        = "cis-hardening-component"
  platform    = "Linux"
  version     = "1.0.0"
  description = "Apply CIS Level 1 hardening"

  data = yamlencode({
    schemaVersion = 1.0
    phases = [{
      name = "build"
      steps = [
        {
          name = "InstallCISBenchmark"
          action = "ExecuteBash"
          inputs = {
            commands = [
              "yum install -y scap-security-guide",
              "oscap xccdf eval --profile xccdf_org.ssgproject.content_profile_cis_level1_server --remediate /usr/share/xml/scap/ssg/content/ssg-amazonlinux2-ds.xml",
              "oscap xccdf eval --profile xccdf_org.ssgproject.content_profile_cis_level1_server --results /var/log/cis-scan.xml /usr/share/xml/scap/ssg/content/ssg-amazonlinux2-ds.xml"
            ]
          }
        },
        {
          name = "ConfigureCloudWatchAgent"
          action = "ExecuteBash"
          inputs = {
            commands = [
              "cat > /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json << 'EOF'\n{\n  \"metrics\": {\n    \"namespace\": \"CISHardening\",\n    \"metrics_collected\": {\n      \"disk\": {\n        \"measurement\": [\"used_percent\"],\n        \"resources\": [\"*\"]\n      }\n    }\n  }\n}\nEOF",
              "/opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -s -c file:/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json"
            ]
          }
        }
      ]
    }]
  })
}

# Image Recipe combining base AMI with hardening
resource "aws_imagebuilder_image_recipe" "hardened" {
  name         = "cis-hardened-amazon-linux"
  parent_image = "arn:aws:imagebuilder:us-east-1:aws:image/amazon-linux-2-x86/x.x.x"
  version      = "1.0.0"

  component {
    component_arn = aws_imagebuilder_component.cis_hardening.arn
  }

  block_device_mapping {
    device_name = "/dev/xvda"

    ebs {
      delete_on_termination = true
      volume_size           = 20
      volume_type           = "gp3"
      encrypted             = true
      kms_key_id            = aws_kms_key.ebs_encryption.arn
    }
  }
}
```

### 14.2.2 Vulnerability Scanning and Patch Management

**Continuous Vulnerability Assessment:**

```python
# Lambda function triggered by Inspector findings
import boto3
import json
from datetime import datetime

def remediate_vulnerability(event, context):
    """
    Auto-remediate critical vulnerabilities found by Amazon Inspector
    """
    finding = json.loads(event['detail'])
    
    if finding['severity'] != 'CRITICAL':
        return {"status": "skipped", "reason": "Not critical severity"}
    
    instance_id = finding['resources'][0]['id']
    vulnerability = finding['title']
    cve_id = finding['package_vulnerability_details']['vulnerability_id']
    
    ssm = boto3.client('ssm')
    ec2 = boto3.client('ec2')
    
    try:
        # Create snapshot before patching
        volumes = ec2.describe_volumes(
            Filters=[
                {'Name': 'attachment.instance-id', 'Values': [instance_id]},
                {'Name': 'attachment.device', 'Values': ['/dev/xvda']}
            ]
        )
        
        if volumes['Volumes']:
            snapshot = ec2.create_snapshot(
                VolumeId=volumes['Volumes'][0]['VolumeId'],
                Description=f"Pre-patch snapshot for {cve_id} remediation"
            )
            
            # Wait for snapshot completion logic here...
        
        # Apply patch via SSM Run Command
        response = ssm.send_command(
            InstanceIds=[instance_id],
            DocumentName="AWS-RunPatchBaseline",
            Parameters={
                "Operation": ["Install"],
                "RebootOption": ["RebootIfNeeded"]
            },
            Comment=f"Auto-remediation for {cve_id}"
        )
        
        # Tag instance for tracking
        ec2.create_tags(
            Resources=[instance_id],
            Tags=[
                {'Key': 'LastPatched', 'Value': datetime.utcnow().isoformat()},
                {'Key': 'VulnerabilityRemediated', 'Value': cve_id}
            ]
        )
        
        return {
            "status": "remediated",
            "instance": instance_id,
            "command_id": response['Command']['CommandId']
        }
        
    except Exception as e:
        # Notify security team for manual remediation
        sns = boto3.client('sns')
        sns.publish(
            TopicArn='arn:aws:sns:us-east-1:123456789012:vulnerability-alerts',
            Subject=f"Failed auto-remediation for {instance_id}",
            Message=json.dumps({
                'instance': instance_id,
                'vulnerability': cve_id,
                'error': str(e)
            })
        )
        raise
```

---

## 14.3 Container and Serverless Security

Containerized workloads require specialized security controls addressing image vulnerabilities, runtime threats, and orchestration platform security.

### 14.3.1 Container Image Security

**Vulnerability Scanning with Amazon ECR:**

```hcl
resource "aws_ecr_repository" "application" {
  name                 = "production-application"
  image_tag_mutability = "IMMUTABLE"  # Prevent tag overwriting

  image_scanning_configuration {
    scan_on_push = true  # Automatic scanning on every push
  }

  encryption_configuration {
    encryption_type = "KMS"
    kms_key         = aws_kms_key.ecr_encryption.arn
  }

  force_delete = false
}

# Lifecycle policy to retain only secure images
resource "aws_ecr_lifecycle_policy" "application" {
  repository = aws_ecr_repository.application.name

  policy = jsonencode({
    rules = [
      {
        rulePriority = 1
        description  = "Expire images with critical vulnerabilities"
        selection = {
          tagStatus   = "any"
          countType   = "imageCountMoreThan"
          countNumber = 1
        }
        action = {
          type = "expire"
        }
        # In practice, use Lambda to check scan results before expiring
      },
      {
        rulePriority = 2
        description  = "Keep last 30 production images"
        selection = {
          tagStatus     = "tagged"
          tagPrefixList = ["prod"]
          countType     = "imageCountMoreThan"
          countNumber   = 30
        }
        action = {
          type = "expire"
        }
      }
    ]
  })
}

# Lambda to block deployment of vulnerable images
resource "aws_lambda_function" "image_scan_remediation" {
  filename      = "scan_remediation.zip"
  function_name = "ecr-scan-enforcer"
  role          = aws_iam_role.lambda_role.arn
  handler       = "index.handler"
  runtime       = "python3.11"

  environment {
    variables = {
      SEVERITY_THRESHOLD = "HIGH"
      SLACK_WEBHOOK_URL  = "https://hooks.slack.com/services/..."
    }
  }
}

# EventBridge rule to trigger on scan completion
resource "aws_cloudwatch_event_rule" "scan_completion" {
  name        = "ecr-scan-completion"
  description = "Trigger on ECR image scan completion"

  event_pattern = jsonencode({
    source      = ["aws.ecr"]
    detail-type = ["ECR Image Scan"]
    detail = {
      scan-status = ["COMPLETE"]
    }
  })
}

resource "aws_cloudwatch_event_target" "lambda_target" {
  rule      = aws_cloudwatch_event_rule.scan_completion.name
  target_id = "ScanRemediation"
  arn       = aws_lambda_function.image_scan_remediation.arn
}
```

**Falco for Runtime Threat Detection (EKS):**

```yaml
# Falco rules for EKS runtime security
apiVersion: v1
kind: ConfigMap
metadata:
  name: falco-rules
  namespace: falco
data:
  custom_rules.yaml: |
    - rule: Unauthorized K8s API Access
      desc: Detect attempts to access Kubernetes API from unauthorized pods
      condition: >
        spawned_process and
        (proc.name in (kubectl, helm) or
         proc.cmdline contains "kubernetes.default")
      output: >
        Unauthorized Kubernetes API access
        (user=%user.name command=%proc.cmdline pod=%k8s.pod.name namespace=%k8s.ns.name)
      priority: CRITICAL

    - rule: Sensitive File Access
      desc: Detect access to sensitive files (/etc/shadow, /etc/passwd)
      condition: >
        open_read and
        (fd.name contains "/etc/shadow" or
         fd.name contains "/etc/passwd" or
         fd.name contains "/etc/kubernetes/pki")
        and not proc.name in (passwd, shadow)
      output: >
        Sensitive file accessed
        (user=%user.name file=%fd.name command=%proc.cmdline pod=%k8s.pod.name)
      priority: HIGH

    - rule: Outbound Connection from Database Pod
      desc: Database pods should not make outbound connections (data exfiltration risk)
      condition: >
        outbound and
        k8s.pod.label.app in (postgres, mysql, mongodb) and
        not (fd.sip in (10.0.0.0/8, 172.16.0.0/12))
      output: >
        Database pod initiated external connection
        (connection=%fd.name pod=%k8s.pod.name namespace=%k8s.ns.name)
      priority: EMERGENCY

    - macro: allowed_web_shell_commands
      condition: (proc.name in (sh, bash) and proc.pname in (nginx, apache))

    - rule: Web Shell Execution
      desc: Detect reverse shells or web shells in application containers
      condition: >
        spawned_process and
        shell_procs and
        proc.pname in (nginx, apache, httpd) and
        not allowed_web_shell_commands
      output: >
        Potential web shell execution
        (parent=%proc.pname command=%proc.cmdline pod=%k8s.pod.name)
      priority: CRITICAL
```

### 14.3.2 Pod Security Standards

**Kubernetes Pod Security Admission (replacing Pod Security Policies):**

```yaml
# Pod Security Standard: Restricted
apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/enforce-version: latest
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

---
# Example compliant deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: secure-application
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: secure-app
  template:
    metadata:
      labels:
        app: secure-app
    spec:
      securityContext:
        runAsNonRoot: true
        seccompProfile:
          type: RuntimeDefault
      
      containers:
        - name: application
          image: myapp:v1.2.3
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
            capabilities:
              drop: ["ALL"]
            runAsUser: 1000
            runAsGroup: 1000
          
          resources:
            limits:
              memory: "512Mi"
              cpu: "500m"
            requests:
              memory: "256Mi"
              cpu: "250m"
          
          volumeMounts:
            - name: tmp
              mountPath: /tmp
            - name: cache
              mountPath: /var/cache
      
      volumes:
        - name: tmp
          emptyDir: {}
        - name: cache
          emptyDir: {}
```

**Security Controls:**
- **Non-root execution:** Container runs as UID 1000, preventing privilege escalation to root
- **Read-only root filesystem:** Prevents attackers from writing malware to container filesystem
- **Capability dropping:** Removes all Linux capabilities (CAP_SYS_ADMIN, etc.) that could be exploited for container escape
- **Resource limits:** Prevents DoS via resource exhaustion
- **seccomp:** Restricts available syscalls to the runtime default profile

---

## 14.4 Secrets Management and Certificate Rotation

Hardcoded credentials in source code or configuration files represent critical vulnerabilities. Cloud-native secrets management provides centralized, auditable, and automatically rotating credential storage.

### 14.4.1 AWS Secrets Manager Implementation

```hcl
# Database credentials with automatic rotation
resource "aws_secretsmanager_secret" "db_credentials" {
  name                    = "production/database/app-credentials"
  description             = "RDS database credentials for application"
  kms_key_id              = aws_kms_key.secrets_encryption.arn
  recovery_window_in_days = 30  # Prevent accidental deletion

  tags = {
    Environment = "production"
    Rotation    = "enabled"
  }
}

resource "aws_secretsmanager_secret_version" "db_credentials" {
  secret_id     = aws_secretsmanager_secret.db_credentials.id
  secret_string = jsonencode({
    username = "app_user"
    password = random_password.db_password.result
    host     = aws_db_instance.main.address
    port     = 5432
    dbname   = "production"
  })
}

# Automatic rotation every 30 days
resource "aws_secretsmanager_secret_rotation" "db_credentials" {
  secret_id           = aws_secretsmanager_secret.db_credentials.id
  rotation_lambda_arn = aws_lambda_function.rotation_lambda.arn

  rotation_rules {
    automatically_after_days = 30
  }
}

# Lambda function for rotation (managed by AWS, but can be customized)
resource "aws_lambda_function" "rotation_lambda" {
  filename      = "rotation_lambda.zip"
  function_name = "secrets-rotation-postgres"
  role          = aws_iam_role.rotation_role.arn
  handler       = "lambda_function.lambda_handler"
  runtime       = "python3.11"
  timeout       = 30

  vpc_config {
    subnet_ids         = module.vpc.private_subnets
    security_group_ids = [aws_security_group.lambda_rotation.id]
  }
}

# Application retrieval with caching
```

**Application Code with Caching:**

```python
import boto3
import json
from botocore.exceptions import ClientError
from functools import lru_cache
import logging

logger = logging.getLogger()
secrets_client = boto3.client('secretsmanager')

class SecretsManagerCache:
    """
    Local cache for secrets to reduce API calls and latency
    Implements best practices from AWS Secrets Manager documentation
    """
    
    def __init__(self):
        self._cache = {}
        self._client = boto3.client('secretsmanager')
    
    def get_secret(self, secret_arn, version_stage='AWSCURRENT'):
        """
        Retrieve secret with local caching
        Falls back to API call if not cached or cache expired
        """
        cache_key = f"{secret_arn}:{version_stage}"
        
        # Check local cache (in production, implement TTL)
        if cache_key in self._cache:
            return self._cache[cache_key]
        
        try:
            response = self._client.get_secret_value(
                SecretId=secret_arn,
                VersionStage=version_stage
            )
            
            if 'SecretString' in response:
                secret = json.loads(response['SecretString'])
            else:
                # Binary secret (e.g., for certificates)
                import base64
                secret = base64.b64decode(response['SecretBinary'])
            
            # Cache locally (in Lambda, cache persists between invocations in the same execution environment)
            self._cache[cache_key] = secret
            return secret
            
        except ClientError as e:
            error_code = e.response['Error']['Code']
            if error_code == 'ResourceNotFoundException':
                logger.error(f"Secret {secret_arn} not found")
            elif error_code == 'InvalidRequestException':
                logger.error(f"Invalid request for secret {secret_arn}")
            raise

# Usage in application
cache = SecretsManagerCache()

def get_database_connection():
    creds = cache.get_secret("arn:aws:secretsmanager:us-east-1:123456789012:secret:production/database/app-credentials")
    
    import psycopg2
    conn = psycopg2.connect(
        host=creds['host'],
        database=creds['dbname'],
        user=creds['username'],
        password=creds['password'],
        port=creds['port'],
        sslmode='require'  # Enforce TLS
    )
    return conn
```

### 14.4.2 Certificate Management with ACM and Cert Manager

**AWS Certificate Manager with DNS Validation:**

```hcl
resource "aws_acm_certificate" "main" {
  domain_name               = "app.company.com"
  subject_alternative_names = ["api.company.com", "*.company.com"]
  validation_method         = "DNS"

  lifecycle {
    create_before_destroy = true  # Ensure new cert is ready before old one is destroyed
  }

  tags = {
    Environment = "production"
  }
}

# DNS validation records
resource "aws_route53_record" "cert_validation" {
  for_each = {
    for dvo in aws_acm_certificate.main.domain_validation_options : dvo.domain_name => {
      name   = dvo.resource_record_name
      record = dvo.resource_record_value
      type   = dvo.resource_record_type
    }
  }

  allow_overwrite = true
  name            = each.value.name
  records         = [each.value.record]
  ttl             = 60
  type            = each.value.type
  zone_id         = aws_route53_zone.main.zone_id
}

# Certificate validation
resource "aws_acm_certificate_validation" "main" {
  certificate_arn         = aws_acm_certificate.main.arn
  validation_record_fqdns = [for record in aws_route53_record.cert_validation : record.fqdn]
}

# TLS 1.3 Enforcement on ALB
resource "aws_lb_listener" "https" {
  load_balancer_arn = aws_lb.application.arn
  port              = "443"
  protocol          = "HTTPS"
  ssl_policy        = "ELBSecurityPolicy-TLS13-1-2-2021-06"  # Forces TLS 1.2+
  certificate_arn   = aws_acm_certificate_validation.main.certificate_arn

  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.application.arn
  }
}

# HTTP to HTTPS redirect
resource "aws_lb_listener" "http_redirect" {
  load_balancer_arn = aws_lb.application.arn
  port              = "80"
  protocol          = "HTTP"

  default_action {
    type = "redirect"

    redirect {
      port        = "443"
      protocol    = "HTTPS"
      status_code = "HTTP_301"
    }
  }
}
```

---

## 14.5 Security Monitoring and Logging Architecture

Comprehensive observability is essential for detecting, investigating, and responding to security incidents. Cloud-native security monitoring aggregates logs, analyzes behavioral patterns, and correlates events across distributed systems.

### 14.5.1 Centralized Logging with CloudTrail and Config

**Organization-Wide CloudTrail:**

```hcl
resource "aws_cloudtrail" "organization" {
  name           = "organization-security-trail"
  s3_bucket_name = aws_s3_bucket.cloudtrail_logs.id
  
  is_organization_trail = true  # Apply to all accounts in AWS Org
  is_multi_region_trail = true
  
  enable_logging = true
  
  event_selector {
    read_write_type                 = "All"
    include_management_events       = true
    exclude_management_event_sources = []  # Log everything
    
    data_resource {
      type   = "AWS::S3::Object"
      values = ["arn:aws:s3:::"]  # Log all S3 object-level operations
    }
    
    data_resource {
      type   = "AWS::Lambda::Function"
      values = ["arn:aws:lambda"]  # Log Lambda function invocations
    }
  }
  
  insight_selector {
    insight_type = "ApiCallRateInsight"  # Detect unusual API activity
  }
  
  kms_key_id = aws_kms_key.cloudtrail_encryption.arn
  
  tags = {
    Purpose     = "SecurityAudit"
    Compliance  = "SOC2"
  }
}

# Log file validation (integrity checking)
resource "aws_cloudtrail" "validation" {
  enable_log_file_validation = true
  
  # ... other configuration
}

# S3 bucket for logs with strict security
resource "aws_s3_bucket" "cloudtrail_logs" {
  bucket        = "org-cloudtrail-logs-${data.aws_caller_identity.current.account_id}"
  force_destroy = false
  
  tags = {
    Security = "critical"
  }
}

resource "aws_s3_bucket_policy" "cloudtrail" {
  bucket = aws_s3_bucket.cloudtrail_logs.id
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "AWSCloudTrailAclCheck"
        Effect = "Allow"
        Principal = {
          Service = "cloudtrail.amazonaws.com"
        }
        Action   = "s3:GetBucketAcl"
        Resource = aws_s3_bucket.cloudtrail_logs.arn
      },
      {
        Sid    = "AWSCloudTrailWrite"
        Effect = "Allow"
        Principal = {
          Service = "cloudtrail.amazonaws.com"
        }
        Action   = "s3:PutObject"
        Resource = "${aws_s3_bucket.cloudtrail_logs.arn}/AWSLogs/${data.aws_caller_identity.current.account_id}/*"
        Condition = {
          StringEquals = {
            "s3:x-amz-acl" = "bucket-owner-full-control"
          }
        }
      },
      {
        Sid    = "DenyInsecureTransport"
        Effect = "Deny"
        Principal = "*"
        Action = "s3:*"
        Resource = aws_s3_bucket.cloudtrail_logs.arn
        Condition = {
          Bool = {
            "aws:SecureTransport" = "false"
          }
        }
      }
    ]
  })
}
```

### 14.5.2 GuardDuty for Threat Detection

**Intelligent Threat Detection:**

```hcl
resource "aws_guardduty_detector" "main" {
  enable = true
  
  datasources {
    s3_logs {
      enable = true
    }
    kubernetes {
      audit_logs {
        enable = true
      }
    }
    malware_protection {
      scan_ec2_instance_with_findings {
        enable = true
      }
    }
  }
  
  finding_publishing_frequency = "FIFTEEN_MINUTES"
}

# Auto-remediation for high-severity findings
resource "aws_cloudwatch_event_rule" "guardduty_high_severity" {
  name        = "guardduty-high-severity"
  description = "Capture high and critical GuardDuty findings"

  event_pattern = jsonencode({
    source      = ["aws.guardduty"]
    detail-type = ["GuardDuty Finding"]
    detail = {
      severity = [{ "numeric" = [">=", 7] }]  # High (7-8.9) and Critical (9+)
    }
  })
}

resource "aws_cloudwatch_event_target" "remediation_lambda" {
  rule      = aws_cloudwatch_event_rule.guardduty_high_severity.name
  target_id = "GuardDutyRemediation"
  arn       = aws_lambda_function.guardduty_remediation.arn
}
```

**Remediation Lambda for Compromised Instances:**

```python
import boto3
import json

def isolate_compromised_instance(event, context):
    """
    Automatically isolate EC2 instances flagged by GuardDuty
    """
    detail = event['detail']
    finding_type = detail['type']
    severity = detail['severity']
    
    # Extract instance ID from finding
    resource = detail['resource']
    instance_id = None
    
    for res in resource['instanceDetails']['tags']:
        if res['key'] == 'Name':
            instance_name = res['value']
    
    instance_id = resource['instanceDetails']['instanceId']
    
    ec2 = boto3.client('ec2')
    sns = boto3.client('sns')
    
    try:
        # 1. Create isolation security group (no inbound/outbound)
        isolation_sg = create_isolation_security_group(ec2, instance_id)
        
        # 2. Replace instance's security groups with isolation group
        original_sgs = ec2.describe_instances(InstanceIds=[instance_id])['Reservations'][0]['Instances'][0]['SecurityGroups']
        
        ec2.modify_instance_attribute(
            InstanceId=instance_id,
            Groups=[isolation_sg['GroupId']]
        )
        
        # 3. Create snapshot for forensics
        volumes = ec2.describe_volumes(
            Filters=[{'Name': 'attachment.instance-id', 'Values': [instance_id]}]
        )
        
        for vol in volumes['Volumes']:
            snapshot = ec2.create_snapshot(
                VolumeId=vol['VolumeId'],
                Description=f"Forensic snapshot for compromised instance {instance_id}",
                TagSpecifications=[{
                    'ResourceType': 'snapshot',
                    'Tags': [
                        {'Key': 'Incident', 'Value': finding_type},
                        {'Key': 'Severity', 'Value': str(severity)},
                        {'Key': 'InstanceId', 'Value': instance_id}
                    ]
                }]
            )
        
        # 4. Tag instance
        ec2.create_tags(
            Resources=[instance_id],
            Tags=[
                {'Key': 'SecurityStatus', 'Value': 'ISOLATED'},
                {'Key': 'IsolationTime', 'Value': event['time']},
                {'Key': 'OriginalSGs', 'Value': json.dumps([sg['GroupId'] for sg in original_sgs])}
            ]
        )
        
        # 5. Notify security team
        sns.publish(
            TopicArn='arn:aws:sns:us-east-1:123456789012:security-incidents',
            Subject=f'CRITICAL: Instance {instance_id} isolated due to {finding_type}',
            Message=json.dumps({
                'instance_id': instance_id,
                'finding_type': finding_type,
                'severity': severity,
                'isolation_sg': isolation_sg['GroupId'],
                'snapshots': [snapshot['SnapshotId']]
            })
        )
        
    except Exception as e:
        # If automation fails, page on-call immediately
        sns.publish(
            TopicArn='arn:aws:sns:us-east-1:123456789012:security-escalation',
            Subject=f'URGENT: Failed to isolate compromised instance {instance_id}',
            Message=str(e)
        )
        raise

def create_isolation_security_group(ec2, instance_id):
    """Create quarantine security group with no ingress/egress"""
    vpc_id = ec2.describe_instances(InstanceIds=[instance_id])['Reservations'][0]['Instances'][0]['VpcId']
    
    sg = ec2.create_security_group(
        GroupName=f'isolation-{instance_id}',
        Description=f'Isolation group for compromised instance {instance_id}',
        VpcId=vpc_id,
        TagSpecifications=[{
            'ResourceType': 'security-group',
            'Tags': [{'Key': 'Purpose', 'Value': 'IncidentResponse'}]
        }]
    )
    
    # Explicit deny all (no ingress/egress rules added)
    return sg
```

### 14.5.3 SIEM Integration

**Centralized Security Analytics:**

```hcl
# Kinesis Firehose to Splunk/ElasticSearch
resource "aws_kinesis_firehose_delivery_stream" "security_logs" {
  name        = "security-log-stream"
  destination = "http_endpoint"

  http_endpoint_configuration {
    url                = "https://http-inputs-splunk.company.com:443/services/collector/event"
    name               = "Splunk"
    access_key         = var.splunk_hec_token
    buffering_size     = 5
    buffering_interval = 300
    retry_duration     = 300
    
    request_configuration {
      content_encoding = "GZIP"
      
      common_attributes {
        name  = "environment"
        value = "production"
      }
    }
    
    s3_configuration {
      role_arn           = aws_iam_role.firehose.arn
      bucket_arn         = aws_s3_bucket.backup_logs.arn
      buffering_size     = 10
      buffering_interval = 400
      compression_format = "GZIP"
    }
  }
}

# Subscribe CloudWatch Logs to Firehose
resource "aws_cloudwatch_log_subscription_filter" "vpc_flow_logs" {
  name            = "vpc-flow-to-siem"
  log_group_name  = aws_cloudwatch_log_group.vpc_flow.name
  filter_pattern  = "[version, account_id, interface_id, srcaddr != 10.0.0.0/8, dstaddr, srcport, dstport, protocol, packets, bytes, start, end, action, log_status]"  # Only external traffic
  destination_arn = aws_kinesis_firehose_delivery_stream.security_logs.arn
  role_arn        = aws_iam_role.cloudwatch_to_firehose.arn
}
```

---

## 14.6 Infrastructure as Code Security

Security must shift left into the development pipeline. IaC scanning tools detect misconfigurations before they reach production.

### 14.6.1 Policy as Code with Checkov

```yaml
# .github/workflows/iac-security.yml
name: Infrastructure Security Scan
on:
  pull_request:
    paths:
      - 'terraform/**'
      - 'cloudformation/**'

jobs:
  security-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Run Checkov
        id: checkov
        uses: bridgecrewio/checkov-action@master
        with:
          directory: .
          framework: terraform
          output_format: sarif
          output_file_path: reports/checkov.sarif
          soft_fail: false  # Fail the build on violations
          skip_check: CKV_AWS_18  # Skip specific checks if justified
          
      - name: Upload SARIF to GitHub Security
        uses: github/codeql-action/upload-sarif@v2
        with:
          sarif_file: reports/checkov.sarif
          
      - name: Terraform Compliance
        uses: terraform-compliance/github_action@main
        with:
          plan: terraform/tfplan.out
        env:
          TF_DIRS: terraform/
```

**Custom Checkov Policy:**

```python
# custom_policies/s3_encryption.py
from checkov.common.models.enums import CheckResult, CheckCategories
from checkov.terraform.checks.resource.base_resource_check import BaseResourceCheck

class S3BucketEncryptionCheck(BaseResourceCheck):
    def __init__(self):
        name = "Ensure S3 bucket has encryption enabled with customer managed key"
        id = "CKV_CUSTOM_001"
        supported_resources = ['aws_s3_bucket']
        categories = [CheckCategories.ENCRYPTION]
        super().__init__(name=name, id=id, categories=categories, supported_resources=supported_resources)

    def scan_resource_conf(self, conf):
        """
        Looks for server_side_encryption_configuration with KMS
        """
        if 'server_side_encryption_configuration' in conf.keys():
            sse_config = conf['server_side_encryption_configuration'][0]
            if 'rule' in sse_config:
                rule = sse_config['rule'][0]
                if 'apply_server_side_encryption_by_default' in rule:
                    default = rule['apply_server_side_encryption_by_default'][0]
                    if 'sse_algorithm' in default and default['sse_algorithm'][0] == 'aws:kms':
                        if 'kms_master_key_id' in default:
                            return CheckResult.PASSED
        return CheckResult.FAILED

scanner = S3BucketEncryptionCheck()
```

### 14.6.2 Terraform Sentinel (HashiCorp Enterprise)

```hcl
# enforce-mandatory-tags.sentinel
import "tfplan/v2" as tfplan

# Get all AWS instances
aws_instances = filter tfplan.resource_changes as _, rc {
    rc.type is "aws_instance" and
    (rc.change.actions contains "create" or rc.change.actions contains "update")
}

# Mandatory tags
mandatory_tags = [
    "Environment",
    "Owner",
    "DataClassification",
    "CostCenter"
]

# Rule to check for mandatory tags
mandatory_tags_rule = rule {
    all aws_instances as _, instance {
        all mandatory_tags as tag {
            keys(instance.change.after.tags) contains tag
        }
    }
}

# Main rule
main = rule {
    mandatory_tags_rule
}
```

---

## 14.7 Chapter Summary and Transition

This chapter has implemented comprehensive infrastructure security controls that operationalize the architectural principles established in previous chapters. We architected defense-in-depth network security utilizing VPC segmentation, security groups, NACLs, and Web Application Firewalls to create multiple barriers against attack, moving beyond perimeter-based models to zero-trust micro-segmentation where every packet is inspected and every connection verified.

Compute hardening strategies demonstrated automated compliance with CIS benchmarks through Systems Manager and golden AMI pipelines, coupled with vulnerability management workflows that automatically remediate critical findings or trigger incident response procedures. Container security addressed the unique challenges of ephemeral workloads through image scanning, runtime threat detection with Falco, and Pod Security Standards that enforce least privilege at the Kubernetes level.

Data protection implementation covered centralized secrets management with automatic rotation, certificate lifecycle management with TLS 1.3 enforcement, and encryption key governance. Security monitoring architectures aggregated logs across distributed systems, implemented intelligent threat detection with GuardDuty, and established automated incident response workflows that isolate compromised resources within minutes of detection.

Finally, we embedded security into the development lifecycle through Infrastructure as Code scanning, ensuring that misconfigurations are caught during pull request review rather than production deployment.

However, even the most sophisticated preventive and detective controls cannot guarantee security in the face of determined adversaries or sophisticated supply chain attacks. When prevention fails—and statistically, it eventually will—organizations must be prepared to respond with speed and precision. Incident response in cloud environments presents unique challenges: ephemeral resources may disappear before forensic capture, cross-account compromises require rapid isolation of entire organizational units, and compliance obligations mandate specific breach notification timelines.

In **Chapter 15: Cloud Security Operations and Incident Response**, we will shift from preventive architecture to reactive capability. You will learn to build Security Operations Centers (SOC) optimized for cloud telemetry, implement automated incident response playbooks that orchestrate across multiple accounts and regions, conduct forensic investigations in ephemeral serverless and container environments, and navigate the compliance and legal implications of cloud breaches. We will explore chaos engineering techniques for validating security controls under adversarial conditions and establish the metrics and KPIs that demonstrate security program maturity to stakeholders and regulators.

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='13. identity_and_access_management_deep_dive.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='15. cloud_security_operations.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
