# Chapter 6: Infrastructure as Code (IaC)

In the preceding chapters, we manually provisioned cloud resources through web consoles and command-line interfaces. While this approach works for learning and experimentation, it fails catastrophically at scale. Manual configuration is error-prone, non-repeatable, and creates "snowflake" environments where development, staging, and production diverge in undocumented ways.

Infrastructure as Code (IaC) is the practice of managing and provisioning computing infrastructure through machine-readable definition files rather than physical hardware configuration or interactive configuration tools. It is the cornerstone of modern cloud operations, enabling version control, automated testing, and consistent, repeatable deployments. This chapter will transform you from a cloud user into a cloud engineer, teaching you to define entire data centers in code.

## 6.1 What is IaC?: Concepts and Paradigms

### The Imperative vs. Declarative Divide
IaC tools generally follow one of two approaches:

**Imperative (Procedural):** You write scripts that specify *how* to achieve the desired state—step-by-step instructions.
```python
# Imperative: "How" to create infrastructure
import boto3
ec2 = boto3.client('ec2')

# Step 1: Check if VPC exists
vpcs = ec2.describe_vpcs(Filters=[{'Name': 'tag:Name', 'Values': ['main']}])
if not vpcs['Vpcs']:
    # Step 2: Create VPC if it doesn't exist
    vpc = ec2.create_vpc(CidrBlock='10.0.0.0/16')
    ec2.create_tags(Resources=[vpc['Vpc']['VpcId']], Tags=[{'Key': 'Name', 'Value': 'main'}])
```
*Challenges:* Handling failures mid-script, updating existing resources, and maintaining idempotency (running the script twice shouldn't create duplicates) becomes complex.

**Declarative:** You define the *desired end state*—what the infrastructure should look like—and the tool determines how to achieve it.
```hcl
# Declarative: "What" the infrastructure should be
resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"
  
  tags = {
    Name = "main"
  }
}
```
*Benefits:* The tool handles complexity, automatically creates or updates resources, and maintains state. This is the industry standard for cloud infrastructure.

### Core Benefits of IaC
1.  **Version Control:** Infrastructure definitions live in Git, enabling code review, rollback, and audit trails.
2.  **Idempotency:** Applying the same configuration multiple times produces the same result without side effects.
3.  **Consistency:** Eliminates "works on my machine" between environments.
4.  **Speed:** Provision entire environments in minutes rather than days.
5.  **Cost Optimization:** Easily destroy and recreate ephemeral environments (dev/test) to minimize costs.

## 6.2 Terraform: The Cross-Cloud Industry Standard

Terraform, developed by HashiCorp, has emerged as the de facto standard for IaC across multi-cloud environments. It uses HashiCorp Configuration Language (HCL), a declarative language designed to be both human-readable and machine-parseable.

### 6.2.1 Terraform Architecture and Workflow

**Key Components:**
*   **Providers:** Plugins that interface with cloud APIs (AWS, Azure, GCP, Kubernetes, etc.).
*   **Resources:** The fundamental building blocks (VPCs, VMs, databases).
*   **State:** Terraform maintains a `terraform.tfstate` file that maps real-world resources to your configuration. This is the single source of truth for what exists.

**Standard Workflow:**
```bash
terraform init      # Initialize working directory, download providers
terraform plan      # Preview changes (dry run)
terraform apply     # Execute changes
terraform destroy   # Remove all resources (use with caution)
```

### 6.2.2 Writing Your First Terraform Configuration

Let us build a production-ready three-tier architecture: Web tier (public), Application tier (private), and Database tier (private).

**Code Snippet: Project Structure**
Organize code for maintainability:
```
infrastructure/
├── main.tf          # Entry point, provider configuration
├── variables.tf     # Input variables
├── outputs.tf       # Output values
├── terraform.tfvars # Variable values (gitignored in production)
└── modules/
    ├── vpc/
    │   ├── main.tf
    │   ├── variables.tf
    │   └── outputs.tf
    └── compute/
        ├── main.tf
        ├── variables.tf
        └── outputs.tf
```

**Code Snippet: Provider Configuration (`main.tf`)**
```hcl
terraform {
  required_version = ">= 1.5.0"
  
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
    random = {
      source  = "hashicorp/random"
      version = "~> 3.0"
    }
  }
  
  # Remote state backend (critical for team collaboration)
  backend "s3" {
    bucket         = "terraform-state-prod-123456"
    key            = "infrastructure/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"  # State locking
  }
}

provider "aws" {
  region = var.aws_region
  
  default_tags {
    tags = {
      Environment = var.environment
      ManagedBy   = "Terraform"
      Project     = "cloud-handbook"
    }
  }
}
```

**Code Snippet: Variables (`variables.tf`)**
Variables make code reusable across environments:
```hcl
variable "aws_region" {
  description = "AWS region for resources"
  type        = string
  default     = "us-east-1"
}

variable "environment" {
  description = "Deployment environment"
  type        = string
  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "Environment must be dev, staging, or prod."
  }
}

variable "vpc_cidr" {
  description = "CIDR block for VPC"
  type        = string
  default     = "10.0.0.0/16"
}

variable "instance_types" {
  description = "Map of instance types per environment"
  type        = map(string)
  default = {
    dev     = "t3.micro"
    staging = "t3.small"
    prod    = "t3.medium"
  }
}
```

**Code Snippet: VPC Module (`modules/vpc/main.tf`)**
```hcl
data "aws_availability_zones" "available" {
  state = "available"
}

resource "aws_vpc" "this" {
  cidr_block           = var.cidr_block
  enable_dns_hostnames = true
  enable_dns_support   = true
  
  tags = {
    Name = "${var.environment}-vpc"
  }
}

# Internet Gateway for public subnets
resource "aws_internet_gateway" "this" {
  vpc_id = aws_vpc.this.id
  
  tags = {
    Name = "${var.environment}-igw"
  }
}

# Public Subnets (for Load Balancers and Bastion hosts)
resource "aws_subnet" "public" {
  count                   = 2
  vpc_id                  = aws_vpc.this.id
  cidr_block              = cidrsubnet(var.cidr_block, 8, count.index)
  availability_zone       = data.aws_availability_zones.available.names[count.index]
  map_public_ip_on_launch = true
  
  tags = {
    Name = "${var.environment}-public-${count.index + 1}"
    Type = "Public"
  }
}

# Private Subnets (for application tier)
resource "aws_subnet" "private" {
  count             = 2
  vpc_id            = aws_vpc.this.id
  cidr_block        = cidrsubnet(var.cidr_block, 8, count.index + 10)
  availability_zone = data.aws_availability_zones.available.names[count.index]
  
  tags = {
    Name = "${var.environment}-private-${count.index + 1}"
    Type = "Private"
  }
}

# Database Subnets (for RDS)
resource "aws_subnet" "database" {
  count             = 2
  vpc_id            = aws_vpc.this.id
  cidr_block        = cidrsubnet(var.cidr_block, 8, count.index + 20)
  availability_zone = data.aws_availability_zones.available.names[count.index]
  
  tags = {
    Name = "${var.environment}-db-${count.index + 1}"
    Type = "Database"
  }
}

# NAT Gateways (one per AZ for high availability)
resource "aws_eip" "nat" {
  count  = 2
  domain = "vpc"
  
  tags = {
    Name = "${var.environment}-nat-${count.index + 1}"
  }
}

resource "aws_nat_gateway" "this" {
  count         = 2
  allocation_id = aws_eip.nat[count.index].id
  subnet_id     = aws_subnet.public[count.index].id
  
  tags = {
    Name = "${var.environment}-nat-${count.index + 1}"
  }
  
  depends_on = [aws_internet_gateway.this]
}

# Route Tables
resource "aws_route_table" "public" {
  vpc_id = aws_vpc.this.id
  
  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.this.id
  }
  
  tags = {
    Name = "${var.environment}-public-rt"
  }
}

resource "aws_route_table" "private" {
  count  = 2
  vpc_id = aws_vpc.this.id
  
  route {
    cidr_block     = "0.0.0.0/0"
    nat_gateway_id = aws_nat_gateway.this[count.index].id
  }
  
  tags = {
    Name = "${var.environment}-private-rt-${count.index + 1}"
  }
}

# Route Table Associations
resource "aws_route_table_association" "public" {
  count          = length(aws_subnet.public)
  subnet_id      = aws_subnet.public[count.index].id
  route_table_id = aws_route_table.public.id
}

resource "aws_route_table_association" "private" {
  count          = length(aws_subnet.private)
  subnet_id      = aws_subnet.private[count.index].id
  route_table_id = aws_route_table.private[count.index].id
}
```

**Code Snippet: Security Groups (`modules/vpc/security.tf`)**
```hcl
resource "aws_security_group" "alb" {
  name_prefix = "${var.environment}-alb-"
  vpc_id      = aws_vpc.this.id
  
  ingress {
    description = "HTTPS from Internet"
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }
  
  egress {
    description = "Allow all outbound"
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
  
  tags = {
    Name = "${var.environment}-alb-sg"
  }
}

resource "aws_security_group" "app" {
  name_prefix = "${var.environment}-app-"
  vpc_id      = aws_vpc.this.id
  
  ingress {
    description     = "HTTP from ALB"
    from_port       = 8080
    to_port         = 8080
    protocol        = "tcp"
    security_groups = [aws_security_group.alb.id]
  }
  
  egress {
    description = "Allow all outbound"
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
  
  tags = {
    Name = "${var.environment}-app-sg"
  }
}
```

### 6.2.3 State Management: The Critical Foundation

Terraform state is a JSON file mapping resource IDs in your configuration to real resource IDs in the cloud. **Losing state means Terraform loses track of what it created**, potentially leading to orphaned resources or duplicate creations.

**State Storage Options:**
*   **Local:** Default `terraform.tfstate` file (unsuitable for teams).
*   **Remote:** S3 (AWS), Azure Blob Storage, GCS (Google), Terraform Cloud.

**State Locking:**
When multiple engineers run Terraform simultaneously, state locking prevents corruption. DynamoDB (AWS) or native locking mechanisms in other backends ensure only one operation proceeds at a time.

**State Security Best Practices:**
1.  **Encryption:** Always encrypt state files at rest (S3 server-side encryption).
2.  **Access Control:** Restrict access to state files; they may contain sensitive data (passwords, keys).
3.  **Versioning:** Enable versioning on state buckets to recover from accidental deletions.
4.  **State Separation:** Use separate state files per environment (dev, staging, prod) to prevent accidental destruction of production while working on development.

**Code Snippet: State Backend Configuration**
```hcl
# Backend configuration (typically in main.tf or backend.tf)
terraform {
  backend "s3" {
    bucket = "company-terraform-state"
    key    = "prod/infrastructure.tfstate"
    region = "us-east-1"
    
    # Encryption
    encrypt = true
    
    # State locking with DynamoDB
    dynamodb_table = "terraform-state-lock"
    
    # Role assumption for security
    role_arn = "arn:aws:iam::123456789:role/TerraformStateAccess"
  }
}
```

### 6.2.4 Modularization and Reusability

Modules are self-contained packages of Terraform configurations that are managed as a group. They enable DRY (Don't Repeat Yourself) principles and standardization across the organization.

**Module Calling Example:**
```hcl
module "networking" {
  source = "./modules/vpc"
  
  environment = var.environment
  cidr_block  = var.vpc_cidr
  
  # Explicit dependency
  depends_on = [aws_kms_key.main]
}

module "web_servers" {
  source = "./modules/compute"
  
  environment    = var.environment
  instance_count = var.environment == "prod" ? 3 : 1
  instance_type  = var.instance_types[var.environment]
  subnet_ids     = module.networking.private_subnet_ids
  security_group_ids = [module.networking.app_security_group_id]
  
  tags = {
    Tier = "Web"
  }
}
```

**Module Versioning (for remote modules):**
```hcl
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.1.2"  # Pin to specific version
  
  name = "my-vpc"
  cidr = "10.0.0.0/16"
  
  # ... other vars
}
```

## 6.3 Cloud-Native IaC Tools

While Terraform is the multi-cloud standard, each major cloud provider offers native IaC tools optimized for their platforms.

### AWS CloudFormation
Native declarative IaC for AWS using JSON or YAML.

**Code Snippet: CloudFormation Template (YAML)**
```yaml
AWSTemplateFormatVersion: '2010-09-09'
Description: 'Simple EC2 instance'

Parameters:
  InstanceType:
    Type: String
    Default: t3.micro
    AllowedValues:
      - t3.micro
      - t3.small
      - t3.medium

Resources:
  WebServer:
    Type: AWS::EC2::Instance
    Properties:
      ImageId: ami-0c02fb55956c7d316
      InstanceType: !Ref InstanceType
      SecurityGroups:
        - !Ref InstanceSecurityGroup
      UserData:
        Fn::Base64: |
          #!/bin/bash
          yum install -y httpd
          systemctl start httpd

  InstanceSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Enable HTTP access
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 80
          ToPort: 80
          CidrIp: 0.0.0.0/0

Outputs:
  PublicIP:
    Description: 'Public IP address of the web server'
    Value: !GetAtt WebServer.PublicIp
```

**AWS CDK (Cloud Development Kit):**
For developers preferring TypeScript, Python, or Java over YAML:
```python
# AWS CDK Python example
from aws_cdk import core, aws_ec2 as ec2

class WebServerStack(core.Stack):
    def __init__(self, scope: core.Construct, id: str, **kwargs):
        super().__init__(scope, id, **kwargs)
        
        vpc = ec2.Vpc(self, "VPC", max_azs=2)
        
        ec2.Instance(
            self, "Instance",
            instance_type=ec2.InstanceType("t3.micro"),
            machine_image=ec2.AmazonLinuxImage(),
            vpc=vpc
        )
```

### Azure Resource Manager (ARM) and Bicep
**ARM Templates:** JSON-based declarative syntax for Azure resources.

**Bicep:** Domain-specific language (DSL) transpiling to ARM, offering cleaner syntax:
```bicep
param location string = resourceGroup().location
param storageAccountName string

resource storageAccount 'Microsoft.Storage/storageAccounts@2023-01-01' = {
  name: storageAccountName
  location: location
  sku: {
    name: 'Standard_LRS'
  }
  kind: 'StorageV2'
}

output storageEndpoint
param location string = resourceGroup().location
param storageAccountName string

resource storageAccount 'Microsoft.Storage/storageAccounts@2023-01-01' = {
  name: storageAccountName
  location: location
  sku: {
    name: 'Standard_LRS'
  }
  kind: 'StorageV2'
  properties: {
    accessTier: 'Hot'
  }
}

output storageEndpoint string = storageAccount.properties.primaryEndpoints.blob
```

### Google Cloud: Modern IaC Approaches
Google Cloud's native **Deployment Manager** is now largely deprecated in favor of more robust third-party tools. Instead, Google recommends:

1.  **Terraform:** The preferred choice for GCP infrastructure (Google is a major Terraform contributor).
2.  **Config Connector:** Kubernetes-native approach where you manage GCP resources as Kubernetes custom resources (CRDs).
3.  **Google Cloud CLI (gcloud):** For imperative scripting, though less suitable for production state management.

**Config Connector Example:**
```yaml
# IAMServiceAccount as a Kubernetes resource
apiVersion: iam.cnrm.cloud.google.com/v1beta1
kind: IAMServiceAccount
metadata:
  name: my-service-account
  namespace: config-control
spec:
  displayName: "Application Service Account"
---
# Storage Bucket as a Kubernetes resource
apiVersion: storage.cnrm.cloud.google.com/v1beta1
kind: StorageBucket
metadata:
  name: my-unique-bucket-name
  namespace: config-control
  annotations:
    cnrm.cloud.google.com/project-id: "my-project"
spec:
  location: US
  uniformBucketLevelAccess: true
  versioning:
    enabled: true
```

## 6.4 IaC Best Practices and Testing

Writing infrastructure code requires the same discipline as application code—testing, linting, and security scanning are non-negotiable in production environments.

### 6.4.1 Static Analysis and Security Scanning
Tools that analyze Terraform code without executing it:

**Checkov:** Scans for misconfigurations and compliance violations (CIS benchmarks, SOC2).
```bash
# Install and run Checkov
pip install checkov
checkov --directory ./terraform --framework terraform

# Output highlights issues like:
# - CKV_AWS_20: S3 bucket has public access
# - CKV_AWS_23: Security group allows unrestricted ingress
# - CKV_AWS_40: IAM policy allows wildcard permissions
```

**TFSec (now part of Trivy):** Security scanner specifically for Terraform:
```bash
trivy config --severity HIGH,CRITICAL ./terraform
```

**Terraform Validate:**
```bash
terraform fmt -check -recursive  # Ensure consistent formatting
terraform validate               # Syntax validation
```

### 6.4.2 Testing Strategies

**1. Terratest (Go-based):**
Write automated tests for infrastructure using Go, deploying real resources in isolated environments and verifying behavior.

```go
// test/vpc_test.go
package test

import (
    "testing"
    "github.com/gruntwork-io/terratest/modules/terraform"
    "github.com/stretchr/testify/assert"
)

func TestVpcCreation(t *testing.T) {
    terraformOptions := &terraform.Options{
        TerraformDir: "../examples/vpc",
        Vars: map[string]interface{}{
            "environment": "test",
            "cidr_block": "10.0.0.0/16",
        },
    }
    
    // Clean up at the end
    defer terraform.Destroy(t, terraformOptions)
    
    // Deploy
    terraform.InitAndApply(t, terraformOptions)
    
    // Validate outputs
    vpcId := terraform.Output(t, terraformOptions, "vpc_id")
    assert.NotEmpty(t, vpcId)
    
    // AWS SDK validation
    // Verify VPC actually exists, has correct CIDR, etc.
}
```

**2. Kitchen-Terraform (Ruby-based):**
Integration testing framework using Test Kitchen and InSpec for compliance verification.

**3. Plan-based Validation:**
In CI/CD, always run `terraform plan` and require human approval or automated policy checks (using Sentinel for Terraform Cloud/Enterprise, or OPA - Open Policy Agent) before apply.

```bash
# Generate plan file for review
terraform plan -out=tfplan
terraform show -json tfplan > plan.json

# Policy check (example with OPA)
opa eval --data policy.rego --input plan.json "data.terraform.deny"
```

### 6.4.3 GitOps for Infrastructure
GitOps applies version control and automated synchronization to infrastructure:

1.  **Declarative:** Infrastructure stored in Git as the single source of truth.
2.  **Versioned:** All changes are pull requests with audit trails.
3.  **Automated:** Agents (like Flux or ArgoCD for Kubernetes, or Terraform Cloud) automatically apply approved changes.
4.  **Observed:** Continuous monitoring ensures actual state matches desired state.

**Workflow:**
```
Developer → PR to Git → CI Validation (lint, security scan) 
→ Peer Review → Merge to Main → Terraform Cloud Trigger 
→ Plan & Apply → State Updated
```

## 6.5 Tool Selection Guide

| Criteria | Terraform | CloudFormation/ARM/Bicep | Pulumi |
|----------|-----------|-------------------------|---------|
| **Multi-cloud** | Native support | Single cloud only | Native support |
| **Language** | HCL (declarative) | JSON/YAML/Bicep | Python/TypeScript/Go |
| **State Management** | Self-managed or Terraform Cloud | Managed by cloud provider | Self-managed or Pulumi Cloud |
| **Drift Detection** | Manual (`refresh`) or Terraform Cloud | AWS Config/Policy | Built-in |
| **Modularity** | Excellent (modules registry) | Nested stacks | Packages/libraries |
| **Testing** | Terratest, Kitchen | TaskCat (CFN only) | Standard unit testing |

**Recommendation:**
-   **Production Multi-cloud:** Terraform is the industry standard.
-   **AWS-only with tight integration:** CloudFormation or CDK for native service integrations.
-   **Developer-centric teams:** Pulumi or CDK if your team prefers imperative programming languages over HCL.
-   **Kubernetes-heavy:** Consider Crossplane or Config Connector for unified control planes.

---

### Summary

In this chapter, we transformed infrastructure management from manual, error-prone clicking into automated, version-controlled code. You learned the declarative paradigm that separates "what" from "how," enabling idempotent, repeatable infrastructure deployments. We built a production-grade three-tier architecture using Terraform, mastering state management, remote backends, and modular design patterns that allow teams to collaborate safely. You explored cloud-native alternatives—CloudFormation, CDK, ARM Bicep, and Config Connector—understanding when each is appropriate. Finally, we established critical best practices: static security scanning with Checkov, automated testing with Terratest, and GitOps workflows that ensure your infrastructure remains compliant, observable, and aligned with your codebase.

Infrastructure as Code is the foundation of DevOps and platform engineering. With these skills, you can now provision entire environments in minutes, destroy them just as quickly to save costs, and maintain complete audit trails of infrastructure changes. However, provisioning infrastructure is only half the battle—the applications running on that infrastructure must be built, tested, and deployed automatically to realize the full promise of cloud agility.

**Next Up: Chapter 7 - CI/CD Pipelines in the Cloud**
In the next chapter, we will bridge the gap between code and deployment, learning to construct Continuous Integration and Continuous Delivery pipelines that automate the testing and release of your applications. You will build pipelines that trigger on every code commit, run automated tests, scan for vulnerabilities, and deploy to your newly codified infrastructure—achieving the full software delivery lifecycle automation that defines modern cloud-native engineering.

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='5. cloud_native_application_design.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='7. cicd_pipelines_in_the_cloud.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
