# Chapter 67: Infrastructure as Code

## Learning Objectives

By the end of this chapter, you will be able to:

- Understand the concept of Infrastructure as Code (IaC) and its benefits for managing ML systems
- Distinguish between declarative and imperative infrastructure management
- Use Terraform to define and provision cloud resources for the NEPSE prediction system
- Write CloudFormation templates for AWS resources
- Automate server configuration with Ansible
- Define Kubernetes resources using YAML manifests and manage them with Helm charts
- Store IaC definitions in Git and integrate with CI/CD pipelines
- Manage secrets securely using tools like HashiCorp Vault or cloud secrets managers
- Apply best practices for IaC: modularity, versioning, state management, and testing

---

## Introduction

In earlier chapters, we deployed the NEPSE prediction system on various cloud platforms, manually clicking through web consoles or running one‑off scripts. As the system grows—adding more services, environments (dev, staging, prod), and team members—manual management becomes error‑prone, inconsistent, and hard to reproduce. **Infrastructure as Code (IaC)** solves these problems by treating infrastructure the same way we treat application code: defined in files, versioned in Git, tested, and automatically deployed.

IaC is the practice of managing and provisioning infrastructure through machine‑readable definition files, rather than physical hardware configuration or interactive configuration tools. It brings several benefits:

- **Consistency**: The same configuration can be applied repeatedly, eliminating drift.
- **Reproducibility**: Entire environments can be recreated from scratch.
- **Versioning**: Infrastructure changes are tracked in Git, enabling rollbacks and audits.
- **Automation**: Infrastructure can be provisioned as part of CI/CD pipelines.
- **Documentation**: The code itself documents the infrastructure.

For the NEPSE system, IaC allows us to define our cloud resources (S3 buckets, databases, Kubernetes clusters) in code, and spin up identical environments for development, testing, and production with a single command.

In this chapter, we will explore the main IaC tools and apply them to the NEPSE project. We'll use **Terraform** for cloud resource provisioning, **Kubernetes manifests** for container orchestration, and **Helm** for packaging. We'll also discuss secrets management and best practices.

---

## 67.1 Infrastructure as Code Fundamentals

### 67.1.1 Declarative vs. Imperative

- **Declarative** approach: You specify the desired end state, and the tool figures out how to achieve it. Examples: Terraform, CloudFormation, Kubernetes YAML.
- **Imperative** approach: You specify step‑by‑step commands to reach the desired state. Examples: shell scripts, Ansible playbooks (though Ansible is declarative in its playbook language, it executes imperatively).

Declarative IaC is generally preferred because it is idempotent and easier to reason about.

### 67.1.2 Core Concepts

- **Resource**: A discrete piece of infrastructure (e.g., an EC2 instance, an S3 bucket).
- **Provider**: A plugin that interacts with a cloud API (e.g., AWS, Azure, GCP).
- **State**: The current state of your infrastructure as recorded by the tool. Used to plan changes.
- **Module**: A reusable group of resources (like a Terraform module or a CloudFormation nested stack).

### 67.1.3 IaC in the ML Lifecycle

For the NEPSE system, IaC can manage:

- **Data storage**: S3 buckets for raw data, processed features, and model artifacts.
- **Compute**: SageMaker instances for training, EC2 for batch jobs, EKS for serving.
- **Networking**: VPCs, subnets, security groups.
- **Databases**: RDS for metadata, Redis for online feature store.
- **Serverless functions**: Lambda for lightweight inference.
- **CI/CD infrastructure**: Build servers, artifact repositories.

All of these can be defined in code and versioned alongside the application code.

---

## 67.2 Terraform

Terraform by HashiCorp is the most popular IaC tool. It uses a declarative language (HCL) and supports many cloud providers.

### 67.2.1 Installing Terraform

```bash
# On macOS
brew install terraform

# On Linux
wget -O- https://apt.releases.hashicorp.com/gpg | gpg --dearmor | sudo tee /usr/share/keyrings/hashicorp-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
sudo apt update && sudo apt install terraform
```

### 67.2.2 Basic Terraform Configuration for NEPSE

Let's create a simple Terraform configuration that sets up an S3 bucket for storing NEPSE data and model artifacts.

**File: `main.tf`**

```hcl
# Configure the AWS provider
provider "aws" {
  region = "us-east-1"
}

# Create an S3 bucket for raw data
resource "aws_s3_bucket" "nepse_raw_data" {
  bucket = "nepse-raw-data-${random_string.suffix.result}"
  acl    = "private"

  versioning {
    enabled = true
  }

  tags = {
    Name        = "NEPSE Raw Data"
    Environment = "Production"
  }
}

# Create an S3 bucket for model artifacts
resource "aws_s3_bucket" "nepse_models" {
  bucket = "nepse-models-${random_string.suffix.result}"
  acl    = "private"

  versioning {
    enabled = true
  }

  tags = {
    Name        = "NEPSE Models"
    Environment = "Production"
  }
}

# Generate a random suffix to ensure globally unique bucket names
resource "random_string" "suffix" {
  length  = 8
  special = false
  upper   = false
}

# Output the bucket names
output "raw_data_bucket" {
  value = aws_s3_bucket.nepse_raw_data.bucket
}

output "models_bucket" {
  value = aws_s3_bucket.nepse_models.bucket
}
```

**Explanation:**  
- We specify the AWS provider and region.
- We define two S3 buckets, each with versioning enabled (important for data and model lineage).
- Because S3 bucket names must be globally unique, we use a random suffix generated by the `random_string` resource.
- Finally, we output the bucket names so they can be used elsewhere (e.g., in training scripts).

### 67.2.3 Initialising and Applying

```bash
terraform init      # Downloads provider plugins and sets up backend
terraform plan      # Shows what changes will be made
terraform apply     # Creates the resources (prompts for confirmation)
```

After applying, Terraform creates a `terraform.tfstate` file that tracks the current state. This file is critical and should be stored remotely for team use (e.g., in an S3 bucket with locking via DynamoDB).

### 67.2.4 Remote State Management

For teams, store the state file remotely:

**File: `backend.tf`**

```hcl
terraform {
  backend "s3" {
    bucket         = "my-terraform-state-bucket"
    key            = "nepse/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}
```

**Explanation:**  
This configures Terraform to store the state in an S3 bucket and use DynamoDB for state locking to prevent concurrent modifications. You must create the S3 bucket and DynamoDB table separately (or bootstrap them with a separate Terraform configuration).

### 67.2.5 Modules

Terraform modules allow you to package and reuse infrastructure components. For example, we could create a module for a standard NEPSE environment (VPC, subnets, S3 buckets, EKS cluster) and reuse it for dev, staging, and prod.

**Example module usage:**

```hcl
module "nepse_dev_env" {
  source = "./modules/nepse_environment"
  environment = "dev"
  vpc_cidr = "10.0.0.0/16"
  # ... other variables
}
```

### 67.2.6 Managing Kubernetes with Terraform

Terraform can also provision Kubernetes resources via the Kubernetes provider. For example, after creating an EKS cluster, we can deploy our prediction service:

```hcl
resource "kubernetes_deployment" "nepse_predictor" {
  metadata {
    name      = "nepse-predictor"
    namespace = "default"
  }

  spec {
    replicas = 3

    selector {
      match_labels = {
        app = "nepse-predictor"
      }
    }

    template {
      metadata {
        labels = {
          app = "nepse-predictor"
        }
      }

      spec {
        container {
          image = "myregistry/nepse-predictor:v1"
          name  = "predictor"
          port {
            container_port = 8000
          }
        }
      }
    }
  }
}
```

**Explanation:**  
This Kubernetes deployment manifest is embedded in Terraform HCL. It defines a deployment with three replicas, using the Docker image `myregistry/nepse-predictor:v1`. Terraform will apply this to the cluster if the Kubernetes provider is configured with the correct kubeconfig.

---

## 67.3 AWS CloudFormation

CloudFormation is AWS's native IaC service. Templates are written in JSON or YAML. It is deeply integrated with AWS and supports almost all AWS resources.

### 67.3.1 Basic CloudFormation Template

**File: `nepse-bucket.yaml`**

```yaml
AWSTemplateFormatVersion: '2010-09-09'
Description: 'NEPSE S3 buckets'

Parameters:
  Environment:
    Type: String
    Default: prod
    AllowedValues: [dev, staging, prod]

Resources:
  RawDataBucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: !Sub 'nepse-raw-${Environment}-${AWS::AccountId}'
      VersioningConfiguration:
        Status: Enabled
      Tags:
        - Key: Name
          Value: !Sub 'NEPSE Raw Data ${Environment}'

  ModelsBucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: !Sub 'nepse-models-${Environment}-${AWS::AccountId}'
      VersioningConfiguration:
        Status: Enabled
      Tags:
        - Key: Name
          Value: !Sub 'NEPSE Models ${Environment}'

Outputs:
  RawDataBucketName:
    Value: !Ref RawDataBucket
  ModelsBucketName:
    Value: !Ref ModelsBucket
```

**Explanation:**  
- `Parameters` allow us to pass in the environment name.
- `Resources` define the two S3 buckets. The bucket names include the environment and account ID to ensure uniqueness.
- `!Sub` is a YAML function that substitutes variables.
- `Outputs` export the bucket names for use in other stacks or scripts.

### 67.3.2 Deploying with CloudFormation

You can deploy via AWS CLI:

```bash
aws cloudformation deploy \
  --template-file nepse-bucket.yaml \
  --stack-name nepse-storage \
  --parameter-overrides Environment=dev \
  --capabilities CAPABILITY_IAM
```

CloudFormation manages the state for you, and you can update or delete the stack later.

---

## 67.4 Ansible

Ansible is an automation tool that can configure servers, install software, and deploy applications. It is agentless (uses SSH) and uses YAML playbooks.

### 67.4.1 Ansible Playbook for Setting Up a Prediction Server

Suppose we have an EC2 instance that will serve our NEPSE model. Ansible can install dependencies, copy the model, and start the service.

**File: `nepse-server.yml`**

```yaml
---
- name: Configure NEPSE prediction server
  hosts: prediction_servers
  become: yes
  vars:
    model_version: v1.2
    repo_url: https://github.com/org/nepse-predictor.git

  tasks:
    - name: Update apt cache
      apt:
        update_cache: yes

    - name: Install Python and pip
      apt:
        name:
          - python3
          - python3-pip
          - python3-venv
        state: present

    - name: Create app directory
      file:
        path: /opt/nepse
        state: directory
        owner: ubuntu
        group: ubuntu

    - name: Clone repository
      git:
        repo: "{{ repo_url }}"
        dest: /opt/nepse/app
        version: "{{ model_version }}"

    - name: Create virtual environment
      pip:
        requirements: /opt/nepse/app/requirements.txt
        virtualenv: /opt/nepse/venv
        virtualenv_command: python3 -m venv

    - name: Copy model file (from S3)
      aws_s3:
        bucket: nepse-models-prod
        object: "/models/{{ model_version }}/model.pkl"
        dest: /opt/nepse/app/model.pkl
        mode: get

    - name: Create systemd service
      template:
        src: nepse-predictor.service.j2
        dest: /etc/systemd/system/nepse-predictor.service
      notify: restart nepse

    - name: Start and enable service
      systemd:
        name: nepse-predictor
        state: started
        enabled: yes

  handlers:
    - name: restart nepse
      systemd:
        name: nepse-predictor
        state: restarted
```

**Explanation:**  
- The playbook runs on hosts in the `prediction_servers` group.
- It installs dependencies, clones the code from Git, sets up a virtual environment, downloads the model from S3, and installs a systemd service.
- The `template` module uses a Jinja2 template (`nepse-predictor.service.j2`) to generate the service file.
- Handlers restart the service if the template changes.

Ansible is great for configuration management, but for provisioning cloud resources, Terraform or CloudFormation are more suitable. They are often used together: Terraform provisions the servers, and Ansible configures them.

---

## 67.5 Kubernetes Manifests

For containerised applications (like our NEPSE prediction API), Kubernetes is the de facto orchestration platform. Kubernetes resources are defined in YAML manifests.

### 67.5.1 Basic Deployment and Service

**File: `nepse-deployment.yaml`**

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nepse-predictor
  labels:
    app: nepse-predictor
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nepse-predictor
  template:
    metadata:
      labels:
        app: nepse-predictor
    spec:
      containers:
      - name: predictor
        image: myregistry/nepse-predictor:v1.2
        ports:
        - containerPort: 8000
        env:
        - name: MODEL_PATH
          value: /app/model.pkl
        - name: REDIS_HOST
          value: redis-service
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "1Gi"
            cpu: "1000m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8000
          initialDelaySeconds: 5
          periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: nepse-predictor
spec:
  selector:
    app: nepse-predictor
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8000
  type: LoadBalancer
```

**Explanation:**  
- The Deployment defines the desired state: 3 replicas of the container, with resource requests/limits, environment variables, and health probes.
- The Service exposes the deployment internally (ClusterIP) or externally (LoadBalancer). Here, we use a LoadBalancer to get a public IP.

### 67.5.2 Managing with `kubectl`

```bash
kubectl apply -f nepse-deployment.yaml
```

To update the image (e.g., to version v1.3), you can either edit the file and reapply, or use:

```bash
kubectl set image deployment/nepse-predictor predictor=myregistry/nepse-predictor:v1.3
```

---

## 67.6 Helm Charts

Helm is a package manager for Kubernetes. It allows you to define, install, and upgrade even the most complex Kubernetes applications as a single unit, called a chart. Charts are reusable and can be versioned.

### 67.6.1 Chart Structure

A Helm chart has a standard directory layout:

```
nepse-chart/
  Chart.yaml          # Metadata
  values.yaml         # Default configuration values
  templates/          # Kubernetes manifest templates
  templates/deployment.yaml
  templates/service.yaml
  templates/_helpers.tpl
  ...
```

### 67.6.2 Example `values.yaml`

```yaml
# Default values for nepse-chart.
replicaCount: 3
image:
  repository: myregistry/nepse-predictor
  tag: v1.2
  pullPolicy: IfNotPresent

service:
  type: LoadBalancer
  port: 80

resources:
  requests:
    memory: 512Mi
    cpu: 500m
  limits:
    memory: 1Gi
    cpu: 1000m

env:
  MODEL_PATH: /app/model.pkl
  REDIS_HOST: redis-service
```

### 67.6.3 Templated Deployment

**File: `templates/deployment.yaml`**

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ include "nepse-chart.fullname" . }}
  labels:
    app.kubernetes.io/name: {{ include "nepse-chart.name" . }}
    helm.sh/chart: {{ include "nepse-chart.chart" . }}
spec:
  replicas: {{ .Values.replicaCount }}
  selector:
    matchLabels:
      app.kubernetes.io/name: {{ include "nepse-chart.name" . }}
  template:
    metadata:
      labels:
        app.kubernetes.io/name: {{ include "nepse-chart.name" . }}
    spec:
      containers:
        - name: predictor
          image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
          imagePullPolicy: {{ .Values.image.pullPolicy }}
          ports:
            - containerPort: 8000
          env:
            {{- range $key, $value := .Values.env }}
            - name: {{ $key }}
              value: {{ $value | quote }}
            {{- end }}
          resources:
            {{- toYaml .Values.resources | nindent 10 }}
```

**Explanation:**  
Helm templates use Go templating. Values from `values.yaml` are injected. This allows the same chart to be used for different environments by overriding values.

### 67.6.4 Installing with Helm

```bash
helm install nepse-prod ./nepse-chart --values prod-values.yaml
```

To upgrade:

```bash
helm upgrade nepse-prod ./nepse-chart --values prod-values.yaml
```

---

## 67.7 Secrets Management

Never store secrets (passwords, API keys) in plain text in your IaC files. Use a secrets management tool.

### 67.7.1 HashiCorp Vault

Vault can store and control access to secrets. Applications can retrieve secrets at runtime.

### 67.7.2 Cloud Secrets Managers

- **AWS Secrets Manager**
- **AWS Systems Manager Parameter Store** (for less sensitive data)
- **Google Cloud Secret Manager**
- **Azure Key Vault**

### 67.7.3 Using Secrets in Terraform

Terraform can retrieve secrets from AWS Secrets Manager and use them in resource definitions.

```hcl
data "aws_secretsmanager_secret_version" "db_password" {
  secret_id = "nepse-db-password"
}

resource "aws_db_instance" "nepse_db" {
  # ...
  password = data.aws_secretsmanager_secret_version.db_password.secret_string
}
```

**Explanation:**  
The `data` source fetches the secret value. Terraform does not store it in state in plain text if you configure the backend to encrypt.

### 67.7.4 Using Secrets in Kubernetes

For Kubernetes, you can use:

- **Secrets** objects (base64 encoded, but not encrypted by default). For production, enable encryption at rest.
- **External Secrets Operator** to sync secrets from Vault/AWS Secrets Manager into Kubernetes Secrets.
- **Sealed Secrets** for GitOps: encrypt secrets into a SealedSecret resource that can be stored in Git.

---

## 67.8 CI/CD Integration

IaC should be integrated into your CI/CD pipelines. For example, on every push to the main branch, you can run `terraform plan` and after approval, `terraform apply`.

### 67.8.1 GitHub Actions for Terraform

```yaml
name: Terraform CI/CD

on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]

jobs:
  terraform:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2

    - name: Setup Terraform
      uses: hashicorp/setup-terraform@v1
      with:
        terraform_version: 1.3.0

    - name: Terraform Init
      run: terraform init
      env:
        AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
        AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

    - name: Terraform Plan
      run: terraform plan -no-color
      continue-on-error: true
      id: plan

    - name: Terraform Apply (on main)
      if: github.ref == 'refs/heads/main' && github.event_name == 'push'
      run: terraform apply -auto-approve
      env:
        AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
        AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
```

**Explanation:**  
- The workflow runs on every push and pull request.
- It initialises Terraform and runs a plan. On the main branch after a merge, it automatically applies the changes (with `-auto-approve`). In a production setup, you might want a manual approval step.

---

## 67.9 Best Practices for IaC

1. **Use version control**: All IaC files should be in Git.
2. **Modularise**: Break large configurations into reusable modules.
3. **Use remote state**: Store state remotely with locking.
4. **Tag resources**: Apply consistent tags for cost tracking and management.
5. **Validate changes**: Run `plan` in CI to catch errors.
6. **Manage secrets securely**: Never hard‑code secrets.
7. **Pin provider versions**: Specify provider versions to avoid unexpected changes.
8. **Test infrastructure**: Use tools like Terratest to write automated tests for your infrastructure.
9. **Document**: Explain why certain resources are created, not just how.
10. **Destroy unused resources**: Regularly review and remove unused infrastructure to save costs.

---

## Chapter Summary

In this chapter, we explored Infrastructure as Code and its application to the NEPSE prediction system. We covered:

- The principles of IaC and its benefits: consistency, reproducibility, versioning, automation.
- Terraform for provisioning cloud resources (S3 buckets, databases, etc.) with examples.
- AWS CloudFormation as an alternative.
- Ansible for configuring servers.
- Kubernetes manifests and Helm charts for deploying containerised applications.
- Secrets management using cloud services and tools.
- Integrating IaC into CI/CD pipelines.
- Best practices for maintaining IaC at scale.

By adopting IaC, the NEPSE system becomes fully reproducible and manageable. The same code can spin up development, staging, and production environments, reducing errors and increasing confidence. Infrastructure is no longer a black box but a versioned, auditable part of the project.

In the next chapter, we will discuss **Monitoring and Alerting**, ensuring that once our infrastructure is in place, we can keep track of its health and performance.

---

**End of Chapter 67**