# Chapter 69: Cost Management

## Learning Objectives

By the end of this chapter, you will be able to:

- Understand the key drivers of cloud costs in a machine learning system
- Implement cost optimisation strategies for compute, storage, and networking
- Choose the right pricing models (on‑demand, spot, reserved) for different workloads
- Use auto‑scaling to match resource consumption with demand
- Apply tagging and cost allocation to track spending by project, team, or environment
- Set up budgets and alerts to avoid unexpected bills
- Analyse cost data using cloud provider tools (AWS Cost Explorer, GCP Cost Management)
- Adopt FinOps principles to foster a culture of cost awareness
- Apply these practices to the NEPSE prediction system to keep it affordable

---

## Introduction

Running a production machine learning system like the NEPSE predictor incurs costs: compute instances for training and serving, storage for data and models, networking for data transfer, and managed services (databases, message queues). Without careful management, these costs can spiral out of control, especially as the system scales.

**Cost management** is the practice of understanding, controlling, and optimising cloud spending. It is not a one‑time activity but an ongoing process that involves technical decisions (e.g., choosing instance types) and organisational practices (e.g., tagging resources). The goal is to maximise the business value delivered per dollar spent.

In this chapter, we will explore cost management strategies tailored to the NEPSE prediction system. We'll cover how to choose the right pricing models, right‑size resources, use auto‑scaling, and monitor costs with cloud provider tools. We'll also introduce **FinOps**, a cultural shift that brings together engineering, finance, and business to manage cloud costs collaboratively.

---

## 69.1 Cost Optimisation Strategies

Cost optimisation in the cloud can be approached from several angles:

1. **Right‑sizing**: Matching instance types and sizes to actual workload requirements.
2. **Pricing models**: Using spot/preemptible instances for fault‑tolerant workloads, reserved instances for steady state.
3. **Auto‑scaling**: Dynamically adjusting resources to meet demand, avoiding over‑provisioning.
4. **Storage optimisation**: Choosing appropriate storage tiers, deleting unused data, compressing data.
5. **Network optimisation**: Minimising data transfer costs, using content delivery networks (CDNs).
6. **Architecture optimisation**: Using serverless where appropriate, eliminating idle resources.

For the NEPSE system, we will apply these strategies to:

- Training jobs (batch, can use spot instances)
- Prediction API (needs consistent performance, may use on‑demand or reserved)
- Data storage (S3 with lifecycle policies)
- Feature store (Redis on appropriate instance types)

---

## 69.2 Cloud Cost Management Tools

Each cloud provider offers tools to monitor and manage costs.

### 69.2.1 AWS Cost Management

- **AWS Cost Explorer**: Visualise and analyse costs and usage.
- **AWS Budgets**: Set custom budgets and receive alerts.
- **AWS Trusted Advisor**: Provides optimisation recommendations (e.g., underutilised instances).
- **AWS Compute Optimizer**: Recommends optimal instance types based on utilisation.

### 69.2.2 Google Cloud Cost Management

- **Cloud Billing reports**: Detailed cost breakdowns.
- **Budget alerts**: Notify when spending exceeds thresholds.
- **Recommender**: Provides rightsizing and discount recommendations.
- **Committed Use Contracts**: Equivalent to reserved instances.

### 69.2.3 Azure Cost Management

- **Cost Analysis**: Explore and aggregate costs.
- **Budgets**: Set spending limits and alerts.
- **Advisor**: Provides optimisation recommendations.

All providers support **tagging** to allocate costs to different projects, teams, or environments.

---

## 69.3 Resource Right‑Sizing

Right‑sizing means selecting the most cost‑effective instance type that still meets performance requirements. It involves monitoring utilisation and adjusting.

### 69.3.1 Monitoring Utilisation

For the NEPSE prediction API deployed on EC2 or EKS, we can monitor CPU, memory, and network using CloudWatch (AWS) or Prometheus. If utilisation is consistently low (e.g., CPU < 20%), the instance is oversized. If it's consistently high (e.g., >80%), it may be undersized.

**Example: Checking CPU utilisation in CloudWatch**

```python
import boto3

cloudwatch = boto3.client('cloudwatch')

response = cloudwatch.get_metric_statistics(
    Namespace='AWS/EC2',
    MetricName='CPUUtilization',
    Dimensions=[{'Name': 'InstanceId', 'Value': 'i-1234567890abcdef0'}],
    StartTime='2025-03-01T00:00:00Z',
    EndTime='2025-03-08T00:00:00Z',
    Period=3600,
    Statistics=['Average']
)

for datapoint in response['Datapoints']:
    print(datapoint['Timestamp'], datapoint['Average'])
```

**Explanation:**  
This script queries the average CPU utilisation over a week. If the average is below 40%, consider downsizing the instance family or using a smaller size (e.g., from `m5.large` to `m5.xlarge` if utilisation is high, or to `t3.medium` if low).

### 69.3.2 Choosing Instance Families

- **General purpose** (e.g., AWS `m5`, `t3`): Balanced CPU/memory. Good for prediction APIs.
- **Compute optimised** (e.g., `c5`): Higher CPU, for compute‑intensive training.
- **Memory optimised** (e.g., `r5`): For in‑memory databases like Redis.
- **Storage optimised** (e.g., `i3`): For large local disk requirements.

For the NEPSE system, the prediction API likely fits a general‑purpose instance (e.g., `t3.medium` for low traffic, `m5.large` for higher). Training jobs may benefit from compute‑optimised instances.

### 69.3.3 Rightsizing Recommendations

AWS Compute Optimizer can automatically analyse utilisation and recommend instance types. Enable it in the AWS Console.

---

## 69.4 Spot and Preemptible Instances

Spot instances (AWS) and preemptible VMs (GCP) offer significant discounts (60‑90%) in exchange for the risk of termination with short notice. They are ideal for fault‑tolerant, stateless workloads.

### 69.4.1 Use Cases for NEPSE

- **Model training**: Training jobs can be interrupted and resumed. Use spot instances for training with checkpointing.
- **Batch inference**: If you run batch predictions overnight, spot instances are suitable.
- **Development and testing**: Non‑production environments can run on spot.

**Example: Requesting a Spot Instance with AWS CLI**

```bash
aws ec2 request-spot-instances \
    --spot-price "0.05" \
    --instance-count 1 \
    --type "one-time" \
    --launch-specification file://spot-spec.json
```

`spot-spec.json` defines the AMI, instance type, etc.

### 69.4.2 Handling Interruptions

Your application must handle interruptions gracefully. For training, save checkpoints frequently (e.g., every few minutes to S3). When the instance is terminated, you can resume from the last checkpoint on a new instance.

**Example: Using Spot Instances with AWS Batch**

AWS Batch can automatically provision spot instances for jobs and handle retries if instances are reclaimed.

### 69.4.3 Spot Instance Best Practices

- Use a mix of instance types (e.g., `c5.large`, `c5a.large`) to increase capacity availability.
- Set a maximum price (optional, but if you set it too low, you may never get capacity).
- Use Spot Fleet or EC2 Fleet to diversify.
- For critical workloads, have a fallback to on‑demand.

---

## 69.5 Reserved Instances and Savings Plans

For steady, predictable workloads, reserved instances (RIs) or savings plans provide significant discounts (up to 72%) in exchange for a commitment of 1 or 3 years.

### 69.5.1 When to Use RIs

- **Prediction API**: If traffic is stable 24/7, RIs are cost‑effective.
- **Databases**: Long‑running databases (e.g., RDS) benefit from RIs.
- **Development servers**: If you run a dev environment 8x5, consider a scheduled RI (AWS offers scheduled RIs).

### 69.5.2 Savings Plans

Savings Plans are more flexible than RIs: you commit to a certain dollar amount per hour of compute usage, and it automatically applies to any instance family in a region (or globally). There are two types:

- **Compute Savings Plans**: Apply to any compute (EC2, Fargate, Lambda) regardless of family.
- **EC2 Instance Savings Plans**: Apply to a specific family in a region.

For the NEPSE system, if you use a mix of instance types and also use Fargate or Lambda, Savings Plans offer flexibility.

### 69.5.3 Example: Purchasing a Savings Plan via AWS CLI

```bash
aws savingsplans purchase-savings-plan \
    --savings-plan-offering-id <offering-id> \
    --commitment 0.50 \
    --term 1-year \
    --payment-option "All Upfront"
```

**Explanation:**  
This commits to spending $0.50 per hour for one year, paid upfront. Any eligible compute usage up to that amount is charged at the discounted rate; beyond that, you pay on‑demand.

---

## 69.6 Auto‑scaling Strategies

Auto‑scaling ensures you only pay for what you need by dynamically adjusting resources based on demand.

### 69.6.1 Horizontal Scaling for Prediction API

In Kubernetes, the Horizontal Pod Autoscaler (HPA) can scale the number of replicas based on CPU/memory or custom metrics (e.g., requests per second).

**Example HPA configuration:**

```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: nepse-predictor-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: nepse-predictor
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Pods
    pods:
      metric:
        name: requests_per_second
      target:
        type: AverageValue
        averageValue: 100
```

**Explanation:**  
The HPA scales the deployment when CPU utilisation exceeds 70% or when requests per second exceed 100, averaged across pods.

### 69.6.2 Vertical Scaling

Vertical scaling (changing instance size) is less common in Kubernetes, but for stateful services like Redis, you might need to move to a larger instance. This often requires downtime.

### 69.6.3 Scaling to Zero

For development or batch workloads, you can scale to zero when not in use. For example, using **KEDA** (Kubernetes Event‑Driven Autoscaling) you can scale a deployment to zero when no messages are in a queue, then scale up when a message arrives.

**Example: KEDA ScaledObject for a batch processor**

```yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: nepse-batch-scaler
spec:
  scaleTargetRef:
    name: nepse-batch-processor
  triggers:
  - type: aws-sqs-queue
    metadata:
      queueURL: https://sqs.us-east-1.amazonaws.com/123456789012/nepse-batch
      queueLength: "5"
```

**Explanation:**  
When the SQS queue length exceeds 5, KEDA scales the deployment to process messages. When the queue is empty, it scales to zero, saving costs.

---

## 69.7 Cost Allocation and Tagging

Tagging resources is essential for understanding where money is spent. Without tags, costs are lumped together, making optimisation difficult.

### 69.7.1 Defining a Tagging Strategy

Common tags:

- `Environment`: dev, staging, prod
- `Project`: NEPSE
- `Owner`: team or individual
- `CostCenter`: accounting code
- `AutoStop`: true/false (for resources that can be stopped overnight)

For the NEPSE system, we might tag:

- EC2 instances for the prediction API: `Environment=prod`, `Project=nepse`, `Service=predictor`
- S3 buckets: `Environment=prod`, `Project=nepse`, `Data=raw`
- Training jobs: `Environment=dev`, `Project=nepse`, `JobType=training`

### 69.7.2 Implementing Tags with Terraform

```hcl
resource "aws_instance" "predictor" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t3.medium"

  tags = {
    Name        = "nepse-predictor-prod"
    Environment = "prod"
    Project     = "nepse"
    Service     = "predictor"
    Terraform   = "true"
  }
}
```

### 69.7.3 Using Tags in Cost Explorer

Once tags are applied, you can filter costs by tag in AWS Cost Explorer. For example, view all costs for `Project=nepse` and break down by `Environment`. This helps you see how much each environment costs and identify anomalies.

---

## 69.8 Monitoring and Forecasting Costs

### 69.8.1 Setting Budgets

AWS Budgets can alert you when costs exceed a threshold. For example, set a monthly budget of $500 for the NEPSE project, with alerts at 80% and 100%.

```bash
aws budgets create-budget \
    --account-id 123456789012 \
    --budget file://budget.json \
    --notifications-with-subscribers file://subscribers.json
```

`budget.json`:

```json
{
    "BudgetName": "NEPSE Monthly Budget",
    "BudgetLimit": {
        "Amount": "500",
        "Unit": "USD"
    },
    "TimeUnit": "MONTHLY",
    "BudgetType": "COST",
    "CostFilters": {
        "TagKeyValue": [ "Project$nepse" ]
    }
}
```

`subscribers.json`:

```json
[
    {
        "Notification": {
            "NotificationType": "ACTUAL",
            "ComparisonOperator": "GREATER_THAN",
            "Threshold": 80,
            "ThresholdType": "PERCENTAGE"
        },
        "Subscribers": [
            {
                "SubscriptionType": "EMAIL",
                "Address": "team@example.com"
            }
        ]
    }
]
```

**Explanation:**  
This budget tracks costs for resources tagged with `Project=nepse` and sends an email alert when actual costs exceed 80% of the $500 monthly budget.

### 69.8.2 Forecasting

AWS Cost Explorer provides forecasts based on historical usage. You can use this to plan future budgets and identify trends.

### 69.8.3 Anomaly Detection

AWS also offers **Cost Anomaly Detection** that uses machine learning to detect unusual spending patterns (e.g., a sudden spike in EC2 costs) and alert you.

---

## 69.9 FinOps Principles

FinOps (Financial Operations) is a cultural shift that brings together engineering, finance, and business to manage cloud costs collaboratively. Key principles:

- **Teams take ownership**: Engineers are responsible for their cloud usage.
- **Centralised visibility**: Costs are visible to everyone.
- **Business value**: Decisions are based on cost vs. value.
- **Continuous optimisation**: Cost management is ongoing, not a one‑time exercise.

### 69.9.1 Implementing FinOps for NEPSE

- **Tagging**: As discussed, enables cost allocation.
- **Regular reviews**: Hold a monthly cost review meeting to discuss trends, anomalies, and optimisation opportunities.
- **Showback/Chargeback**: If multiple teams use the NEPSE system, show each team their costs.
- **Unit economics**: Track cost per prediction or per training run. This helps assess the value of the system.

**Example: Calculating cost per prediction**

If the monthly cost for the prediction API is $200 and you serve 100,000 predictions, cost per prediction is $0.002. This metric can guide decisions: if you optimise, you can lower the cost, or you can decide to increase prices if you're selling the service.

---

## 69.10 Best Practices for Cost Management

1. **Tag everything**: Start tagging from day one.
2. **Use auto‑scaling**: Don't pay for idle resources.
3. **Choose the right pricing model**: Spot for flexible workloads, reserved for steady state.
4. **Monitor and alert**: Set budgets and anomaly detection.
5. **Delete unused resources**: Orphaned storage volumes, old snapshots, unused load balancers.
6. **Use infrastructure as code**: Makes it easier to track and manage resources.
7. **Educate the team**: Ensure everyone understands cost implications.
8. **Review regularly**: Schedule monthly cost reviews.
9. **Leverage cloud‑native services**: Managed services (e.g., Aurora Serverless) can be more cost‑effective than self‑managed.
10. **Consider multi‑cloud or hybrid**: Sometimes another cloud or on‑premises is cheaper for certain workloads.

---

## Chapter Summary

In this chapter, we explored cost management for the NEPSE prediction system. We covered:

- The key cost drivers in cloud‑based ML systems.
- Resource right‑sizing and instance selection.
- Using spot/preemptible instances for flexible workloads.
- Reserved instances and savings plans for steady state.
- Auto‑scaling strategies to match demand.
- Tagging and cost allocation for visibility.
- Setting budgets and monitoring costs with cloud tools.
- FinOps principles for cultural alignment.
- Best practices to keep costs under control.

By applying these strategies, the NEPSE system can remain affordable as it scales. Cost management is not a one‑time task but an ongoing discipline that pays dividends in lower bills and better resource utilisation.

This chapter concludes **Part XII: Industry Best Practices and Standards**. In the final part, we will discuss **User Interfaces and Visualization**, covering dashboards and tools to interact with the NEPSE predictions.

---

**End of Chapter 69**

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='68. monitoring_and_alerting.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='../9. user_interfaces_and_visualization/70. building_dashboards.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
