### [🏠 **Home**](NoteBookIndex.ipynb) &nbsp; | &nbsp; [⏪ **Prev** (04-scalability-and-performance)](senior-architecture-patterns_20251215_1232_06_04-scalability-and-performance.ipynb) &nbsp; | &nbsp; [**Next** (05-messaging-and-communication) ⏩](senior-architecture-patterns_20251215_1232_08_05-messaging-and-communication.ipynb)
---

# FOLDER: 06-operational-and-deployment
**Generated:** 2025-12-15 12:32

**Contains:** 4 files | **Total Size:** 0.02 MB

## 📂 `06-operational-and-deployment/`

#### 📄 `06-operational-and-deployment/23-blue-green-deployment.md`


# 23\. Blue-Green Deployment

## 1\. The Concept

Blue-Green Deployment is a release strategy that reduces downtime and risk by running two identical production environments, called "Blue" and "Green."

  * **Blue:** The currently live version (v1) handling 100% of user traffic.
  * **Green:** The new version (v2), currently idle or accessible only to internal testers.

To release, you deploy v2 to Green, test it thoroughly, and then switch the Load Balancer to route all traffic from Blue to Green. If anything goes wrong, you switch back instantly.

## 2\. The Problem

  * **Scenario:** You are deploying a critical update to a banking app.
  * **The "In-Place" Risk:** You stop the server, unzip the new jar file, and restart the server.
      * **Downtime:** The user sees a "502 Bad Gateway" for 2 minutes.
      * **The Panic:** The new version crashes on startup. You now have to scramble to find the old jar file and redeploy it. The system is down for 15 minutes.
      * **The Consequence:** Deployment becomes a scary event that teams avoid doing. "Don't deploy on Fridays\!"

## 3\. The Solution

Decouple the "Deployment" (installing bits) from the "Release" (serving traffic).

1.  **Deployment:** You spin up the Green environment. The public cannot see it yet. You run smoke tests against it.
2.  **Cutover:** You change the Router/Load Balancer configuration. Traffic flows to Green. Blue is now idle.
3.  **Rollback:** If Green throws errors, you just flip the switch back to Blue. It is instantaneous because Blue is still running.

### Junior vs. Senior View

| Perspective | Approach | Outcome |
| :--- | :--- | :--- |
| **Junior** | "I'll use `rsync` to overwrite the files on the live server. It's fast and easy." | **Maintenance Windows.** "The site will be down from 2 AM to 4 AM." If the deploy fails, you are stuck debugging live in production. |
| **Senior** | "Infrastructure is disposable. Spin up a completely new stack (Green). Verify it. Switch the pointer. Kill the old stack (Blue) only when we are 100% sure." | **Zero Downtime.** Deployments are boring and safe. Rollback is a single button press. We can deploy at 2 PM on a Friday. |

## 4\. Visual Diagram

## 5\. The Hard Part: The Database

The infrastructure part is easy (especially with Kubernetes). **The Database is the bottleneck.**

  * You usually have **one** shared database for both Blue and Green (syncing two databases in real-time is too complex).
  * **The Constraint:** The database schema must be compatible with *both* v1 (Blue) and v2 (Green) at the same time.

### The "Expand-Contract" Pattern

If you need to rename a column from `address` to `full_address`:

1.  **Migration 1 (Expand):** Add `full_address` column. Copy data from `address`. Keep `address`.
      * *Result:* DB has both. Blue uses `address`. Green uses `full_address`.
2.  **Deploy:** Blue-Green Switch.
3.  **Migration 2 (Contract):** Once Green is stable, delete the `address` column.

## 6\. When to Use It (and When NOT to)

  * ✅ **Use when:**
      * **Critical Uptime:** You cannot afford 5 minutes of downtime.
      * **Instant Rollback:** You need a safety net.
      * **Monoliths:** It is often easier to Blue/Green a monolith than to do rolling updates.
  * ❌ **Avoid when:**
      * **Stateful Apps:** If users have active WebSocket connections or in-memory sessions on Blue, switching them to Green cuts them off. (Requires sticky sessions or external session stores like Redis).
      * **Destructive DB Changes:** If the new version drops a table, you cannot roll back to Blue (Blue will crash querying the missing table).

## 7\. Implementation Example (Kubernetes)

In Kubernetes, this is often done using `Service` selectors.

### Step 1: The Current State (Blue)

```yaml
apiVersion: v1
kind: Service
metadata:
  name: my-app-service
spec:
  selector:
    version: v1  # POINTS TO BLUE
  ports:
    - port: 80
```

### Step 2: Deploy Green (v2)

We deploy a new Deployment named `app-v2`. It starts up, but receives NO traffic because the Service is still looking for `version: v1`.

  * We can port-forward to `app-v2` to test it manually.

### Step 3: The Switch

We patch the Service to look for `v2`.

```bash
kubectl patch service my-app-service -p '{"spec":{"selector":{"version":"v2"}}}'
```

  * **Result:** The Service instantly routes new packets to the v2 pods. The v1 pods stop receiving traffic.
  * **Cleanup:** After 1 hour, delete the `app-v1` deployment.

## 8\. Blue-Green vs. Canary

  * **Blue-Green:** Instant switch. 100% of traffic moves at once. Great for simple applications.
  * **Canary:** Gradual shift. 1% -\> 10% -\> 50% -\> 100%. Better for high-scale systems where a bug affecting 100% of users instantly would be catastrophic.

## 9\. Strategic Note on Cost

Blue-Green implies running **double the infrastructure** during the deployment window.

  * If your production cluster costs $10k/month, you need capacity to spike to $20k/month temporarily.
  * **Senior Tip:** In the Cloud, this is cheap (you only pay for the extra hour). On-premise, this is hard (you need double the physical servers).

#### 📄 `06-operational-and-deployment/24-canary-release.md`


# 24\. Canary Release

## 1\. The Concept

A Canary Release is a technique to reduce the risk of introducing a new software version in production by slowly rolling out the change to a small subset of users before making it available to everyone. It is named after the "canary in a coal mine"—if the canary (the small subset of users) stops singing (encounters errors), you evacuate the mine (rollback) before the miners (the rest of your user base) get hurt.

## 2\. The Problem

  * **Scenario:** You have 1 million active users. You deploy version 2.0 using a standard "Rolling Update" or "Blue-Green" switch.
  * **The Bug:** Version 2.0 has a subtle memory leak that only appears under high load, or a UI bug that breaks the "Checkout" button for users on iPads.
  * **The Impact:** Because you switched 100% of traffic to the new version, **all 1 million users** are affected instantly. Support lines are flooded, revenue drops to zero, and your reputation takes a hit.

## 3\. The Solution

Instead of switching 0% to 100%, you switch gradually: 0% -\> 1% -\> 10% -\> 50% -\> 100%.

1.  **Phase 1:** Deploy v2 to a small capacity. Route 1% of live traffic to it.
2.  **Verification:** Monitor Error Rates, Latency, and Business Metrics (e.g., "Orders per minute").
3.  **Expansion:** If metrics are healthy, increase traffic to 10%.
4.  **Completion:** Continue until 100% of traffic is on v2. Then decommission v1.

### Junior vs. Senior View

| Perspective | Approach | Outcome |
| :--- | :--- | :--- |
| **Junior** | "We tested it in Staging. It works. Just deploy it to all servers." | **High Risk.** Staging is never exactly like Production. Real users do weird things that QA didn't predict. |
| **Senior** | "Staging is a rehearsal. Production is the show. Let 500 random users try the new code first. If they don't complain, let 5,000 try it." | **Blast Radius Containment.** If v2 is broken, only 1% of users had a bad day. The other 99% never noticed. We roll back the 1% instantly. |

## 4\. Visual Diagram

## 5\. When to Use It (and When NOT to)

  * ✅ **Use when:**
      * **High Scale:** You have enough traffic that "1%" is statistically significant.
      * **Critical Business Flows:** Changing the Payment Gateway or Login logic.
      * **Cloud Native:** You are using Kubernetes, Istio, or AWS ALB, which make weighted routing easy.
  * ❌ **Avoid when:**
      * **Low Traffic:** If you get 1 request per minute, "1% traffic" means waiting 100 minutes for a data point. Just do Blue-Green.
      * **Client-Side Apps:** It is harder (though not impossible) to do Canary releases for Mobile Apps (App Store delays) or Desktop software.
      * **Database Schema Changes:** Like Blue-Green, Canary requires the database to support *both* versions simultaneously.

## 6\. Implementation Example (Kubernetes/Istio)

In a standard Kubernetes setup, you can do a rough Canary by scaling replicas (1 pod v2, 9 pods v1 = 10% traffic).
For precise control, you use a Service Mesh like **Istio** or an Ingress Controller like **Nginx**.

### Istio `VirtualService` Configuration

```yaml
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: payment-service
spec:
  hosts:
  - payment-service
  http:
  - route:
    - destination:
        host: payment-service
        subset: v1  # The Stable Version
      weight: 90
    - destination:
        host: payment-service
        subset: v2  # The Canary Version
      weight: 10
```

### The Rollout Strategy (Automated)

Manual Canary updates are tedious. Tools like **Flagger** or **Argo Rollouts** automate this:

1.  **09:00 AM:** Deploy v2. Flagger sets traffic to 5%.
2.  **09:05 AM:** Flagger checks Prometheus: "Is HTTP 500 rate \< 1%?".
3.  **09:06 AM:** Success. Flagger increases traffic to 20%.
4.  **09:10 AM:** Failure detected (Latency spiked \> 500ms). Flagger automatically reverts traffic to 0% and sends a Slack alert.

## 7\. What to Monitor (The Canary Analysis)

It is not enough to just check "Is the server up?" You must compare the **Baseline (v1)** vs. the **Canary (v2)**.

1.  **Technical Metrics:**
      * HTTP Error Rate (5xx).
      * Latency (p99).
      * CPU/Memory Saturation.
2.  **Business Metrics (The Senior level):**
      * "Add to Cart" conversion rate.
      * "Ad Impressions" count.
      * *Why?* v2 might be technically "stable" (no crashes), but if a CSS bug hides the "Buy" button, revenue drops. Only business metrics catch this.

## 8\. Sticky Sessions

A common challenge: A user hits the site and gets the Canary (v2). They refresh the page and get the Stable (v1). This is jarring.
**Solution:** Enable **Session Affinity** (Sticky Sessions) based on a Cookie or User ID. Once a user is assigned to the Canary group, they should stay there until the deployment finishes.

## 9\. Canary vs. Blue-Green vs. Rolling

  * **Rolling Update:** Update server 1, then server 2, etc. (Easiest, but hard to rollback).
  * **Blue-Green:** Switch 100% traffic at once. (Safest for rollback, but risky impact).
  * **Canary:** Switch traffic gradually. (Safest for impact, but most complex setup).

#### 📄 `06-operational-and-deployment/25-immutable-infrastructure.md`



# 25\. Immutable Infrastructure

## 1\. The Concept

Immutable Infrastructure is an approach where servers are never modified after they are deployed. If you need to update an application, fix a bug, or apply a security patch, you do not SSH into the server to run `apt-get update`. Instead, you build a completely new machine image (or container), deploy the new instance, and destroy the old one.

## 2\. The Problem

  * **Scenario:** You have 20 servers running your application. They were all set up 2 years ago.
  * **The Configuration Drift:** Over time, sysadmins have logged in to tweak settings:
      * Server 1 has `Java 8u101` and a hotfix for Log4j.
      * Server 2 has `Java 8u102` but is missing the hotfix.
      * Server 3 has a random cron job installed by an employee who quit last year.
  * **The "Snowflake" Server:** Each server is unique (a snowflake). If Server 5 crashes, nobody knows exactly how to recreate it because the manual changes weren't documented.
  * **The Fear:** "Don't touch Server 1\! If you reboot it, it might not come back up."

## 3\. The Solution

Treat servers like cattle, not pets.

1.  **Bake:** Define your server configuration in code (Dockerfile, Packer). Build an image (AMI / Docker Image). This image is now "frozen" and immutable.
2.  **Deploy:** Launch 20 instances of this exact image.
3.  **Update:** To change a configuration, update the code, bake a *new* image (v2), and replace the old instances.
4.  **Prohibit SSH:** In extreme implementations, SSH access is disabled. No human *can* change the live server.

### Junior vs. Senior View

| Perspective | Approach | Outcome |
| :--- | :--- | :--- |
| **Junior** | "I'll use Ansible to loop through all 100 servers and update the config file in place." | **Drift & Decay.** If the script fails on server \#42, that server is now inconsistent. The state of the fleet is unknown. |
| **Senior** | "I'll build a new Docker image with the new config. Kubernetes will roll out the new pods and terminate the old ones." | **Consistency.** We know exactly what is running in production because it is binary-identical to what we tested in staging. |

## 4\. Visual Diagram

## 5\. When to Use It (and When NOT to)

  * ✅ **Use when:**
      * **Cloud / Virtualization:** It requires the ability to provision and destroy VMs/Containers instantly (AWS, Azure, Kubernetes).
      * **Scaling:** Auto-scaling groups need a "Golden Image" to launch new instances from automatically.
      * **Compliance:** You can prove to auditors exactly what software version was running at any point in time by showing the image hash.
  * ❌ **Avoid when:**
      * **Physical Hardware:** You cannot throw away a physical Dell server every time you update Nginx. (Though you can re-image it via PXE boot, it's slow).
      * **Stateful Databases:** You generally *do* patch database servers in place (or rely on managed services like RDS) because moving terabytes of data to a new instance takes too long.

## 6\. Implementation Example (Packer & Terraform)

### Step 1: Define the Image (Packer)

Create a definition that builds the OS + App dependencies.

```json
{
  "builders": [{
    "type": "amazon-ebs",
    "ami_name": "my-app-v1.0-{{timestamp}}",
    "instance_type": "t2.micro",
    "source_ami": "ami-12345678"
  }],
  "provisioners": [{
    "type": "shell",
    "inline": [
      "sudo apt-get update",
      "sudo apt-get install -y nginx",
      "sudo cp /tmp/my-app.conf /etc/nginx/nginx.conf"
    ]
  }]
}
```

*Run `packer build` -\> Output: `ami-0abc123`*

### Step 2: Deploy the Image (Terraform)

Update your infrastructure code to use the new AMI ID.

```hcl
resource "aws_launch_configuration" "app_conf" {
  image_id      = "ami-0abc123" # The new immutable image
  instance_type = "t2.micro"
}

resource "aws_autoscaling_group" "app_asg" {
  launch_configuration = aws_launch_configuration.app_conf.name
  min_size = 3
  max_size = 10
  
  # Terraform will gradually replace old instances with new ones
}
```

## 7\. The Golden Image vs. Base Image

  * **Golden Image:** Includes the OS, dependencies, AND the application code.
      * *Pros:* Fastest startup (machine is ready to serve traffic immediately).
      * *Cons:* Slow build time (every code change requires baking a full VM image).
  * **Base Image (Hybrid):** Includes OS + Dependencies (Java/Node). The Application code is downloaded at boot time (User Data).
      * *Pros:* Faster CI/CD pipeline.
      * *Cons:* Slower startup/scaling time.
      * *Senior Choice:* Use **Docker**. The "Golden Image" build time for a container is seconds, giving you the best of both worlds.

## 8\. Troubleshooting (The "Debug Container" Pattern)

If you can't SSH into production, how do you debug a crash?

1.  **Centralized Logging:** Logs must be shipped to ELK/Splunk immediately. You debug via logs, not `tail -f`.
2.  **Metrics:** Prometheus/Datadog provides the health vitals.
3.  **The Sidecar:** In Kubernetes, you can attach a temporary "Debug Container" (with curl, netstat, etc.) to the crashing pod to inspect it without modifying the pod itself.

## 9\. Key Benefits Summary

1.  **Predictability:** Works in Prod exactly like it worked in Dev.
2.  **Security:** If a hacker compromises a server, you don't "clean" it. You kill it. The persistence of the malware is limited to the life of that instance.
3.  **Rollback:** Switch the Auto Scaling Group back to the previous AMI ID. Done.



#### 📄 `06-operational-and-deployment/README.md`


# 🚢 Group 6: Operational & Deployment

## Overview

**"It works on my machine" is not a deployment strategy.**

Writing code is the easy part. Getting that code into production reliably, without downtime, and ensuring it runs consistently across 100 servers is the hard part. This module shifts focus from *Code Architecture* to *Infrastructure Architecture*.

These patterns move you away from "Pet" servers (hand-crafted, fragile) to "Cattle" servers (automated, disposable). They introduce safety nets that allow you to deploy at 2 PM on a Friday without fear.

## 📜 Pattern Index

| Pattern | Goal | Senior "Soundbite" |
| :--- | :--- | :--- |
| **[23. Blue-Green Deployment](https://www.google.com/search?q=./23-blue-green-deployment.md)** | **Zero Downtime** | "Spin up the new version next to the old one. Switch the traffic instantly. If it breaks, switch back." |
| **[24. Canary Release](https://www.google.com/search?q=./24-canary-release.md)** | **Risk Reduction** | "Don't give the new update to everyone. Give it to 1% of users and see if they survive." |
| **[25. Immutable Infrastructure](https://www.google.com/search?q=./25-immutable-infrastructure.md)** | **Consistency** | "Never patch a running server. If you need to change a config, build a new image and replace the server." |

## 🧠 The Operational Checklist

Before approving a deployment strategy, a Senior Architect asks:

1.  **The "Undo" Test:** If the deployment fails 30 seconds after go-live, can we revert to the previous version in under 1 minute? (Blue-Green allows this).
2.  **The "Blast Radius" Test:** If we ship a critical bug, does it take down the entire platform, or just affect a small group? (Canary limits this).
3.  **The "Drift" Test:** Are the servers running in production exactly the same as the ones we tested in staging? Or has someone manually tweaked the `nginx.conf` on Prod-Server-05? (Immutable Infrastructure prevents this).
4.  **The "Database" Test:** Does the database schema support *both* the old code and the new code running simultaneously? (Required for all zero-downtime patterns).

## ⚠️ Common Pitfalls in This Module

  * **Infrastructure as ClickOps:** Manually clicking around the AWS Console to create servers. This is unrepeatable and dangerous. Use Terraform/CloudFormation.
  * **Ignoring the Database:** Implementing fancy Blue-Green deployments for the code but forgetting that a database migration locks the table for 10 minutes, causing downtime anyway.
  * **Lack of Observability:** Doing a Canary release without having the dashboards to actually tell if the Canary is failing.


