# 🧠 cloudChronicles Lab #001: Disaster Recovery Detective

**Lab Type:** Idea  
**Estimated Time:** 30–45 mins  
**Skill Level:** Beginner

In [None]:
# Let's begin by printing your name to personalize the notebook
your_name = ""
print(f"Welcome to the lab, {your_name}!")

## 🔍 STAR Method Lab Prompt

**Situation:**  
A regional outage has occurred in Google Cloud's **us-central1** region. The services hosted in this region, including databases, storage, and application services, are impacted. Critical business functions are dependent on these services, and rapid recovery is required to minimize downtime and prevent data loss.

**Task:**  
Implement a failover and disaster recovery process to restore services as quickly as possible with minimal data loss and downtime. This involves:
- Ensuring that replicated database instances can be promoted to the new active region.
- Switching to a secondary Cloud Storage location that contains the same data in another region.
- Configuring Pub/Sub to reroute traffic for messaging and event streaming.
- Ensuring application instances failover and continue running in another region.

**Action:**  
### Step 1: Cloud SQL Replicas for Database Failover

**Pre-Outage Setup:**
- Ensure that **Cloud SQL replicas** are deployed in at least one other region, such as **us-east1** or **europe-west1**, to provide geographic redundancy.
- Use **cross-region replication** to ensure that the replicas are synchronized with the primary database.

**During Outage:**
- Promote the **Cloud SQL read replica** to be the primary database in the secondary region.
- Update application configurations to point to the new Cloud SQL instance in the secondary region.

**Post-Outage Recovery:**
- Once the **us-central1 region is restored**, evaluate the consistency of data between the original primary and the promoted replica.
- If necessary, perform a manual reconciliation of data, and reconfigure the replication back to us-central1 after the outage is resolved.

### Step 2: Multi-Region Cloud Storage

**Pre-Outage Setup:**
- Enable **multi-region Cloud Storage buckets** to replicate data across regions.
- Configure automatic **bucket versioning** and **cross-region replication** for critical data.
- Utilize **Lifecycle Policies** to ensure important data is kept in multiple regions for compliance and redundancy.

**During Outage:**
- Configure applications to use Cloud Storage in a secondary region (such as **europe-west1** or **us-east1**).
- Update storage paths and API calls to point to the secondary region’s Cloud Storage bucket.

**Post-Outage Recovery:**
- Once the us-central1 region is restored, perform a **data sync** to ensure that both buckets (primary and secondary) are consistent.
- Use **gsutil rsync** or similar tools to sync data back if any discrepancies occur.

### Step 3: Pub/Sub Event Routing and Failover

**Pre-Outage Setup:**
- Set up **Pub/Sub topic and subscription replication** across multiple regions (e.g., **us-east1** or **europe-west1**) using **multi-region Pub/Sub configurations**.
- Configure **push endpoints** to handle messages in multiple regions.

**During Outage:**
- Use **Pub/Sub dead-letter topics** and **retry policies** to manage message delivery during the outage.
- Reroute Pub/Sub traffic to the healthy region using **regional failover configurations**.

**Post-Outage Recovery:**
- Monitor messages for any delayed processing or errors due to the region failover.
- Once the us-central1 region is online, reinstate the traffic flow to the primary region.

### Step 4: Compute Engine / Kubernetes Engine Failover

**Pre-Outage Setup:**
- Set up **instance groups** or **GKE clusters** in a secondary region.
- Use **Cloud Load Balancing** with **global frontend** to distribute traffic to healthy instances across multiple regions.
- For Kubernetes, configure **Multi-Cluster Ingress** and **Cluster Federation** to ensure workloads can be moved across clusters in different regions.

**During Outage:**
- If Compute Engine instances are impacted, quickly spin up VMs or deploy containers to the secondary region’s infrastructure.
- Reconfigure **Cloud Load Balancer** and **DNS** to route traffic to the secondary region.

**Post-Outage Recovery:**
- Once the **us-central1 region is operational**, gradually move workloads back to the primary region.
- Ensure that applications are healthy in the new region, and monitor for any post-migration issues.

**Expected Result:**  
The implementation of this DR plan results in a smooth and reliable failover with minimal disruption. By leveraging **Cloud SQL replicas**, **multi-region Cloud Storage**, and **Pub/Sub event routing**, your infrastructure can automatically or manually switch to a backup region with minimal downtime.

### Key Outcomes:
- **Zero Data Loss:** Cross-region replication ensures that both databases and storage have up-to-date backups, eliminating the risk of data loss during the outage.
- **Continuous Application Availability:** The use of **global load balancing** and **multi-cluster configurations** ensures that application workloads failover without causing downtime for users.
- **Clear Recovery Path:** Once the us-central1 region is restored, the system can revert back to normal with minimal effort, reducing the recovery time objective (RTO) and ensuring a consistent user experience.

This disaster recovery plan ensures that the infrastructure can remain highly available, resilient, and fault-tolerant across regions, making it highly scalable and reducing the risk of a significant business impact during a regional failure in Google Cloud.

## ✍️ Your Assignment

_Use this section to complete your deliverable:_

```markdown
(Example Format)

- **Primary Region**: us-central1  
- **Backup Location**: us-east1  
- **Failover Trigger**: Load balancer health check + Pub/Sub alert  
- **Redundancy Services**:  
   - Cloud SQL with failover  
   - Cloud Storage versioning  
   - Cloud Functions for health monitoring  
- **Backup Schedule**: Every 6 hours, daily export to multi-region bucket  
```