# Disaster Recovery: Restore Cosmos DB Data from OneLake Mirrored Storage

This notebook demonstrates how to restore Cosmos DB data in a recovery region by reading mirrored data from OneLake and ingesting it into a recovered Cosmos DB artifact. This is part of a complete disaster recovery strategy using Git integration for artifact configuration and OneLake mirroring for data recovery.

## üìã Prerequisites

Before starting, you'll need to have completed these **disaster recovery preparation steps**:

### In the Primary Region (Before Disaster):
- ‚úÖ **Git integration enabled** for your Workspace and Cosmos DB artifact
- ‚úÖ **Cosmos DB mirroring to OneLake** enabled (automatic, RPO < 15 minutes)
- ‚úÖ **Lakehouse created** with shortcuts to OneLake mirrored data
- ‚úÖ **OneLake shortcut configured** to access mirrored Cosmos DB containers

### In the Recovery Region (After Disaster):
- ‚úÖ **New Fabric workspace** created in the recovery region
- ‚úÖ **Cosmos DB artifact recovered** from Git repository (configuration only, no data)
- ‚úÖ **Custom Spark environment** with Cosmos DB Spark Connector libraries
- ‚úÖ **Access to OneLake** in the disaster region (automatic replication)

> **Recovery Point Objective (RPO):** < 15 minutes (based on OneLake mirroring)  
> **Recovery Time Objective (RTO):** Minutes to hours (depends on data volume)

## üéØ What This Notebook Does

This notebook will:
1. ‚úÖ Read mirrored Cosmos DB data from OneLake Delta tables
2. ‚úÖ Connect to the recovered Cosmos DB artifact in the recovery region
3. ‚úÖ Ingest data into recovered containers using ItemOverwrite strategy
4. ‚úÖ Process multiple containers in a single execution
5. ‚úÖ Provide progress logging and record counts

## üöÄ Getting Started

### Step 1: Import This Notebook to Your Workspace

1. Download the `ingest-from-mirror.ipynb` file from the repository
2. In your **Fabric workspace** (recovery region), select **Import** ‚Üí **Notebook**
3. Upload the downloaded `.ipynb` file
4. The notebook will be imported and ready to configure

### Step 2: Configure Your Spark Environment

Before running this notebook, you must create a custom Spark environment with the required Cosmos DB libraries:

1. In your workspace, create **+New item** ‚Üí **Environment**
2. Name it (e.g., `CosmosRecoveryEnvironment`)
3. Under **Custom libraries**, select **Upload** and add these JAR files:

**Required Libraries (Spark 3.5):**
- [`azure-cosmos-spark_3-5_2-12-4.41.0.jar`](https://repo1.maven.org/maven2/com/azure/cosmos/spark/azure-cosmos-spark_3-5_2-12/4.41.0/)
- [`fabric-cosmos-spark-auth_3-1.0.0.jar`](https://repo1.maven.org/maven2/com/azure/cosmos/spark/fabric-cosmos-spark-auth_3/1.0.0/)

4. **Publish** the environment
5. In this notebook, select **Settings** ‚Üí **Environment** ‚Üí Select your custom environment
6. Wait for the environment to attach (may take a few minutes)

### Step 3: Gather Required Information

Before running the code, collect the following information:

| Parameter | Description | Example |
|-----------|-------------|---------|
| **disasterRegionWorkspaceName** | Workspace name where OneLake mirrored data resides | `MyWorkspacePrimary` |
| **disasterRegionLakehouseName** | Lakehouse name with shortcuts to mirrored data | `CosmosBackupLakehouse` |
| **recoveryRegionCosmosAccountEndpoint** | Cosmos DB endpoint in recovery region | `https://abc123.eastus2.sql.cosmos.fabric.microsoft.com:443/` |
| **recoveryRegionCosmosDatabase** | Database name in recovered Cosmos DB artifact | `RecoveredDatabase` |
| **containerNamesToRecover** | List of container names to restore | `Seq("Products", "Orders", "Customers")` |

**To find your Cosmos DB endpoint:**
1. Open your **recovered Cosmos DB artifact** in the recovery region
2. Go to **Settings** ‚Üí **Account Endpoint**
3. Copy the endpoint URL (includes the artifact ID and region)

### Step 4: Update Configuration and Run

1. Scroll to the **first code cell** below
2. Replace all placeholder values with your actual configuration
3. Select the **Spark (Scala)** kernel if not already selected
4. Run the notebook cells in order

## ‚ö†Ô∏è Important Notes

### Data Consistency
- **Schema drift**: If property data types changed over time, OneLake may upcast or store nulls
- **Hierarchical data**: Arrays and objects are serialized as JSON strings during mirroring
- **Deserialization**: You may need to deserialize JSON strings back to structured types

### Recovery Strategy
- This notebook uses `ItemOverwrite` strategy to handle potential duplicates
- Each container is processed sequentially
- Progress is logged to help monitor large ingestion operations
- Data is persisted in memory to improve write performance

### Performance Considerations
- Large containers may take significant time to process
- Monitor Spark executor logs for progress and errors
- Consider batching very large datasets if memory constraints occur

---

## üìñ Related Documentation

- [Full Disaster Recovery Guide](./README.md) - Complete BCDR procedures and best practices
- [Cosmos DB Spark Connector Documentation](https://github.com/Azure/azure-sdk-for-java/tree/main/sdk/cosmos/azure-cosmos-spark_3_2-12)
- [OneLake Shortcuts](https://docs.microsoft.com/fabric/onelake/onelake-shortcuts)
- [Git Integration in Fabric](https://docs.microsoft.com/fabric/cicd/git-integration/intro-to-git-integration)

---

## üîß Configuration Section

Update the values below with your specific environment details, then run the cell to begin the recovery process.

In [None]:
// ======== USER INPUTS ========

// Workspace name in the disaster region
val disasterRegionWorkspaceName = "<DISASTER_REGION_WORKSPACE_NAME>"  

// Lakehouse name in the disaster region
val disasterRegionLakehouseName = "<DISASTER_REGION_LAKEHOUSE_NAME>"     

// Cosmos DB AccountEndpoint in the recovery region 
// Example: https://<ARTIFACT_ID>.<REGION>.sql.cosmos.fabric.microsoft.com:443/
val recoveryRegionCosmosAccountEndpoint = "<RECOVERY_REGION_COSMOS_ACCOUNT_ENDPOINT>"  

// Cosmos DB Database name in the recovery region 
val recoveryRegionCosmosDatabase = "<RECOVERY_REGION_COSMOS_DATABASE>"  

// List of container names to recover
val containerNamesToRecover = Seq("<CONTAINER_NAME_1>", "<CONTAINER_NAME_2>", "<CONTAINER_NAME_3>")

containerNamesToRecover.foreach { containerName =>
  println(s"--- Starting BCDR recovery for container: $containerName ---")

  // Construct the OneLake Delta path
  val onelakePath =
    s"abfss://${disasterRegionWorkspaceName}@dxt-onelake.dfs.fabric.microsoft.com/${disasterRegionlakehouseName}.Lakehouse/Tables/${containerName}"

  println(s"Reading Delta table from: $onelakePath")

  // Read from Delta Lakehouse table
  val recoveryDF = spark.read.format("delta").load(onelakePath)
  recoveryDF.persist()

  println(s"Number of records read: ${recoveryDF.count()}")

  // Cosmos write configuration
  val writeCfg = Map(
    "spark.cosmos.auth.type" -> "AccessToken",
    "spark.cosmos.accountEndpoint" -> recoveryRegionCosmosAccountEndpoint,
    "spark.cosmos.accountDataResolverServiceName" -> "com.azure.cosmos.spark.fabric.FabricAccountDataResolver",
    "spark.cosmos.useGatewayMode" -> "true",
    "spark.cosmos.auth.aad.audience" -> "https://cosmos.azure.com/.default",
    "spark.cosmos.database" -> recoveryRegioncosmosDatabase, 
    "spark.cosmos.container" -> containerName,
    "spark.cosmos.read.consistencyStrategy" -> "LOCAL_COMMITTED",
    "spark.cosmos.diagnostics" -> "sampled",
    "spark.cosmos.write.strategy" -> "ItemOverwrite"
  )

  // Write to Cosmos DB Artifact in Recovery Region
  println(s"Writing data to Cosmos container: ${containerName}")
  recoveryDF
    .write
    .format("cosmos.oltp")
    .mode("Append")
    .options(writeCfg)
    .save()

  println(s" Completed write for container: ${containerName}\n")
}
