# Disaster Recovery in Azure

*Disaster recovery* is a critical aspect of any organization's data strategy. It refers to the set of processes and tools implemented to safeguard data, applications, and systems, enabling rapid recovery and continuity of operations in the face of disruptive events. By having a robust disaster recovery plan, businesses can minimize downtime, ensure data integrity, and maintain their ability to serve customers without significant interruptions.

## Motivation

Disasters can strike unexpectedly, causing data loss, system downtime, and business interruptions. In today's technology-driven world, organizations heavily rely on data and applications to function efficiently and serve customers. The consequences of such incidents can be catastrophic, leading to financial losses, damage to reputation, and a loss of customer trust.

Disaster recovery is essential for business continuity and mitigating the impact of unforeseen events. By having a robust disaster recovery plan, organizations can:

- **Minimize Downtime**: Rapidly restore critical systems to reduce downtime and maintain operations

- **Protect Data**: Implement backup and replication mechanisms to safeguard valuable data

- **Ensure Compliance**: Adhere to regulatory requirements for data protection and business continuity

- **Safeguard Reputation**: Maintain high availability to retain customer confidence and loyalty

- **Adapt to Threats**: Mitigate the impact of cyber attacks and secure critical systems and information

## Azure's Disaster Recovery Capabilities

Microsoft's cloud computing platform, Azure, offers a wide range of disaster recovery capabilities designed to protect data, applications, and services running on the platform. Some key features include:

- **Azure Backup**: Azure Backup allows you to regularly back up your Azure SQL Databases, virtual machines, and other resources. It provides an automated and scalable solution for data protection.

- **Geo-Replication for Azure SQL Database**: Geo-replication enables you to asynchronously replicate your Azure SQL Database to a secondary region, providing data redundancy and disaster recovery options

- **Azure Site Recovery**: Azure Site Recovery (ASR) is a comprehensive disaster recovery solution that replicates and orchestrates the failover of on-premises and virtual machine workloads to Azure or between Azure regions

- **Azure Virtual Machine Scale Sets**: Virtual Machine Scale Sets allow you to deploy and manage a group of identical, load-balanced VMs to ensure high availability and automatic scaling

> In this lesson we will focus on different disaster recovery scenarios for Azure SQL Database.

## Types of Backups in Azure SQL Database

In Azure SQL Database, various backup options are available to meet different data protection needs. Understanding these backup types allows you to make informed decisions about which approach best suits your organization's requirements. Let's explore the three primary types of backups in Azure SQL Database:

### 1. Automated Backups

*Automated backups* are managed by Azure SQL Database and are enabled by default for all databases. These backups are taken regularly. The frequency of automated backups depends on the service tier of your database:

- For the Basic and Standard service tiers, automated backups are taken every week
- For the Premium and Business Critical service tiers, automated backups are taken every five minutes.

#### Advantages of Automated Backups

- Simple and hassle-free setup, as Azure SQL Database handles the backup process
- Regular and frequent backups ensure minimal data loss in case of a disaster
- No additional cost for automated backups, as they are included in the service tier pricing

#### Limitations of Automated Backups

- Limited retention period: Basic and Standard service tiers retain automated backups for up to 7 days, while Premium and Business Critical retain backups for up to 35 days
- Lack of flexibility in scheduling: You cannot customize the timing or frequency of automated backups

### 2. Manual Backups

In contrast to automated backups, *manual backups* provide more control over the backup process. With manual backups, you initiate the backup operations and choose when and where to store the backup files. This approach is particularly useful when you need to perform on-demand backups at specific intervals.

#### Advantages of Manual Backups

- Customizable backup schedule: You can decide when to take backups based on your organization's needs
- Longer retention periods: You can retain manual backups for an extended duration, depending on your storage capacity
- Flexibility in storage location: Manual backups can be stored in Azure Blob Storage or locally

####  Limitations of Manual Backups

- Manual setup and management: The responsibility for initiating and managing backups lies with the user
- Potential data loss risk: If you forget to schedule or perform backups regularly, there is a higher risk of data loss in case of a disaster

### 3. Long-Term Retention Backups

Long-term retention backups allow you to keep historical backups for an extended period, beyond the standard retention period provided by automated backups. This feature is useful for compliance requirements or when you need to retain backups for a more extended duration.

#### Advantages of Long-Term Retention Backups

- Compliance support: Long-term retention helps organizations meet regulatory, and compliance requirements
- Extended backup history: You can keep backups for up to 10 years, depending on the storage capacity

#### Limitations of Long-Term Retention Backups

- Additional cost: Long-term retention backups incur additional storage costs
- Limited to specific service tiers: Long-term retention is available only for the Premium and Business Critical service tiers


## Mimicking Data Loss in a Production Environment

Disaster recovery procedures are crucial for ensuring business continuity and data protection in the event of unforeseen incidents. However, relying solely on theoretical plans without testing them in a controlled environment can be risky. Testing disaster recovery procedures allows organizations to:

- Validate recovery strategies and identify weaknesses
- Build confidence in the effectiveness of the plan
- Reduce downtime by practicing efficient recovery
- Identify gaps and continuously improve the process
- Comply with regulations and industry standards

### Methods to Mimic Data Loss in Azure SQL Database

Azure SQL Database provides various methods to mimic data loss scenarios in a controlled environment. Here are some common techniques:

- **Intentional Deletion of Data**: You can simulate data loss by intentionally deleting critical data from the database. This could be individual records, entire tables, or specific data ranges.

- **Data Corruption**: Introduce data corruption into the database by modifying specific records or fields, either directly in the database or through application transactions

- **Accidental Updates**: Introduce erroneous updates to the data, either through direct manipulation or by running faulty scripts

- **Service Outages**: Simulate service outages or connectivity issues to the database to test failover and disaster recovery readiness

- **Simulated Regional Outages**: If you have implemented georeplication, simulate regional outages in the primary region to trigger the failover to the secondary region. We will discuss more about this scenario in a later section.

### Precautions When Mimicking Data Loss

When mimicking data loss scenarios, exercise caution to avoid unintended consequences. Follow these precautions:

- **Use Non-Production Environments**: Perform data loss testing in non-production environments to prevent any impact on live data and operations

- **Backup the Database**: Take a backup of the database before initiating any data loss scenarios, ensuring you can revert to a known good state if needed

- **Involve Stakeholders**: Inform relevant stakeholders about the testing process to avoid misunderstandings and maintain transparency

- **Document the Process**: Record the steps followed and the outcomes during testing for documentation and analysis

By mimicking data loss in a controlled environment and testing disaster recovery procedures regularly, organizations can ensure that they are well-prepared to handle real-life incidents and minimize the impact on critical business operations.

### Hands-On: Mimicking Data Loss in a Production Azure SQL Database

Before continuing with this hands-on, first make sure you have followed the necessary hands-on from the previous lessons in this pathway. If you had, you should have a production Windows VM which was Azure Data Studio installed. This should have a connection establish to a production Azure SQL Database, which hosts the `sales_database`.

To mimic data loss in the `sales_database` Azure SQL Database follow these steps:

- Open Azure Data Studio on the production VM and identify previously created connection to the Azure SQL Database

- Open a query window by right-clicking on the production database and selecting **New Query**. Let's examine the data before proceeding with the mimicking data loss. For example, let's select all the entries in the `dbo.dim_users` table. Once you wrote the query just press **Run** to execute it.

<p align="center">
    <img src="images/FirstQuery.png" height="600" width="1000"/>
</p>

- Scroll all the way down in the **Results** pane to observe the number of entries in the table. In this example I can see there are 1000 entries in the `dbo.dim_users` table.

- Write and execute SQL queries to intentionally delete or corrupt specific data. For example:

In [None]:
-- Intentional Deletion
DELETE TOP (100)
FROM dbo.dim_products;

-- Data Corruption
UPDATE TOP (100) dbo.dim_products
SET product_price = NULL

- Verify the data loss by querying the affected table to ensure that the data has been intentionally deleted or corrupted. For example if you used the query that deletes the first 100 rows in the `dbo.dim_products` table, you should only see 900 entries in the table now.

<p align="center">
    <img src="images/CorruptedData.png" height="600" width="1000"/>
</p>

## Hands-On: Restoring a Database from an Azure SQL Database Backup

In this section we will provide step-by-step instructions on how to restore a database from a backup after experiencing data loss. Follow these instructions to restore a database from an Azure SQL Database backup. For this example, we will use our recently corrupted `sales_database` Azure SQL Database:

- Navigate to the Azure portal and from the Azure SQL Database dashboard, locate and select the target database that needs restoration

- From the SQL Database Home Page select the **Restore** option at the top bar on the page

<p align="center">
    <img src="images/RestoreOption.png" height="450" width="950"/>
</p>

- This will open a **Restore database** window. Here you will first need to select the **Restore point**. Choose the restore point that represents the point in time before the data loss occurred. For example, I will select a restore point on the same day two hours before the data loss has occurred. This is because I know no data has been added to the database since then, however if this is an active database that gets updated real-time you would have to choose the closest point in time before data loss has occurred for minimal data loss.

<p align="center">
    <img src="images/RestorePoint.png" height="450" width="900"/>
</p>

- Next, I will choose the new **Database name**, which in this example will be the same as the production database name with `-restored` added to it. After that you are ready to restore your production database so just click **Review + create** and finally click **Create**. Azure will begin restoring the selected backup to the specified destination. This will take a couple of minutes to complete.

- Once the new database has been deployment we will be able to see it under the resource list in Azure SQL Database page. To verify the database has been restored to a correct point in time, before the data loss has occurred, we will establish a connection to it using Azure Data Studio.

<p align="center">
    <img src="images/RestoredDatabase.png" height="500" width="1000"/>
</p>

And we can see that our restored database now has the correct number of entries (1000) for the `dbo.dim_products` table, indicating that our point in time restoration worked as expected.

## Geo-Replication in Azure SQL Database

*Geo-Replication* is a disaster recovery feature in Azure SQL Database that provides data redundancy and high availability by asynchronously replicating your primary database to a secondary region. This secondary region is often in a different geographic location, ensuring data durability in case of a regional outage or disaster.

### How Geo-Replication Works in Azure SQL Database

- **Primary Database**: The primary database is the original database that serves your application and handles read and write operations

- **Secondary Database**: The secondary database is a read-only copy of the primary database located in a different Azure region. It is synchronized asynchronously with the primary database to minimize any performance impact on the primary.

- **Data Replication**: As data changes occur in the primary database, these changes are asynchronously replicated to the secondary database through Azure's reliable infrastructure

- **Automatic Failover**: In the event of a regional outage or planned maintenance, you can manually initiate a failover or rely on Azure's automatic failover mechanism, which promotes the secondary database to the primary role to minimize downtime

### Configuring and Setting up Geo-Replication

Follow these steps to configure and set up georeplication for your Azure SQL Database:

- Navigate to the Azure portal and from the Azure SQL Database dashboard, select the primary database you want to replicate. We will use the previously restored production database here. 

- In the left-hand menu of the primary database blade, click on **Replicas** under **Data Management**

<p align="center">
    <img src="images/ReplicationBlade.png" height="600" width="900"/>
</p>

- Click on **Create replica** to begin the setup process. In the **Geo Replica** menu we will first have to create a new SQL Server using the **Create new** button under the **Server** pane. This server will should be located in a region geographically distant from the primary region to ensure data redundancy. This will represent the secondary region where the database will be replicated.

- In this example, we are creating a new server called `my-replication-server` that is located in **(US) East US**. We will select **Use SQL authentication** as the authentication method and provision the SQL log in credentials for this server.

<p align="center">
    <img src="images/ReplicationServer.png" height="600" width="900"/>
</p>

- Press **OK** to create the new SQL Server. This will redirect us back to the **Geo Replica** window where now we should have the new server details under the **Server** field in the **Database details** pane.

- Click **Review + create** and then click on **Create** to initiate the replication process. Azure will start the initial data synchronization between the primary and secondary databases. This might take a couple of minutes to complete.

Once the resource has been provisioned to go the resource page, and you should be able to see the following information:

<p align="center">
    <img src="images/GeoReplicaDatabase.png" height="350" width="650"/>
</p>

Indicating that this database is a replica (as we can see under the **Replica type** field) with the primary database being `production-database-restored`. Congratulations you have successfully created a georeplica of your primary database!

## Test Failover and Tailback

> *Failover* is the process of switching the workload from the primary region to the secondary region in a georeplicated environment

It is typically performed during planned maintenance or in response to a disaster in the primary region. Failover ensures high availability and business continuity by allowing applications to continue running from the secondary region.

> *Tailback* is the process of reverting the workload back to the primary region after a successful failover. 

Once the primary region is restored, the workload is switched back from the secondary region to the primary region, ensuring that the original production environment is reinstated.

### Initiating a Test Failover to the Secondary Region

A test failover is a non-disruptive way to verify the functionality of the failover environment without impacting the production workload. Follow these steps to initiate a test failover to the secondary region:

- Navigate to the Azure portal and from the Azure SQL Database dashboard, select the SQL server associated with the primary database that you want to failover (in the example above that will be our primary database `production-database-restored` that is located in the UK region, not the US one which is the georeplica)

- Select **Failover groups** under the **Data management** pane, and then select **Add group** to create a new failover group

<p align="center">
    <img src="images/AddFailoverGroup.png" height="550" width="1000"/>
</p>

- On the **Failover group** page enter a name for the failover group. Then for the **Server** select your secondary server, not the server on which the primary database resides in. For example this could be the replication server created earlier. Leave the rest of the settings as default and click **Create**.

- Now navigate to the Azure SQL Server in which your secondary/replication database resides in. In this example that will be `my-replication-server`.

- Here under **Failover groups** you should see the newly created failover group

<p align="center">
    <img src="images/FailoverGroup.png" height="350" width="950"/>
</p>

- Access the failover group page, you should be able to see the following page, which indicates the two servers in the failover group and which is the primary and the secondary one:

<p align="center">
    <img src="images/BeforeFailover.png" height="450" width="950"/>
</p>

- To initiate a planned failover (without data loss), select **Failover** from the task pane to fail over your failover group containing your database. You will likely receive a warning about switching the secondary database to a primary role click **Yes** to continue here.

- Wait for the failover to complete and review which server is now primary and which server is secondary. If failover succeeded, the two servers should have swapped roles. Here, you can also use the connection details of the secondary server to connect to the new primary database, and run tests to validate the database's functionality in the secondary region, ensuring that it is working as expected.

<p align="center">
    <img src="images/AfterFailover.png" height="500" width="900"/>
</p>

- Select **Failover** again to fail the servers back to their original roles

By following these steps, you can safely perform a test failover to verify your disaster recovery environment's functionality and then perform a tailback to revert to the primary region. Testing these processes regularly ensures the readiness of your disaster recovery plan and boosts confidence in your organization's ability to handle unexpected incidents.


## Performing a Planned Failover to a Secondary Region

A *planned failover* is typically executed when you need to perform maintenance or planned downtime in the primary region. It allows you to proactively move your workload to the secondary region, ensuring minimal disruption to your applications and users during the maintenance window. Some common circumstances requiring a planned failover include:

- **Planned Maintenance**: When you need to apply updates, patches, or configuration changes to the primary region's infrastructure or services

- **Disaster Recovery Drills**: To validate the effectiveness of your disaster recovery strategy and ensure the secondary region can effectively handle the workload

- **Geographic Load Balancing**: For load balancing purposes, you might decide to temporarily move the workload to the secondary region to distribute traffic evenly

Performing a planned failover requires careful planning and consideration to minimize any impact on applications and users. Here are some important considerations:

- **Downtime Window**: Plan the failover during a scheduled maintenance window or during periods of low user activity to minimize disruptions

- **Application Awareness**: Ensure that your applications are designed to handle failover scenarios, and any necessary configuration changes are made to point to the secondary region after the planned failover

- **Data Synchronization**: Depending on the data replication lag, there may be some data loss during the failover. Make sure you understand the replication lag and potential data loss implications.

- **Read-Only Access**: The secondary region, after the planned failover, will be in a read-write state. Adjust application behavior accordingly to avoid unintended write operations.

- **Reverse Failover Plan**: Plan for a reverse failover (tailback) to bring the workload back to the primary region after the maintenance is complete

## Key Takeaways

- Disaster recovery is essential for business continuity, and Azure offers various capabilities to ensure high availability and data protection
- Azure SQL Database provides automated, manual, and long-term retention backups to cater to different data protection needs
- Testing disaster recovery procedures in a controlled environment allows you to prepare for real data loss scenarios, such as intentional deletions or data corruption
- Setting up georeplication in Azure SQL Database ensures data redundancy and high availability by replicating the primary database to a secondary region
- Performing test failovers and tailbacks helps validate the effectiveness of disaster recovery strategies and ensures a smooth transition between regions
- A planned failover is performed during maintenance or planned downtime and allows you to proactively move the workload to the secondary region