# Distributed Database

Reference: [Link](https://phoenixnap.com/kb/distributed-database)

## Introduction

Distributed databases are used for horizontal scaling, and they are designed to meet the workload requirements without having to make changes in 

1. the database application or 
2. vertically scale a single machine.

Distributed databases resolve various issues, such as 

1. availability, 
2. fault tolerance, 
3. throughput, 
4. latency, 
5. scalability, and many other __*problems that can arise from using a single machine and a single database*__.

In this article, you'll learn what distributed databases are and their advantages and disadvantages.

## Definition

A distributed database represents:

1. multiple interconnected databases, 
2. spread out across several sites, 
3. connected by a network. 

Since the databases are all connected, they __appear as a single database__ to the users.

    Distributed databases utilize multiple nodes. 

    They scale horizontally and develop a distributed system. 

    More nodes in the system provide: 
    
        >   more computing power, 
    
        >   offer greater availability, and 
    
        >   resolve the single point of failure issue.

Different parts of the distributed database are stored in several physical locations, and the processing requirements are distributed among processors on multiple database nodes.

A __centralized distributed database management system (DDBMS)__ manages the distributed data as if it were stored in one physical location. 

DDBMS synchronizes all data operations among databases and ensures that the updates in one database automatically reflect on databases in other sites.

## Distributed Database Features
Some general features of distributed databases are:

1. Location independency - Data is physically stored at multiple sites and managed by an independent DDBMS.

2. Distributed query processing - Distributed databases answer queries in a distributed environment that manages data at multiple sites. High-level queries are transformed into a query execution plan for simpler management.

3. Distributed transaction management - Provides a consistent distributed database through commit protocols, distributed concurrency control techniques, and distributed recovery methods in case of many transactions and failures.

4. Seamless integration - Databases in a collection usually represent a single logical database, and they are interconnected.

5. Network linking - All databases in a collection are linked by a network and communicate with each other.

6. Transaction processing - Distributed databases incorporate transaction processing, which is a program including a collection of one or more database operations. Transaction processing is an atomic process that is either entirely executed or not at all.

## Distributed Database Types
There are two types of distributed databases:

Homogenous: databases with identical schema stored in different locations  
Heterogenous: databases with different schema stored in different locations

## Distributed Database Storage
Distributed database storage is managed in two ways:

>Replication

>Fragmentation

### Replication
In database replication, the systems store copies of data on different sites. If an entire database is available on multiple sites, it is a fully redundant database.

The __advantage of database replication__ is that:

1. it increases data availability on different sites and 
2. allows for parallel query requests to be processed.

However, database replication means that 

>data requires constant updates and synchronization with other sites to maintain an exact database copy. Any changes made on one site must be recorded on other sites, or else inconsistencies occur.

>Constant updates cause a lot of server overhead and complicate concurrency control, as a lot of concurrent queries must be checked in all available sites.

### Fragmentation
When it comes to fragmentation of distributed database storage, the relations are fragmented, which means they are split into smaller parts. Each of the fragments is stored on a different site, where it is required.

The prerequisite for fragmentation is to make sure that the fragments can later be reconstructed into the original relation without losing data.

The advantage of fragmentation is that there are no data copies, which prevents data inconsistency.

There are two types of fragmentation:

Horizontal fragmentation - The relation schema is fragmented into groups of rows, and each group (tuple) is assigned to one fragment.
Vertical fragmentation - The relation schema is fragmented into smaller schemas, and each fragment contains a common candidate key to guarantee a lossless join.


## Distributed Database Advantages and Disadvantages
Below are some key advantages and disadvantages of distributed databases:

|Advantages|	Disadvantages|
|----------|-----------------|
|Modular development|	Costly software|
|Reliability|	Large overhead|
|Lower communication costs|	Data integrity|
|Better response|	Improper data distribution|

#### Advantages

1. Modular Development. Modular development of a distributed database implies that a system can be expanded to new locations or units by adding new servers and data to the existing setup and connecting them to the distributed system without interruption. This type of expansion causes no interruptions in the functioning of distributed databases.
2. Reliability. Distributed databases offer greater reliability in contrast to centralized databases. In case of a database failure in a centralized database, the system comes to a complete stop. In a distributed database, the system functions even when failures occur, only delivering reduced performance until the issue is resolved.
3. Lower Communication Cost. Locally storing data reduces communication costs for data manipulation in distributed databases. Local data storage is not possible in centralized databases.
4.Better Response. Efficient data distribution in a distributed database system provides a faster response when user requests are met locally. In centralized databases, user requests pass through the central machine, which processes all requests. The result is an increase in response time, especially with a lot of queries.

#### Disadvantages

1. Costly Software. Ensuring data transparency and coordination across multiple sites often requires using expensive software in a distributed database system.
2. Large Overhead. Many operations on multiple sites requires numerous calculations and constant synchronization when database replication is used, causing a lot of processing overhead.
3. Data Integrity. A possible issue when using database replication is data integrity, which is compromised by updating data at multiple sites.
4. Improper Data Distribution. Responsiveness to user requests largely depends on proper data distribution. That means responsiveness can be reduced if data is not correctly distributed across multiple sites.