# Week 2 Notes: Parallel Databases and Search

## Part 1: Introduction to Parallel Databases (Volume I)

This section covers the foundational concepts of parallel database processing, including the obstacles to performance and the high-level architectures.

### üéØ Parallelism Objectives & Obstacles

The main objectives of parallel query processing are to improve performance through **Speed Up** (running a task faster with more resources) and **Scale Up** (handling more data in the same time with more resources).

However, achieving perfect linear performance is hindered by several obstacles:

* **Start-up and Consolidation**: Every parallel task has "serial parts".
    * **Start-up** is the initial cost of initiating all the parallel processes.
    * **Consolidation** is the final cost of collecting and combining the results from all processors.
* **Interference and Communication**:
    * **Interference** occurs when parallel processes compete for shared resources (like a bus or disk).
    * **Communication** introduces overhead, as processes may need to wait for data or signals from other processes.
* **Skew**: This refers to the unevenness or imbalance of the workload among processors. A skewed workload is undesirable because the total job time is limited by the most overloaded processor.

### üå™Ô∏è Modeling Skew

Skew is often modeled using the **Zipf distribution**. This model helps estimate the data distribution's unevenness.

* The formula is: $|R_{i}|=\frac{|R|}{i^{0}\times\sum_{j=1}^{N}\frac{1}{j^{0}}}$ 
* The parameter **$\theta$ (theta)** denotes the degree of skewness.
    * **$\theta = 0$**: Indicates no skew (a uniform distribution).
    * **$\theta = 1$**: Indicates a highly skewed distribution.


### üèõÔ∏è Parallel Database Architectures

There are four main architectures for parallel databases:

1.  **Shared-Memory**: All processors share a common main memory and all disks.
    * **Pro**: Load balancing is relatively easy.
    * **Con**: Suffers from memory and bus contention as more processors are added.
2.  **Shared-Disk**: Each processor has its own private main memory, but all processors share all disks.
3.  **Shared-Nothing**: This is the most scalable architecture. Each processor has its own private main memory *and* its own private disks. Processors communicate over an interconnected network.
    * **Con**: Load balancing is more difficult.
4.  **Shared-Something (Cluster)**: A hybrid model that is common in practice. It consists of multiple "nodes" connected in a shared-nothing network. Each individual node is a shared-memory (SMP) machine.

### ‚ö° Forms of Parallelism

Parallelism can be applied at different levels to speed up database processing:

* **Interquery Parallelism**: Different queries are executed in parallel. This is the primary way to scale up online transaction processing (OLTP) systems.
* **Intraquery Parallelism**: A single, complex query is broken down and its parts are executed in parallel. This is used to speed up long-running queries , and it can be done in two ways:
    * **Intraoperation Parallelism**: A single operation (like a sort or a join) is parallelized by partitioning the data.
    * **Interoperation Parallelism**: *Different* operations within the same query are executed concurrently. This includes:
        * **Pipeline Parallelism**: The output of one operation is immediately fed as input to the next, like an assembly line.
        * **Independent Parallelism**: Operations that do not depend on each other are executed at the same time.

In practice, a query will use a **Mixed Parallelism** approach, combining all these forms.

---

## Part 2: Parallel Search (Volume II)

This section applies parallel processing concepts specifically to search (selection) operations.

### ‚ùì Types of Search Queries

A search query is a "selection" operation that retrieves a horizontal subset (records) from a table.

* **Exact-Match Search**: Uses an exact value, like `WHERE Sid = 23`.
* **Range Search**: Covers a range of values.
    * **Continuous**: `WHERE Sgpa > 3.50`.
    * **Discrete**: `WHERE Sdegree IN ('BCS', 'BInfSys')`.
* **Multiattribute Search**: Involves more than one attribute, using `AND` or `OR`.

### üóÇÔ∏è Data Partitioning Strategies

Data partitioning is the act of distributing data across multiple processing elements to enable parallelism.

#### Basic Partitioning Methods
First, data can be partitioned **vertically** (splitting by columns/attributes) or **horizontally** (splitting by rows/records). Horizontal partitioning is more common for parallel databases.

Key horizontal partitioning methods include:

* **Round-Robin**: Each record is allocated to the next processor in turn.
    * **Pros**: Guarantees even data distribution and perfect load balance.
    * **Cons**: Data is not grouped semantically. An exact-match query must run on *all* processors, as there's no way to know where the data is.
* **Hash Partitioning**: A hash function is applied to an attribute to determine which processor stores the record.
    * **Pros**: Data is grouped semantically. Very efficient for exact-match searches, as the query can be sent to exactly *one* processor.
    * **Cons**: Can cause data skew. Inefficient for range searches, as the hash values are not ordered, requiring *all* processors to be activated.
* **Range Partitioning**: Records are distributed based on a range of values for an attribute (e.g., A-C to P1, D-G to P2).
    * **Pros**: Excellent for range searches, as the query can be localized to only the "selected" processors that hold that range.
    * **Cons**: Can easily result in data skew if data is not uniformly distributed.
* **Random-Unequal Partitioning**: The partitioning method is unknown or based on a non-retrieval attribute. This is common for temporary data that results from a previous operation.

#### Complex Partitioning Methods
These methods are based on multiple attributes or combine basic methods.

* **HRPS (Hybrid-Range Partitioning Strategy)**: Combines range and round-robin. It first divides the data into many small *range* fragments, then distributes those fragments in a *round-robin* fashion. This strategy provides the range localization benefits of range partitioning while also achieving the load-balancing properties of round-robin.
* **MAGIC (Multiattribute Grid Declustering)**: Partitions data based on multiple attributes. It creates a grid where each attribute is an axis, and each cell in the grid is assigned to a processor. This allows queries on *any* of the partitioning attributes to be localized to a subset of processors.
* **BERD (Bubba's Extended Range Declustering)**: A two-level multiattribute method. It first applies range partitioning on a *primary* attribute. It then creates an "auxiliary" table based on a *secondary* attribute, which is also range partitioned.

### üèÉ Parallel Search Algorithms

A parallel search algorithm has three main components:

1.  **Processor Activation (or Involvement)**
    This determines how many processors need to be activated for a query. It depends entirely on the **partitioning method** and the **query type**.
    * **Example**: An *exact-match* query on *hash-partitioned* data only needs **1** processor. The same query on *round-robin* data needs **All** processors.
    * **Example**: A *continuous range* query on *range-partitioned* data only needs **Selected** processors. The same query on *hash-partitioned* data needs **All** processors.

2.  **Local Searching Method**
    This is the search algorithm used *within* each activated processor.
    * If the local data is **Ordered**, use **Binary Search**.
    * If the local data is **Unordered**, use **Linear Search**.

3.  **Key Comparison**
    This determines whether to stop searching after finding a match.
    * You can **Stop** after the first match *only if* the query is an **Exact Match** and the attribute's values are **Unique**.
    * In all other cases (range queries, or if duplicates are possible), you must **Continue** searching to find all possible matches.

# Parallel Computing
Obstacles causing sub-linear speedups:  
- Startup and consolidation costs
    - Startup: Initiation of multiple processes
    - Consolidation: Cost of collecting results obtained by each processor by host processor
- Interference and communication
    - Interference: Competing to access shared resources
    - Communication: One process communicating with other processes, and often one has to 
    wait for others to be ready for communication (i.e. waiting time) (**bottleneck**)
- Skew
    - Unevenness of workload: requires load balancing
    - Measure of skew: 
$$
|R_i| = \frac{|R|}{i^{\theta} \times \sum_{j=1}^{N} \frac{1}{j^{\theta}}}
\quad \text{where } 0 \leq \theta \leq 1
$$



# Forms of Parallelism
Forms of parallelism for database processing:  
- Interquery parallelism  
- Intraquery parallelism  
- Interoperation parallelism  
- Intraoperation parallelism  
- Mixed parallelism  


# Parallel Database Architectures
Parallel computers are no longer a monopoly of supercomputers  
Parallel computers are available in many forms:  
- Shared-memory architecture
- Shared-disk architecture
- Shared-nothing architecture
- Shared-something architecture


# Parallel Search
## Search Queries
3 kinds of search queries:  
- Exact-match search  
- Range search  
- Multi attribute search  

## Data Partitioning
- Distributes data over a number of processing elements
- Each processing element executed in parallel
- Can be physical or logical data partitioning

### Basic Data Partitioning
- Vertical vs horizontal data partitioning  
    - Vertical: partitions data across all processors  
        - Used in distributed database systems  
    - Horizontal: each processor holds a partial number of complete records  
        - Used in parallel relational database systems  
- Round robin data partitioning  
    - Sequential equal partitioning  
    - Even distribution of items, but data is not grouped semantically  
- Hash data partitioning  
    - A hash function (math formula) partitions the data  
    - Data grouped semantically, easy for exact match search but not range search  
    - Initial data skewed   
- Range data partitioning   
    - Spread the records based on given range of partitioning attribute e.g. based on gpa  
    - Initial data allocation skewed  
- Random unequal data partitioning   
    - Random everything  

### Complex Data Partitioning
- Partitioning done based on multiple attributes or single attribute but multiple partitioning methods  
- Single attribute with multiple partitioning methods:  
    - Hybrid-Range Partitioning Strategy (HRPS)  
        - Partition to fragments using range, then distribute fragments by round robin  
        - Cannot localize a range query search  

        ‚Äì Support for Small Tables  
        If the number of fragments of a table is less than the number of processors, 
        then the table will automatically be partitioned across a subset of the processors  
        ‚Äì Support for Tables with Nonuniform Distributions of the Partitioning Attribute Values  
        Because the cardinality of each fragment is not based on the value of the 
        partitioning attribute value, once the HRPS determines the cardinality of each 
        fragment, it will partition a table based on that value.  

- Multiple attributes:  
    - Multiattribute Grid Declustering (MAGIC)  
    - Bubba‚Äôs Extended Range Declustering (BERB)  


## Search Algorithms
Serial search algorithms:  
- Linear search  
- Binary search  
 
Parallel earch algorithms:  
- Processor activation or involvement  
- Local searching method (linear or binary)  
- Key comparison  

Spark session is a higher level of spark context
Spark SQL - module for structured data processing