# Big Data Lecture Notes: Stream Join Processing

## Part 1: Introduction to Stream Joins

Joining data is a fundamental operation, but it becomes much more complex when dealing with continuous data streams. A **stream join** is the process of combining data from two or more streams based on a common attribute.

### Types of Stream Joins

We can categorize stream joins based on whether the streams are finite or infinite:

* **Bounded Stream Join**: This occurs when all streams in the join have a defined start and end. Since the total amount of data is finite, this process is similar to a traditional database join.
* **Unbounded Stream Join**: This occurs when at least one of the streams is infinite (it has a start but no end). This is the more challenging scenario.

#### Challenges of Unbounded Joins
1.  **How do you join infinite streams?** Since you can't store an entire infinite stream, you can't perform a traditional join. You need a way to limit the scope of the join.
2.  **How do you handle late or out-of-order data?** Due to network latency, a tuple from one stream might arrive long after its matching tuple from another stream has already been processed and discarded.

---

## Part 2: Joining Unbounded Streams

To handle infinite data, we perform joins on a limited portion of the stream using **windows**. The join is only applied to the tuples currently inside the window for each stream.


### The Problem with Simple Hash Joins

A standard hash join involves building a hash table from one dataset (S) and then probing it with tuples from the other dataset (R). This doesn't work for streams.

* **Question**: What's the problem with this approach for streams?
    * **Answer**: It's all about timing. If a tuple `r` from stream R arrives *before* its matching tuple `s` from stream S, the join will be missed. When `r` probes the hash table for S, `s` won't be there yet.

### Symmetric Hash Join: The Solution

To solve this timing issue, we use a **Symmetric Hash Join**. The key idea is to maintain **two hash tables**, one for each stream (e.g., Hash Table R and Hash Table S).

The process for a newly arriving tuple `r` from Stream R is as follows:
1.  **Probe**: `r` is checked against the *other* stream's hash table (Hash Table S). If a match is found, a join result is produced.
2.  **Hash**: `r` is then inserted into its *own* hash table (Hash Table R), making it available for future tuples arriving on Stream S to find.

The same process happens in reverse when a tuple `s` arrives on Stream S. This symmetry ensures that a join can be made regardless of which matching tuple arrives first.

### Window Sliding Mechanisms for Joins

In a time-based window join, the window must "slide" forward. This can be triggered in two ways:

1.  **Tuple Slide**: The window moves forward **every time a new tuple arrives**. When a new tuple `r` arrives, any tuples that are now outside the new time window are expired and removed from their respective hash tables.
2.  **Time Slide**: The window moves forward based on a **fixed time interval** (e.g., every 10 seconds). This is often managed by breaking the window into smaller "basic windows" and sliding by one or more of these basic units at each interval.

#### M-Join: Joining More Than Two Streams
The **M-Join** is a multi-way stream join algorithm based on the Symmetric Hash Join. It's used when you need to join three or more streams (e.g., R, S, and T).
* Each stream maintains its own hash table (Hash Table R, S, T).
* When a new tuple `r` arrives on Stream R, it is probed against the hash tables of **all other streams** (S and T) before being inserted into its own hash table (R).

---

## Part 3: Other Join Types

### Challenges of Tuple-Based Window Joins

While time-based windows are common, **tuple-based (count-based) windows** are rarely used for joins due to several unresolved problems:
* **Inconsistent Windows**: If Stream R's window is defined as "100 tuples," what happens if Stream S has a much slower arrival rate? By the time 100 tuples arrive for R, only a few might have arrived for S. This means you'd be joining new data from R with very old data from S.
* **Ambiguous Semantics**: It's unclear how to define a meaningful join across streams with different velocities using a fixed number of tuples. Because of this ambiguity, tuple-based window joins are not a well-researched or practical approach.

### Bounded Stream Joins (Pipelining Join)

This is the simplest case, used when both streams have a defined end.
* **No Window Needed**: Since the datasets are finite, there's no need to expire old tuples.
* **Process**: A Symmetric Hash Join can be used, but with a key difference: tuples are **never removed** from the hash tables. The hash tables grow until they contain all tuples from both streams, at which point the streams end and the join is complete.
* This is also known as a **Pipelining Join** because it processes the data as it flows through the "pipeline" without needing to wait for the streams to end.

### Week 10 - Stream Join Processing
Bounded Stream Join
- There is a start and end
Unbounded Stream Join
There is a start but no end

Due to network latency, some data arrives late
Data which matches each other might be missed for a join operation
Use 2 symmetric hash tables to avoid missing the join of any incoming tuple