# FIT5202: Data Processing for Big Data (Revision Notes)

This unit was divided into three main pillars: Volume, Complexity, and Velocity.

---

## 1. Volume: Processing Large-Scale Data (Sessions 1-4)

This section focuses on the strategies and challenges of processing massive, stored datasets using parallel systems.

### 📈 Key Performance Metrics

* **Speed Up**: Measures how much faster a task runs on a multiprocessor system. The goal is to run a given task in less time by adding more processors.
    * **Formula**: $Speed~Up = \frac{Elapsed~Time~(Uniprocessor)}{Elapsed~Time~(Multiprocessors)}$ 
    * **Linear Speed Up**: Performance scales perfectly with resources (e.g., 4 processors = 4x faster).
    * **Sub-Linear Speed Up**: Performance gain is *less* than the resources added. This is the most common outcome.
    * **Super-Linear Speed Up**: Performance gain is *more* than the resources added (rare).

* **Scale Up**: Measures the ability to handle a larger task in the *same amount of time* by proportionally increasing resources.
    * **Formula**: $Scale~Up = \frac{Elapsed~Time~(Small~System,~Small~Data)}{Elapsed~Time~(Large~System,~Large~Data)}$ 
    * **Linear Scale Up**: Achieved if Scale Up = 1. This means you can double the data, double the resources, and the time stays the same.

### 🚧 Obstacles to Parallelism

Achieving perfect linear speed up is difficult due to several overheads:

* **Start-up & Consolidation**: The cost of initiating all parallel processes and the cost of collecting the final results from all processors.
* **Interference & Communication**: Processors compete for shared resources (interference)  or must wait for other processors to be ready (communication).

#### Example: Calculating Sub-Linear Speed Up
A job takes 1 hour (60 min) on 1 processor. The job has a 10% serial part and a 90% parallel part. We use 4 processors, which have a 20% overhead (e.g., waiting time).

1.  **Serial Time**: 60 min * 10% = **6 min** (This cannot be parallelized).
2.  **Parallel Time (Ideal)**: 60 min * 90% = 54 min. With 4 processors: 54 / 4 = **13.5 min**.
3.  **Parallel Time (Actual)**: Add 20% overhead: 13.5 min * 1.20 = **16.2 min**.
4.  **Total Time**: 6 min (Serial) + 16.2 min (Parallel) = **22.2 min**.
5.  **Speed Up**: 60 min / 22.2 min = **2.7**.
6.  **Result**: Since 2.7 is less than the 4 processors used, this is **Sub-Linear Speed Up**.

### 🌪️ Data Skew

Skew is the uneven distribution of data  or processing time  across processors. It's a major obstacle, as the total job time is limited by the single most-overloaded processor.

* This is often modeled by the **Zipf distribution**.
* A skew degree of $\theta=0$ is a perfectly uniform distribution (no skew), while $\theta=1$ is highly skewed.

### 🔍 Parallel Search

Parallel search involves two steps: partitioning the data and then searching it.

**1. Data Partitioning Methods** 
* **Round-Robin**: Distributes records evenly. Pro: Good load balancing. Con: No semantic grouping (a query might need all processors).
* **Hash Partitioning**: Groups data using a hash function. Pro: Good for exact match queries (only 1 processor needed). Con: Can cause data skew.
* **Range Partitioning**: Groups data based on a range of values. Pro: Efficient for range queries (only selected processors needed). Con: Can easily cause data skew.

**2. Parallel Search Algorithm Components** 
* **Processor Activation**: How many processors to use. This depends on the partitioning and query type (e.g., an exact-match query on hash-partitioned data only needs 1 processor).
* **Local Search Method**: The algorithm used on each processor. Use **Binary Search** for ordered data, **Linear Search** for unordered data.
* **Key Comparison**: When to stop searching. Stop on a found match only if the query is an **Exact Match** and the attribute values are **Unique**.

### ⛓️ Parallel Join

Parallel joins also consist of a data partitioning phase and a local join phase.

* **Partitioning (Divide & Broadcast)**: The larger table is divided and split among processors. The *smaller* table is broadcast (replicated) to *all* processors.
* **Local Join**: Each processor joins its partition of the large table with its full copy of the small table. The most common local join is the **Hash Join** (build a hash table with one table, probe it with the other).

**Parallel Outer Joins** 
* **ROJA (Redistribution)**: Reshuffles both tables based on the join attribute, then performs a local outer join.
* **DOJA (Duplication)**: Duplicates the small table, performs a local *inner* join, then redistributes the results to perform the outer join logic.
* **DER (Duplication & Efficient Redistribution)**: Duplicates the left table, performs a local inner join, but then *only* redistributes the ROW IDs of unmatched tuples (more efficient).

### 📊 Parallel Sort & GroupBy

**Parallel External Sort**
Used when data is too large to fit in memory.

* **Serial External Sort-Merge**:
    1.  **Pass 0 (Sort)**: Read data in chunks that fit in memory (buffers), sort these chunks, and write them back to disk as sorted "subfiles".
    2.  **Pass 1+ (Merge)**: Perform a "k-way merge" on the subfiles. The number 'k' is (buffers - 1). Repeat until only one sorted file remains.
* **Parallel Partitioned Sort**: The most effective method.
    1.  **Partitioning**: Perform a "Range Redistribution" to send all data within a certain range to a specific processor.
    2.  **Local Sort**: Each processor sorts its local data.
    3.  **Result**: The data is now globally sorted. **No final merge is needed**. The main problem is that the partitioning step can cause skew.

**Parallel GroupBy**
* **Two-Phase Method**: 1. Perform local aggregation on each processor. 2. Redistribute the local results. 3. Perform a final global aggregation.
* **Redistribution Method**: 1. Redistribute the *raw* records based on the GroupBy attribute. 2. Perform local aggregation. This is simpler but can be skewed.

---

## 2. Complexity: Machine Learning (Sessions 5-8)

This section focuses on applying machine learning algorithms to big data.

### 🤖 ML Pipeline

A typical ML process follows these steps: **Training Data** $\rightarrow$ **Featurization** $\rightarrow$ **Training** $\rightarrow$ **Model** $\rightarrow$ **Model Evaluation**.

**Featurization** is the process of converting raw data into numerical features:
* **Extraction**: Creating features (e.g., TF-IDF, Word2Vec).
* **Transformation**: Modifying features (e.g., Tokenization, Stop Words Removal, One Hot Encoding).
* **Selection**: Choosing a subset of features (e.g., Vector Slicer).

### 🏷️ Supervised Learning

The data has associated **labels** , and the goal is to predict the label for new data.
* **Classification**: Predicts a category (e.g., "dog" or "not dog").
* **Regression**: Predicts a continuous value.

#### Decision Trees (ID3)
This algorithm builds a tree by repeatedly splitting the data.
1.  At each node, calculate the **Information Gain (IG)** for every attribute.
2.  **IG** is the change in **Entropy** (measure of uncertainty) from splitting on that attribute.
3.  The attribute with the **highest IG** is chosen as the splitting node.
4.  This process repeats until all data in a leaf node belongs to the same class.

#### Parallel Decision Trees
* **Data Parallelism (Intra-Node)**: All processors work on the *same node*. Each processor computes the IG for a *subset of attributes* (vertical partitioning).
* **Result Parallelism (Inter-Node)**: Different processors work on *different nodes* at the same level of the tree concurrently.

### 🧩 Unsupervised Learning

The data has **no labels**. The goal is to find hidden structure.
* **Clustering**: Groups data into clusters (e.g., K-Means).
* **Association**: Finds relationships (e.g., "people who buy X also buy Y").

#### K-Means Clustering
An iterative algorithm to group data into *k* clusters.
1.  **Initialize**: Randomly choose *k* initial cluster centroids.
2.  **Assignment Step**: Assign each data point to its closest centroid.
3.  **Update Step**: Recalculate each centroid to be the mean of all points assigned to it.
4.  **Repeat**: Continue steps 2 and 3 until the cluster memberships stop changing.

#### Parallel K-Means
* **Data Parallelism**: Each processor clusters its own partition of data. The final clusters are then united.
* **Result Parallelism**: Each processor is responsible for *one* of the *k* target clusters. This requires data movement between processors as points change cluster membership.

### 🤝 Collaborative Filtering (CF)

A common method for building recommendation systems.

* **User-Based CF**: Recommends items based on users with similar tastes.
    1.  Calculate the **similarity** (e.g., Cosine Similarity) between the target user and all other users.
    2.  Predict the target user's rating for an item based on a **weighted average** of the ratings from the *most similar* users.
    3.  Recommend the items with the highest predicted ratings.

* **Model-Based CF (ALS)**: Uses **Matrix Factorization** to learn "latent factors" (hidden preferences) for users and items.
    * It factors the large **Rating Matrix (R)** into a smaller **User Matrix (U)** and **Item Matrix (V)**.
    * **Alternating Least Squares (ALS)**: An algorithm that finds U and V by minimizing the error between the *original* ratings and the *predicted* ratings (from U * V). It "alternates" by fixing U to solve for V, then fixing V to solve for U, repeating until stable.

---

## 3. Velocity: Processing Fast Data (Sessions 9-11)

This section focuses on handling real-time, continuous, and unbounded data streams.

### 🌊 Stream Processing Basics

* **Data Stream**: A real-time, continuous, ordered, and unbounded sequence of items.
* **Challenge**: Data is infinite, so we can't store it all. We must process it in one pass using **sliding windows**.
* **Window Types**:
    * **Time-Based**: A fixed time duration (e.g., "all data from the last 5 seconds").
    * **Tuple-Based**: A fixed number of items (e.g., "the last 100 tuples").
* **Window Movement**:
    * **Overlapping (Sliding)**: The window slides by an increment *less than* its size.
    * **Non-Overlapping (Tumbling)**: The window slides by an increment *equal to* its size.
* **Event Time vs. Processing Time**:
    * **Event Time**: The timestamp when the data was generated at the source.
    * **Processing Time**: The timestamp when the data arrived at the processing server.
    * In the real world, **Event Time is always earlier** than Processing Time due to network delays.

### ⚡ Stream Joins

Joining two or more unbounded streams is difficult due to timing.

* **Symmetric Hash Join**: The solution to out-of-order arrivals.
    * A simple hash join fails if a tuple `r` arrives before its matching tuple `s`.
    * A symmetric join maintains **two hash tables**, one for each stream (R and S).
    * When `r` arrives, it **probes** Table S for matches, then is **inserted** into Table R for future `s` tuples to find.
* **M-Join**: An extension of the symmetric hash join for **more than two streams**. An arriving tuple probes the hash tables of *all other* streams before being inserted into its *own* table.
* **Handshake Join**: A conceptual join where two streams "handshake" as they pass. The main problem is that tuples can "miss" each other. Solutions involve adding empty "slots"  or performing multiple handshakes before moving.

### 🔬 Granularity Reduction

This is the process of aggregating data to a lower level of detail (e.g., from raw data "level-0" to an aggregated "level-1").

* **Moving Average vs. Granularity Reduction**:
    * **No Reduction (Rolling Mean)**: Using an *overlapped window* where the slide is 1 record (e.g., a 6-month window sliding 1 month at a time). The number of data points remains the same; the data is just smoothed. This is **Case A: Overlapped Windows - No granularity reduction**.
    * **With Reduction**: Using a *non-overlapped (tumbling) window* or an *overlapped window with a slide > 1*. This results in fewer data points.

* **Mixed Levels of Granularity**: Combining different granularities, often to allow "drill-down" analysis.
    * **Temporal-based**: Based on time (e.g., 1-hour granularity at night, 10-minute during the day).
    * **Spatial-based**: Based on location (e.g., average by state, but drill-down to see individual cities).

### 📡 Sensor Arrays

A group of distributed sensors working together.

* **Category 1: Measuring the SAME Thing** (e.g., 3 weather stations for one city).
    * **Method 1: Reduce then Merge**: First, find the 1-hour average for each station. Second, average those averages.
    * **Method 2: Merge then Reduce**: First, average the raw data from all 3 stations. Second, find the 1-hour average of that merged stream.

* **Category 2: Measuring DIFFERENT Things** (e.g., indoor sensors for Air Quality, Temperature, and Humidity).
    * These streams **must be normalized** to a common scale (e.g., a "Room Quality Score" from 1-5) before they can be merged.
    * **Method 1: Reduce, Normalize, then Merge**.
    * **Method 2: Normalize, Merge, then Reduce**.