# Big Data Lecture Notes: Stream Data Processing

## Part 1: Fundamentals of Data Streams

### What is a Data Stream? 🏞️

A **data stream** is a real-time, continuous, and ordered sequence of data items that is potentially unbounded (infinite). Unlike a traditional database where data is stored before being queried, stream data is processed as it arrives.

Think of it like a river: the water (data) is always flowing, and you can't store the entire river. You can only analyze the water that is passing by you right now.

#### Key Characteristics of Streaming Data
* **Unbounded Data**: The data arrives continuously and has no defined end.
* **Real-time Processing**: Queries are run continuously over the data as it arrives, and quick responses are needed.
* **Continuous Arrival**: Data can arrive at a uniform rate (e.g., a sensor reading every second) or in sudden bursts (e.g., a surge in social media posts).
* **Focus on Events/Trends**: We are often interested in detecting events or tracking trends over time.


#### Database vs. Stream Processing

| Feature | Database (Batch Processing) | Stream Processing |
| :--- | :--- | :--- |
| **Data Scope** | Bounded (entire dataset) | Unbounded (infinite stream) |
| **Data State** | Relatively static | Dynamic and constantly changing |
| **Queries** | Complex, ad-hoc queries | Simple, continuous queries |
| **Answers** | Exact and precise | Often approximate |
| **Data Access** | Can access data multiple times | Single-pass operation (no backtracking) |

### Querying Data Streams

Since we can't store an entire infinite stream, we must process data on the fly, often using a **synopsis** or **accumulator** to maintain a summary.

* **Question**: *How to find the total number of attendees up to now?*
    * **Answer**: Use a single accumulator variable. When a new data point arrives, add its value to the accumulator. There's no need to store the individual data points.
* **Question**: *How to find the average number of attendees so far?*
    * **Answer**: Use two accumulators: one for the sum and another for the count. The average can be calculated at any time by dividing the sum by the count.
* **Question**: *How to predict if the next number will be higher or lower?*
    * **Answer**: This is a predictive query. It requires learning from past data. Since we have limited memory, we can't store the whole stream. Instead, we use a model that gets updated as new data arrives.

---

## Part 2: Windowing

Windowing is the most common technique for handling unbounded streams. It allows us to run computations over a finite chunk, or "window," of the stream.

### Types of Windows

1.  **Time-Based Windows**: The window is defined by a fixed time duration.
    * **Example**: "Calculate the average traffic volume over the last 5 minutes."
    * The number of data points in each window can vary, especially if the data arrives in bursts.
2.  **Tuple-Based (Count-Based) Windows**: The window is defined by a fixed number of data points.
    * **Example**: "Calculate the average price over the last 100 stock trades."
    * The time duration covered by each window can vary.

### Window Movement

Windows can move over the stream in two primary ways:

1.  **Tumbling Windows (Non-Overlapping)**: The window "tumbles" forward in increments equal to its size. Each data point belongs to exactly one window.
2.  **Sliding Windows (Overlapping)**: The window "slides" forward in increments smaller than its size. This means the windows overlap, and a single data point can belong to multiple windows. This is useful for getting more frequent updates.

* **Practice Question**: A stream has tuples in the format `(eventTime, value, processingTime)`. We use a **time-based window** with a **size of 3 seconds** and a **slide of 2 seconds**, starting at processing time 1. What is in the third window?
    * **Data**: `{1,a,1}, {2,b,4}, {3,c,4}, {5,e,6}, {6,f,8}, {7,g,8}, ...`
    * **Answer**: We trace the windows based on *processing time*.
        1.  **Window 1**: Covers the time interval `[1, 4)`. It contains tuples with processing times of 1, 2, and 3. The tuples are: `{1,a,1}`. *Wait, re-reading the example data, the tuples are {1,a,1}, then {2,b,4}, {3,c,4}. So Window 1 contains all three.*
        2.  **Window 2**: The window slides by 2 seconds, so it starts at time `1+2=3`. It covers the interval `[3, 6)`, containing tuples with processing times of 3, 4, and 5. The tuples are: `{2,b,4}, {3,c,4}, {5,e,6}`.
        3.  **Window 3**: The window slides by another 2 seconds, starting at time `3+2=5`. It covers the interval `[5, 8)`, containing tuples with processing times of 5, 6, and 7. The tuples are: `{5,e,6}, {6,f,8}, {7,g,8}`.
    * The correct answer is **C**.

---

## Part 3: Stream Processing Technologies

### Real-Time Streaming Architecture

Modern streaming architectures typically involve a few key components working together. A common pattern is integrating **Apache Kafka** with **Apache Spark Streaming**.


* **Producers**: Applications that generate and send the data streams (e.g., IoT devices, web servers).
* **Kafka**: A messaging system that ingests and stores these streams in a fault-tolerant way.
* **Spark Streaming**: A processing engine that consumes the streams from Kafka in small batches, processes them, and sends the results to a destination.
* **Consumers/Destinations**: Applications or systems that use the processed data, such as a database, a dashboard, or another consumer application.

### Apache Kafka 📬

Kafka is a distributed, publish-subscribe messaging system that acts as the central nervous system for real-time data. It's designed to be scalable, durable, and fault-tolerant.

#### Key Kafka Concepts
* **Producer**: Publishes streams of records to Kafka topics.
* **Consumer**: Subscribes to topics to read and process the streams of records.
* **Topic**: A category or feed name to which records are published. Think of it as a table in a database, but for a stream. A topic is stored as a **commit log**.
* **Broker**: A Kafka server. Kafka is run as a cluster of one or more brokers that manage the data.
* **Partition**: A topic is split into multiple partitions. Each partition is an ordered, append-only log. Splitting a topic into partitions allows for parallelism, as multiple consumers can read from different partitions at the same time. Partitions are also replicated across brokers for fault tolerance.
* **Offset**: A unique sequence ID given to each record as it arrives in a partition. Consumers are responsible for tracking the offset they have read up to, which gives them full control over how they consume messages.

### Stream Data Processing
Data stream is a real-time, continuous, time-ordered sequence of items

We store accumulated statistics of data and not every data item due to memory constraints
Stream window size determines processing needed

Database
- Bounded data
- Relatively static data
- Complex, adhoc query
- Can backtrack during processing
- Exact answer to a query
- Tuples arrival rate is low

Stream processing
- Unbounded data
- Dynamic data
- Simple, continuous query
- No backtracking, single pass operation
- Approximate answer to a query
- Tuples arrival rate is high

### Run week 9 Lab. New docker container.