# Big Data Lecture Notes: Granularity and Sensors

## Part 1: Granularity in Data Streams

### What is Granularity? 🔍

**Granularity** refers to the **level of detail** at which data is stored or analyzed. Think of it like looking at a map: you can view the entire world (low granularity) or zoom in to a specific street (high granularity).

* **High Granularity** (Fine-grained): This is the raw, detailed, unaggregated data. It's complete but can be very complex and overwhelming. For example, collecting air quality data every single second.
* **Low Granularity** (Coarse-grained): This is summarized or aggregated data. It's less detailed but simpler, making it easier to spot trends and manage. For example, averaging the second-by-second air quality data into a single daily value.

The main reason for using different levels of granularity is **efficiency and complexity management**. Querying massive, high-granularity streams is slow. By pre-aggregating data into lower granularities, we can perform faster analysis and make complex information easier to understand.


---

### Granularity Reduction with Windows

In stream processing, we use **time-based windows** to aggregate data and reduce its granularity. The way the window moves (slides) determines how the granularity is affected.

#### Non-Overlapped (Tumbling) Windows
Here, the window slide interval is **equal to the window size**. Each data point belongs to exactly one window.
* **Effect**: This directly reduces granularity. For example, using a 6-hour non-overlapped window on data collected every 10 minutes will transform the stream into a single data point every 6 hours.

#### Overlapped (Sliding) Windows
Here, the window slide interval is **less than the window size**.

* **Rolling Mean (No Granularity Reduction)**: If the window slides by a very small amount (e.g., one record or time unit), you are calculating a **moving average**. This **does not reduce the number of data points**; it only *smooths* the data to make underlying trends more visible.
* **Sliding with a Larger Step (Granularity Reduction)**: If the window slides by an interval greater than one time unit (e.g., a 6-hour window that slides every 3 hours), you will produce fewer data points than you started with. This both smooths the data *and* reduces its granularity.

---

### Mixed Levels of Granularity

This technique involves combining different levels of granularity into a single, dynamic view. It allows you to see a general trend but also **"drill down"** into more detail when something interesting happens.

There are two main types:

1.  **Temporal-based Mixed Granularity**: This is based on **time**. You might use a coarse granularity during periods of low activity but switch to a fine granularity during peak hours to see more detail. For example, using 6-hour averages during the day and 1-hour averages at night.
2.  **Spatial-based Mixed Granularity**: This is based on **location**. For example, a public transport dashboard might show the average number of passengers per suburb (low granularity). If you see a spike in a particular suburb, you can drill down to see the passenger numbers for each individual stop within that suburb (high granularity).

---

## Part 2: Sensor Arrays

A **sensor array** is a group of distributed sensors that work together to provide a more complete picture of an environment. There are two main categories of sensor arrays.

### 1. Multiple Sensors Measuring the SAME Thing

This is done to improve accuracy over a large area. For example, using three different weather stations to get a more accurate average temperature for an entire city.

There are two ways to process this data:

* **Method 1: Reduce then Merge**
    1.  **Reduce**: Lower the granularity of each sensor's data stream individually (e.g., calculate the 3-hour average for each station).
    2.  **Merge**: Combine the reduced-granularity streams into a single stream (e.g., by averaging the 3-hour averages from all stations).

* **Method 2: Merge then Reduce**
    1.  **Merge**: Combine the raw, high-granularity data from all sensors into a single stream (e.g., by averaging the raw readings at each timestamp).
    2.  **Reduce**: Lower the granularity of the newly merged stream.

Both methods produce similar results, but **Merge then Reduce** may preserve slightly more detail from the original raw data.

### 2. Multiple Sensors Measuring DIFFERENT Things

This involves a group of sensors measuring different variables in the same environment, such as an indoor sensor array with sensors for air quality (ppm), temperature (°C), and humidity (%).

* **Question**: Why do we need to normalize this data?
    * **Answer**: You can't directly compare or average values from different types of sensors (e.g., 25°C and 1000 ppm are on completely different scales). **Normalization** is required to convert all measurements to a common scale (like a quality score from 1 to 5) before they can be merged.

The processing methods are similar, but with an added normalization step:

* **Method 1: Reduce, Normalize, then Merge**
    1.  **Reduce** the granularity of each sensor stream.
    2.  **Normalize** the reduced data to a common scale.
    3.  **Merge** the normalized streams.

* **Method 2: Normalize, Merge, then Reduce**
    1.  **Normalize** the raw data from each sensor.
    2.  **Merge** the normalized raw streams.
    3.  **Reduce** the granularity of the final merged stream.