**Transformations in Apache Spark:**

A transformation refers to an operation that produces a new RDD (Resilient Distributed Dataset) or DataFrame from an existing one. Transformations are lazy operations, meaning that Spark doesn't immediately execute the transformation when it is called. Instead, Spark builds an execution plan and only executes the transformations when an action is performed on the RDD or DataFrame.

Transformations allow you to define the computation steps on your data, such as filtering, mapping, reducing, joining, etc.

**Key Characteristics of Transformations in Spark:**

*   **Lazy Execution:** Transformations do not execute immediately. Instead, Spark constructs a DAG (Directed Acyclic Graph) that represents the lineage of transformations applied on the data. The actual computation is triggered only when an action is called (e.g., collect(), save(), count()).

* **Immutable:** Each transformation produces a new RDD or DataFrame; the original dataset remains unchanged.

* **Transformations can be Narrow or Wide:** Based on how data is shuffled across partitions, transformations are categorized into narrow and wide transformations.


**Narrow Transformation:**

In **PySpark**, a Narrow Transformation refers to a type of transformation where each input partition contributes to at most one output partition. This means that during a narrow transformation, data is not shuffled across the cluster, and each element in the input RDD (Resilient Distributed Dataset) or DataFrame is mapped to a single corresponding element in the output.

**Key Characteristics of Narrow Transformations:**

* **No Data Shuffling:** Data does not need to be moved between different nodes in the cluster. This makes narrow transformations more efficient than wide transformations, which involve data shuffling.
* **One-to-One Relationship:** Each element of the input is processed independently, with a one-to-one mapping between input and output partitions.
* **Local Computation:** Operations are carried out locally within each partition without needing data from other partitions.

**Examples of Narrow Transformations in PySpark:**

-   **map():** Transforms each element of the RDD or DataFrame by applying a function.
-   **filter():** Filters the data by applying a condition and returns the elements that satisfy it.


Example:

In [None]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("PySparkExample") \
    .getOrCreate()

In [None]:
rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5])
result_rdd = rdd.map(lambda x: x * 2) # There is no shuffle
print(result_rdd.collect())

In this case, the `map()` operation applies a function to each element, but all computations are done within each partition without any need for shuffling the data between nodes.

**Benefits:**

*   **Performance:** Narrow transformations are typically faster because they avoid expensive data shuffling.
*   **Less Network I/O:** Since there’s no data movement between partitions, the operations are more efficient and less taxing on network resources.

**Wide Transformation:**

A wide transformation refers to a type of operation where data from one partition can end up being sent to multiple partitions. These transformations require shuffling of data between different nodes in the cluster, as data needs to be reorganized for the computation.

**Key Characteristics of Wide Transformations:**

*   **Shuffle operation:** Wide transformations cause a shuffle, meaning that Spark has to redistribute the data across different partitions.
*   **More expensive:** Since shuffling involves disk and network I/O, wide transformations can be costly in terms of time and resources.
*   **Group and aggregate data:** Wide transformations typically involve operations where data from multiple partitions needs to be grouped, aggregated, or redistributed based on some key.

**Examples of Wide Transformations in pyspark:**

*   **groupByKey():** Groups the data based on the key, requiring a shuffle to reorganize data.
*   **reduceByKey():** Reduces the data by key and requires shuffling because data with the same key needs to be brought together.
*   **join():** When you join two datasets (e.g., RDDs or DataFrames) based on a key, Spark needs to shuffle the data to match the keys across partitions.


**Why Are Wide Transformations Costly?**

*   **Data Movement:** Wide transformations require moving data across the cluster. For example, in a groupByKey() operation, all the values with the same key must be co-located, potentially requiring the movement of data across multiple nodes.
*   **Disk I/O:** If there is not enough memory to perform the transformation, Spark might spill data to disk during the shuffle process.
*   **Network I/O:** The data shuffle can also result in significant network overhead, especially in large clusters or when working with large datasets.

Example:

In [None]:
rdd = spark.sparkContext.parallelize([('a', 1), ('b', 2), ('a', 3)])
reduced = rdd.reduceByKey(lambda x, y: x + y)  # This involves a shuffle
reduced.collect()

**Summary:**

Wide transformations require a shuffle and typically involve operations that redistribute data across partitions based on some key, such as `groupByKey()`, `reduceByKey()`, `join()`, and `distinct()`.
These operations are more expensive due to the need for data movement, network I/O, and possible disk spills.

*   **Spark History Server:**

    -   A web UI to monitor and visualize completed jobs, stages, and tasks.
    -   The web UI is accessible through the port 18080.
*   **Job:**

    -   A complete unit of work initiated by an action, consisting of one or more stages.
    -   Number of Jobs created = Number of Actions being called.
*   **Stage:**
    -   A subset of a job, divided by wide transformations. Stages consist of multiple tasks.
    -   Number of Stages = Number of Wide Transformation Used + 1
*   **Tasks:** 
    -   The smallest unit of execution in Spark, operating on a partition of the data.
    -   Number of Tasks = Number of Partitions

Narrow Transformation Spark UI:

![local image](./screenshots/narrow-trans-job.png)

![local Image](./screenshots/narrow-trans-stages-tasks.png)

Wide Transformation Spark UI:

![local image](./screenshots/wide-trans-job.png)

![Local Image](./screenshots/wide-trans-stages-tasks.png)