reduceByKey() Vs groupByKey() in pyspark:

| **Aspect**               | **`reduceByKey()`**                                         | **`groupByKey()`**                                         |
|--------------------------|------------------------------------------------------------|-----------------------------------------------------------|
| **Purpose**              | Combines values with the same key using a specified reduce function. | Groups values with the same key into a list or iterable.   |
| **Operation Type**       | Transformation that reduces the values of each key.        | Transformation that groups values of each key.            |
| **Efficiency**           | More efficient for large datasets as it reduces data during shuffle. | Less efficient; stores all values in memory and then groups them. |
| **Shuffle Behavior**     | Causes a shuffle but performs aggregation (combining values) during shuffle. | Causes a shuffle without any aggregation, just grouping.   |
| **Output Format**        | Returns an RDD of key-value pairs, where each key is associated with a reduced value. | Returns an RDD of key-value pairs, where each key is associated with an iterable of values. |
| **Use Case**             | Use when you want to perform an aggregation (e.g., sum, max, etc.) on values for each key. | Use when you need to collect all values for each key, without aggregation. |
| **Memory Usage**         | More memory efficient because values are reduced during the shuffle. | Can consume more memory because it retains all values for each key before processing. |
| **Typical Function**     | `reduceByKey(func)` where `func` is a commutative and associative function (e.g., `lambda x, y: x + y`). | `groupByKey()` which groups values associated with each key into a list. |
| **Example**              | `rdd.reduceByKey(lambda x, y: x + y)`                      | `rdd.groupByKey()`                                          |

**Key Takeaways:**
- **`reduceByKey()`** is typically preferred when performing any kind of aggregation (like sum, max, etc.), as it reduces data during the shuffle and is more memory efficient.
- **`groupByKey()`** is used when you want to group all the values for each key without any reduction. It can be less efficient because it involves shuffling the entire dataset and storing the entire list of values for each key in memory.

Example:

reduceByKey()

In [None]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("PySparkExample") \
    .getOrCreate()

In [2]:
base_rdd = spark.sparkContext.textFile("/Users/sugumarsrinivasan/Documents/data/orders_4gb.csv")

In [None]:
header = base_rdd.first()

In [4]:
data_without_header = base_rdd.filter(lambda line: line != header)

In [None]:
data_without_header.take(5)

In [6]:
mapped_rdd = data_without_header.map(lambda x: (x.split(",")[3],1))

In [7]:
reduced_rdd = mapped_rdd.reduceByKey(lambda x,y: x+y)

In [None]:
reduced_rdd.collect()

groupByKey()

In [9]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("PySparkExample") \
    .getOrCreate()

In [10]:
base_rdd = spark.sparkContext.textFile("/Users/sugumarsrinivasan/Documents/data/orders.csv")

In [11]:
header = base_rdd.first()

In [12]:
data_without_header = base_rdd.filter(lambda line: line != header)

In [13]:
mapped_rdd = data_without_header.map(lambda x: (x.split(",")[3], x.split([2])))

In [14]:
grouped_rdd = mapped_rdd.groupByKey()

In [15]:
result = grouped_rdd.map(lambda x: (x[0],len(x[1])))

In [None]:
result.collect()