# Skew data in Spark

## Issues with skew data in processing

- Imbalanced workload: some partitions may have significantly more data than others, lead to some tasks taking much longer to complete, which can slow down the entire job.
- Out of memory errors: skewed data can cause out of memory errors, can be particularly problematic if the data is being cached in memory for iterative processing.
- Uneven resource usage: a partition may consume a disproportionate amount of resources (such as CPU or memory), leading to inefficient resource utilization.
- Slow processing times: Skewed data can cause slower processing times, particularly for operations like joins and aggregations, which require shuffling and data movement between partitions.
- Job failures: skewed partitions cause out-of-memory errors or lead to long-running tasks that exceed the maximum allotted time.

## How to handle skew data?

- Salting: add a random prefix to the key of each record to distribute the data uniformly across the partitions.
- Bucketing: partition data based on the values of a key into a fixed number of buckets.
- Broadcast join:
    - For join operations.
    - Broadcast the small table to all the partitions of the large table, reducing the amount of data that needs to be shuffled.

        &rarr; Reduce the impact of skew data.
- Co-partitioning: partition two tables using the same partitioning scheme to ensure that the data is distributed uniformly across the partitions