# Data Skew
<p align='center'><img src='.././images/data_skew.webp' height=300><br>Figure: even (left) vs. uneven (right) data skew. </p>

Reference: 
- [Solving Data Skewness in Spark
](https://www.junaideffendi.com/blog/solving-data-skewness-in-spark/)
- [Apache Spark Core – Practical Optimization Daniel Tomes (Databricks)
](https://www.youtube.com/watch?v=_ArCesElWp8)

In [4]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

In [None]:
spark = SparkSession.builder\
        .master("local")\
        .appName("data-skew")\
        .config("spark.ui.port", "4050")\
        .enableHiveSupport()\
        .getOrCreate()

In [3]:
spark

In [2]:
df = spark.read.option("header", "true")\
               .option("nullValue", "?") \
               .option("inferSchema", "true")\
               .csv("../data/linkage/block_*.csv")

                                                                                

## Spark Shuffle Partition
- Based on your dataset size, number of cores, and memory, Spark shuffling can benefit or harm your jobs. When you dealing with less amount of data, you should typically reduce the shuffle partitions otherwise you will end up with many partitioned files with a fewer number of records in each partition. which results in running many tasks with lesser data to process.

- On another hand, when you have too much data and have less number of partitions results in fewer longer running tasks, and sometimes you may also get out of memory error.

- Getting the right size of the shuffle partition is always tricky and takes many runs with different values to achieve the optimized number. This is one of the key properties to look for when you have performance issues on Spark jobs.

In [15]:
spark.conf.get("spark.sql.shuffle.partitions") #default 200 

'200'

In [11]:
df.count()

                                                                                

5749132

In [20]:
df.show(2, vertical=True)



-RECORD 0-------------------------
 id_1         | 15879             
 id_2         | 77381             
 cmp_fname_c1 | 1.0               
 cmp_fname_c2 | null              
 cmp_lname_c1 | 0.142857142857143 
 cmp_lname_c2 | null              
 cmp_sex      | 1                 
 cmp_bd       | 1                 
 cmp_bm       | 0                 
 cmp_by       | 0                 
 cmp_plz      | 0                 
 is_match     | false             
-RECORD 1-------------------------
 id_1         | 22735             
 id_2         | 51381             
 cmp_fname_c1 | 1.0               
 cmp_fname_c2 | null              
 cmp_lname_c1 | 0.0               
 cmp_lname_c2 | null              
 cmp_sex      | 1                 
 cmp_bd       | 0                 
 cmp_bm       | 0                 
 cmp_by       | 1                 
 cmp_plz      | 0                 
 is_match     | false             
only showing top 2 rows



                                                                                

In [22]:
df = df.repartition(5, 'cmp_sex')
df.rdd.getNumPartitions()



5

In [23]:
# to get the counts of rows per partition.
df.withColumn("partition", F.spark_partition_id()).groupBy("partition").count().show() 



+---------+-------+
|partition|  count|
+---------+-------+
|        1| 258703|
|        3|5490429|
+---------+-------+



                                                                                

## Data Skew Correction ?
- In the real world, perfect data distributions are rare. Often when reading data, we are pulling from pre-partitioned files or ETL pipelines which may not automatically be distributed as nicely.

### 1. Repartition by Column(s)
- The first solution is to logically re-partition your data based on the transformations
    -  If you’re grouping or joining, partitioning by the groupBy/join columns can improve shuffle efficiency.

`df = df.repartition(<n_partitions>, '<col_1>', '<col_2>',...)`

### 2. Salt
- SALTING is a common technique to solve data skews. 
- The idea is to add a random key to distribute data evenly between join keys
- create a column with a random value the partition by that column

In [24]:
df = df.withColumn('salt', F.rand())
df = df.repartition(8, 'salt')
df.groupBy(F.spark_partition_id()).count().show()




+--------------------+------+
|SPARK_PARTITION_ID()| count|
+--------------------+------+
|                   0|717940|
|                   1|717987|
|                   2|718592|
|                   3|720004|
|                   4|719440|
|                   5|717955|
|                   6|718421|
|                   7|718793|
+--------------------+------+



                                                                                