## Partitioning Spark DataFrame

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

In [2]:
userName = 'CodeInDNA'
spark = SparkSession. \
        builder. \
        appName(f'{userName} - JoinSparkDF'). \
        getOrCreate()

In [20]:
ordersDF = spark.read.json('../data/orders.json')

In [22]:
ordersDF.show(2)

+-----------------+--------------------+--------+---------------+
|order_customer_id|          order_date|order_id|   order_status|
+-----------------+--------------------+--------+---------------+
|            11599|2013-07-25 00:00:...|       1|         CLOSED|
|              256|2013-07-25 00:00:...|       2|PENDING_PAYMENT|
+-----------------+--------------------+--------+---------------+
only showing top 2 rows



In [23]:
help(ordersDF.write.partitionBy)

Help on method partitionBy in module pyspark.sql.readwriter:

partitionBy(*cols) method of pyspark.sql.readwriter.DataFrameWriter instance
    Partitions the output by the given columns on the file system.
    
    If specified, the output is laid out on the file system similar
    to Hive's partitioning scheme.
    
    :param cols: name of columns
    
    >>> df.write.partitionBy('year', 'month').parquet(os.path.join(tempfile.mkdtemp(), 'data'))
    
    .. versionadded:: 1.4



In [24]:
# json does not have the keyword argument related to partitioning
help(ordersDF.write.json)

Help on method json in module pyspark.sql.readwriter:

json(path, mode=None, compression=None, dateFormat=None, timestampFormat=None, lineSep=None, encoding=None, ignoreNullFields=None) method of pyspark.sql.readwriter.DataFrameWriter instance
    Saves the content of the :class:`DataFrame` in JSON format
    (`JSON Lines text format or newline-delimited JSON <http://jsonlines.org/>`_) at the
    specified path.
    
    :param path: the path in any Hadoop supported file system
    :param mode: specifies the behavior of the save operation when data already exists.
    
        * ``append``: Append contents of this :class:`DataFrame` to existing data.
        * ``overwrite``: Overwrite existing data.
        * ``ignore``: Silently ignore this operation if data already exists.
        * ``error`` or ``errorifexists`` (default case): Throw an exception if data already                 exists.
    :param compression: compression codec to use when saving to file. This can be one of the
     

In [25]:
# parquet has the keyword argument called paritionBy 
help(ordersDF.write.parquet)

Help on method parquet in module pyspark.sql.readwriter:

parquet(path, mode=None, partitionBy=None, compression=None) method of pyspark.sql.readwriter.DataFrameWriter instance
    Saves the content of the :class:`DataFrame` in Parquet format at the specified path.
    
    :param path: the path in any Hadoop supported file system
    :param mode: specifies the behavior of the save operation when data already exists.
    
        * ``append``: Append contents of this :class:`DataFrame` to existing data.
        * ``overwrite``: Overwrite existing data.
        * ``ignore``: Silently ignore this operation if data already exists.
        * ``error`` or ``errorifexists`` (default case): Throw an exception if data already                 exists.
    :param partitionBy: names of partitioning columns
    :param compression: compression codec to use when saving to file. This can be one of the
                        known case-insensitive shorten names (none, uncompressed, snappy, gzip,
     

In [26]:
help(ordersDF.write.orc)

Help on method orc in module pyspark.sql.readwriter:

orc(path, mode=None, partitionBy=None, compression=None) method of pyspark.sql.readwriter.DataFrameWriter instance
    Saves the content of the :class:`DataFrame` in ORC format at the specified path.
    
    :param path: the path in any Hadoop supported file system
    :param mode: specifies the behavior of the save operation when data already exists.
    
        * ``append``: Append contents of this :class:`DataFrame` to existing data.
        * ``overwrite``: Overwrite existing data.
        * ``ignore``: Silently ignore this operation if data already exists.
        * ``error`` or ``errorifexists`` (default case): Throw an exception if data already                 exists.
    :param partitionBy: names of partitioning columns
    :param compression: compression codec to use when saving to file. This can be one of the
                        known case-insensitive shorten names (none, snappy, zlib, and lzo).
                     

#### Partition date by date

In [27]:
ordersDF.show(2)

+-----------------+--------------------+--------+---------------+
|order_customer_id|          order_date|order_id|   order_status|
+-----------------+--------------------+--------+---------------+
|            11599|2013-07-25 00:00:...|       1|         CLOSED|
|              256|2013-07-25 00:00:...|       2|PENDING_PAYMENT|
+-----------------+--------------------+--------+---------------+
only showing top 2 rows



In [28]:
ordersDF.printSchema()

root
 |-- order_customer_id: long (nullable = true)
 |-- order_date: string (nullable = true)
 |-- order_id: long (nullable = true)
 |-- order_status: string (nullable = true)



In [30]:
(ordersDF.
withColumn('order_date', date_format('order_date', 'yyyyMMdd')).
show(5))

+-----------------+----------+--------+---------------+
|order_customer_id|order_date|order_id|   order_status|
+-----------------+----------+--------+---------------+
|            11599|  20130725|       1|         CLOSED|
|              256|  20130725|       2|PENDING_PAYMENT|
|            12111|  20130725|       3|       COMPLETE|
|             8827|  20130725|       4|         CLOSED|
|            11318|  20130725|       5|       COMPLETE|
+-----------------+----------+--------+---------------+
only showing top 5 rows



In [31]:
(ordersDF.
withColumn('order_date', date_format('order_date', 'yyyyMMdd')).
coalesce(1).
write.
partitionBy('order_date').
parquet('../data/orders_partitioned_by_date'))

In [32]:
ordersDF.count()

68883

In [33]:
spark.read.parquet('../data/orders_partitioned_by_date/').count()

68883

#### Partition data by month

In [34]:
(ordersDF.
withColumn('order_month', date_format('order_date', 'yyyyMM')).
coalesce(1).
write.
partitionBy('order_month').
parquet('../data/orders_partitioned_by_month'))

In [36]:
spark.read.parquet('../data/orders_partitioned_by_month/').count()

68883

#### Partition by year, month and then day of month

In [39]:
(ordersDF.
withColumn('year', date_format('order_date', 'yyyy')).
withColumn('month', date_format('order_date', 'MM')).
withColumn('day_of_month', date_format('order_date', 'dd')).
show(5))

+-----------------+--------------------+--------+---------------+----+-----+------------+
|order_customer_id|          order_date|order_id|   order_status|year|month|day_of_month|
+-----------------+--------------------+--------+---------------+----+-----+------------+
|            11599|2013-07-25 00:00:...|       1|         CLOSED|2013|   07|          25|
|              256|2013-07-25 00:00:...|       2|PENDING_PAYMENT|2013|   07|          25|
|            12111|2013-07-25 00:00:...|       3|       COMPLETE|2013|   07|          25|
|             8827|2013-07-25 00:00:...|       4|         CLOSED|2013|   07|          25|
|            11318|2013-07-25 00:00:...|       5|       COMPLETE|2013|   07|          25|
+-----------------+--------------------+--------+---------------+----+-----+------------+
only showing top 5 rows



In [43]:
(ordersDF.
withColumn('year', date_format('order_date', 'yyyy')).
withColumn('month', date_format('order_date', 'MM')).
withColumn('day_of_month', date_format('order_date', 'dd')).
coalesce(1).
write.
partitionBy('year', 'month', 'day_of_month').
parquet('../data/orders_paritioned_by_ymd'))

In [44]:
spark.read.parquet('../data/orders_paritioned_by_ymd/').count()

68883

In [85]:
schema = """order_id INT, order_customer_id INT, order_status STRING, order_date STRING"""
df = spark.read.parquet('../data/orders_partitioned_by_date/')

In [86]:
df.dtypes

[('order_customer_id', 'bigint'),
 ('order_id', 'bigint'),
 ('order_status', 'string'),
 ('order_date', 'int')]

In [88]:
df.show(5)

+-----------------+--------+---------------+----------+
|order_customer_id|order_id|   order_status|order_date|
+-----------------+--------+---------------+----------+
|             6471|   15793|       COMPLETE|  20131103|
|             5323|   15794|     PROCESSING|  20131103|
|            10096|   15795|         CLOSED|  20131103|
|            11665|   15796|       COMPLETE|  20131103|
|             6249|   15797|PENDING_PAYMENT|  20131103|
+-----------------+--------+---------------+----------+
only showing top 5 rows



#### Partition Pruning

In [89]:
sum_df = (df.
where("order_date=='20130725'"))

In [90]:
# In traditional method, first spark scan all the partitions and then apply the filter which leads to increase delay(non-optimisitic approach)
# From Spark 3.0, Spark tries to apply filter first at the scanning stage which leads to "no more unnecessary scans" 
sum_df.explain(mode='formatted')

== Physical Plan ==
* ColumnarToRow (2)
+- Scan parquet  (1)


(1) Scan parquet 
Output [4]: [order_customer_id#931L, order_id#932L, order_status#933, order_date#934]
Batched: true
Location: InMemoryFileIndex [file:/E:/Practice/PySpark/data/orders_partitioned_by_date]
PartitionFilters: [isnotnull(order_date#934), (order_date#934 = 20130725)]
ReadSchema: struct<order_customer_id:bigint,order_id:bigint,order_status:string>

(2) ColumnarToRow [codegen id : 1]
Input [4]: [order_customer_id#931L, order_id#932L, order_status#933, order_date#934]




In [92]:
sum_df.show(5)

+-----------------+--------+---------------+----------+
|order_customer_id|order_id|   order_status|order_date|
+-----------------+--------+---------------+----------+
|            11599|       1|         CLOSED|  20130725|
|              256|       2|PENDING_PAYMENT|  20130725|
|            12111|       3|       COMPLETE|  20130725|
|             8827|       4|         CLOSED|  20130725|
|            11318|       5|       COMPLETE|  20130725|
+-----------------+--------+---------------+----------+
only showing top 5 rows

