## PySpark partitionBy() – Write to Disk Example

PySpark partitionBy() is a function of pyspark.sql.DataFrameWriter class which is used to partition the large dataset (DataFrame) into smaller files based on one or multiple columns while writing to disk

In [2]:
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName("spark sql").master("local[*]").getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/01/09 13:39:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/01/09 13:39:58 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
22/01/09 13:39:58 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


In [6]:
"""
Partitioning the data on the file system is a way to improve the performance of the query when
dealing with a large dataset in the Data lake. A Data Lake is a centralized repository of structured,
semi-structured, unstructured, and binary data that allows you to store a large amount of data as-is
in its original raw format.
"""
2

2

## 2. Partition Advantages
As you are aware PySpark is designed to process large datasets with 100x faster than the tradition processing, this wouldn’t have been possible with out partition. Below are some of the advantages using PySpark partitions on memory or on disk.

Fast accessed to the data
Provides the ability to perform an operation on a smaller dataset
Partition at rest (disk) is a feature of many databases and data processing frameworks and it is key to make jobs work at scale.



In [8]:
df=spark.read.option("header",True) \
        .csv("../data/simple-zipcodes.csv")
df.printSchema()
df.count()

root
 |-- RecordNumber: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Zipcode: string (nullable = true)
 |-- State: string (nullable = true)



20

In [9]:
#partitionBy()
df.write.option("header",True) \
        .partitionBy("state") \
        .mode("overwrite") \
        .csv("../data/zipcodes-state")


In [10]:
df.rdd.getNumPartitions()

1

In [13]:
"""
On each directory, you may see one or more part files (since our dataset is small, 
all records for each state are kept in a single part file). You can change this behavior by repartition()
the data in memory first. Specify the number of partitions (part files) you would want for each 
state as an argument to the repartition() method.
"""
#partitionBy()
df.write.option("header",True) \
        .partitionBy("state","city") \
        .mode("overwrite") \
        .csv("../data/zipcodes-state")


"""
It creates a folder hierarchy for each partition; we have mentioned the first partition as state
followed by city hence, it creates a city folder inside the state folder (one folder for each city in
a state).
"""


'\nIt creates a folder hierarchy for each partition; we have mentioned the first partition as state\nfollowed by city hence, it creates a city folder inside the state folder (one folder for each city in\na state).\n'

In [21]:
"""
6. Using repartition() and partitionBy() together

For each partition column, if you wanted to further divide into several partitions, 
use repartition() and partitionBy() together as explained in the below example.

repartition() creates specified number of partitions in memory. The partitionBy() 
will write files to disk for each memory partition and partition column. 
"""

#Use repartition() and partitionBy() together
df.repartition(2) \
        .write.option("header",True) \
        .partitionBy("state") \
        .mode("overwrite") \
        .csv("../data/zipcodes-state-more")

In [22]:
"""
7. Data Skew – Control Number of Records per Partition File
"""

#partitionBy() control number of partitions
df.write.option("header",True) \
        .option("maxRecordsPerFile", 2) \
        .partitionBy("state") \
        .mode("overwrite") \
        .csv("../data/zipcodes-state")

In [24]:

# 8. Read a Specific Partition
dfSinglePart=spark.read.option("header",True) \
            .csv("../data/zipcodes-state/state=AL")
dfSinglePart.printSchema()
dfSinglePart.show()



root
 |-- RecordNumber: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Zipcode: string (nullable = true)

+------------+-------+-------------+-------+
|RecordNumber|Country|         City|Zipcode|
+------------+-------+-------------+-------+
|       54356|     US|  SPRUCE PINE|  35585|
|       54354|     US|SPRING GARDEN|  36275|
|       54355|     US|  SPRINGVILLE|  35146|
+------------+-------+-------------+-------+

