<a href="https://colab.research.google.com/github/Shivayogi-A/Pyspark_programming/blob/master/partitionBy()_and_BucketBy().ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PySpark PartitionBy()

PySpark partitionBy() is a function which is used to partition the DataFrame into smaller files based on one or multiple key columns while writing to disk.

PySpark supports partition in two ways; **partition in memory** (DataFrame) and **partition on the disk** (File system).


**Partition in memory:** You can partition or repartition the DataFrame by calling repartition() or coalesce() transformations.

**Partition on disk:** While writing the PySpark DataFrame back to disk, you can choose how to partition the data based on columns using partitionBy().

**Note:** The partitionBy() is available in **DataFrameWriter** class hence, it is used to write the partition data to the disk.

In [None]:
#Syntax of partitionBy()
--partitionBy("key column name")

In [None]:
#Example usage of partitionBy()

df.write.option("header",True) \
        .partitionBy("state") \
        .mode("overwrite") \
        .csv("/tmp/zipcodes-state")

The above code will write the data partitioned based on key colmun 'state'.

In [None]:
#partitionBy() multiple columns
df.write.option("header",True) \
        .partitionBy("state","city") \
        .mode("overwrite") \
        .csv("/tmp/zipcodes-state")

The above code will write date partitioned based on key column 'state' and 'city'



**Data Skew â€“ Control Number of Records per Partition File**

Use option **maxRecordsPerFile** if you want to control the number of records for each partition. This is particularly helpful when your data is skewed (Having some partitions with very low records and other partitions with a high number of records).

In [None]:
#partitionBy() control number of partitions
df.write.option("header",True) \
        .option("maxRecordsPerFile", 2) \
        .partitionBy("state") \
        .mode("overwrite") \
        .csv("/tmp/zipcodes-state")

**How does partitioning affect query performance?**\
Partitioning can significantly improve query performance, especially when querying specific subsets of data. It helps skip irrelevant data when reading, reducing the amount of data that needs to be processed.

**How is partitionBy() different from groupBy() in PySpark?**\
partitionBy() is used for physically organizing data on disk when writing to a file system, while groupBy() is used for the logical grouping of data within a DataFrame.

# Bucketing in Spark - BucketBy()

Bucketing is a way to assign rows of a dataset to specific buckets and collocate them on disk.\
This enables efficient wide transformations in Spark, as the data is already collocated in the executors correctly.\
**Wide transformations** are operations that require shuffling data across partitions, which can be a costly operation.

In Spark, bucketing is implemented by the **.bucketBy()** method of the **DataFrameWriter** class. To bucket a dataset, you need to provide the method with the number of buckets you want to create and the column to bucket by.

Here is an example of how to bucket a dataset in Spark:

In [None]:
from pyspark.sql import SparkSession


# Create a SparkSession
spark = SparkSession.builder.appName("BucketingExample").getOrCreate()


# Load a dataset
df = spark.read.format("csv").option("header", "true").load("path/to/dataset")


# Bucket the dataset by the "id" column into 10 buckets
df.write.bucketBy(10, "id").sortBy("id").format("parquet").save("path/to/bucketed/dataset")

In the above example, we loaded a dataset and bucketed it by the "id" column into 10 buckets using the .bucketBy() method. The resulting bucketed dataset is then sorted by the "id" column and saved as a parquet file in the specified directory.

**When to use partitioning and bucketing?**\
If you will often perform filtering on a given column and it is of low cardinality, partition on that column. If you will be performing complex operations like joins, groupBys, and windowing and the column is of high cardinality, consider bucketing on that column.\
However, bucketing is complicated and requires careful consideration of nuances and caveats. For example, there are conditions that need to be met between two datasets in order for bucketing to have the desired effect. Additionally, bucketing can only be used when the data is saved as a table, as the metadata of the buckets needs to be saved somewhere, usually in a Hive metadata store.
Conclusion