Optimizing the reading of CSV files in PySpark can significantly improve performance, especially when dealing with large datasets. Here are some tips and techniques to optimize CSV file reading in PySpark:


In [0]:
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName("CSV Operation").getOrCreate()

In [0]:
spark.read.format("csv").load("dbfs:/FileStore/Customer_Updated.csv").display()

_c0,_c1,_c2,_c3
customer_id,customer_name,join_date,location
105,Eva,2022-01-01,Ohio
106,Frank,2022-02-01,Nevada
107,Grace,2022-03-01,Colorado
108,Henry,2022-04-01,Utah


1. Specify Schema
spark.read.format("csv").option("header","True").load("dbfs:/FileStore/Customer_Updated.csv",schema=schema).display()

2.Adjust Partition Size(Increasing the number of partitions can help in parallel processing.)
spark.read.format("csv").option("header","True").load("dbfs:/FileStore/Customer_Updated.csv",schema=schema).repartition(10).display()

3.Use Appropriate Data Types

4 Use Compression
(If the CSV files are large, consider using compressed formats like gzip, bzip2, or snappy.)

5.Filter Data Early
spark.read.format("csv").option("header","True").load("dbfs:/FileStore/Customer_Updated.csv",schema=schema).filter(col("customer_id")>101).display()

6.Optimize File Handling(# Coalesce small files into fewer large files)
spark.read.format("csv").option("header","True").load("dbfs:/FileStore/Customer_Updated.csv",schema=schema).coalesce(4)

7.Pruning Unnecessary Columns:
selected_columns = ["customer_id", "location"]
spark.read.format("csv").option("header","True").load("dbfs:/FileStore/Customer_Updated.csv",schema=schema).select(selected_columns).display()

8. Optimize Spark Configurations:




In [0]:
schema=StructType(
  [
    StructField("customer_id",IntegerType(),True),
    StructField("customer_name",StringType(),True),
    StructField("join_date",StringType(),True),
    StructField("location",StringType(),True)

  ]
)
spark.read.format("csv").option("header","True").load("dbfs:/FileStore/Customer_Updated.csv",schema=schema).repartition(10).display()

customer_id,customer_name,join_date,location
105,Eva,2022-01-01,Ohio
107,Grace,2022-03-01,Colorado
106,Frank,2022-02-01,Nevada
108,Henry,2022-04-01,Utah


In [0]:
#.Adjust Partition Size(Increasing the number of partitions can help in parallel processing.)
spark.read.format("csv").option("header","True").load("dbfs:/FileStore/Customer_Updated.csv",schema=schema).repartition(10).display()

customer_id,customer_name,join_date,location
105,Eva,2022-01-01,Ohio
107,Grace,2022-03-01,Colorado
106,Frank,2022-02-01,Nevada
108,Henry,2022-04-01,Utah


In [0]:
#.Filter Data Early
spark.read.format("csv").option("header","True").load("dbfs:/FileStore/Customer_Updated.csv",schema=schema).filter(col("customer_id")>101).display()


customer_id,customer_name,join_date,location
105,Eva,2022-01-01,Ohio
106,Frank,2022-02-01,Nevada
107,Grace,2022-03-01,Colorado
108,Henry,2022-04-01,Utah


In [0]:

#Optimize File Handling(# Coalesce small files into fewer large files)
spark.read.format("csv").option("header","True").load("dbfs:/FileStore/Customer_Updated.csv",schema=schema).coalesce(4)

Out[20]: DataFrame[customer_id: int, customer_name: string, join_date: string, location: string]

In [0]:
#Pruning Unnecessary Columns:
selected_columns = ["customer_id", "location"]
spark.read.format("csv").option("header","True").load("dbfs:/FileStore/Customer_Updated.csv",schema=schema).select(selected_columns).display()

customer_id,location
105,Ohio
106,Nevada
107,Colorado
108,Utah
