PySpark provides powerful and flexible APIs to read and write data from a variety of sources - including CSV, JSON, Parquet, ORC, and databases - using the Spark DataFrame interface. These operations form the backbone of most ETL (Extract, Transform, Load) pipelines, enabling you to process data at scale in a distributed environment.

In this tutorial, you’ll learn the general patterns for reading and writing files in PySpark, understand the meaning of common parameters, and see examples for different data formats. By the end, you’ll be comfortable handling input and output operations in PySpark using clean, reusable code.

Let's start with reading files. This is the general format for reading files in PySpark. It works for all file formats.

In [1]:
from pyspark.sql import SparkSession

# Step 1: Initialize SparkSession
spark = SparkSession.builder \
    .appName("Read File Example") \
    .getOrCreate()




Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/11/07 17:49:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/11/07 17:49:50 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/11/07 17:49:50 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


In [8]:
df = spark.read \
    .format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("../data/boston.csv")

df.show(5)
df.printSchema()

+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+-----+
|   CRIM|  ZN|INDUS|CHAS|  NOX|   RM| AGE|   DIS|RAD|TAX|PTRATIO|     B|LSTAT|Price|
+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+-----+
|0.00632|18.0| 2.31|   0|0.538|6.575|65.2|  4.09|  1|296|   15.3| 396.9| 4.98| 24.0|
|0.02731| 0.0| 7.07|   0|0.469|6.421|78.9|4.9671|  2|242|   17.8| 396.9| 9.14| 21.6|
|0.02729| 0.0| 7.07|   0|0.469|7.185|61.1|4.9671|  2|242|   17.8|392.83| 4.03| 34.7|
|0.03237| 0.0| 2.18|   0|0.458|6.998|45.8|6.0622|  3|222|   18.7|394.63| 2.94| 33.4|
|0.06905| 0.0| 2.18|   0|0.458|7.147|54.2|6.0622|  3|222|   18.7| 396.9| 5.33| 36.2|
+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+-----+
only showing top 5 rows

root
 |-- CRIM: double (nullable = true)
 |-- ZN: double (nullable = true)
 |-- INDUS: double (nullable = true)
 |-- CHAS: integer (nullable = true)
 |-- NOX: double (nullable = true)
 |-- RM: double (nullable 

In [6]:
df = spark.read \
    .format("parquet") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("../data/titanic.parquet")

df.show(5)
df.printSchema()

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| NULL|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| NULL|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| NULL|       S|
+-----------+--------+------+--------------------+------+----+-----+-----+------

Writing Files in PySpark
This is the general format for writing files in PySpark. It works for all file formats.



In [7]:
df.write \
    .format("csv") \
    .option("header", "true") \
    .mode("overwrite") \
    .save("../data/titanic.csv")


In [9]:
df.write \
    .format("parquet") \
    .mode("overwrite") \
    .save("../data/boston.parquet")


In [11]:
df.write.csv(path="../data/test.csv", header=True, mode="overwrite")
df.write.parquet(path="../data/test.parquet", mode="overwrite")