# Module 2 - Reading and Writing Data

## Introduction

One of the most common tasks in data engineering is reading data from various file formats and writing processed data back. PySpark supports many file formats including CSV, JSON, TXT, Parquet, and more.

## What You'll Learn

- Reading CSV files
- Reading JSON files
- Reading text files
- Writing DataFrames to files
- Understanding file format options
- Best practices for reading and writing


In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
import os

# Create data directory if it doesn't exist
os.makedirs("data", exist_ok=True)

# Create SparkSession
spark = SparkSession.builder \
    .appName("Reading and Writing Files") \
    .master("local[*]") \
    .getOrCreate()


## Reading CSV Files

CSV (Comma-Separated Values) is one of the most common file formats. PySpark can read CSV files with various options.


In [2]:
# First, let's create a sample CSV file for demonstration
# Create a sample CSV file
csv_content = """Name,Age,City,Salary
Alice,25,New York,50000
Bob,30,London,60000
Charlie,35,Tokyo,70000
Diana,28,Paris,55000
Eve,32,Sydney,65000"""

# Write to file
with open("data/sample_data.csv", "w") as f:
    f.write(csv_content)

print("Sample CSV file created!")


Sample CSV file created!


In [3]:
# Read CSV file - basic usage
df_csv = spark.read.csv("data/sample_data.csv", header=True, inferSchema=True)

print("CSV file read successfully!")
df_csv.show()
df_csv.printSchema()


CSV file read successfully!
+-------+---+--------+------+
|   Name|Age|    City|Salary|
+-------+---+--------+------+
|  Alice| 25|New York| 50000|
|    Bob| 30|  London| 60000|
|Charlie| 35|   Tokyo| 70000|
|  Diana| 28|   Paris| 55000|
|    Eve| 32|  Sydney| 65000|
+-------+---+--------+------+

root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- City: string (nullable = true)
 |-- Salary: integer (nullable = true)



## CSV Reading Options

**Common Options:**
- `header=True`: First row contains column names
- `inferSchema=True`: Automatically detect data types (slower but convenient)
- `sep`: Delimiter (default: comma)
- `nullValue`: String to treat as null
- `dateFormat`: Date format string

**Note**: `inferSchema=True` scans the entire file, which can be slow for large files. For production, define schema explicitly.


## Two Ways to Enforce Schema

When reading data, you can enforce schema in two ways:

### 1. Schema Option - Schema DDL (String Format)

You can specify schema as a DDL (Data Definition Language) string directly in the `.schema()` method:

```python
df = spark.read \
    .format("csv") \
    .schema("Name STRING, Age INT, City STRING, Salary INT") \
    .load("path/to/file")
```

**Advantages:**
- Simple and concise
- Easy to read and write
- Good for quick prototyping

### 2. StructType (Programmatic Schema Definition)

You can define schema using `StructType` and `StructField` from `pyspark.sql.types`:

```python
from pyspark.sql.types import *

schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("City", StringType(), True),
    StructField("Salary", IntegerType(), True)
])

df = spark.read \
    .format("csv") \
    .schema(schema) \
    .load("path/to/file")
```

**Advantages:**
- More explicit and type-safe
- Better for complex schemas
- Preferred for production code
- Allows for more control over nullable fields and metadata

**Note:** Both methods enforce schema at read time, which is faster and more reliable than using `inferSchema=True`.


In [4]:
# Method 1: Using Schema DDL (String Format)
df_csv_ddl = spark.read \
    .format("csv") \
    .option("header", "true") \
    .schema("Name STRING, Age INT, City STRING, Salary INT") \
    .load("data/sample_data.csv")

print("Method 1: Schema DDL")
df_csv_ddl.show()
df_csv_ddl.printSchema()

print("\n" + "="*50 + "\n")

# Method 2: Using StructType (Programmatic Schema)
from pyspark.sql.types import *

schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("City", StringType(), True),
    StructField("Salary", IntegerType(), True)
])

df_csv_struct = spark.read \
    .format("csv") \
    .option("header", "true") \
    .schema(schema) \
    .load("data/sample_data.csv")

print("Method 2: StructType")
df_csv_struct.show()
df_csv_struct.printSchema()


Method 1: Schema DDL
+-------+---+--------+------+
|   Name|Age|    City|Salary|
+-------+---+--------+------+
|  Alice| 25|New York| 50000|
|    Bob| 30|  London| 60000|
|Charlie| 35|   Tokyo| 70000|
|  Diana| 28|   Paris| 55000|
|    Eve| 32|  Sydney| 65000|
+-------+---+--------+------+

root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- City: string (nullable = true)
 |-- Salary: integer (nullable = true)



Method 2: StructType
+-------+---+--------+------+
|   Name|Age|    City|Salary|
+-------+---+--------+------+
|  Alice| 25|New York| 50000|
|    Bob| 30|  London| 60000|
|Charlie| 35|   Tokyo| 70000|
|  Diana| 28|   Paris| 55000|
|    Eve| 32|  Sydney| 65000|
+-------+---+--------+------+

root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- City: string (nullable = true)
 |-- Salary: integer (nullable = true)



## Handling Date Types

### Default Date Format in Spark

The default format of date type in Spark is **`yyyy-MM-dd`** (e.g., `2024-12-28`). If the date format in your data is different from this format, Spark will fail to parse it and you'll get a **parse error**.

**Important Note**: In case of parse issues, the **complete date column shows up as null**. This means all values in that column will be `null` if Spark cannot parse the dates according to the expected format.

### Two Ways to Handle Different Date Formats

#### Method 1: Use `dateFormat` Option While Reading

You can specify the date format explicitly using the `.option("dateFormat", "...")` parameter when reading the DataFrame:

```python
df = spark.read \
    .format("csv") \
    .schema("Name STRING, BirthDate DATE, Salary INT") \
    .option("header", "true") \
    .option("dateFormat", "dd/MM/yyyy") \
    .load("path/to/file")
```

**Advantages:**
- Date is parsed directly during read
- Column is already of `DATE` type
- No additional transformation needed

#### Method 2: Load Date as String and Convert Later

If you're unsure about the date format or need more flexibility, you can:
1. Load the date column as a `STRING` type
2. Apply transformation to convert it to `DATE` type using `to_date()` function

```python
from pyspark.sql.functions import to_date

# Read date as string
df = spark.read \
    .format("csv") \
    .schema("Name STRING, BirthDate STRING, Salary INT") \
    .option("header", "true") \
    .load("path/to/file")

# Convert string to date
df = df.withColumn("BirthDate", to_date("BirthDate", "dd/MM/yyyy"))
```

**Advantages:**
- More flexible - can handle multiple date formats
- Can inspect the string values before conversion
- Useful when date format is inconsistent in the data


In [5]:
# Create a sample CSV file with dates in non-standard format (dd/MM/yyyy)
csv_with_dates = """Name,Age,BirthDate,Salary
Alice,25,15/03/1999,50000
Bob,30,22/07/1994,60000
Charlie,35,10/11/1989,70000
Diana,28,05/01/1996,55000
Eve,32,18/09/1992,65000"""

with open("data/sample_data_with_dates.csv", "w") as f:
    f.write(csv_with_dates)

print("Sample CSV file with dates created!")
print("Date format: dd/MM/yyyy (e.g., 15/03/1999)")


Sample CSV file with dates created!
Date format: dd/MM/yyyy (e.g., 15/03/1999)


In [6]:
# Example 1: What happens when date format doesn't match (all dates become null)
print("=" * 60)
print("Example 1: Reading without specifying dateFormat")
print("=" * 60)

# Try to read with DATE type but without dateFormat option
# Since format is dd/MM/yyyy but Spark expects yyyy-MM-dd, all dates will be null
df_wrong_format = spark.read \
    .format("csv") \
    .schema("Name STRING, Age INT, BirthDate DATE, Salary INT") \
    .option("header", "true") \
    .load("data/sample_data_with_dates.csv")

print("\nNotice: All BirthDate values are null due to parse error!")
df_wrong_format.show()
df_wrong_format.printSchema()


Example 1: Reading without specifying dateFormat

Notice: All BirthDate values are null due to parse error!
+-------+---+---------+------+
|   Name|Age|BirthDate|Salary|
+-------+---+---------+------+
|  Alice| 25|     NULL| 50000|
|    Bob| 30|     NULL| 60000|
|Charlie| 35|     NULL| 70000|
|  Diana| 28|     NULL| 55000|
|    Eve| 32|     NULL| 65000|
+-------+---+---------+------+

root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- BirthDate: date (nullable = true)
 |-- Salary: integer (nullable = true)



In [7]:
# Example 2: Method 1 - Using dateFormat option while reading
print("=" * 60)
print("Method 1: Using dateFormat option")
print("=" * 60)

df_method1 = spark.read \
    .format("csv") \
    .schema("Name STRING, Age INT, BirthDate DATE, Salary INT") \
    .option("header", "true") \
    .option("dateFormat", "dd/MM/yyyy") \
    .load("data/sample_data_with_dates.csv")

print("\nSuccessfully parsed dates using dateFormat option!")
df_method1.show()
df_method1.printSchema()


Method 1: Using dateFormat option

Successfully parsed dates using dateFormat option!
+-------+---+----------+------+
|   Name|Age| BirthDate|Salary|
+-------+---+----------+------+
|  Alice| 25|1999-03-15| 50000|
|    Bob| 30|1994-07-22| 60000|
|Charlie| 35|1989-11-10| 70000|
|  Diana| 28|1996-01-05| 55000|
|    Eve| 32|1992-09-18| 65000|
+-------+---+----------+------+

root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- BirthDate: date (nullable = true)
 |-- Salary: integer (nullable = true)



In [9]:
# Example 3: Method 2 - Load as string and convert later
print("=" * 60)
print("Method 2: Load as string and convert using to_date()")
print("=" * 60)

from pyspark.sql.functions import to_date

# Step 1: Read date column as STRING
df_method2_string = spark.read \
    .format("csv") \
    .schema("Name STRING, Age INT, BirthDate STRING, Salary INT") \
    .option("header", "true") \
    .load("data/sample_data_with_dates.csv")

print("\nStep 1: Date column read as STRING")
df_method2_string.show()
df_method2_string.printSchema()

# Step 2: Convert string to date using to_date()
df_method2 = df_method2_string.withColumn(
    "BirthDate", 
    to_date("BirthDate", "dd/MM/yyyy")
)

print("\nStep 2: After converting STRING to DATE using to_date()")
df_method2.show()
df_method2.printSchema()


Method 2: Load as string and convert using to_date()

Step 1: Date column read as STRING
+-------+---+----------+------+
|   Name|Age| BirthDate|Salary|
+-------+---+----------+------+
|  Alice| 25|15/03/1999| 50000|
|    Bob| 30|22/07/1994| 60000|
|Charlie| 35|10/11/1989| 70000|
|  Diana| 28|05/01/1996| 55000|
|    Eve| 32|18/09/1992| 65000|
+-------+---+----------+------+

root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- BirthDate: string (nullable = true)
 |-- Salary: integer (nullable = true)


Step 2: After converting STRING to DATE using to_date()
+-------+---+----------+------+
|   Name|Age| BirthDate|Salary|
+-------+---+----------+------+
|  Alice| 25|1999-03-15| 50000|
|    Bob| 30|1994-07-22| 60000|
|Charlie| 35|1989-11-10| 70000|
|  Diana| 28|1996-01-05| 55000|
|    Eve| 32|1992-09-18| 65000|
+-------+---+----------+------+

root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- BirthDate: date (nullable = true)
 |

### Additional Notes on Date Formats

**Common Date Format Patterns:**
- `yyyy-MM-dd` - Default Spark format (e.g., `2024-12-28`)
- `dd/MM/yyyy` - European format (e.g., `28/12/2024`)
- `MM/dd/yyyy` - US format (e.g., `12/28/2024`)
- `dd-MM-yyyy` - Alternative format (e.g., `28-12-2024`)
- `yyyy/MM/dd` - Alternative format (e.g., `2024/12/28`)

**Using StructType with dateFormat:**

You can also use `StructType` schema definition with the `dateFormat` option:

```python
from pyspark.sql.types import *

schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("BirthDate", DateType(), True),  # DateType() for dates
    StructField("Salary", IntegerType(), True)
])

df = spark.read \
    .format("csv") \
    .schema(schema) \
    .option("header", "true") \
    .option("dateFormat", "dd/MM/yyyy") \
    .load("path/to/file")
```

**Key Takeaway**: Always specify the `dateFormat` option when your data doesn't match Spark's default `yyyy-MM-dd` format, otherwise all date values will be `null`.


In [10]:
# Example 4: Using StructType with dateFormat option
print("=" * 60)
print("Example: Using StructType schema with dateFormat")
print("=" * 60)

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DateType

schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("BirthDate", DateType(), True),  # DateType() for dates
    StructField("Salary", IntegerType(), True)
])

df_struct_date = spark.read \
    .format("csv") \
    .schema(schema) \
    .option("header", "true") \
    .option("dateFormat", "dd/MM/yyyy") \
    .load("data/sample_data_with_dates.csv")

print("\nUsing StructType with DateType and dateFormat option:")
df_struct_date.show()
df_struct_date.printSchema()


Example: Using StructType schema with dateFormat

Using StructType with DateType and dateFormat option:
+-------+---+----------+------+
|   Name|Age| BirthDate|Salary|
+-------+---+----------+------+
|  Alice| 25|1999-03-15| 50000|
|    Bob| 30|1994-07-22| 60000|
|Charlie| 35|1989-11-10| 70000|
|  Diana| 28|1996-01-05| 55000|
|    Eve| 32|1992-09-18| 65000|
+-------+---+----------+------+

root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- BirthDate: date (nullable = true)
 |-- Salary: integer (nullable = true)



## Reading JSON Files

JSON (JavaScript Object Notation) is another common format, especially for APIs and semi-structured data.


In [11]:
# Create a sample JSON file
json_content = """{"name": "Alice", "age": 25, "city": "New York", "salary": 50000}
{"name": "Bob", "age": 30, "city": "London", "salary": 60000}
{"name": "Charlie", "age": 35, "city": "Tokyo", "salary": 70000}
{"name": "Diana", "age": 28, "city": "Paris", "salary": 55000}
{"name": "Eve", "age": 32, "city": "Sydney", "salary": 65000}"""

# Write to file (JSON Lines format - one JSON object per line)
with open("data/sample_data.json", "w") as f:
    f.write(json_content)

print("Sample JSON file created!")


Sample JSON file created!


In [12]:
# Read JSON file
df_json = spark.read.json("data/sample_data.json")

print("JSON file read successfully!")
df_json.show()
df_json.printSchema()


JSON file read successfully!
+---+--------+-------+------+
|age|    city|   name|salary|
+---+--------+-------+------+
| 25|New York|  Alice| 50000|
| 30|  London|    Bob| 60000|
| 35|   Tokyo|Charlie| 70000|
| 28|   Paris|  Diana| 55000|
| 32|  Sydney|    Eve| 65000|
+---+--------+-------+------+

root
 |-- age: long (nullable = true)
 |-- city: string (nullable = true)
 |-- name: string (nullable = true)
 |-- salary: long (nullable = true)



## Reading Text Files

Text files can be read as DataFrames where each line becomes a row with a single column. Useful for unstructured data or when you need to process raw text.


In [13]:
# Create a sample text file
text_content = """This is line 1
This is line 2
This is line 3
This is line 4
This is line 5"""

with open("data/sample_data.txt", "w") as f:
    f.write(text_content)

print("Sample text file created!")


Sample text file created!


In [14]:
# Read text file as DataFrame (each line becomes a row)
df_text = spark.read.text("data/sample_data.txt")

print("Text file read as DataFrame:")
print(f"Number of lines: {df_text.count()}")
print("\nContent:")
df_text.show(truncate=False)


Text file read as DataFrame:
Number of lines: 5

Content:
+--------------+
|value         |
+--------------+
|This is line 1|
|This is line 2|
|This is line 3|
|This is line 4|
|This is line 5|
+--------------+



## Reading Multiple Files

PySpark can read multiple files at once. Just provide a directory path or a pattern.


In [None]:
# Create multiple CSV files
csv1 = """Name,Age
Alice,25
Bob,30"""

csv2 = """Name,Age
Charlie,35
Diana,28"""

with open("data/data_part1.csv", "w") as f:
    f.write(csv1)
    
with open("data/data_part2.csv", "w") as f:
    f.write(csv2)

print("Multiple CSV files created!")


Multiple CSV files created!


In [None]:
# Read multiple files using wildcard pattern
df_multiple = spark.read.csv("data/data_part*.csv", header=True, inferSchema=True)

print("Reading multiple files:")
df_multiple.show()


## Writing DataFrames to Files

After processing data, you often need to write it back to files. PySpark supports writing to various formats.


In [16]:
# Write DataFrame to CSV
output_path = "data/output_data.csv"

# Note: This creates a directory with part files (Spark writes in parallel)
df_csv.write.csv(output_path, header=True, mode="overwrite")

print(f"DataFrame written to {output_path}")
print("Note: Spark creates a directory with part files for parallel writing")


DataFrame written to data/output_data.csv
Note: Spark creates a directory with part files for parallel writing


In [17]:
# Write DataFrame to JSON
output_json_path = "data/output_data.json"

df_json.write.json(output_json_path, mode="overwrite")

print(f"DataFrame written to {output_json_path}")


DataFrame written to data/output_data.json


## Write Modes

When writing files, you can specify the mode to control how Spark handles existing data:

### 1. `overwrite`

If the folder where the results need to be written already exists, then it will be **overwritten**.

**Use case**: When you want to replace existing data completely.

```python
df.write.mode("overwrite").csv("path/to/output")
```

### 2. `ignore`

If the folder already exists, writing files will be **ignored** (no error, no write).

**Use case**: When you want to skip writing if data already exists, without throwing an error.

```python
df.write.mode("ignore").csv("path/to/output")
```

### 3. `append`

If the folder already exists, **new files will be appended** to the existing folder.

**Use case**: When you want to add new data to existing data (e.g., incremental loads).

```python
df.write.mode("append").csv("path/to/output")
```

### 4. `errorIfExists` (default)

If the folder already exists, the write operation will **throw an error**.

**Use case**: When you want to ensure you don't accidentally overwrite existing data.

```python
df.write.mode("errorIfExists").csv("path/to/output")
# or simply
df.write.csv("path/to/output")  # errorIfExists is the default
```

**Summary Table:**

| Mode | Behavior if folder exists |
|------|---------------------------|
| `overwrite` | Overwrites existing data |
| `ignore` | Ignores write (no error) |
| `append` | Appends new files to existing folder |
| `errorIfExists` | Throws an error (default behavior) |


In [18]:
# Example: Write with different modes
df_csv.write.csv("data/output_mode_example.csv", header=True, mode="overwrite")
print("Written with 'overwrite' mode")

# Append mode
df_csv.write.csv("data/output_mode_example.csv", header=True, mode="append")
print("Appended with 'append' mode")


Written with 'overwrite' mode
Appended with 'append' mode


## Writing Single File (Coalesce)

By default, Spark writes multiple part files (one per partition). To write a single file, use `coalesce(1)`.


In [19]:
# Write as single file using coalesce
df_csv.coalesce(1).write.csv("data/output_single_file.csv", header=True, mode="overwrite")

print("Written as single file using coalesce(1)")
print("Note: Use coalesce(1) only for small datasets. For large data, multiple files are better for parallel processing.")


Written as single file using coalesce(1)
Note: Use coalesce(1) only for small datasets. For large data, multiple files are better for parallel processing.


## Best Practices

1. **Define Schema Explicitly**: For production, always define schema instead of using `inferSchema=True`
2. **Use Parquet for Large Data**: Parquet is columnar and compressed (we'll learn this later)
3. **Multiple Files are OK**: Spark writes multiple part files - this is normal and efficient
4. **Avoid coalesce(1) for Large Data**: Only use for small datasets that need to be single files
5. **Use Appropriate Write Modes**: Choose the right mode based on your use case


## Summary

In this notebook, you learned:

1. **Reading CSV Files**: Using `spark.read.csv()` with options like `header`, `inferSchema`, and explicit schema
2. **Reading JSON Files**: Using `spark.read.json()` for JSON data
3. **Reading Text Files**: Using `spark.read.text()` for unstructured text
4. **Reading Multiple Files**: Using wildcard patterns to read multiple files
5. **Writing DataFrames**: Using `write.csv()`, `write.json()` with different modes
6. **Write Modes**: `overwrite`, `append`, `ignore`, `error`
7. **Single File Output**: Using `coalesce(1)` for small datasets

**Key Takeaway**: PySpark can read and write various file formats. Always define schemas explicitly for production code, and understand that Spark writes multiple files by default for parallel processing.

**Next Steps**: In Module 3, we'll learn about basic DataFrame operations like filtering, selecting columns, and sorting data.
