# Module 04: Data Loading and Saving

**Difficulty**: ‚≠ê‚≠ê

**Estimated Time**: 75-90 minutes

**Prerequisites**: 
- Module 00: Introduction to Big Data and Spark Ecosystem
- Module 01: PySpark Setup and SparkSession
- Module 03: DataFrames and Datasets

## Learning Objectives

By the end of this notebook, you will be able to:
1. Read data from various file formats (CSV, JSON, Parquet, text files)
2. Write DataFrames to different file formats with appropriate options
3. Work with partitioned data for better performance
4. Handle schema inference, schema evolution, and data quality issues
5. Use DataFrameReader and DataFrameWriter options effectively

## Setup

In [None]:
# Import required libraries
from pyspark.sql import SparkSession
from pyspark.sql.types import (
    StructType, StructField, StringType, IntegerType, 
    DoubleType, DateType, TimestampType
)
from pyspark.sql.functions import col, lit, current_timestamp
import os
import json
from datetime import datetime, date

# Create SparkSession
spark = SparkSession.builder \
    .appName("Module 04: Data Loading and Saving") \
    .master("local[*]") \
    .config("spark.driver.memory", "2g") \
    .getOrCreate()

print(f"‚úì SparkSession created: {spark.sparkContext.appName}")
print(f"  Spark version: {spark.version}")
print(f"  Spark UI: {spark.sparkContext.uiWebUrl}")

# Create data directory for examples
os.makedirs('sample_data', exist_ok=True)
print("\n‚úì Sample data directory created")

## 1. File Format Overview

### Common Formats in Big Data

| Format | Type | Schema | Compression | Use Case |
|--------|------|--------|-------------|----------|
| **CSV** | Text | No | Medium | Simple data exchange, human-readable |
| **JSON** | Text | No | Medium | Semi-structured data, APIs |
| **Parquet** | Binary | Yes | Excellent | Analytics, production systems |
| **ORC** | Binary | Yes | Excellent | Hive tables, analytics |
| **Avro** | Binary | Yes | Good | Data serialization, streaming |
| **Text** | Text | No | Poor | Log files, unstructured text |

### Format Recommendations

**Use CSV when**:
- Data exchange with non-technical users
- Excel compatibility needed
- Simple, flat data structures

**Use JSON when**:
- Nested/hierarchical data
- API responses
- Semi-structured data

**Use Parquet when** (RECOMMENDED for Spark):
- Large datasets in production
- Need columnar storage benefits
- Analytics workloads
- Best compression and performance

### Columnar vs Row-Based Storage

```
Row-Based (CSV, JSON):
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ Row 1: id=1, name=Alice, age=25 ‚îÇ
‚îÇ Row 2: id=2, name=Bob, age=30   ‚îÇ
‚îÇ Row 3: id=3, name=Charlie, age=35‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
  ‚úì Good for: Reading full rows
  ‚úó Bad for: Reading specific columns

Columnar (Parquet, ORC):
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ id:     ‚îÇ name:        ‚îÇ age:     ‚îÇ
‚îÇ 1, 2, 3 ‚îÇ Alice, Bob,  ‚îÇ 25,30,35 ‚îÇ
‚îÇ         ‚îÇ Charlie      ‚îÇ          ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
  ‚úì Good for: Analytics (SELECT specific columns)
  ‚úì Better compression (similar values together)
  ‚úì Predicate pushdown optimization
```

## 2. Reading CSV Files

CSV is the most common format for data exchange, but has challenges:
- No schema information
- Must infer or specify data types
- Various delimiters, quote characters, null values

In [None]:
# Create sample CSV file
csv_data = """id,name,department,salary,hire_date
1,Alice,Engineering,75000,2020-01-15
2,Bob,Sales,65000,2019-06-10
3,Charlie,Engineering,80000,2021-03-22
4,Diana,Marketing,70000,2020-08-05
5,Eve,Engineering,85000,2018-11-30
6,Frank,,72000,2022-02-14
7,Grace,Sales,,2021-09-01"""

with open('sample_data/employees.csv', 'w') as f:
    f.write(csv_data)

print("‚úì Sample CSV file created")

### Basic CSV Reading

In [None]:
# Basic read with schema inference
df_csv_basic = spark.read.csv(
    'sample_data/employees.csv',
    header=True,        # First row is header
    inferSchema=True    # Automatically infer data types
)

print("Basic CSV read:")
df_csv_basic.show()
print("\nInferred schema:")
df_csv_basic.printSchema()

### CSV Reading with Options

In [None]:
# Read with explicit options
df_csv_options = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .option("nullValue", "") \
    .option("dateFormat", "yyyy-MM-dd") \
    .option("mode", "DROPMALFORMED") \
    .csv('sample_data/employees.csv')

print("CSV with options:")
df_csv_options.show()

# Alternative syntax using format()
df_csv_format = spark.read \
    .format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load('sample_data/employees.csv')

print("\n‚úì Both syntaxes produce same result")

### CSV Reading with Explicit Schema

In [None]:
# Define explicit schema (better performance, no inference needed)
employee_schema = StructType([
    StructField("id", IntegerType(), nullable=False),
    StructField("name", StringType(), nullable=False),
    StructField("department", StringType(), nullable=True),
    StructField("salary", DoubleType(), nullable=True),
    StructField("hire_date", DateType(), nullable=True)
])

df_csv_schema = spark.read \
    .schema(employee_schema) \
    .option("header", "true") \
    .option("dateFormat", "yyyy-MM-dd") \
    .csv('sample_data/employees.csv')

print("CSV with explicit schema:")
df_csv_schema.show()
df_csv_schema.printSchema()

print("\nüí° Explicit schema is faster (no scanning) and type-safe!")

### Important CSV Options

| Option | Default | Description |
|--------|---------|-------------|
| `header` | false | First row is header |
| `inferSchema` | false | Infer column types (requires extra pass) |
| `sep` | "," | Field delimiter |
| `quote` | "\"" | Quote character |
| `escape` | "\\" | Escape character |
| `nullValue` | "" | String representing null |
| `dateFormat` | yyyy-MM-dd | Date format pattern |
| `timestampFormat` | yyyy-MM-dd'T'HH:mm:ss | Timestamp format |
| `mode` | PERMISSIVE | Error handling mode |

### Error Handling Modes

- **PERMISSIVE** (default): Set malformed records to null
- **DROPMALFORMED**: Drop rows with malformed data
- **FAILFAST**: Throw exception on malformed data

## 3. Reading JSON Files

JSON is great for nested/hierarchical data from APIs and web services.

In [None]:
# Create sample JSON file (one JSON object per line - JSON Lines format)
json_data = [
    {"id": 1, "name": "Alice", "age": 25, "address": {"city": "New York", "country": "USA"}},
    {"id": 2, "name": "Bob", "age": 30, "address": {"city": "London", "country": "UK"}},
    {"id": 3, "name": "Charlie", "age": 35, "address": {"city": "Tokyo", "country": "Japan"}},
    {"id": 4, "name": "Diana", "age": 28},  # Missing address
    {"id": 5, "name": "Eve", "age": 32, "address": {"city": "Paris", "country": "France"}}
]

with open('sample_data/users.json', 'w') as f:
    for record in json_data:
        f.write(json.dumps(record) + '\n')

print("‚úì Sample JSON file created")

In [None]:
# Read JSON file
df_json = spark.read.json('sample_data/users.json')

print("JSON DataFrame:")
df_json.show(truncate=False)
print("\nJSON schema (note nested structure):")
df_json.printSchema()

In [None]:
# Access nested fields
print("Accessing nested fields:")
df_json.select(
    "name",
    "age",
    col("address.city").alias("city"),
    col("address.country").alias("country")
).show()

# Alternative: Use getField()
df_json.select(
    "name",
    col("address").getField("city").alias("city")
).show()

## 4. Reading Parquet Files

**Parquet is the recommended format** for Spark because:
- Columnar storage (read only needed columns)
- Built-in compression
- Schema included in file
- Excellent performance

In [None]:
# First, write a DataFrame to Parquet (we'll learn more about writing soon)
df_csv_schema.write.mode("overwrite").parquet('sample_data/employees.parquet')

# Read Parquet file
df_parquet = spark.read.parquet('sample_data/employees.parquet')

print("Parquet DataFrame:")
df_parquet.show()
print("\nParquet schema (automatically preserved):")
df_parquet.printSchema()

print("\n‚úì No need to specify schema - it's stored in the file!")

### Parquet Advantages Demo

In [None]:
# Columnar storage benefit: Read only specific columns
import time

# Create larger dataset for demonstration
large_data = [(i, f"Name{i}", f"Dept{i%5}", float(50000 + i*100)) 
              for i in range(100000)]
df_large = spark.createDataFrame(large_data, ["id", "name", "department", "salary"])

# Write to CSV and Parquet
df_large.write.mode("overwrite").csv('sample_data/large_data.csv')
df_large.write.mode("overwrite").parquet('sample_data/large_data.parquet')

# Compare file sizes
import os

def get_dir_size(path):
    total = 0
    for entry in os.scandir(path):
        if entry.is_file():
            total += entry.stat().st_size
        elif entry.is_dir():
            total += get_dir_size(entry.path)
    return total

csv_size = get_dir_size('sample_data/large_data.csv')
parquet_size = get_dir_size('sample_data/large_data.parquet')

print(f"CSV size: {csv_size / 1024 / 1024:.2f} MB")
print(f"Parquet size: {parquet_size / 1024 / 1024:.2f} MB")
print(f"Compression ratio: {csv_size / parquet_size:.2f}x")

print("\n‚úì Parquet is much smaller due to compression!")

## 5. Writing DataFrames

### Write Modes

| Mode | Behavior |
|------|----------|
| `overwrite` | Delete existing data and write |
| `append` | Add to existing data |
| `ignore` | Write only if doesn't exist |
| `error` (default) | Throw error if exists |

### Writing CSV

In [None]:
# Write to CSV
df_csv_schema.write \
    .mode("overwrite") \
    .option("header", "true") \
    .csv('sample_data/output_csv')

print("‚úì Data written to CSV")
print("\nFiles created:")
for file in os.listdir('sample_data/output_csv'):
    print(f"  {file}")

# Note: Spark writes to a directory, not a single file
# This enables parallel writing across partitions

### Writing JSON

In [None]:
# Write to JSON
df_json.write \
    .mode("overwrite") \
    .json('sample_data/output_json')

print("‚úì Data written to JSON")

### Writing Parquet

In [None]:
# Write to Parquet with compression
df_csv_schema.write \
    .mode("overwrite") \
    .option("compression", "snappy") \
    .parquet('sample_data/output_parquet')

print("‚úì Data written to Parquet")

# Parquet compression options: snappy (default), gzip, lzo, none
# snappy: Fast compression/decompression (recommended)
# gzip: Better compression ratio, slower

### Writing to a Single File

In [None]:
# Use coalesce(1) to write to single file
# ‚ö†Ô∏è Only for small datasets!
df_csv_schema.coalesce(1).write \
    .mode("overwrite") \
    .option("header", "true") \
    .csv('sample_data/single_file')

print("‚úì Data written to single file (using coalesce)")
print("\nFiles created:")
for file in os.listdir('sample_data/single_file'):
    print(f"  {file}")

print("\n‚ö†Ô∏è Warning: coalesce(1) reduces parallelism!")
print("   Only use for small final outputs.")

## 6. Partitioned Data

### What is Partitioning?

Partitioning organizes data into subdirectories based on column values.

**Benefits**:
- **Partition pruning**: Skip irrelevant data
- **Better performance**: Read only needed partitions
- **Organized storage**: Data grouped logically

**Example**: Partition by date
```
sales/
  ‚îú‚îÄ‚îÄ year=2023/
  ‚îÇ   ‚îú‚îÄ‚îÄ month=01/
  ‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ data.parquet
  ‚îÇ   ‚îî‚îÄ‚îÄ month=02/
  ‚îÇ       ‚îî‚îÄ‚îÄ data.parquet
  ‚îî‚îÄ‚îÄ year=2024/
      ‚îî‚îÄ‚îÄ month=01/
          ‚îî‚îÄ‚îÄ data.parquet
```

In [None]:
# Create sample sales data with dates
import random
from datetime import timedelta

sales_data = []
base_date = date(2024, 1, 1)
products = ['Product A', 'Product B', 'Product C']
regions = ['North', 'South', 'East', 'West']

for i in range(1000):
    sale_date = base_date + timedelta(days=random.randint(0, 90))
    sales_data.append((
        i,
        sale_date,
        sale_date.year,
        sale_date.month,
        random.choice(products),
        random.choice(regions),
        random.randint(1, 100),
        round(random.uniform(100, 1000), 2)
    ))

df_sales = spark.createDataFrame(
    sales_data,
    ["sale_id", "sale_date", "year", "month", "product", "region", "quantity", "revenue"]
)

print("Sales DataFrame:")
df_sales.show(10)

### Writing Partitioned Data

In [None]:
# Write partitioned by year and month
df_sales.write \
    .mode("overwrite") \
    .partitionBy("year", "month") \
    .parquet('sample_data/sales_partitioned')

print("‚úì Data written with partitions")
print("\nPartition structure:")
for root, dirs, files in os.walk('sample_data/sales_partitioned'):
    level = root.replace('sample_data/sales_partitioned', '').count(os.sep)
    indent = ' ' * 2 * level
    print(f"{indent}{os.path.basename(root)}/")
    if level < 2:  # Only show first 2 levels
        sub_indent = ' ' * 2 * (level + 1)
        for file in files[:1]:  # Show only first file
            print(f"{sub_indent}{file}")

### Reading Partitioned Data

In [None]:
# Read partitioned data
df_sales_read = spark.read.parquet('sample_data/sales_partitioned')

print("Read partitioned data:")
df_sales_read.show(10)
print("\n‚úì Partition columns automatically included!")

# Demonstrate partition pruning
print("\nFiltering by partition column (year=2024, month=1):")
df_filtered = df_sales_read.filter((col("year") == 2024) & (col("month") == 1))
df_filtered.show(10)

print("\nüí° Spark only reads year=2024/month=1 partition!")
print("   This is MUCH faster for large datasets.")

### Partitioning Best Practices

**DO**:
- Partition by columns frequently used in filters (date, region, category)
- Aim for partitions of 100MB-1GB each
- Use 1-3 partition columns maximum

**DON'T**:
- Partition by high-cardinality columns (user_id, transaction_id)
- Create too many small partitions (< 1MB)
- Partition by columns rarely used in queries

## 7. Handling Data Quality Issues

Real-world data is messy. Spark provides options to handle common issues.

In [None]:
# Create CSV with data quality issues
messy_csv = """id,name,age,salary
1,Alice,25,50000
2,Bob,thirty,60000
3,Charlie,35,not_a_number
4,Diana,,55000
5,Eve,28,65000
,Frank,32,70000
7,Grace,29,"""

with open('sample_data/messy_data.csv', 'w') as f:
    f.write(messy_csv)

print("‚úì Messy CSV created with quality issues")

### Mode: PERMISSIVE (Default)

In [None]:
# PERMISSIVE: Set malformed values to null
df_permissive = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .option("mode", "PERMISSIVE") \
    .csv('sample_data/messy_data.csv')

print("PERMISSIVE mode (sets malformed to null):")
df_permissive.show()
df_permissive.printSchema()

### Mode: DROPMALFORMED

In [None]:
# DROPMALFORMED: Drop rows with malformed data
df_dropmalformed = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .option("mode", "DROPMALFORMED") \
    .csv('sample_data/messy_data.csv')

print("DROPMALFORMED mode (drops bad rows):")
df_dropmalformed.show()
print(f"\nRows kept: {df_dropmalformed.count()} out of {df_permissive.count()}")

### Capturing Malformed Records

In [None]:
# Capture malformed records in a separate column
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

schema_with_corrupt = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("salary", IntegerType(), True),
    StructField("_corrupt_record", StringType(), True)
])

df_with_corrupt = spark.read \
    .option("header", "true") \
    .option("mode", "PERMISSIVE") \
    .option("columnNameOfCorruptRecord", "_corrupt_record") \
    .schema(schema_with_corrupt) \
    .csv('sample_data/messy_data.csv')

print("With corrupt record column:")
df_with_corrupt.show(truncate=False)

print("\nOnly corrupt records:")
df_with_corrupt.filter(col("_corrupt_record").isNotNull()).show(truncate=False)

## Exercises

### Exercise 1: Multi-Format Data Pipeline

Create a data pipeline that:
1. Reads the CSV file `sample_data/employees.csv`
2. Adds a column `loaded_at` with current timestamp
3. Writes the result to Parquet format
4. Reads back the Parquet and displays it

In [None]:
# Exercise 1: Your code here

# Step 1: Read CSV
# Your code here

# Step 2: Add timestamp column
# Your code here

# Step 3: Write to Parquet
# Your code here

# Step 4: Read back and display
# Your code here

### Exercise 2: Working with Partitions

Using the `df_sales` DataFrame:
1. Write the data partitioned by `region` and `product`
2. Read back only the data for region="North"
3. Count records for each product in the North region
4. Calculate total revenue by product

In [None]:
# Exercise 2: Your code here

# Step 1: Write partitioned data
# Your code here

# Step 2: Read only North region
# Your code here

# Step 3: Count by product
# Your code here

# Step 4: Total revenue by product
# Your code here

### Exercise 3: Schema Enforcement

Create an explicit schema for customer data with:
- `customer_id` (integer, not null)
- `email` (string, not null)
- `registration_date` (date, nullable)
- `total_purchases` (double, nullable)

Then:
1. Create sample data (5 customers)
2. Create DataFrame with this schema
3. Write to Parquet
4. Read back and verify schema is preserved

In [None]:
# Exercise 3: Your code here

# Step 1: Define schema
# Your code here

# Step 2: Create sample data
# Your code here

# Step 3: Create DataFrame
# Your code here

# Step 4: Write to Parquet
# Your code here

# Step 5: Read and verify
# Your code here

### Exercise 4: Format Comparison

Create a large DataFrame (10,000 rows) with columns:
- `id`, `product`, `category`, `price`, `timestamp`

Then:
1. Write to CSV, JSON, and Parquet
2. Compare file sizes
3. Measure read times for each format
4. Which format is smallest? Which is fastest to read?

In [None]:
# Exercise 4: Your code here

# Step 1: Create large DataFrame
# Your code here

# Step 2: Write to all formats
# Your code here

# Step 3: Compare sizes
# Your code here

# Step 4: Measure read times
# Your code here

## Solutions

### Exercise 1 Solution

In [None]:
# Solution 1: Multi-Format Data Pipeline

# Step 1: Read CSV
df_pipeline = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .csv('sample_data/employees.csv')

# Step 2: Add timestamp
df_pipeline = df_pipeline.withColumn("loaded_at", current_timestamp())

print("Data with timestamp:")
df_pipeline.show(truncate=False)

# Step 3: Write to Parquet
df_pipeline.write.mode("overwrite").parquet('sample_data/pipeline_output')

# Step 4: Read back
df_result = spark.read.parquet('sample_data/pipeline_output')
print("\nRead from Parquet:")
df_result.show(truncate=False)
print("\n‚úì Pipeline completed successfully!")

### Exercise 2 Solution

In [None]:
# Solution 2: Working with Partitions

# Step 1: Write partitioned data
df_sales.write \
    .mode("overwrite") \
    .partitionBy("region", "product") \
    .parquet('sample_data/sales_by_region_product')

# Step 2: Read only North region
df_north = spark.read \
    .parquet('sample_data/sales_by_region_product') \
    .filter(col("region") == "North")

print("North region sales:")
df_north.show(10)

# Step 3: Count by product
print("\nCount by product in North:")
df_north.groupBy("product").count().show()

# Step 4: Total revenue by product
print("\nTotal revenue by product in North:")
df_north.groupBy("product") \
    .sum("revenue") \
    .withColumnRenamed("sum(revenue)", "total_revenue") \
    .show()

### Exercise 3 Solution

In [None]:
# Solution 3: Schema Enforcement

# Step 1: Define schema
customer_schema = StructType([
    StructField("customer_id", IntegerType(), nullable=False),
    StructField("email", StringType(), nullable=False),
    StructField("registration_date", DateType(), nullable=True),
    StructField("total_purchases", DoubleType(), nullable=True)
])

# Step 2: Create sample data
customer_data = [
    (1, "alice@example.com", date(2024, 1, 15), 1250.50),
    (2, "bob@example.com", date(2024, 2, 20), 890.75),
    (3, "charlie@example.com", date(2024, 3, 10), 2340.00),
    (4, "diana@example.com", date(2024, 1, 5), 567.25),
    (5, "eve@example.com", date(2024, 2, 28), 1890.80)
]

# Step 3: Create DataFrame
df_customers = spark.createDataFrame(customer_data, schema=customer_schema)

print("Customer DataFrame:")
df_customers.show()
df_customers.printSchema()

# Step 4: Write to Parquet
df_customers.write.mode("overwrite").parquet('sample_data/customers')

# Step 5: Read and verify
df_customers_read = spark.read.parquet('sample_data/customers')
print("\nRead from Parquet:")
df_customers_read.printSchema()
print("\n‚úì Schema preserved in Parquet!")

### Exercise 4 Solution

In [None]:
# Solution 4: Format Comparison

import time
import random
from datetime import datetime, timedelta

# Step 1: Create large DataFrame
large_data = []
products = ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Keyboard']
categories = ['Electronics', 'Accessories']
base_time = datetime(2024, 1, 1)

for i in range(10000):
    large_data.append((
        i,
        random.choice(products),
        random.choice(categories),
        round(random.uniform(50, 2000), 2),
        base_time + timedelta(hours=random.randint(0, 1000))
    ))

df_large_test = spark.createDataFrame(
    large_data,
    ["id", "product", "category", "price", "timestamp"]
)

print(f"Created DataFrame with {df_large_test.count()} rows")

# Step 2: Write to all formats
df_large_test.write.mode("overwrite").csv('sample_data/compare_csv')
df_large_test.write.mode("overwrite").json('sample_data/compare_json')
df_large_test.write.mode("overwrite").parquet('sample_data/compare_parquet')

# Step 3: Compare sizes
csv_size = get_dir_size('sample_data/compare_csv')
json_size = get_dir_size('sample_data/compare_json')
parquet_size = get_dir_size('sample_data/compare_parquet')

print("\n=== File Size Comparison ===")
print(f"CSV:     {csv_size / 1024:.2f} KB")
print(f"JSON:    {json_size / 1024:.2f} KB")
print(f"Parquet: {parquet_size / 1024:.2f} KB")
print(f"\nParquet is {csv_size/parquet_size:.2f}x smaller than CSV")

# Step 4: Measure read times
print("\n=== Read Time Comparison ===")

start = time.time()
spark.read.csv('sample_data/compare_csv').count()
csv_time = time.time() - start
print(f"CSV:     {csv_time:.4f} seconds")

start = time.time()
spark.read.json('sample_data/compare_json').count()
json_time = time.time() - start
print(f"JSON:    {json_time:.4f} seconds")

start = time.time()
spark.read.parquet('sample_data/compare_parquet').count()
parquet_time = time.time() - start
print(f"Parquet: {parquet_time:.4f} seconds")

print("\n=== Conclusion ===")
print(f"Smallest: Parquet ({parquet_size / 1024:.2f} KB)")
print(f"Fastest:  Parquet ({parquet_time:.4f} seconds)")
print("\n‚úì Parquet wins on both size AND speed!")

## Summary

### Key Concepts Covered

‚úÖ **File Formats**: CSV, JSON, Parquet - each with specific use cases

‚úÖ **Reading Data**: DataFrameReader with various options and modes

‚úÖ **Writing Data**: DataFrameWriter with modes (overwrite, append, ignore, error)

‚úÖ **Partitioning**: Organizing data for better query performance

‚úÖ **Data Quality**: Handling malformed data with PERMISSIVE, DROPMALFORMED, FAILFAST

### Format Recommendations

**For Production Spark Applications**: Use **Parquet**
- Best compression
- Fastest performance
- Schema included
- Columnar benefits

**For Data Exchange**: Use **CSV** or **JSON**
- Human-readable
- Tool compatibility
- Simple structure

### Important Methods

**Reading**:
- `spark.read.csv()` / `json()` / `parquet()`
- `spark.read.format().load()`
- `.option()` for configuration
- `.schema()` for explicit schema

**Writing**:
- `df.write.csv()` / `json()` / `parquet()`
- `df.write.format().save()`
- `.mode()` for write behavior
- `.partitionBy()` for partitioning
- `.option()` for configuration

### Best Practices

1. **Use Parquet** for production Spark workloads
2. **Define explicit schemas** instead of inferring (faster, type-safe)
3. **Partition wisely** by columns used in filters (date, region, category)
4. **Handle errors** appropriately with mode settings
5. **Monitor file sizes** - avoid too many small files
6. **Use compression** (snappy for Parquet)

### Partitioning Guidelines

‚úÖ **DO**:
- Partition by date (year, month, day)
- Use low-cardinality columns (region, category, status)
- Aim for partition sizes of 100MB-1GB

‚ùå **DON'T**:
- Partition by high-cardinality (user_id, transaction_id)
- Create thousands of tiny partitions
- Partition by columns not used in queries

### What's Next?

In **Module 05: DataFrame Operations**, you will:
- Perform powerful transformations (select, filter, groupBy)
- Join multiple DataFrames
- Aggregate data with built-in functions
- Use window functions for advanced analytics

### Additional Resources

- [Spark Data Sources](https://spark.apache.org/docs/latest/sql-data-sources.html)
- [Parquet Format](https://parquet.apache.org/)
- [Best Practices for File Formats](https://spark.apache.org/docs/latest/sql-performance-tuning.html)

In [None]:
# Cleanup
spark.stop()
print("SparkSession stopped. ‚úì")