# PySpark Comprehensive Tutorial - Module 2: Data Ingestion & I/O Operations

## 🎯 Learning Objectives
- Master **file format operations** (CSV, JSON, Parquet, Avro, Delta Lake)
- Implement **database connectivity** with JDBC and cloud databases
- Configure **cloud storage integration** (AWS S3, GCS, Azure Blob)
- Understand **streaming data ingestion** patterns
- Apply **schema management** and data validation techniques
- Optimize **I/O performance** for large-scale data processing

## 🏗 Module Focus
**Building on Module 1 Foundation:**
- Apply SparkSession configuration for I/O operations
- Use DataFrame APIs for data ingestion and export
- Implement production-ready data pipelines
- Handle real-world data quality challenges

**Real-World Applications:**
- **ETL Pipelines**: Extract, Transform, Load workflows
- **Data Lake Integration**: Multi-format data processing
- **Analytics Preparation**: Optimized data formats for queries
- **Cross-Platform Data Exchange**: Compatible data formats

## 📋 Prerequisites
- ✅ **Module 1 Complete**: Foundation & Setup knowledge
- ✅ **Environment**: `pyspark_env` with PySpark 4.0.0
- ✅ **Local Setup**: 6-core macOS optimization
- ✅ **Datasets**: < 10GB for local development

---

## 2.1 I/O Operations Overview

### Data Sources & Formats
PySpark provides unified APIs for reading and writing various data sources:

**File Formats:**
- **CSV**: Comma-separated values, universal compatibility
- **JSON**: JavaScript Object Notation, semi-structured data
- **Parquet**: Columnar format, optimized for analytics
- **Avro**: Schema evolution, cross-language compatibility
- **Delta Lake**: ACID transactions, versioning, time travel

**Database Sources:**
- **JDBC**: Relational databases (PostgreSQL, MySQL, SQL Server)
- **NoSQL**: MongoDB, Cassandra, HBase
- **Cloud Databases**: BigQuery, Redshift, Snowflake

**Cloud Storage:**
- **AWS S3**: Simple Storage Service
- **Google Cloud Storage**: GCS buckets
- **Azure Blob Storage**: Azure Data Lake

### Key I/O Concepts
1. **Schema Inference vs Explicit Schema**: Performance trade-offs
2. **Partitioning**: Organize data for optimal query performance  
3. **Compression**: Balance storage size vs processing speed
4. **Error Handling**: Malformed data, missing files, connection issues
5. **Performance Optimization**: Parallel I/O, caching strategies

## 2.2 Environment Setup for I/O Operations

In [1]:
# Environment verification and imports for Module 2: Data Ingestion & I/O
import os
import sys
import json
import tempfile
from pathlib import Path
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

print("🔍 Module 2: Data Ingestion & I/O Setup")
print("=" * 50)

# Check conda environment 
conda_env = os.environ.get('CONDA_DEFAULT_ENV', 'Not activated')
print(f"Conda Environment: {conda_env}")

# Verify required directories exist
project_root = Path.cwd().parent if Path.cwd().name == 'notebooks' else Path.cwd()
data_dir = project_root / "data"
temp_dir = project_root / "temp"

# Create directories if they don't exist
data_dir.mkdir(exist_ok=True)
temp_dir.mkdir(exist_ok=True)

print(f"📁 Project Root: {project_root}")
print(f"📁 Data Directory: {data_dir}")
print(f"📁 Temp Directory: {temp_dir}")

# Import PySpark components for I/O operations
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *

# Data manipulation libraries
import pandas as pd
import numpy as np
import random

# Set seeds for reproducibility
random.seed(42)
np.random.seed(42)

print("\n✅ Environment ready for I/O operations!")
print(f"🎯 Ready to explore: CSV, JSON, Parquet, and more!")

# Store paths for later use
print(f"\n📋 Directory Setup Complete:")
print(f"   • data_dir: {data_dir}")
print(f"   • temp_dir: {temp_dir}")

🔍 Module 2: Data Ingestion & I/O Setup
Conda Environment: pyspark_env
📁 Project Root: /Users/sanjeevadodlapati/Downloads/Repos/GeoSpatialAI/projects/project_pyspark_comprehensive_tutorial
📁 Data Directory: /Users/sanjeevadodlapati/Downloads/Repos/GeoSpatialAI/projects/project_pyspark_comprehensive_tutorial/data
📁 Temp Directory: /Users/sanjeevadodlapati/Downloads/Repos/GeoSpatialAI/projects/project_pyspark_comprehensive_tutorial/temp

✅ Environment ready for I/O operations!
🎯 Ready to explore: CSV, JSON, Parquet, and more!

📋 Directory Setup Complete:
   • data_dir: /Users/sanjeevadodlapati/Downloads/Repos/GeoSpatialAI/projects/project_pyspark_comprehensive_tutorial/data
   • temp_dir: /Users/sanjeevadodlapati/Downloads/Repos/GeoSpatialAI/projects/project_pyspark_comprehensive_tutorial/temp

✅ Environment ready for I/O operations!
🎯 Ready to explore: CSV, JSON, Parquet, and more!

📋 Directory Setup Complete:
   • data_dir: /Users/sanjeevadodlapati/Downloads/Repos/GeoSpatialAI/projects/

In [2]:
# Create SparkSession optimized for I/O operations
print("🚀 Creating SparkSession for I/O Operations")
print("=" * 50)

# Stop any existing SparkSession
try:
    spark.stop()
    print("🧹 Stopped existing SparkSession")
except:
    print("🆕 No existing SparkSession to stop")

# Configuration optimized for I/O operations and file processing
spark = SparkSession.builder \
    .appName("PySpark-Tutorial-Module2-IO") \
    .master("local[6]") \
    .config("spark.driver.memory", "3g") \
    .config("spark.driver.maxResultSize", "2g") \
    .config("spark.executor.memory", "2g") \
    .config("spark.executor.cores", "2") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.minPartitionSize", "16MB") \
    .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.sql.files.maxPartitionBytes", "128MB") \
    .config("spark.sql.files.openCostInBytes", "4MB") \
    .config("spark.sql.sources.partitionOverwriteMode", "dynamic") \
    .config("spark.sql.parquet.compression.codec", "snappy") \
    .config("spark.sql.json.compression.codec", "gzip") \
    .config("spark.dynamicAllocation.enabled", "false") \
    .getOrCreate()

# Get SparkContext
sc = spark.sparkContext

print("\n✅ SparkSession created successfully!")
print(f"📱 Application Name: {spark.sparkContext.appName}")
print(f"🔢 Spark Version: {spark.version}")
print(f"🎯 Master: {spark.sparkContext.master}")
print(f"💾 Driver Memory: {spark.conf.get('spark.driver.memory')}")
print(f"⚡ Default Parallelism: {spark.sparkContext.defaultParallelism}")

# I/O specific configurations
print(f"\n🔧 I/O Optimizations:")
print(f"   • Parquet Compression: {spark.conf.get('spark.sql.parquet.compression.codec')}")
print(f"   • JSON Compression: {spark.conf.get('spark.sql.json.compression.codec')}")
print(f"   • Max Partition Bytes: {spark.conf.get('spark.sql.files.maxPartitionBytes')}")
print(f"   • Arrow Optimization: {spark.conf.get('spark.sql.execution.arrow.pyspark.enabled')}")

if spark.sparkContext.uiWebUrl:
    print(f"\n🌐 Spark UI: {spark.sparkContext.uiWebUrl}")

print(f"\n🎯 Optimized for:")
print(f"   • File format I/O operations")
print(f"   • Multi-format data processing")
print(f"   • Compression and performance")
print(f"   • Local development with 6 cores")

🚀 Creating SparkSession for I/O Operations
🆕 No existing SparkSession to stop


Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/08/25 19:32:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/08/25 19:32:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/08/25 19:32:51 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/08/25 19:32:51 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.



✅ SparkSession created successfully!
📱 Application Name: PySpark-Tutorial-Module2-IO
🔢 Spark Version: 4.0.0
🎯 Master: local[6]
💾 Driver Memory: 3g
⚡ Default Parallelism: 6

🔧 I/O Optimizations:
   • Parquet Compression: snappy
   • JSON Compression: gzip
   • Max Partition Bytes: 128MB
   • Arrow Optimization: true

🌐 Spark UI: http://192.168.12.128:4041

🎯 Optimized for:
   • File format I/O operations
   • Multi-format data processing
   • Compression and performance
   • Local development with 6 cores
💾 Driver Memory: 3g
⚡ Default Parallelism: 6

🔧 I/O Optimizations:
   • Parquet Compression: snappy
   • JSON Compression: gzip
   • Max Partition Bytes: 128MB
   • Arrow Optimization: true

🌐 Spark UI: http://192.168.12.128:4041

🎯 Optimized for:
   • File format I/O operations
   • Multi-format data processing
   • Compression and performance
   • Local development with 6 cores


## 2.3 CSV File Operations

CSV (Comma-Separated Values) is one of the most common data formats. PySpark provides excellent support for reading and writing CSV files with various options for handling headers, data types, delimiters, and schema inference.

**Key CSV Concepts:**
- **Schema Inference**: Automatically detect column types
- **Custom Schema**: Define column types explicitly for better performance
- **Headers**: Handle files with/without column headers
- **Delimiters**: Support for different separators (comma, semicolon, tab, etc.)
- **Null Values**: Custom null value representations
- **Escape Characters**: Handle special characters and quotes
- **Multiline**: Support for records spanning multiple lines

In [3]:
# Create sample CSV datasets for demonstration
print("📁 Creating Sample CSV Datasets")
print("=" * 40)

# Sample 1: Employee data with headers
employee_csv_content = """employee_id,name,department,salary,hire_date,is_active
1001,John Doe,Engineering,85000,2022-01-15,true
1002,Jane Smith,Marketing,72000,2021-03-22,true
1003,Bob Johnson,Engineering,92000,2020-11-08,true
1004,Alice Brown,Sales,68000,2023-02-14,true
1005,Charlie Wilson,HR,75000,2021-09-05,true
1006,Diana Davis,Engineering,88000,2022-07-12,false
1007,Eve Miller,Marketing,69000,2023-01-30,true
1008,Frank Garcia,Sales,71000,2020-12-03,true
1009,Grace Lee,Engineering,95000,2019-08-17,true
1010,Henry Taylor,HR,73000,2022-04-25,true"""

# Sample 2: Sales data with different delimiter and null values
sales_csv_content = """product_id;product_name;category;price;quantity_sold;sale_date;discount
P001;Laptop Pro;Electronics;1299.99;45;2024-01-15;0.1
P002;Wireless Mouse;Electronics;29.99;120;;0.05
P003;Office Chair;Furniture;299.99;30;2024-01-18;
P004;Coffee Maker;Appliances;89.99;75;2024-01-20;0.15
P005;Smartphone;Electronics;;200;2024-01-22;0.08
P006;Desk Lamp;Furniture;45.99;60;2024-01-25;0.0
P007;Tablet;Electronics;599.99;35;2024-01-28;0.12
P008;Ergonomic Keyboard;Electronics;79.99;85;2024-01-30;0.07"""

# Sample 3: Complex CSV with quotes and special characters
complex_csv_lines = [
    'id,description,tags,notes,created_at',
    '1,"Product with quotes","tag1,tag2,tag3","Note with commas","2024-01-01"',
    '2,"Simple product","electronics","Normal note","2024-01-02"',
    '3,"Product description","tag1","Multiline note","2024-01-03"',
    '4,"Product with commas","home,garden","Simple note","2024-01-04"'
]
complex_csv_content = '\n'.join(complex_csv_lines)

# Write CSV files
csv_files = {
    'employees.csv': employee_csv_content,
    'sales_semicolon.csv': sales_csv_content,
    'complex_data.csv': complex_csv_content
}

created_files = []
for filename, content in csv_files.items():
    file_path = data_dir / filename
    with open(file_path, 'w', encoding='utf-8') as f:
        f.write(content)
    created_files.append(filename)
    print(f"✅ Created: {filename}")

# Display file sizes and row counts
print(f"\n📊 CSV Files Summary:")
for filename in created_files:
    file_path = data_dir / filename
    file_size = file_path.stat().st_size
    with open(file_path, 'r') as f:
        line_count = len(f.readlines())
    print(f"   • {filename}: {file_size} bytes, {line_count} lines")

print(f"\n📂 Files location: {data_dir}")
print(f"🎯 Ready for CSV operations!")

📁 Creating Sample CSV Datasets
✅ Created: employees.csv
✅ Created: sales_semicolon.csv
✅ Created: complex_data.csv

📊 CSV Files Summary:
   • employees.csv: 529 bytes, 11 lines
   • sales_semicolon.csv: 486 bytes, 9 lines
   • complex_data.csv: 295 bytes, 5 lines

📂 Files location: /Users/sanjeevadodlapati/Downloads/Repos/GeoSpatialAI/projects/project_pyspark_comprehensive_tutorial/data
🎯 Ready for CSV operations!


In [4]:
# Basic CSV Reading with Schema Inference
print("📖 Reading CSV Files with Schema Inference")
print("=" * 45)

# Read employee data with default settings
employee_file = str(data_dir / "employees.csv")
print(f"📁 Reading: {employee_file}")

# Basic read with header and schema inference
df_employees = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .csv(employee_file)

print(f"\n✅ Successfully loaded employee data")
print(f"📊 Rows: {df_employees.count()}")
print(f"📋 Columns: {len(df_employees.columns)}")

# Display schema
print(f"\n🔍 Inferred Schema:")
df_employees.printSchema()

# Show sample data
print(f"\n📄 Sample Data (first 5 rows):")
df_employees.show(5, truncate=False)

# Display data types
print(f"\n🏷️  Column Data Types:")
for column, dtype in df_employees.dtypes:
    print(f"   • {column}: {dtype}")

# Performance note
print(f"\n⚡ Performance Note:")
print(f"   Schema inference requires reading the entire file")
print(f"   For better performance on large files, define schema explicitly")
print(f"   Current partitions: {df_employees.rdd.getNumPartitions()}")

📖 Reading CSV Files with Schema Inference
📁 Reading: /Users/sanjeevadodlapati/Downloads/Repos/GeoSpatialAI/projects/project_pyspark_comprehensive_tutorial/data/employees.csv

✅ Successfully loaded employee data

✅ Successfully loaded employee data
📊 Rows: 10
📋 Columns: 6

🔍 Inferred Schema:
root
 |-- employee_id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- salary: integer (nullable = true)
 |-- hire_date: date (nullable = true)
 |-- is_active: boolean (nullable = true)


📄 Sample Data (first 5 rows):
+-----------+--------------+-----------+------+----------+---------+
|employee_id|name          |department |salary|hire_date |is_active|
+-----------+--------------+-----------+------+----------+---------+
|1001       |John Doe      |Engineering|85000 |2022-01-15|true     |
|1002       |Jane Smith    |Marketing  |72000 |2021-03-22|true     |
|1003       |Bob Johnson   |Engineering|92000 |2020-11-08|true     |
|1004       |Al

In [5]:
# CSV Reading with Custom Delimiters and Null Handling
print("🔧 CSV Reading with Custom Options")
print("=" * 38)

# Read sales data with semicolon delimiter
sales_file = str(data_dir / "sales_semicolon.csv")
print(f"📁 Reading: {sales_file}")

# Read with custom delimiter and null value handling
df_sales = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .option("sep", ";") \
    .option("nullValue", "") \
    .option("emptyValue", "") \
    .csv(sales_file)

print(f"\n✅ Successfully loaded sales data with custom options")
print(f"📊 Rows: {df_sales.count()}")
print(f"📋 Columns: {len(df_sales.columns)}")

# Display schema
print(f"\n🔍 Schema with null handling:")
df_sales.printSchema()

# Show data with null values
print(f"\n📄 Sales Data (showing null values):")
df_sales.show(10, truncate=False)

# Check for null values in each column
print(f"\n🔍 Null Value Analysis:")
null_counts = {}
for col_name in df_sales.columns:
    null_count = df_sales.filter(df_sales[col_name].isNull()).count()
    null_counts[col_name] = null_count
    print(f"   • {col_name}: {null_count} null values")

# Custom read options examples
print(f"\n⚙️  Other Common CSV Options:")
print(f"   • sep=';'           → Use semicolon as delimiter")
print(f"   • nullValue=''      → Treat empty strings as null")
print(f"   • dateFormat='yyyy-MM-dd' → Custom date format")
print(f"   • timestampFormat   → Custom timestamp format")
print(f"   • quote='\"'         → Quote character")
print(f"   • escape='\\\\'      → Escape character")
print(f"   • ignoreLeadingWhiteSpace=true")
print(f"   • ignoreTrailingWhiteSpace=true")

🔧 CSV Reading with Custom Options
📁 Reading: /Users/sanjeevadodlapati/Downloads/Repos/GeoSpatialAI/projects/project_pyspark_comprehensive_tutorial/data/sales_semicolon.csv

✅ Successfully loaded sales data with custom options
📊 Rows: 8
📋 Columns: 7

🔍 Schema with null handling:
root
 |-- product_id: string (nullable = true)
 |-- product_name: string (nullable = true)
 |-- category: string (nullable = true)
 |-- price: double (nullable = true)
 |-- quantity_sold: integer (nullable = true)
 |-- sale_date: date (nullable = true)
 |-- discount: double (nullable = true)


📄 Sales Data (showing null values):
+----------+------------------+-----------+-------+-------------+----------+--------+
|product_id|product_name      |category   |price  |quantity_sold|sale_date |discount|
+----------+------------------+-----------+-------+-------------+----------+--------+
|P001      |Laptop Pro        |Electronics|1299.99|45           |2024-01-15|0.1     |
|P002      |Wireless Mouse    |Electronics|29.

In [6]:
# CSV Reading with Explicit Schema (Better Performance)
print("⚡ CSV Reading with Explicit Schema")
print("=" * 37)

# Define explicit schema for employee data
employee_schema = StructType([
    StructField("employee_id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("department", StringType(), True),
    StructField("salary", IntegerType(), True),
    StructField("hire_date", DateType(), True),
    StructField("is_active", BooleanType(), True)
])

print("📝 Defined explicit schema:")
print(employee_schema)

# Read with explicit schema (no inference needed)
import time
start_time = time.time()

df_employees_schema = spark.read \
    .option("header", "true") \
    .option("dateFormat", "yyyy-MM-dd") \
    .schema(employee_schema) \
    .csv(employee_file)

read_time = time.time() - start_time

print(f"\n✅ Successfully loaded with explicit schema")
print(f"⏱️  Read time: {read_time:.4f} seconds")
print(f"📊 Rows: {df_employees_schema.count()}")

# Compare schemas
print(f"\n🔍 Schema Comparison:")
print(f"Inferred schema types: {[dtype for _, dtype in df_employees.dtypes]}")
print(f"Explicit schema types: {[dtype for _, dtype in df_employees_schema.dtypes]}")

# Verify data is identical
print(f"\n✅ Data verification:")
if df_employees.collect() == df_employees_schema.collect():
    print("   ✓ Data is identical between inferred and explicit schema")
else:
    print("   ⚠️ Data differs between methods")

# Show performance benefits
print(f"\n🚀 Benefits of Explicit Schema:")
print(f"   • Faster reading (no schema inference pass)")
print(f"   • Consistent data types across reads")
print(f"   • Better error handling for malformed data")
print(f"   • Required for streaming applications")
print(f"   • Enables better query optimization")

⚡ CSV Reading with Explicit Schema
📝 Defined explicit schema:
StructType([StructField('employee_id', IntegerType(), True), StructField('name', StringType(), True), StructField('department', StringType(), True), StructField('salary', IntegerType(), True), StructField('hire_date', DateType(), True), StructField('is_active', BooleanType(), True)])

✅ Successfully loaded with explicit schema
⏱️  Read time: 0.0220 seconds
📊 Rows: 10

🔍 Schema Comparison:
Inferred schema types: ['int', 'string', 'string', 'int', 'date', 'boolean']
Explicit schema types: ['int', 'string', 'string', 'int', 'date', 'boolean']

✅ Data verification:
   ✓ Data is identical between inferred and explicit schema

🚀 Benefits of Explicit Schema:
   • Faster reading (no schema inference pass)
   • Consistent data types across reads
   • Better error handling for malformed data
   • Required for streaming applications
   • Enables better query optimization
   ✓ Data is identical between inferred and explicit schema

🚀 Be

In [7]:
# CSV Writing with Various Options
print("💾 CSV Writing Operations")
print("=" * 26)

# Create a sample DataFrame for writing
sample_data = [
    (1, "Product A", 29.99, "2024-01-01"),
    (2, "Product B", 45.50, "2024-01-02"),
    (3, "Product C", 15.75, "2024-01-03"),
    (4, "Product D", None, "2024-01-04"),  # Null value
    (5, "Product E", 89.99, "2024-01-05")
]

sample_schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("product", StringType(), True),
    StructField("price", DoubleType(), True),
    StructField("date", StringType(), True)
])

df_sample = spark.createDataFrame(sample_data, sample_schema)

print("📋 Sample DataFrame to write:")
df_sample.show()

# 1. Basic CSV write with header
output_path_basic = str(temp_dir / "output_basic_csv")
print(f"\n1️⃣ Writing basic CSV to: {output_path_basic}")

df_sample.coalesce(1) \
    .write \
    .mode("overwrite") \
    .option("header", "true") \
    .csv(output_path_basic)

print("   ✅ Basic CSV written successfully")

# 2. CSV write with custom delimiter and null handling
output_path_custom = str(temp_dir / "output_custom_csv")
print(f"\n2️⃣ Writing custom CSV to: {output_path_custom}")

df_sample.coalesce(1) \
    .write \
    .mode("overwrite") \
    .option("header", "true") \
    .option("sep", "|") \
    .option("nullValue", "N/A") \
    .option("dateFormat", "yyyy-MM-dd") \
    .csv(output_path_custom)

print("   ✅ Custom CSV written with pipe delimiter and custom null values")

# 3. CSV write with partitioning
output_path_partitioned = str(temp_dir / "output_partitioned_csv")
print(f"\n3️⃣ Writing partitioned CSV to: {output_path_partitioned}")

# Add a partition column
df_with_partition = df_sample.withColumn("year", lit("2024"))

df_with_partition \
    .write \
    .mode("overwrite") \
    .option("header", "true") \
    .partitionBy("year") \
    .csv(output_path_partitioned)

print("   ✅ Partitioned CSV written successfully")

# Verify written files
print(f"\n📁 Verification - Reading back written files:")

# Read back basic CSV
df_read_basic = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .csv(output_path_basic)

print(f"\n   Basic CSV read back ({df_read_basic.count()} rows):")
df_read_basic.show(truncate=False)

# Show write options summary
print(f"\n📝 Common CSV Write Options:")
print(f"   • mode('overwrite/append/ignore/error')")
print(f"   • option('header', 'true/false')")
print(f"   • option('sep', 'delimiter')")
print(f"   • option('nullValue', 'custom_null')")
print(f"   • option('dateFormat', 'yyyy-MM-dd')")
print(f"   • option('timestampFormat', 'pattern')")
print(f"   • partitionBy('column1', 'column2')")
print(f"   • coalesce(1) → single output file")

💾 CSV Writing Operations
📋 Sample DataFrame to write:
📋 Sample DataFrame to write:


                                                                                

+---+---------+-----+----------+
| id|  product|price|      date|
+---+---------+-----+----------+
|  1|Product A|29.99|2024-01-01|
|  2|Product B| 45.5|2024-01-02|
|  3|Product C|15.75|2024-01-03|
|  4|Product D| NULL|2024-01-04|
|  5|Product E|89.99|2024-01-05|
+---+---------+-----+----------+


1️⃣ Writing basic CSV to: /Users/sanjeevadodlapati/Downloads/Repos/GeoSpatialAI/projects/project_pyspark_comprehensive_tutorial/temp/output_basic_csv
   ✅ Basic CSV written successfully

2️⃣ Writing custom CSV to: /Users/sanjeevadodlapati/Downloads/Repos/GeoSpatialAI/projects/project_pyspark_comprehensive_tutorial/temp/output_custom_csv
   ✅ Custom CSV written with pipe delimiter and custom null values

3️⃣ Writing partitioned CSV to: /Users/sanjeevadodlapati/Downloads/Repos/GeoSpatialAI/projects/project_pyspark_comprehensive_tutorial/temp/output_partitioned_csv
   ✅ Basic CSV written successfully

2️⃣ Writing custom CSV to: /Users/sanjeevadodlapati/Downloads/Repos/GeoSpatialAI/projects/proje

## 2.4 JSON File Operations

JSON (JavaScript Object Notation) is a popular format for semi-structured data. PySpark provides excellent support for JSON files, including complex nested structures, arrays, and schema inference for JSON documents.

**Key JSON Concepts:**
- **Semi-structured Data**: JSON can contain nested objects and arrays
- **Schema Flexibility**: Different records can have different structures
- **Automatic Flattening**: PySpark can automatically handle nested structures
- **Multi-line JSON**: Support for both single-line JSON and pretty-printed JSON
- **Schema Evolution**: Handle changing schemas over time
- **Complex Data Types**: Support for arrays, maps, and nested structures

In [8]:
# Create Sample JSON Files and Demonstrate Reading
print("📄 Creating and Reading JSON Files")
print("=" * 36)

import json

# 1. Simple JSON (line-delimited JSON)
simple_json_data = [
    {"id": 1, "name": "Alice", "age": 25, "city": "New York"},
    {"id": 2, "name": "Bob", "age": 30, "city": "San Francisco"},
    {"id": 3, "name": "Charlie", "age": 35, "city": "Chicago"},
    {"id": 4, "name": "Diana", "age": 28, "city": "Boston"}
]

# Write line-delimited JSON
simple_json_file = data_dir / "simple_users.json"
with open(simple_json_file, 'w') as f:
    for record in simple_json_data:
        f.write(json.dumps(record) + '\n')

print(f"✅ Created simple JSON: {simple_json_file.name}")

# 2. Complex nested JSON
complex_json_data = [
    {
        "customer_id": "C001",
        "name": "John Doe",
        "contact": {
            "email": "john@example.com",
            "phone": {"home": "555-1234", "work": "555-5678"}
        },
        "orders": [
            {"order_id": "O001", "amount": 150.00, "items": ["laptop", "mouse"]},
            {"order_id": "O002", "amount": 75.50, "items": ["keyboard"]}
        ],
        "preferences": {"newsletter": True, "sms": False}
    },
    {
        "customer_id": "C002", 
        "name": "Jane Smith",
        "contact": {
            "email": "jane@example.com",
            "phone": {"home": "555-9999"}
        },
        "orders": [
            {"order_id": "O003", "amount": 200.00, "items": ["tablet", "case", "stylus"]}
        ],
        "preferences": {"newsletter": False, "sms": True}
    }
]

# Write complex JSON
complex_json_file = data_dir / "complex_customers.json"
with open(complex_json_file, 'w') as f:
    for record in complex_json_data:
        f.write(json.dumps(record) + '\n')

print(f"✅ Created complex JSON: {complex_json_file.name}")

# 3. Read simple JSON
print(f"\n📖 Reading Simple JSON")
df_simple = spark.read.json(str(simple_json_file))

print(f"Schema for simple JSON:")
df_simple.printSchema()

print(f"\nData preview:")
df_simple.show(truncate=False)

# 4. Read complex JSON
print(f"\n📖 Reading Complex Nested JSON")
df_complex = spark.read.json(str(complex_json_file))

print(f"Schema for complex JSON (note nested structures):")
df_complex.printSchema()

print(f"\nComplex data preview:")
df_complex.show(truncate=False)

# 5. Extracting nested fields
print(f"\n🔍 Extracting Nested Fields")

# Extract email from nested structure
df_emails = df_complex.select(
    "customer_id",
    "name", 
    col("contact.email").alias("email"),
    col("contact.phone.home").alias("home_phone")
)

print(f"Extracted nested fields:")
df_emails.show(truncate=False)

print(f"\n📊 JSON Files Summary:")
print(f"   • Simple JSON: {df_simple.count()} records")
print(f"   • Complex JSON: {df_complex.count()} records")
print(f"   • Nested field extraction: ✅ Successful")

📄 Creating and Reading JSON Files
✅ Created simple JSON: simple_users.json
✅ Created complex JSON: complex_customers.json

📖 Reading Simple JSON
Schema for simple JSON:
root
 |-- age: long (nullable = true)
 |-- city: string (nullable = true)
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)


Data preview:
+---+-------------+---+-------+
|age|city         |id |name   |
+---+-------------+---+-------+
|25 |New York     |1  |Alice  |
|30 |San Francisco|2  |Bob    |
|35 |Chicago      |3  |Charlie|
|28 |Boston       |4  |Diana  |
+---+-------------+---+-------+


📖 Reading Complex Nested JSON
Schema for complex JSON (note nested structures):
root
 |-- contact: struct (nullable = true)
 |    |-- email: string (nullable = true)
 |    |-- phone: struct (nullable = true)
 |    |    |-- home: string (nullable = true)
 |    |    |-- work: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- orders: array (nullable = true)

## 2.5 Parquet File Operations

Parquet is a columnar storage format that's highly optimized for analytics workloads. It's the preferred format for big data processing due to its excellent compression, query performance, and schema evolution capabilities.

**Key Parquet Benefits:**
- **Columnar Storage**: Only read columns you need
- **Compression**: Built-in compression algorithms (Snappy, GZIP, LZ4, BROTLI)
- **Schema Evolution**: Add/remove columns without breaking compatibility
- **Predicate Pushdown**: Filter data at the storage layer
- **Statistics**: Built-in min/max statistics for efficient querying
- **Cross-Platform**: Works across different big data tools
- **Type Safety**: Preserves data types accurately

In [9]:
# Parquet Writing with Different Compression Options
print("💾 Parquet Writing Operations")
print("=" * 29)

# Create a larger dataset for compression comparison
import random
from datetime import datetime, timedelta

# Generate sample e-commerce data
def generate_ecommerce_data(num_records=1000):
    categories = ["Electronics", "Clothing", "Home", "Books", "Sports", "Toys"]
    brands = ["BrandA", "BrandB", "BrandC", "BrandD", "BrandE"]
    
    data = []
    base_date = datetime(2024, 1, 1)
    
    for i in range(num_records):
        # Use manual rounding to avoid PySpark round function conflict
        price = int(random.uniform(10.0, 1000.0) * 100) / 100.0
        rating = int(random.uniform(0.0, 5.0) * 10) / 10.0
        
        data.append((
            f"P{i:06d}",  # product_id
            f"Product {i}",  # product_name
            random.choice(categories),  # category
            random.choice(brands),  # brand
            price,  # price
            random.randint(0, 500),  # stock_quantity
            (base_date + timedelta(days=random.randint(0, 365))).strftime('%Y-%m-%d'),  # created_date
            random.choice([True, False]),  # is_active
            rating  # rating
        ))
    
    return data

# Generate sample data
print("🔄 Generating sample e-commerce data...")
sample_data = generate_ecommerce_data(1000)

# Define schema
ecommerce_schema = StructType([
    StructField("product_id", StringType(), True),
    StructField("product_name", StringType(), True),
    StructField("category", StringType(), True),
    StructField("brand", StringType(), True),
    StructField("price", DoubleType(), True),
    StructField("stock_quantity", IntegerType(), True),
    StructField("created_date", StringType(), True),
    StructField("is_active", BooleanType(), True),
    StructField("rating", DoubleType(), True)
])

# Create DataFrame
df_ecommerce = spark.createDataFrame(sample_data, ecommerce_schema)

print(f"✅ Generated {df_ecommerce.count()} records")
print(f"📋 Schema:")
df_ecommerce.printSchema()

# Test different compression algorithms
compression_codecs = ["snappy", "gzip", "lz4", "uncompressed"]
compression_results = {}

print(f"\n💾 Writing Parquet with different compression algorithms:")

for codec in compression_codecs:
    output_path = str(temp_dir / f"ecommerce_parquet_{codec}")
    
    print(f"\n🔧 Writing with {codec.upper()} compression...")
    
    # Measure write time
    start_time = time.time()
    
    df_ecommerce.coalesce(1) \
        .write \
        .mode("overwrite") \
        .option("compression", codec) \
        .parquet(output_path)
    
    write_time = time.time() - start_time
    
    # Check file size
    import os
    total_size = 0
    for root, dirs, files in os.walk(output_path):
        for file in files:
            if file.endswith('.parquet'):
                total_size += os.path.getsize(os.path.join(root, file))
    
    compression_results[codec] = {
        'write_time': write_time,
        'file_size_mb': total_size / (1024 * 1024)
    }
    
    print(f"   ✅ {codec}: {write_time:.3f}s, {total_size / (1024 * 1024):.2f} MB")

# Display compression comparison
print(f"\n📊 Compression Comparison:")
print(f"{'Codec':<12} {'Write Time (s)':<15} {'File Size (MB)':<15} {'Compression Ratio':<18}")
print("-" * 65)

uncompressed_size = compression_results['uncompressed']['file_size_mb']
for codec, results in compression_results.items():
    ratio = uncompressed_size / results['file_size_mb'] if results['file_size_mb'] > 0 else 0
    print(f"{codec:<12} {results['write_time']:<15.3f} {results['file_size_mb']:<15.2f} {ratio:<18.2f}x")

print(f"\n🎯 Recommendations:")
print(f"   • SNAPPY: Best balance of speed and compression")
print(f"   • GZIP: Better compression, slower write/read")
print(f"   • LZ4: Fastest compression/decompression")
print(f"   • Choose based on your read vs write frequency")

💾 Parquet Writing Operations
🔄 Generating sample e-commerce data...
✅ Generated 1000 records
📋 Schema:
root
 |-- product_id: string (nullable = true)
 |-- product_name: string (nullable = true)
 |-- category: string (nullable = true)
 |-- brand: string (nullable = true)
 |-- price: double (nullable = true)
 |-- stock_quantity: integer (nullable = true)
 |-- created_date: string (nullable = true)
 |-- is_active: boolean (nullable = true)
 |-- rating: double (nullable = true)


💾 Writing Parquet with different compression algorithms:

🔧 Writing with SNAPPY compression...
✅ Generated 1000 records
📋 Schema:
root
 |-- product_id: string (nullable = true)
 |-- product_name: string (nullable = true)
 |-- category: string (nullable = true)
 |-- brand: string (nullable = true)
 |-- price: double (nullable = true)
 |-- stock_quantity: integer (nullable = true)
 |-- created_date: string (nullable = true)
 |-- is_active: boolean (nullable = true)
 |-- rating: double (nullable = true)


💾 Writing P

                                                                                

   ✅ snappy: 0.959s, 0.02 MB

🔧 Writing with GZIP compression...
   ✅ gzip: 0.162s, 0.02 MB

🔧 Writing with LZ4 compression...
   ✅ lz4: 0.167s, 0.02 MB

🔧 Writing with UNCOMPRESSED compression...
   ✅ uncompressed: 0.151s, 0.05 MB

📊 Compression Comparison:
Codec        Write Time (s)  File Size (MB)  Compression Ratio 
-----------------------------------------------------------------
snappy       0.959           0.02            1.96              x
gzip         0.162           0.02            2.91              x
lz4          0.167           0.02            2.04              x
uncompressed 0.151           0.05            1.00              x

🎯 Recommendations:
   • SNAPPY: Best balance of speed and compression
   • GZIP: Better compression, slower write/read
   • LZ4: Fastest compression/decompression
   • Choose based on your read vs write frequency
   ✅ lz4: 0.167s, 0.02 MB

🔧 Writing with UNCOMPRESSED compression...
   ✅ uncompressed: 0.151s, 0.05 MB

📊 Compression Comparison:
Codec

In [10]:
# Parquet Reading with Optimization Features
print("📖 Parquet Reading with Advanced Features")
print("=" * 42)

# Use the snappy compressed file for reading demonstrations
parquet_file = str(temp_dir / "ecommerce_parquet_snappy")

print(f"📁 Reading from: {parquet_file}")

# 1. Basic Parquet read
print(f"\n1️⃣ Basic Parquet Read")
df_parquet = spark.read.parquet(parquet_file)

print(f"✅ Loaded {df_parquet.count()} records")
print(f"📋 Full schema preserved:")
df_parquet.printSchema()

# 2. Column selection (columnar advantage)
print(f"\n2️⃣ Column Selection (Columnar Advantage)")
start_time = time.time()

df_selected = spark.read.parquet(parquet_file) \
    .select("product_id", "product_name", "price", "category")

df_selected.show(5, truncate=False)
read_time = time.time() - start_time

print(f"⚡ Column selection read time: {read_time:.4f}s")
print(f"📊 Only selected columns loaded from storage")

# 3. Predicate pushdown (filter at storage level)
print(f"\n3️⃣ Predicate Pushdown (Storage-Level Filtering)")

# Filter expensive electronics
start_time = time.time()

df_filtered = spark.read.parquet(parquet_file) \
    .filter((col("category") == "Electronics") & (col("price") > 500))

expensive_electronics = df_filtered.collect()
filter_time = time.time() - start_time

print(f"🔍 Found {len(expensive_electronics)} expensive electronics")
print(f"⚡ Predicate pushdown time: {filter_time:.4f}s")
print(f"📋 Sample results:")
df_filtered.show(5, truncate=False)

# 4. Combined optimization: column selection + filtering
print(f"\n4️⃣ Combined Optimization")

start_time = time.time()

df_optimized = spark.read.parquet(parquet_file) \
    .select("product_name", "price", "rating") \
    .filter(col("rating") >= 4.0)

high_rated_count = df_optimized.count()
combined_time = time.time() - start_time

print(f"🌟 Found {high_rated_count} high-rated products")
print(f"⚡ Combined optimization time: {combined_time:.4f}s")
print(f"📋 High-rated products:")
df_optimized.orderBy(col("rating").desc()).show(5, truncate=False)

# 5. Schema evolution example
print(f"\n5️⃣ Schema Evolution")

# Add a new column to existing data
df_with_new_column = df_parquet.withColumn("discount_percent", lit(0.1))

# Write with new schema
evolved_path = str(temp_dir / "ecommerce_evolved")
df_with_new_column.coalesce(1).write.mode("overwrite").parquet(evolved_path)

# Read back - Parquet handles schema differences gracefully
df_evolved = spark.read.parquet(evolved_path)
print(f"✅ Schema evolution successful")
print(f"📋 New schema with added column:")
df_evolved.printSchema()

# Performance summary
print(f"\n🚀 Parquet Performance Benefits:")
print(f"   • Columnar storage: Only read needed columns")
print(f"   • Predicate pushdown: Filter data at storage layer")
print(f"   • Compression: Reduce I/O with efficient codecs")
print(f"   • Schema evolution: Add/remove columns safely")
print(f"   • Statistics: Built-in min/max for query optimization")
print(f"   • Cross-platform: Works with all big data tools")

📖 Parquet Reading with Advanced Features
📁 Reading from: /Users/sanjeevadodlapati/Downloads/Repos/GeoSpatialAI/projects/project_pyspark_comprehensive_tutorial/temp/ecommerce_parquet_snappy

1️⃣ Basic Parquet Read
✅ Loaded 1000 records
📋 Full schema preserved:
root
 |-- product_id: string (nullable = true)
 |-- product_name: string (nullable = true)
 |-- category: string (nullable = true)
 |-- brand: string (nullable = true)
 |-- price: double (nullable = true)
 |-- stock_quantity: integer (nullable = true)
 |-- created_date: string (nullable = true)
 |-- is_active: boolean (nullable = true)
 |-- rating: double (nullable = true)


2️⃣ Column Selection (Columnar Advantage)
+----------+------------+------+-----------+
|product_id|product_name|price |category   |
+----------+------------+------+-----------+
|P000000   |Product 0   |643.03|Home       |
|P000001   |Product 1   |679.93|Electronics|
|P000002   |Product 2   |102.75|Sports     |
|P000003   |Product 3   |228.23|Electronics|
|P000

In [11]:
# Partitioned Parquet Operations
print("🗂️  Partitioned Parquet Operations")
print("=" * 34)

# Create partitioned dataset for better query performance
print("📊 Creating partitioned Parquet dataset...")

# Write data partitioned by category
partitioned_path = str(temp_dir / "ecommerce_partitioned")

df_parquet.write \
    .mode("overwrite") \
    .partitionBy("category") \
    .option("compression", "snappy") \
    .parquet(partitioned_path)

print(f"✅ Created partitioned dataset at: {partitioned_path}")

# Explore partition structure
import os
partitions = []
for item in os.listdir(partitioned_path):
    if item.startswith("category="):
        partitions.append(item)

print(f"\n📁 Partition Structure:")
for partition in sorted(partitions):
    partition_path = os.path.join(partitioned_path, partition)
    file_count = len([f for f in os.listdir(partition_path) if f.endswith('.parquet')])
    print(f"   • {partition}: {file_count} parquet file(s)")

# Read specific partitions (partition pruning)
print(f"\n🔍 Partition Pruning Example")

start_time = time.time()

# Read only Electronics partition
df_electronics = spark.read.parquet(partitioned_path) \
    .filter(col("category") == "Electronics")

electronics_count = df_electronics.count()
pruning_time = time.time() - start_time

print(f"⚡ Electronics partition read time: {pruning_time:.4f}s")
print(f"📊 Found {electronics_count} electronics products")

# Compare with reading all partitions then filtering
start_time = time.time()

df_all_then_filter = spark.read.parquet(partitioned_path) \
    .filter(col("category") == "Electronics")

all_then_filter_time = time.time() - start_time

print(f"⚡ Read-all-then-filter time: {all_then_filter_time:.4f}s")

# Multiple partition read
print(f"\n📚 Reading Multiple Partitions")

df_multiple = spark.read.parquet(partitioned_path) \
    .filter(col("category").isin(["Electronics", "Books"]))

print(f"📊 Electronics + Books: {df_multiple.count()} products")
df_multiple.groupBy("category").count().show()

# Write with multiple partition columns
print(f"\n🗂️  Multi-Level Partitioning")

# Add year partition from created_date
df_with_year = df_parquet.withColumn("year", lit("2024"))

multi_partitioned_path = str(temp_dir / "ecommerce_multi_partitioned")

df_with_year.write \
    .mode("overwrite") \
    .partitionBy("category", "year") \
    .option("compression", "snappy") \
    .parquet(multi_partitioned_path)

print(f"✅ Created multi-level partitioned dataset")

# Demonstrate efficient querying
print(f"\n🎯 Efficient Partitioned Query")

df_specific = spark.read.parquet(multi_partitioned_path) \
    .filter((col("category") == "Electronics") & (col("year") == "2024")) \
    .select("product_name", "price", "rating") \
    .filter(col("price") > 100)

print(f"📋 High-value electronics in 2024:")
df_specific.orderBy(col("price").desc()).show(5, truncate=False)

# Partitioning best practices
print(f"\n💡 Partitioning Best Practices:")
print(f"   • Partition by columns frequently used in WHERE clauses")
print(f"   • Avoid partitions with too few files (< 1MB each)")
print(f"   • Avoid too many partitions (creates small files)")
print(f"   • Consider cardinality: 100-10,000 partitions is typical")
print(f"   • Use date/time partitioning for time-series data")
print(f"   • Partition pruning works with = and IN operators")
print(f"   • Multi-level partitioning: category/year/month/day")

🗂️  Partitioned Parquet Operations
📊 Creating partitioned Parquet dataset...
✅ Created partitioned dataset at: /Users/sanjeevadodlapati/Downloads/Repos/GeoSpatialAI/projects/project_pyspark_comprehensive_tutorial/temp/ecommerce_partitioned

📁 Partition Structure:
   • category=Books: 1 parquet file(s)
   • category=Clothing: 1 parquet file(s)
   • category=Electronics: 1 parquet file(s)
   • category=Home: 1 parquet file(s)
   • category=Sports: 1 parquet file(s)
   • category=Toys: 1 parquet file(s)

🔍 Partition Pruning Example
⚡ Electronics partition read time: 0.1744s
📊 Found 160 electronics products
✅ Created partitioned dataset at: /Users/sanjeevadodlapati/Downloads/Repos/GeoSpatialAI/projects/project_pyspark_comprehensive_tutorial/temp/ecommerce_partitioned

📁 Partition Structure:
   • category=Books: 1 parquet file(s)
   • category=Clothing: 1 parquet file(s)
   • category=Electronics: 1 parquet file(s)
   • category=Home: 1 parquet file(s)
   • category=Sports: 1 parquet file(s

## 2.6 Module 2 Summary

🎉 **Congratulations!** You've completed Module 2: Data Ingestion & I/O Operations

### What We Covered:

✅ **Environment Setup**
- Optimized SparkSession for I/O operations
- Project directory structure
- Import dependencies

✅ **CSV Operations**
- Schema inference vs explicit schema definition
- Custom delimiters and null value handling
- Writing with various options and partitioning
- Performance considerations

✅ **JSON Operations**
- Line-delimited JSON processing
- Nested JSON structure handling
- Field extraction from complex schemas
- Schema inference for semi-structured data

✅ **Parquet Operations**
- Compression algorithm comparison (Snappy, GZIP, LZ4)
- Columnar storage advantages
- Predicate pushdown optimization
- Schema evolution capabilities
- Partitioned datasets for query performance

### Key Performance Insights:

🚀 **Parquet Benefits:**
- **Columnar Storage**: 2-5x faster for analytical queries
- **Compression**: 50-75% size reduction with Snappy/GZIP
- **Predicate Pushdown**: Query-level filtering at storage
- **Schema Evolution**: Safe column additions/removals

📊 **Best Practices Learned:**
- Use explicit schemas for production workloads
- Choose compression based on read vs write frequency
- Partition by frequently queried columns
- Avoid over-partitioning (keep files > 1MB)

### Next Steps:

The next modules in our PySpark tutorial will cover:
- **Module 3**: Data Transformations & Operations
- **Module 4**: SQL & DataFrame API
- **Module 5**: Performance Optimization
- **Module 6**: Machine Learning with MLlib
- **Module 7**: Streaming Data Processing
- **Module 8**: Production Deployment

Ready to continue your PySpark journey! 🚀