# üí• Explode, Arrays & Maps: Complex Data Types in DataFrames

**Time to complete:** 35 minutes  
**Difficulty:** Intermediate  
**Prerequisites:** DataFrame basics, column expressions

---

## üéØ Learning Objectives

By the end of this notebook, you will master:
- ‚úÖ **`explode()`** - Convert arrays to rows
- ‚úÖ **Array operations** - Manipulate array columns
- ‚úÖ **Map operations** - Work with key-value pairs
- ‚úÖ **Struct operations** - Handle nested data
- ‚úÖ **JSON data handling** - Parse and process JSON
- ‚úÖ **Complex data transformations** - Nested operations

**Essential for handling real-world JSON and nested data!**

---

## üîç Understanding Complex Data Types

**Spark DataFrames support complex data types beyond simple strings and numbers:**

- **Arrays**: Ordered collections `[item1, item2, item3]`
- **Maps**: Key-value pairs `{"key1": "value1", "key2": "value2"}`
- **Structs**: Named tuples with fields `{field1: value1, field2: value2}`

### Why Complex Types Matter:
- **JSON data** often contains nested structures
- **Real-world data** is rarely flat
- **Performance** - Avoid expensive joins with pre-joined data
- **Flexibility** - Handle variable-length data

**`explode()` is your gateway to flattening complex data!**

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, explode_outer, posexplode, arrays_zip
from pyspark.sql.functions import array, array_contains, array_distinct, array_sort
from pyspark.sql.functions import map_keys, map_values, map_from_arrays
from pyspark.sql.functions import struct, col, when, lit
from pyspark.sql.types import ArrayType, MapType, StructType, StructField, StringType, IntegerType
import pyspark.sql.functions as F

spark = SparkSession.builder \
    .appName("Explode_Arrays_Maps") \
    .master("local[*]") \
    .getOrCreate()

print(f"‚úÖ Spark ready - Version: {spark.version}")

# Create sample data with complex types
complex_data = [
    ("Alice", ["Python", "Java", "SQL"], {"experience": 5, "level": "Senior"}, 
     {"city": "New York", "country": "USA"}),
    ("Bob", ["Scala", "Python"], {"experience": 3, "level": "Mid"}, 
     {"city": "London", "country": "UK"}),
    ("Charlie", ["Java", "C++", "JavaScript"], {"experience": 7, "level": "Senior"}, 
     {"city": "Tokyo", "country": "Japan"}),
    ("Diana", ["R", "Python", "SQL"], {"experience": 4, "level": "Mid"}, 
     {"city": "Berlin", "country": "Germany"})
]

df = spark.createDataFrame(complex_data, 
    ["name", "skills", "profile", "location"])

print("üìä Complex Data Dataset:")
df.show(truncate=False)
df.printSchema()

## üí• Explode Operations

### Basic Explode: Arrays to Rows

In [None]:
# Explode operations
print("üí• EXPLODE OPERATIONS")
print("=" * 50)

# Basic explode - convert array elements to separate rows
exploded_df = df.withColumn("skill", explode("skills"))

print("Original data:")
df.select("name", "skills").show(truncate=False)

print("\nAfter explode (skills become separate rows):")
exploded_df.select("name", "skill").show()

print(f"\nOriginal rows: {df.count()}")
print(f"Exploded rows: {exploded_df.count()}")
print(f"Total skills: {exploded_df.select("skill").distinct().count()}")

### Explode Outer: Handle Null Arrays

In [None]:
# Create data with null/empty arrays
null_data = [
    ("Alice", ["Python", "Java"]),
    ("Bob", []),  # Empty array
    ("Charlie", None),  # Null array
    ("Diana", ["SQL"])
]

null_df = spark.createDataFrame(null_data, ["name", "skills"])
print("Data with null/empty arrays:")
null_df.show()

# explode() fails on null/empty arrays
try:
    null_df.withColumn("skill", explode("skills")).show()
except Exception as e:
    print(f"\n‚ùå explode() fails: {str(e)[:80]}...")

# explode_outer() handles nulls gracefully
print("\n‚úÖ explode_outer() handles nulls:")
null_df.withColumn("skill", explode_outer("skills")).show()

# posexplode() includes position index
print("\nüìç posexplode() with position:")
null_df.withColumn("skill_pos", posexplode("skills")).show()

### Multiple Explodes and Complex Flattening

In [None]:
# Complex explode scenarios
print("üîÄ COMPLEX EXPLODE SCENARIOS")
print("=" * 50)

# Create nested array data
nested_data = [
    ("Project_A", [["Alice", "Bob"], ["Charlie"]], ["backend", "frontend"]),
    ("Project_B", [["Diana"], ["Eve", "Frank"]], ["database", "api"])
]

nested_df = spark.createDataFrame(nested_data, 
    ["project", "teams", "components"])

print("Nested array data:")
nested_df.show(truncate=False)

# Single explode
single_explode = nested_df.withColumn("team", explode("teams"))
print("\nSingle explode (teams):")
single_explode.select("project", "team").show(truncate=False)

# Double explode (flatten nested arrays)
double_explode = single_explode.withColumn("member", explode("team"))
print("\nDouble explode (team members):")
double_explode.select("project", "member").show()

# Cross product with arrays_zip
zipped_df = nested_df.withColumn(
    "team_component", arrays_zip("teams", "components")
).withColumn(
    "zipped", explode("team_component")
)

print("\nZipped arrays (team ‚Üî component):")
zipped_df.select("project", "zipped").show(truncate=False)

## üìä Array Operations

### Creating and Manipulating Arrays

In [None]:
# Array operations
print("üìä ARRAY OPERATIONS")
print("=" * 50)

# Create arrays from columns
array_data = [
    ("Alice", "Python", "Java", "SQL"),
    ("Bob", "Scala", "Python", None),
    ("Charlie", "Java", "C++", "JavaScript")
]

array_df = spark.createDataFrame(array_data, 
    ["name", "skill1", "skill2", "skill3"])

# Create array column from existing columns
array_created = array_df.withColumn(
    "skills_array", array("skill1", "skill2", "skill3")
)

print("Created array from columns:")
array_created.select("name", "skills_array").show(truncate=False)

# Array operations
array_ops = array_created.withColumn(
    "array_size", F.size("skills_array")
).withColumn(
    "has_python", array_contains("skills_array", "Python")
).withColumn(
    "distinct_skills", array_distinct("skills_array")
).withColumn(
    "sorted_skills", array_sort("skills_array")
).withColumn(
    "first_skill", col("skills_array")[0]
).withColumn(
    "last_two", F.slice("skills_array", -2, 2)
)

print("\nArray operations:")
array_ops.select(
    "name", "skills_array", "array_size", "has_python", 
    "distinct_skills", "first_skill", "last_two"
).show(truncate=False)

### Advanced Array Transformations

In [None]:
# Advanced array transformations
print("üîß ADVANCED ARRAY TRANSFORMATIONS")
print("=" * 50)

# Transform array elements
transformed_arrays = array_created.withColumn(
    "upper_skills", F.transform("skills_array", lambda x: F.upper(x))
).withColumn(
    "skill_lengths", F.transform("skills_array", lambda x: F.length(x))
).withColumn(
    "filtered_skills", F.filter("skills_array", lambda x: x.isNotNull())
).withColumn(
    "skill_exists", F.exists("skills_array", lambda x: x == "Python")
)

print("Array element transformations:")
transformed_arrays.select(
    "name", "skills_array", "upper_skills", "skill_lengths", 
    "filtered_skills", "skill_exists"
).show(truncate=False)

# Array aggregation
array_agg_df = array_created.withColumn(
    "concat_skills", F.array_join("skills_array", ", ")
).withColumn(
    "max_length", F.array_max(F.transform("skills_array", lambda x: F.length(x)))
).withColumn(
    "total_length", F.aggregate(
        F.transform("skills_array", lambda x: F.length(x)),
        0,
        lambda acc, x: acc + x
    )
)

print("\nArray aggregations:")
array_agg_df.select(
    "name", "concat_skills", "max_length", "total_length"
).show(truncate=False)

## üó∫Ô∏è Map Operations

### Creating and Using Maps

In [None]:
# Map operations
print("üó∫Ô∏è MAP OPERATIONS")
print("=" * 50)

# Create maps from data
map_data = [
    ("Alice", ["Python", "Java"], [4, 3]),
    ("Bob", ["Scala", "Python"], [2, 4]),
    ("Charlie", ["Java", "C++"], [5, 2])
]

map_df = spark.createDataFrame(map_data, 
    ["name", "skills", "experience_years"])

# Create map from arrays
map_created = map_df.withColumn(
    "skill_experience", map_from_arrays("skills", "experience_years")
)

print("Created maps from arrays:")
map_created.select("name", "skill_experience").show(truncate=False)

# Map operations
map_ops = map_created.withColumn(
    "map_keys", map_keys("skill_experience")
).withColumn(
    "map_values", map_values("skill_experience")
).withColumn(
    "map_size", F.size("skill_experience")
).withColumn(
    "python_exp", col("skill_experience")["Python"]
).withColumn(
    "has_java", F.map_contains_key("skill_experience", "Java")
)

print("\nMap operations:")
map_ops.select(
    "name", "skill_experience", "map_keys", "map_values", 
    "map_size", "python_exp", "has_java"
).show(truncate=False)

## üèóÔ∏è Struct Operations

### Working with Nested Structures

In [None]:
# Struct operations
print("üèóÔ∏è STRUCT OPERATIONS")
print("=" * 50)

# Create struct columns
struct_df = df.withColumn(
    "profile_struct", struct(
        col("profile.experience").alias("years"),
        col("profile.level").alias("seniority")
    )
).withColumn(
    "location_struct", struct(
        col("location.city").alias("city"),
        col("location.country").alias("country")
    )
)

print("Created struct columns:")
struct_df.select("name", "profile_struct", "location_struct").show(truncate=False)

# Access struct fields
struct_access = struct_df.withColumn(
    "years_exp", col("profile_struct.years")
).withColumn(
    "seniority", col("profile_struct.seniority")
).withColumn(
    "city", col("location_struct.city")
).withColumn(
    "country", col("location_struct.country")
)

print("\nAccessing struct fields:")
struct_access.select(
    "name", "years_exp", "seniority", "city", "country"
).show()

# Complex struct operations
complex_struct = struct_df.withColumn(
    "employee_summary", struct(
        col("name"),
        col("profile_struct"),
        col("location_struct"),
        F.size("skills").alias("skill_count")
    )
)

print("\nNested struct with summary:")
complex_struct.select("employee_summary").show(truncate=False)

## üìã JSON Data Processing

### Parsing JSON Strings

In [None]:
# JSON processing
print("üìã JSON DATA PROCESSING")
print("=" * 50)

# Create JSON data
json_strings = [
    '{"name": "Alice", "skills": ["Python", "Java"], "profile": {"experience": 5, "level": "Senior"}}',
    '{"name": "Bob", "skills": ["Scala", "Python"], "profile": {"experience": 3, "level": "Mid"}}',
    '{"name": "Charlie", "skills": ["Java", "C++"], "profile": {"experience": 7, "level": "Senior"}}'
]

json_df = spark.createDataFrame([(json_str,) for json_str in json_strings], ["json_data"])

print("JSON strings:")
json_df.show(truncate=False)

# Parse JSON
parsed_df = json_df.withColumn(
    "parsed", F.from_json("json_data", 
        "name string, skills array<string>, profile struct<experience int, level string>"
    )
)

print("\nParsed JSON:")
parsed_df.select("parsed").show(truncate=False)

# Extract from parsed JSON
extracted_df = parsed_df.withColumn(
    "employee_name", col("parsed.name")
).withColumn(
    "skill_list", col("parsed.skills")
).withColumn(
    "experience", col("parsed.profile.experience")
).withColumn(
    "level", col("parsed.profile.level")
)

print("\nExtracted JSON fields:")
extracted_df.select(
    "employee_name", "skill_list", "experience", "level"
).show(truncate=False)

# Explode JSON arrays
final_df = extracted_df.withColumn("skill", explode_outer("skill_list"))
print("\nExploded skills from JSON:")
final_df.select("employee_name", "skill", "experience", "level").show()

## üé® Real-World Use Cases

### Processing Nested Customer Data

In [None]:
# Real-world example: Customer analytics
print("üé® REAL-WORLD USE CASES")
print("=" * 50)

# Simulate customer data with complex structures
customer_data = [
    {
        "customer_id": "C001",
        "name": "Alice Johnson",
        "orders": [
            {"order_id": "O001", "amount": 150.00, "items": ["laptop", "mouse"]},
            {"order_id": "O002", "amount": 75.50, "items": ["book"]}
        ],
        "preferences": {"category": "electronics", "notifications": True},
        "addresses": [
            {"type": "home", "city": "New York", "country": "USA"},
            {"type": "work", "city": "Newark", "country": "USA"}
        ]
    },
    {
        "customer_id": "C002",
        "name": "Bob Smith",
        "orders": [
            {"order_id": "O003", "amount": 200.00, "items": ["phone", "case"]}
        ],
        "preferences": {"category": "mobile", "notifications": False},
        "addresses": [
            {"type": "home", "city": "London", "country": "UK"}
        ]
    }
]

# Create DataFrame from nested data
customer_df = spark.createDataFrame(customer_data)

print("Customer data with nested structures:")
customer_df.show(truncate=False)
customer_df.printSchema()

# Extract order details (explode orders)
order_details = customer_df.withColumn(
    "order", explode("orders")
).select(
    "customer_id", "name",
    col("order.order_id").alias("order_id"),
    col("order.amount").alias("order_amount"),
    col("order.items").alias("order_items")
)

print("\nFlattened order details:")
order_details.show()

# Extract item-level details (double explode)
item_details = order_details.withColumn(
    "item", explode("order_items")
).select("customer_id", "name", "order_id", "order_amount", "item")

print("\nItem-level details:")
item_details.show()

# Customer analytics
analytics = customer_df.withColumn(
    "total_orders", F.size("orders")
).withColumn(
    "total_spent", F.aggregate(
        F.transform("orders", lambda x: x.amount),
        0.0,
        lambda acc, x: acc + x
    )
).withColumn(
    "avg_order_value", col("total_spent") / col("total_orders")
).withColumn(
    "preferred_category", col("preferences.category")
).withColumn(
    "home_city", F.filter("addresses", lambda x: x.type == "home")[0].city
)

print("\nCustomer analytics summary:")
analytics.select(
    "customer_id", "name", "total_orders", "total_spent", 
    "avg_order_value", "preferred_category", "home_city"
).show()

## üö® Common Mistakes and Solutions

In [None]:
# Common mistakes
print("üö® COMMON MISTAKES WITH COMPLEX DATA TYPES")
print("=" * 60)

# Mistake 1: explode() on null arrays
print("‚ùå Mistake: explode() on null arrays")
null_array_df = spark.createDataFrame([("Alice", None), ("Bob", ["Java"])], ["name", "skills"])
try:
    null_array_df.withColumn("skill", explode("skills")).show()
except Exception as e:
    print(f"Error: {str(e)[:80]}...")

print("\n‚úÖ Solution: Use explode_outer()")
null_array_df.withColumn("skill", explode_outer("skills")).show()

# Mistake 2: Accessing non-existent struct fields
print("\n‚ùå Mistake: Wrong struct field access")
test_struct = spark.createDataFrame([("Alice", {"age": 25, "city": "NY"})], ["name", "info"])
try:
    test_struct.withColumn("wrong_field", col("info.nonexistent")).show()
except Exception as e:
    print(f"Error: {str(e)[:80]}...")

print("\n‚úÖ Solution: Check schema first")
test_struct.printSchema()
print("Correct access:")
test_struct.withColumn("age", col("info.age")).withColumn("city", col("info.city")).show()

# Mistake 3: Array index out of bounds
print("\n‚ùå Mistake: Array index out of bounds")
array_bounds = spark.createDataFrame([("Alice", ["Java"]), ("Bob", ["Python", "Scala"])], ["name", "skills"])
try:
    array_bounds.withColumn("third_skill", col("skills")[2]).show()
except Exception as e:
    print(f"Error: {str(e)[:80]}...")

print("\n‚úÖ Solution: Check array size first")
array_bounds.withColumn("array_size", F.size("skills")) \
    .withColumn("second_skill", 
                when(F.size("skills") > 1, col("skills")[1])
                .otherwise("N/A")) \
    .show()

## üéØ Key Takeaways

### What You Learned:
- ‚úÖ **`explode()` & `explode_outer()`** - Convert arrays to rows
- ‚úÖ **Array operations** - create, manipulate, transform arrays
- ‚úÖ **Map operations** - work with key-value pairs
- ‚úÖ **Struct operations** - handle nested data structures
- ‚úÖ **JSON processing** - parse and extract from JSON data
- ‚úÖ **Complex transformations** - nested operations and flattening

### Essential Functions:
- üî∏ `explode(col)` - Array to rows (fails on null)
- üî∏ `explode_outer(col)` - Array to rows (handles null)
- üî∏ `posexplode(col)` - Array to rows with position
- üî∏ `arrays_zip(*cols)` - Zip multiple arrays
- üî∏ `array(*cols)` - Create array from columns
- üî∏ `array_contains(arr, value)` - Check if array contains value
- üî∏ `map_from_arrays(keys, values)` - Create map from arrays
- üî∏ `struct(*cols)` - Create struct from columns
- üî∏ `from_json(col, schema)` - Parse JSON strings

### Performance Considerations:
- üî∏ `explode()` can significantly increase row count
- üî∏ Complex nested operations may require multiple passes
- üî∏ Use `explode_outer()` to avoid null-related failures
- üî∏ Consider caching intermediate results for complex pipelines
- üî∏ Monitor data skew after explode operations

### Common Patterns:
- üî∏ **Flattening JSON**: `from_json() + explode() + struct access`
- üî∏ **Array filtering**: `filter(array_col, lambda x: condition)`
- üî∏ **Map transformations**: `transform(map_col, lambda k,v: operation)`
- üî∏ **Nested aggregations**: `aggregate(array_col, initial, merge_func)`
- üî∏ **Struct field access**: `col("struct_col.field_name")`

---

## üöÄ Next Steps

Now that you master complex data types, you're ready for:

1. **DataFrame Joins** - Combining multiple DataFrames
2. **Spark SQL Integration** - SQL interface for DataFrames
3. **Advanced Analytics** - Machine learning and streaming
4. **Production Data Processing** - Handling real-world data at scale

**Complex data types are essential for modern data processing!**

---

**üéâ Congratulations! You now have the power to wrangle the most complex nested data structures in Spark!**