# Module 7 - Advanced Operations: Complex Data Types

## Introduction

PySpark supports complex data types beyond simple strings and numbers. This notebook covers Arrays, Structs, and Maps - essential for handling nested and semi-structured data.

## What You'll Learn

- Working with ArrayType
- Working with StructType (nested structures)
- Working with MapType
- Accessing and manipulating complex data
- Common operations on complex types


In [5]:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
# Change 'map' to 'create_map'
from pyspark.sql.functions import col, explode, array, struct, create_map, size, array_contains
import os
os.environ["JAVA_TOOL_OPTIONS"] = (
    "--add-opens=java.base/java.lang=ALL-UNNAMED "
    "--add-opens=java.base/java.nio=ALL-UNNAMED "
    "--add-opens=java.base/sun.nio.ch=ALL-UNNAMED"
)
os.environ["PYSPARK_PYTHON"] = "python"
os.environ["PYSPARK_DRIVER_PYTHON"] = "python"

# Create SparkSession
spark = SparkSession.builder \
    .appName("Complex Data Types") \
    .master("local[*]") \
    .getOrCreate()

print("SparkSession successfully created!")

SparkSession successfully created!


## Arrays

Arrays store multiple values of the same type in a single column. Useful for tags, categories, or lists.


In [2]:
# Create DataFrame with Array column
data = [
    ("Alice", ["Python", "SQL", "Spark"]),
    ("Bob", ["Java", "Scala"]),
    ("Charlie", ["Python", "R", "SQL", "Spark"]),
    ("Diana", ["SQL"])
]

schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Skills", ArrayType(StringType()), True)
])

df_arrays = spark.createDataFrame(data, schema)
df_arrays.show(truncate=False)
df_arrays.printSchema()


                                                                                

+-------+-----------------------+
|Name   |Skills                 |
+-------+-----------------------+
|Alice  |[Python, SQL, Spark]   |
|Bob    |[Java, Scala]          |
|Charlie|[Python, R, SQL, Spark]|
|Diana  |[SQL]                  |
+-------+-----------------------+

root
 |-- Name: string (nullable = true)
 |-- Skills: array (nullable = true)
 |    |-- element: string (containsNull = true)



In [3]:
# Access array elements by index
from pyspark.sql.functions import col

df_with_first_skill = df_arrays.withColumn("FirstSkill", col("Skills")[0])
df_with_first_skill.show(truncate=False)

# Get array size
df_with_size = df_arrays.withColumn("SkillCount", size(col("Skills")))
df_with_size.show(truncate=False)


+-------+-----------------------+----------+
|Name   |Skills                 |FirstSkill|
+-------+-----------------------+----------+
|Alice  |[Python, SQL, Spark]   |Python    |
|Bob    |[Java, Scala]          |Java      |
|Charlie|[Python, R, SQL, Spark]|Python    |
|Diana  |[SQL]                  |SQL       |
+-------+-----------------------+----------+

+-------+-----------------------+----------+
|Name   |Skills                 |SkillCount|
+-------+-----------------------+----------+
|Alice  |[Python, SQL, Spark]   |3         |
|Bob    |[Java, Scala]          |2         |
|Charlie|[Python, R, SQL, Spark]|4         |
|Diana  |[SQL]                  |1         |
+-------+-----------------------+----------+



In [4]:
# Check if array contains a value
df_has_python = df_arrays.withColumn("KnowsPython", array_contains(col("Skills"), "Python"))
df_has_python.show(truncate=False)


+-------+-----------------------+-----------+
|Name   |Skills                 |KnowsPython|
+-------+-----------------------+-----------+
|Alice  |[Python, SQL, Spark]   |true       |
|Bob    |[Java, Scala]          |false      |
|Charlie|[Python, R, SQL, Spark]|true       |
|Diana  |[SQL]                  |false      |
+-------+-----------------------+-----------+



In [5]:
# Explode array - create one row per array element
df_exploded = df_arrays.select("Name", explode("Skills").alias("Skill"))
df_exploded.show()


+-------+------+
|   Name| Skill|
+-------+------+
|  Alice|Python|
|  Alice|   SQL|
|  Alice| Spark|
|    Bob|  Java|
|    Bob| Scala|
|Charlie|Python|
|Charlie|     R|
|Charlie|   SQL|
|Charlie| Spark|
|  Diana|   SQL|
+-------+------+



## Structs

Structs represent nested structures (like JSON objects). Each struct has named fields with their own types.


In [6]:
# Create DataFrame with Struct column
data = [
    ("Alice", ("New York", "USA", 10001)),
    ("Bob", ("London", "UK", "SW1A")),
    ("Charlie", ("Tokyo", "Japan", "100-0001"))
]

schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Address", StructType([
        StructField("City", StringType(), True),
        StructField("Country", StringType(), True),
        StructField("PostalCode", StringType(), True)
    ]), True)
])

df_structs = spark.createDataFrame(data, schema)
df_structs.show(truncate=False)
df_structs.printSchema()


+-------+------------------------+
|Name   |Address                 |
+-------+------------------------+
|Alice  |{New York, USA, 10001}  |
|Bob    |{London, UK, SW1A}      |
|Charlie|{Tokyo, Japan, 100-0001}|
+-------+------------------------+

root
 |-- Name: string (nullable = true)
 |-- Address: struct (nullable = true)
 |    |-- City: string (nullable = true)
 |    |-- Country: string (nullable = true)
 |    |-- PostalCode: string (nullable = true)



In [7]:
# Access struct fields using dot notation
df_city = df_structs.select("Name", col("Address.City").alias("City"))
df_city.show()

# Access multiple struct fields
df_address = df_structs.select(
    "Name",
    col("Address.City").alias("City"),
    col("Address.Country").alias("Country")
)
df_address.show()


+-------+--------+
|   Name|    City|
+-------+--------+
|  Alice|New York|
|    Bob|  London|
|Charlie|   Tokyo|
+-------+--------+

+-------+--------+-------+
|   Name|    City|Country|
+-------+--------+-------+
|  Alice|New York|    USA|
|    Bob|  London|     UK|
|Charlie|   Tokyo|  Japan|
+-------+--------+-------+



## Different Ways of Handling Nested Schema

When working with nested data structures (like JSON with nested objects), you need to define schemas that include nested StructTypes. PySpark provides multiple ways to handle nested schemas, each with its own advantages.

### Method 1: Using StructType with Nested StructType (Programmatic Schema)

This is the most explicit and type-safe way to define nested schemas. You define the schema using `StructType` and `StructField`, with nested `StructType` for nested structures.

**Advantages:**
- Most explicit and clear
- Type-safe
- Better for production code
- Full control over data types and nullability

**When to Use:**
- Production code
- When you need explicit control over schema
- When working with complex nested structures


In [8]:
# Method 1: Using StructType with Nested StructType
from pyspark.sql.types import StructType, StructField, StringType, LongType

# Define nested schema using StructType
customer_schema = StructType([
    StructField("customer_id", LongType()),
    StructField("fullname", StructType([
        StructField("firstname", StringType()),
        StructField("lastname", StringType())
    ])),
    StructField("city", StringType())
])

# Create sample data matching the nested schema
customer_data = [
    (1, ("Sumit", "Mittal"), "Bangalore"),
    (2, ("Ram", "Kumar"), "Hyderabad"),
    (3, ("Vijay", "Kumar"), "Pune")
]

# Create DataFrame with nested schema
df_nested_method1 = spark.createDataFrame(customer_data, customer_schema)

print("Method 1: Using StructType with nested StructType")
print("="*60)
df_nested_method1.show(truncate=False)
df_nested_method1.printSchema()

print("\nNote: This is the most explicit way to define nested schemas.")
print("Best for production code when you need full control over data types.")


Method 1: Using StructType with nested StructType
+-----------+---------------+---------+
|customer_id|fullname       |city     |
+-----------+---------------+---------+
|1          |{Sumit, Mittal}|Bangalore|
|2          |{Ram, Kumar}   |Hyderabad|
|3          |{Vijay, Kumar} |Pune     |
+-----------+---------------+---------+

root
 |-- customer_id: long (nullable = true)
 |-- fullname: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- city: string (nullable = true)


Note: This is the most explicit way to define nested schemas.
Best for production code when you need full control over data types.


### Method 2: Using DDL Schema String (Simple String Format)

You can define nested schemas using a DDL (Data Definition Language) string format. This is more concise and easier to read for simple schemas.

**Advantages:**
- Simple and concise
- Easy to read and write
- Good for quick prototyping
- Less verbose than StructType

**When to Use:**
- Quick prototyping
- Simple nested structures
- When you prefer string-based schema definition

**Syntax**: Use `struct<field1:type1,field2:type2>` for nested structures in DDL format.


In [9]:
# Method 2: Using DDL Schema String
# Define schema as a DDL string
ddl_schema = "customer_id long, fullname struct<firstname:string,lastname:string>, city string"

# Same data as before
customer_data = [
    (1, ("Sumit", "Mittal"), "Bangalore"),
    (2, ("Ram", "Kumar"), "Hyderabad"),
    (3, ("Vijay", "Kumar"), "Pune")
]

# Create DataFrame with DDL schema
df_nested_method2 = spark.createDataFrame(customer_data, ddl_schema)

print("Method 2: Using DDL Schema String")
print("="*60)
df_nested_method2.show(truncate=False)
df_nested_method2.printSchema()

print("\nNote: DDL schema is more concise and easier to read.")
print("Good for quick prototyping and simple nested structures.")
print("\nDDL Syntax for nested struct: struct<field1:type1,field2:type2>")


Method 2: Using DDL Schema String
+-----------+---------------+---------+
|customer_id|fullname       |city     |
+-----------+---------------+---------+
|1          |{Sumit, Mittal}|Bangalore|
|2          |{Ram, Kumar}   |Hyderabad|
|3          |{Vijay, Kumar} |Pune     |
+-----------+---------------+---------+

root
 |-- customer_id: long (nullable = true)
 |-- fullname: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- city: string (nullable = true)


Note: DDL schema is more concise and easier to read.
Good for quick prototyping and simple nested structures.

DDL Syntax for nested struct: struct<field1:type1,field2:type2>


### Method 3: Reading from Files with Nested Schema

When reading nested data from files (especially JSON), you can specify the nested schema using either StructType or DDL format.

**Common Use Case**: Reading JSON files with nested objects where you want to enforce a specific schema rather than relying on schema inference.

**Advantages:**
- Ensures data matches expected structure
- Better performance (no schema inference)
- Type safety
- Handles nested structures correctly


In [10]:
# Method 3: Reading from JSON file with nested schema
import os
import json

# Create sample JSON file with nested structure
os.makedirs("data", exist_ok=True)

json_data = [
    {"customer_id": 1, "fullname": {"firstname": "Sumit", "lastname": "Mittal"}, "city": "Bangalore"},
    {"customer_id": 2, "fullname": {"firstname": "Ram", "lastname": "Kumar"}, "city": "Hyderabad"},
    {"customer_id": 3, "fullname": {"firstname": "Vijay", "lastname": "Kumar"}, "city": "Pune"}
]

# Write JSON file (one JSON object per line - JSON Lines format)
with open("data/customer_nested.json", "w") as f:
    for record in json_data:
        f.write(json.dumps(record) + "\n")

# Method 3a: Read with StructType schema
customer_schema = StructType([
    StructField("customer_id", LongType()),
    StructField("fullname", StructType([
        StructField("firstname", StringType()),
        StructField("lastname", StringType())
    ])),
    StructField("city", StringType())
])

df_from_json_struct = spark.read \
    .format("json") \
    .schema(customer_schema) \
    .load("data/customer_nested.json")

print("Method 3a: Reading JSON with StructType schema")
print("="*60)
df_from_json_struct.show(truncate=False)
df_from_json_struct.printSchema()

# Method 3b: Read with DDL schema
ddl_schema = "customer_id long, fullname struct<firstname:string,lastname:string>, city string"

df_from_json_ddl = spark.read \
    .format("json") \
    .schema(ddl_schema) \
    .load("data/customer_nested.json")

print("\nMethod 3b: Reading JSON with DDL schema")
print("="*60)
df_from_json_ddl.show(truncate=False)
df_from_json_ddl.printSchema()

print("\nNote: Both methods work when reading from files.")
print("Choose based on your preference: StructType (explicit) or DDL (concise).")


Method 3a: Reading JSON with StructType schema
+-----------+---------------+---------+
|customer_id|fullname       |city     |
+-----------+---------------+---------+
|1          |{Sumit, Mittal}|Bangalore|
|2          |{Ram, Kumar}   |Hyderabad|
|3          |{Vijay, Kumar} |Pune     |
+-----------+---------------+---------+

root
 |-- customer_id: long (nullable = true)
 |-- fullname: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- city: string (nullable = true)


Method 3b: Reading JSON with DDL schema
+-----------+---------------+---------+
|customer_id|fullname       |city     |
+-----------+---------------+---------+
|1          |{Sumit, Mittal}|Bangalore|
|2          |{Ram, Kumar}   |Hyderabad|
|3          |{Vijay, Kumar} |Pune     |
+-----------+---------------+---------+

root
 |-- customer_id: long (nullable = true)
 |-- fullname: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |  

### Summary: Handling Nested Schemas

| Method | Format | Best For | Example |
|--------|--------|----------|---------|
| **StructType** | Programmatic | Production code, complex schemas | `StructType([StructField("nested", StructType([...]))])` |
| **DDL String** | String-based | Quick prototyping, simple schemas | `"nested struct<field1:type1,field2:type2>"` |
| **Reading from Files** | Either format | When reading nested JSON/data | `.schema(schema).load("file.json")` |

**Key Points:**
1. **StructType Method**: Most explicit, type-safe, best for production
2. **DDL Method**: More concise, easier to read, good for prototyping
3. **Both work** with `createDataFrame()` and `spark.read.format().schema()`
4. **Nested structures** use `StructType` within `StructType` or `struct<...>` in DDL

**Best Practice**: 
- Use **StructType** for production code when you need explicit control
- Use **DDL** for quick prototyping or when you prefer string-based schemas
- Always define schemas explicitly when reading files (better performance than inference)


In [11]:
# Create struct from columns
from pyspark.sql.functions import struct

data_simple = [
    ("Alice", "New York", "USA"),
    ("Bob", "London", "UK"),
    ("Charlie", "Tokyo", "Japan")
]

df_simple = spark.createDataFrame(data_simple, ["Name", "City", "Country"])

# Create struct column
df_with_struct = df_simple.withColumn("Address", struct("City", "Country"))
df_with_struct.show(truncate=False)
df_with_struct.printSchema()


+-------+--------+-------+---------------+
|Name   |City    |Country|Address        |
+-------+--------+-------+---------------+
|Alice  |New York|USA    |{New York, USA}|
|Bob    |London  |UK     |{London, UK}   |
|Charlie|Tokyo   |Japan  |{Tokyo, Japan} |
+-------+--------+-------+---------------+

root
 |-- Name: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- Address: struct (nullable = false)
 |    |-- City: string (nullable = true)
 |    |-- Country: string (nullable = true)



## Maps

Maps store key-value pairs. Keys and values can have different types.


### Important: Python `map()` vs PySpark MapType

**Don't confuse these two!**

- **Python's `map()` function**: A built-in Python function that applies a function to each element of an iterable. It's used for transformations in regular Python code.
- **PySpark's MapType**: A complex data type that stores key-value pairs in a DataFrame column, similar to a Python dictionary.

Let's see the difference with examples:


In [12]:
# Python's map() function - applies a function to each element
# This is regular Python, NOT PySpark

numbers = [1, 2, 3, 4, 5]
squared = list(map(lambda x: x ** 2, numbers))
print("Python map() example:")
print(f"Original: {numbers}")
print(f"Squared: {squared}")

# Python's map() works on Python collections (lists, tuples, etc.)
names = ["alice", "bob", "charlie"]
uppercase = list(map(str.upper, names))
print(f"\nNames: {names}")
print(f"Uppercase: {uppercase}")

print("\n" + "="*60)
print("NOTE: Python's map() is NOT the same as PySpark's MapType!")
print("="*60)
print("\nPySpark MapType is a DATA TYPE for storing key-value pairs")
print("in DataFrame columns, similar to Python dictionaries.")
print("\nBelow, we'll see PySpark's MapType (the complex data type).")


Python map() example:
Original: [1, 2, 3, 4, 5]
Squared: [1, 4, 9, 16, 25]

Names: ['alice', 'bob', 'charlie']
Uppercase: ['ALICE', 'BOB', 'CHARLIE']

NOTE: Python's map() is NOT the same as PySpark's MapType!

PySpark MapType is a DATA TYPE for storing key-value pairs
in DataFrame columns, similar to Python dictionaries.

Below, we'll see PySpark's MapType (the complex data type).


In [17]:
# Create DataFrame with Map column
from pyspark.sql.functions import create_map, lit

data = [
    ("Alice", {"Python": 5, "SQL": 4, "Spark": 3}),
    ("Bob", {"Java": 4, "Scala": 5}),
    ("Charlie", {"Python": 5, "R": 3})
]

# Note: Maps are tricky to create directly, so we'll use create_map function
data_for_map = [
    ("Alice", "Python", 5, "SQL", 4, "Spark", 3),
    ("Bob", "Java", 4, "Scala", 5, None, None),
    ("Charlie", "Python", 5, "R", 3, None, None)
]

print(data)
print(data_for_map)

df_map = spark.createDataFrame(data_for_map, ["Name", "k1", "v1", "k2", "v2", "k3", "v3"])

# Create map from columns
df_with_map = df_map.withColumn(
    "Skills",
    create_map(
        lit("Python"), col("v1"),
        lit("SQL"), col("v2"),
        lit("Spark"), col("v3")
    )
).select("Name", "Skills")

df_with_map.show(truncate=False)


[('Alice', {'Python': 5, 'SQL': 4, 'Spark': 3}), ('Bob', {'Java': 4, 'Scala': 5}), ('Charlie', {'Python': 5, 'R': 3})]
[('Alice', 'Python', 5, 'SQL', 4, 'Spark', 3), ('Bob', 'Java', 4, 'Scala', 5, None, None), ('Charlie', 'Python', 5, 'R', 3, None, None)]


Py4JJavaError: An error occurred while calling o525.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 12.0 failed 1 times, most recent failure: Lost task 0.0 in stage 12.0 (TID 12) (LPNI5CD1207KJH.igglobal.com executor driver): java.io.IOException: Cannot run program "python3": CreateProcess error=2, The system cannot find the file specified
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1143)
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1073)
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:181)
	at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109)
	at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124)
	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:174)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:67)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:842)
Caused by: java.io.IOException: CreateProcess error=2, The system cannot find the file specified
	at java.base/java.lang.ProcessImpl.create(Native Method)
	at java.base/java.lang.ProcessImpl.<init>(ProcessImpl.java:499)
	at java.base/java.lang.ProcessImpl.start(ProcessImpl.java:158)
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1110)
	... 34 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2856)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2792)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2791)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2791)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1247)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1247)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1247)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3060)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2994)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2983)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:989)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2398)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2419)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2438)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:530)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:483)
	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:61)
	at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:4332)
	at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:3314)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4322)
	at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:546)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:4320)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:201)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:108)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:4320)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:3314)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:3537)
	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:280)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:315)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:842)
Caused by: java.io.IOException: Cannot run program "python3": CreateProcess error=2, The system cannot find the file specified
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1143)
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1073)
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:181)
	at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109)
	at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124)
	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:174)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:67)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	... 1 more
Caused by: java.io.IOException: CreateProcess error=2, The system cannot find the file specified
	at java.base/java.lang.ProcessImpl.create(Native Method)
	at java.base/java.lang.ProcessImpl.<init>(ProcessImpl.java:499)
	at java.base/java.lang.ProcessImpl.start(ProcessImpl.java:158)
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1110)
	... 34 more


In [18]:
# Access map values by key
df_map_access = df_with_map.withColumn("PythonLevel", col("Skills")["Python"])
df_map_access.show(truncate=False)


Py4JJavaError: An error occurred while calling o531.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 13.0 failed 1 times, most recent failure: Lost task 0.0 in stage 13.0 (TID 13) (LPNI5CD1207KJH.igglobal.com executor driver): java.io.IOException: Cannot run program "python3": CreateProcess error=2, The system cannot find the file specified
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1143)
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1073)
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:181)
	at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109)
	at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124)
	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:174)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:67)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:842)
Caused by: java.io.IOException: CreateProcess error=2, The system cannot find the file specified
	at java.base/java.lang.ProcessImpl.create(Native Method)
	at java.base/java.lang.ProcessImpl.<init>(ProcessImpl.java:499)
	at java.base/java.lang.ProcessImpl.start(ProcessImpl.java:158)
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1110)
	... 34 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2856)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2792)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2791)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2791)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1247)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1247)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1247)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3060)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2994)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2983)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:989)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2398)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2419)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2438)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:530)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:483)
	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:61)
	at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:4332)
	at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:3314)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4322)
	at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:546)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:4320)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:201)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:108)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:4320)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:3314)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:3537)
	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:280)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:315)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:842)
Caused by: java.io.IOException: Cannot run program "python3": CreateProcess error=2, The system cannot find the file specified
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1143)
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1073)
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:181)
	at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109)
	at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124)
	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:174)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:67)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	... 1 more
Caused by: java.io.IOException: CreateProcess error=2, The system cannot find the file specified
	at java.base/java.lang.ProcessImpl.create(Native Method)
	at java.base/java.lang.ProcessImpl.<init>(ProcessImpl.java:499)
	at java.base/java.lang.ProcessImpl.start(ProcessImpl.java:158)
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1110)
	... 34 more


## Nested Complex Types

You can nest complex types - arrays of structs, structs containing arrays, etc.


In [15]:
# Example: Array of Structs (common in JSON data)
from pyspark.sql.functions import array, struct

data_nested = [
    ("Alice", [("Python", 5), ("SQL", 4)]),
    ("Bob", [("Java", 4), ("Scala", 5)])
]

# Create using struct and array functions
df_nested = spark.createDataFrame([("Alice",), ("Bob",)], ["Name"])

# For simplicity, let's create a DataFrame with nested structure manually
# In practice, you'd read this from JSON or create it programmatically
print("Nested structures are common when reading JSON data.")
print("We'll see more examples when working with JSON sources.")


Nested structures are common when reading JSON data.
We'll see more examples when working with JSON sources.


## Common Operations Summary

**Arrays:**
- Access element: `col("array_col")[index]`
- Size: `size(col("array_col"))`
- Contains: `array_contains(col("array_col"), value)`
- Explode: `explode(col("array_col"))`

**Structs:**
- Access field: `col("struct_col.field_name")`
- Create: `struct(col1, col2, ...)`

**Maps:**
- Access value: `col("map_col")[key]`
- Create: `create_map(key1, val1, key2, val2, ...)`


## Summary

In this notebook, you learned:

1. **Arrays**: Store multiple values of the same type, access by index, check size, explode
2. **Structs**: Nested structures with named fields, access using dot notation
3. **Maps**: Key-value pairs, access values by key
4. **Nested Types**: Complex types can be nested (arrays of structs, etc.)

**Key Takeaway**: Complex data types are essential for handling semi-structured data like JSON. Understanding how to access and manipulate these types is crucial for real-world data engineering tasks.

**Next Steps**: In Module 8, we'll learn about performance optimization techniques including partitioning, caching, and bucketing.
