# Module 07a - Advanced Operations - Complex Types - Exercises

## Instructions

This notebook contains exercises based on the concepts learned in Module 07a.

- Complete each exercise in the provided code cells
- Run the data setup cells first to generate/create necessary data
- Test your solutions by running the verification cells (if provided)
- Refer back to the main module notebook if you need help


## Data Setup

Run the cells below to set up the data needed for the exercises.


In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql.functions import col, when, lit
import os

# Create SparkSession
spark = SparkSession.builder \
    .appName("Module Exercises") \
    .master("local[*]") \
    .getOrCreate()

# Set data directory
data_dir = "../data"
os.makedirs(data_dir, exist_ok=True)

print("SparkSession created successfully!")
print(f"Data directory: {os.path.abspath(data_dir)}")

# Create simple employee DataFrame
df_employees = spark.createDataFrame([
    ("Alice", 25, "Sales", 50000),
    ("Bob", 30, "IT", 60000),
    ("Charlie", 35, "Sales", 70000),
    ("Diana", 28, "IT", 55000),
    ("Eve", 32, "HR", 65000)
], ["Name", "Age", "Department", "Salary"])

print("Employee DataFrame created:")
df_employees.show()


## Exercises

Complete the following exercises based on the concepts from Module 07a.


### Exercise 1: Basic Operation

Complete a basic operation based on Module 07a concepts.

In [0]:
# Select specific column
df_employees.select("Name", "Salary").show()


In [0]:
# Access array elements by index
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.getOrCreate()

data = [
    ("Alice", ["Python", "SQL", "Spark"]),
    ("Bob", ["Java", "Scala"]),
    ("Charlie", ["Python", "R", "SQL"])
]

df = spark.createDataFrame(data, ["Name", "Skills"])
df.show(truncate=False)


In [0]:
# access first index
df.select(
    "Name",
    col("Skills").getItem(0).alias("First_Skill")
).show()


In [0]:
# SQL indexing
df.selectExpr(
    "Name",
    "Skills[0] as First_Skill",
    "Skills[1] as Second_Skill"
).show()


In [0]:
# filter using array index
df.filter(col("Skills").getItem(0) == "Python").show()


In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import array_contains, col

df.filter(array_contains(col("Skills"), "Python")).show()


In [0]:
df.createOrReplaceTempView("employees")

spark.sql("""
    SELECT Name, Skills
    FROM employees
    WHERE array_contains(Skills, 'Python')
""").show()


In [0]:
#explode
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode

df_exploded = df.select(
    "Name",
    explode("Skills").alias("Skill")
)

df_exploded.show()

In [0]:
# using stuct
from pyspark.sql.functions import struct

df_with_struct = df_employees.withColumn(
    "Details",
    struct(
        col("Age").alias("experience"),
        col("Salary").alias("projects")
    )
)

df_with_struct.printSchema()


In [0]:
from pyspark.sql.functions import struct, col

df_struct = df_employees.withColumn(
    "Details",
    struct(
        col("Age"),
        col("Salary")
    )
)

df_struct.show(truncate=False)
df_struct.printSchema()


In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import MapType, StringType, IntegerType

spark = SparkSession.builder.getOrCreate()

data = [
    ("Alice", {"experience": 5, "projects": 10}),
    ("Bob", {"experience": 3, "projects": 5})
]

df = spark.createDataFrame(
    data,
    ["Name", "Details"]
)

df.show(truncate=False)
df.printSchema()


In [0]:
# key exists are not
from pyspark.sql.functions import map_contains_key

df.select(
    "Name",
    map_contains_key(col("Details"), "projects").alias("Has_Projects")
).show()


In [0]:
# convert map to rows
from pyspark.sql.functions import explode

df.select(
    "Name",
    explode("Details").alias("Key", "Value")
).show()


## Summary

Great job completing the exercises! Review your solutions and compare them with the solutions notebook if needed.
