# Module 04 - Data Transformations & Aggregations - Exercises## InstructionsThis notebook contains exercises based on the concepts learned in Module 04.- Complete each exercise in the provided code cells- Run the data setup cells first to generate/create necessary data- Test your solutions by running the verification cells (if provided)- Refer back to the main module notebook if you need help

## Data Setup

Run the cells below to set up the data needed for the exercises.


In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, DateType
from pyspark.sql.functions import col, when, lit
import os
import pandas as pd
import numpy as np

# Create SparkSession
spark = SparkSession.builder \
    .appName(f"Module 4 Exercises") \
    .master("local[*]") \
    .getOrCreate()

# Set data directory
data_dir = "../data"
os.makedirs(data_dir, exist_ok=True)

print("SparkSession created successfully!")
print(f"Data directory: {os.path.abspath(data_dir)}")

# Generate larger dataset for aggregations (~500 MB)
print("Generating large dataset for aggregations (this may take a minute)...")
n_records = 2_000_000  # ~500 MB of data

large_data = {
    "employee_id": range(1, n_records + 1),
    "name": [f"Employee_{i}" for i in range(1, n_records + 1)],
    "department": np.random.choice(["Sales", "IT", "HR", "Finance", "Marketing"], n_records),
    "salary": np.random.randint(40000, 150000, n_records),
    "age": np.random.randint(22, 65, n_records),
    "city": np.random.choice(
        ["NYC", "LA", "Chicago", "Houston", "Phoenix", "Philadelphia"], n_records
    )
}

df_large = pd.DataFrame(large_data)
df_large.to_csv(f"{data_dir}/large_employees.csv", index=False)



print(f"Created large CSV file: {data_dir}/large_employees.csv ({len(df_large)} records)")

# Also create a smaller DataFrame for quick exercises
small_data = [
    ("Alice", 25, "Sales", 50000, "NYC"),
    ("Bob", 30, "IT", 60000, "LA"),
    ("Charlie", 35, "Sales", 70000, "Chicago"),
    ("Diana", 28, "IT", 55000, "Houston"),
    ("Eve", 32, "HR", 65000, "Phoenix"),
    ("Frank", 27, "Sales", 52000, "NYC"),
    ("Grace", 29, "IT", 58000, "LA"),
    ("Henry", 31, "HR", 62000, "Chicago")
]

small_schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("Department", StringType(), True),
    StructField("Salary", IntegerType(), True),
    StructField("City", StringType(), True)
])

df_employees = spark.createDataFrame(small_data, small_schema)

print("\nSmall DataFrame for quick exercises:")
df_employees.show()


## ExercisesComplete the following exercises based on the concepts from Module 04.

### Exercise 1: GroupBy and AggregateGroup df_employees by Department and calculate:- Count of employees- Average salary- Maximum salary

In [0]:
from pyspark.sql.functions import avg,count,max
# Your code here
df_1 = df_employees.groupBy("Department").agg(
    count("*").alias("Total Count"),
    avg("Salary").alias("Avarage Salary"),
    max("Salary").alias("Maximum Salary")
)
df_1.show()

### Exercise 2: Add a ColumnAdd a new column 'Bonus' to df_employees that is 10% of the Salary.

In [0]:
# Your code here
df_2 = df_employees.withColumn("bonus",col("Salary")*1.1)
df_2.show()

### Exercise 3: Handle Null ValuesFill null values in the 'Age' column with 0 (if any exist).

In [0]:
# Your code here
df_4 = df_employees.fillna({'Age':0})
df_4.show()

### Exercise 4: Large Dataset AggregationRead the large_employees.csv file and:1. Group by department2. Calculate average salary per department3. Show results

In [0]:
# Your code here
data = spark.read \
       .format("csv") \
       .option("header",True) \
       .load(f"{data_dir}/large_employees.csv")

result = data.groupBy("department").agg(
    avg("salary").alias("Avarage Salary")
)
result.show()

# Note: This may take a few minutes due to dataset size

## Summary

Great job completing the exercises! Review your solutions and compare them with the solutions notebook if needed.
