# Module 03 - Basic DataFrame Operations - Exercises## InstructionsThis notebook contains exercises based on the concepts learned in Module 03.- Complete each exercise in the provided code cells- Run the data setup cells first to generate/create necessary data- Test your solutions by running the verification cells (if provided)- Refer back to the main module notebook if you need help

## Data Setup

Run the cells below to set up the data needed for the exercises.


In [3]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, DateType
from pyspark.sql.functions import col, when, lit
import os

# -----------------------------
# Create SparkSession
# -----------------------------
spark = SparkSession.builder \
    .appName(f"Module 03 Exercises") \
    .master("local[*]") \
    .getOrCreate()

# Set data directory
data_dir = "../data"
os.makedirs(data_dir, exist_ok=True)

print("SparkSession created successfully!")
print(f"Data directory: {os.path.abspath(data_dir)}")

# -----------------------------
# Create sample data with duplicates and nulls
# -----------------------------
data = [
    ("Alice", 25, "Sales", 50000, "NYC"),
    ("Bob", 30, "IT", 60000, "LA"),
    ("Alice", 25, "Sales", 50000, "NYC"),          # Duplicate
    ("Charlie", None, "Sales", 70000, "Chicago"),  # Null age
    ("Diana", 28, "IT", 55000, None),              # Null city
    ("Eve", 32, "HR", 65000, "Houston"),
    ("Bob", 30, "IT", 60000, "LA"),                # Duplicate
    ("Frank", 27, None, 52000, "Phoenix")          # Null department
]

# Define schema
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("Department", StringType(), True),
    StructField("Salary", IntegerType(), True),
    StructField("City", StringType(), True)
])

# Create DataFrame
df_employees = spark.createDataFrame(data, schema)

print("Employee DataFrame created:")
df_employees.show()


SparkSession created successfully!
Data directory: /data
Employee DataFrame created:
+-------+----+----------+------+-------+
|   Name| Age|Department|Salary|   City|
+-------+----+----------+------+-------+
|  Alice|  25|     Sales| 50000|    NYC|
|    Bob|  30|        IT| 60000|     LA|
|  Alice|  25|     Sales| 50000|    NYC|
|Charlie|NULL|     Sales| 70000|Chicago|
|  Diana|  28|        IT| 55000|   NULL|
|    Eve|  32|        HR| 65000|Houston|
|    Bob|  30|        IT| 60000|     LA|
|  Frank|  27|      NULL| 52000|Phoenix|
+-------+----+----------+------+-------+



## ExercisesComplete the following exercises based on the concepts from Module 03.

df_with_category = df.withColumn(
    "SalaryCategory",
    expr("""
      CASE
          when Salary > 65000 THEN "High"
          when Salary > 55000 THEN "Medium"
          ELSE "low"
      END
    """)
)
print("DataFrame with SalaryCategory:")
df_with_category.show()
### Exercise 1: Filter DataFilter the df_employees DataFrame to show only employees from the 'Sales' department.

In [4]:
# Your code here
df_1 = df_employees.filter(col("Department")=="Sales")
df_1.show()

+-------+----+----------+------+-------+
|   Name| Age|Department|Salary|   City|
+-------+----+----------+------+-------+
|  Alice|  25|     Sales| 50000|    NYC|
|  Alice|  25|     Sales| 50000|    NYC|
|Charlie|NULL|     Sales| 70000|Chicago|
+-------+----+----------+------+-------+



### Exercise 2: Select ColumnsSelect only 'Name', 'Age', and 'Salary' columns from df_employees.

In [5]:
# Your code here
df_2 = df_employees.select("Name","Age","Salary")
df_2.show()

+-------+----+------+
|   Name| Age|Salary|
+-------+----+------+
|  Alice|  25| 50000|
|    Bob|  30| 60000|
|  Alice|  25| 50000|
|Charlie|NULL| 70000|
|  Diana|  28| 55000|
|    Eve|  32| 65000|
|    Bob|  30| 60000|
|  Frank|  27| 52000|
+-------+----+------+



### Exercise 3: Remove DuplicatesRemove duplicate rows from df_employees based on all columns.

In [7]:
# Your code here
df_3 = df_employees.dropDuplicates()
df_3.show()

+-------+----+----------+------+-------+
|   Name| Age|Department|Salary|   City|
+-------+----+----------+------+-------+
|  Alice|  25|     Sales| 50000|    NYC|
|    Bob|  30|        IT| 60000|     LA|
|    Eve|  32|        HR| 65000|Houston|
|Charlie|NULL|     Sales| 70000|Chicago|
|  Diana|  28|        IT| 55000|   NULL|
|  Frank|  27|      NULL| 52000|Phoenix|
+-------+----+----------+------+-------+



### Exercise 4: Sort DataSort df_employees by Salary in descending order.

In [12]:
# Your code here
df_4 = df_employees.orderBy(col("Salary").desc())
df_4.show()

+-------+----+----------+------+-------+
|   Name| Age|Department|Salary|   City|
+-------+----+----------+------+-------+
|Charlie|NULL|     Sales| 70000|Chicago|
|    Eve|  32|        HR| 65000|Houston|
|    Bob|  30|        IT| 60000|     LA|
|    Bob|  30|        IT| 60000|     LA|
|  Diana|  28|        IT| 55000|   NULL|
|  Frank|  27|      NULL| 52000|Phoenix|
|  Alice|  25|     Sales| 50000|    NYC|
|  Alice|  25|     Sales| 50000|    NYC|
+-------+----+----------+------+-------+



## Summary

Great job completing the exercises! Review your solutions and compare them with the solutions notebook if needed.
