# DataFrame Operations - Practice Notebook

This notebook focuses on **Untyped Dataset Operations** (DataFrame operations) as covered in the [Spark SQL Getting Started Guide](https://spark.apache.org/docs/latest/sql-getting-started.html).

## Learning Objectives
- Master DataFrame transformations and actions
- Understand lazy evaluation in Spark
- Practice column operations and expressions
- Work with different data types and functions

## Sections
1. **Setup and Data Preparation**
2. **Column Selection and Expressions**
3. **Filtering and Conditional Operations**
4. **Transformations vs Actions**
5. **Working with Functions**
6. **Practice Exercises**

---


In [None]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import *

spark = SparkSession.builder.appName("DataFrame Operations").getOrCreate()

data = [
    ("Alice", 25, "Engineer", 75000, "2020-01-15"),
    ("Bob", 30, "Manager", 85000, "2019-03-20"),
    ("Charlie", 35, "Engineer", 80000, "2018-06-10"),
    ("Diana", 28, "Analyst", 65000, "2021-02-28"),
    ("Eve", 32, "Manager", 90000, "2017-11-05"),
    ("Frank", 29, "Engineer", 78000, "2020-09-12")
]

columns = ["name", "age", "job_title", "salary", "hire_date"]
df = spark.createDataFrame(data, columns)
df.show()

25/07/13 12:25:24 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


+-------+---+---------+------+----------+
|   name|age|job_title|salary| hire_date|
+-------+---+---------+------+----------+
|  Alice| 25| Engineer| 75000|2020-01-15|
|    Bob| 30|  Manager| 85000|2019-03-20|
|Charlie| 35| Engineer| 80000|2018-06-10|
|  Diana| 28|  Analyst| 65000|2021-02-28|
|    Eve| 32|  Manager| 90000|2017-11-05|
|  Frank| 29| Engineer| 78000|2020-09-12|
+-------+---+---------+------+----------+



## 1. Column Selection and Expressions

Learn different ways to select and manipulate columns in DataFrames.


In [104]:
df.select("name").show()

+-------+
|   name|
+-------+
|  Alice|
|    Bob|
|Charlie|
|  Diana|
|    Eve|
|  Frank|
+-------+



In [105]:
df.select("name", "age").show()

+-------+---+
|   name|age|
+-------+---+
|  Alice| 25|
|    Bob| 30|
|Charlie| 35|
|  Diana| 28|
|    Eve| 32|
|  Frank| 29|
+-------+---+



In [106]:
df.select(df.name, df.age).show()

+-------+---+
|   name|age|
+-------+---+
|  Alice| 25|
|    Bob| 30|
|Charlie| 35|
|  Diana| 28|
|    Eve| 32|
|  Frank| 29|
+-------+---+



In [107]:
df.select(df["name"], df["age"]).show()

+-------+---+
|   name|age|
+-------+---+
|  Alice| 25|
|    Bob| 30|
|Charlie| 35|
|  Diana| 28|
|    Eve| 32|
|  Frank| 29|
+-------+---+



In [108]:
df.select("*").show()

+-------+---+---------+------+----------+
|   name|age|job_title|salary| hire_date|
+-------+---+---------+------+----------+
|  Alice| 25| Engineer| 75000|2020-01-15|
|    Bob| 30|  Manager| 85000|2019-03-20|
|Charlie| 35| Engineer| 80000|2018-06-10|
|  Diana| 28|  Analyst| 65000|2021-02-28|
|    Eve| 32|  Manager| 90000|2017-11-05|
|  Frank| 29| Engineer| 78000|2020-09-12|
+-------+---+---------+------+----------+



## 2. Filtering and Conditional Operations

Practice different filtering techniques and conditional logic.


In [109]:
df.show()

+-------+---+---------+------+----------+
|   name|age|job_title|salary| hire_date|
+-------+---+---------+------+----------+
|  Alice| 25| Engineer| 75000|2020-01-15|
|    Bob| 30|  Manager| 85000|2019-03-20|
|Charlie| 35| Engineer| 80000|2018-06-10|
|  Diana| 28|  Analyst| 65000|2021-02-28|
|    Eve| 32|  Manager| 90000|2017-11-05|
|  Frank| 29| Engineer| 78000|2020-09-12|
+-------+---+---------+------+----------+



In [110]:
df.filter(df["age"]>30).show()

+-------+---+---------+------+----------+
|   name|age|job_title|salary| hire_date|
+-------+---+---------+------+----------+
|Charlie| 35| Engineer| 80000|2018-06-10|
|    Eve| 32|  Manager| 90000|2017-11-05|
+-------+---+---------+------+----------+



In [111]:
df.filter(df["job_title"]=="Engineer").show()

+-------+---+---------+------+----------+
|   name|age|job_title|salary| hire_date|
+-------+---+---------+------+----------+
|  Alice| 25| Engineer| 75000|2020-01-15|
|Charlie| 35| Engineer| 80000|2018-06-10|
|  Frank| 29| Engineer| 78000|2020-09-12|
+-------+---+---------+------+----------+



In [112]:
mask1 = df["age"]>28
mask2 = df["salary"]>75000

df.filter(mask1 & mask2).show()

+-------+---+---------+------+----------+
|   name|age|job_title|salary| hire_date|
+-------+---+---------+------+----------+
|    Bob| 30|  Manager| 85000|2019-03-20|
|Charlie| 35| Engineer| 80000|2018-06-10|
|    Eve| 32|  Manager| 90000|2017-11-05|
|  Frank| 29| Engineer| 78000|2020-09-12|
+-------+---+---------+------+----------+



In [113]:
mask1 = df["job_title"] == "Engineer"
mask2 = df["salary"] > 85000

df.filter(mask1 | mask2).show()

+-------+---+---------+------+----------+
|   name|age|job_title|salary| hire_date|
+-------+---+---------+------+----------+
|  Alice| 25| Engineer| 75000|2020-01-15|
|Charlie| 35| Engineer| 80000|2018-06-10|
|    Eve| 32|  Manager| 90000|2017-11-05|
|  Frank| 29| Engineer| 78000|2020-09-12|
+-------+---+---------+------+----------+



In [114]:
df.filter(df["name"].contains("a")).show()

+-------+---+---------+------+----------+
|   name|age|job_title|salary| hire_date|
+-------+---+---------+------+----------+
|Charlie| 35| Engineer| 80000|2018-06-10|
|  Diana| 28|  Analyst| 65000|2021-02-28|
|  Frank| 29| Engineer| 78000|2020-09-12|
+-------+---+---------+------+----------+



In [115]:
df.filter(df["job_title"].isin(["Engineer", "Manager"])).show()

+-------+---+---------+------+----------+
|   name|age|job_title|salary| hire_date|
+-------+---+---------+------+----------+
|  Alice| 25| Engineer| 75000|2020-01-15|
|    Bob| 30|  Manager| 85000|2019-03-20|
|Charlie| 35| Engineer| 80000|2018-06-10|
|    Eve| 32|  Manager| 90000|2017-11-05|
|  Frank| 29| Engineer| 78000|2020-09-12|
+-------+---+---------+------+----------+



## 3. Transformations vs Actions

Understanding the difference between lazy transformations and actions that trigger execution.


In [116]:
filtered_df = df.filter(df["age"]>25)
selected_df = filtered_df.select("name", "age", "salary")
sorted_df = selected_df.orderBy("salary", ascending=False)

In [117]:
type(sorted_df)

pyspark.sql.classic.dataframe.DataFrame

In [118]:
sorted_df.show()

+-------+---+------+
|   name|age|salary|
+-------+---+------+
|    Eve| 32| 90000|
|    Bob| 30| 85000|
|Charlie| 35| 80000|
|  Frank| 29| 78000|
|  Diana| 28| 65000|
+-------+---+------+



In [119]:
sorted_df.count()

5

In [120]:
collected_data = sorted_df.collect()
type(collected_data)
collected_data[0]

Row(name='Eve', age=32, salary=90000)

In [121]:
collected_data

[Row(name='Eve', age=32, salary=90000),
 Row(name='Bob', age=30, salary=85000),
 Row(name='Charlie', age=35, salary=80000),
 Row(name='Frank', age=29, salary=78000),
 Row(name='Diana', age=28, salary=65000)]

In [122]:
sorted_df.first()

Row(name='Eve', age=32, salary=90000)

In [123]:
sorted_df.take(2)

[Row(name='Eve', age=32, salary=90000), Row(name='Bob', age=30, salary=85000)]

In [124]:
df.limit(1).show()

+-----+---+---------+------+----------+
| name|age|job_title|salary| hire_date|
+-----+---+---------+------+----------+
|Alice| 25| Engineer| 75000|2020-01-15|
+-----+---+---------+------+----------+



## 4. Working with Built-in Functions

Explore Spark's built-in functions for data manipulation.


In [125]:
print("=== STRING FUNCTIONS ===")
df.select(
    df["name"],
    F.upper(df["name"]).alias("name_upper"),
    F.lower(df["name"]).alias("name_lower"),
    F.length(df["name"]).alias("name_length"),
    F.substring(df["name"],1,3).alias("first_3_chars")
).show()

=== STRING FUNCTIONS ===
+-------+----------+----------+-----------+-------------+
|   name|name_upper|name_lower|name_length|first_3_chars|
+-------+----------+----------+-----------+-------------+
|  Alice|     ALICE|     alice|          5|          Ali|
|    Bob|       BOB|       bob|          3|          Bob|
|Charlie|   CHARLIE|   charlie|          7|          Cha|
|  Diana|     DIANA|     diana|          5|          Dia|
|    Eve|       EVE|       eve|          3|          Eve|
|  Frank|     FRANK|     frank|          5|          Fra|
+-------+----------+----------+-----------+-------------+



In [126]:
print("\n=== MATHEMATICAL FUNCTIONS ===")
df.select(
    df["name"],
    df["salary"],
    F.round(df["salary"] / 12, 2).alias("monthly_salary"),
    F.sqrt(df["age"]).alias("sqrt_age"),
    F.abs(df["age"] - 30).alias("age_diff_from_30")
).show()


=== MATHEMATICAL FUNCTIONS ===
+-------+------+--------------+-----------------+----------------+
|   name|salary|monthly_salary|         sqrt_age|age_diff_from_30|
+-------+------+--------------+-----------------+----------------+
|  Alice| 75000|        6250.0|              5.0|               5|
|    Bob| 85000|       7083.33|5.477225575051661|               0|
|Charlie| 80000|       6666.67|5.916079783099616|               5|
|  Diana| 65000|       5416.67|5.291502622129181|               2|
|    Eve| 90000|        7500.0|5.656854249492381|               2|
|  Frank| 78000|        6500.0|5.385164807134504|               1|
+-------+------+--------------+-----------------+----------------+



In [127]:
df.limit(1).show()

+-----+---+---------+------+----------+
| name|age|job_title|salary| hire_date|
+-----+---+---------+------+----------+
|Alice| 25| Engineer| 75000|2020-01-15|
+-----+---+---------+------+----------+



In [128]:
# Date functions (convert string to date first)
print("\n=== DATE FUNCTIONS ===")
df_with_date = df.withColumn("hire_date", F.to_date(df["hire_date"], "yyyy-MM-dd"))

df_with_date.select(
    df["name"],
    df_with_date["hire_date"],
    F.year(df_with_date["hire_date"]).alias("hire_year"),
    F.month(df_with_date["hire_date"]).alias("hire_month"),
    F.datediff(F.current_date(), df_with_date["hire_date"]).alias("days_since_hire")
).show()


=== DATE FUNCTIONS ===
+-------+----------+---------+----------+---------------+
|   name| hire_date|hire_year|hire_month|days_since_hire|
+-------+----------+---------+----------+---------------+
|  Alice|2020-01-15|     2020|         1|           2006|
|    Bob|2019-03-20|     2019|         3|           2307|
|Charlie|2018-06-10|     2018|         6|           2590|
|  Diana|2021-02-28|     2021|         2|           1596|
|    Eve|2017-11-05|     2017|        11|           2807|
|  Frank|2020-09-12|     2020|         9|           1765|
+-------+----------+---------+----------+---------------+



## 5. Advanced DataFrame Operations

Explore more advanced operations like grouping, aggregations, and window functions.


In [129]:
df.show()

+-------+---+---------+------+----------+
|   name|age|job_title|salary| hire_date|
+-------+---+---------+------+----------+
|  Alice| 25| Engineer| 75000|2020-01-15|
|    Bob| 30|  Manager| 85000|2019-03-20|
|Charlie| 35| Engineer| 80000|2018-06-10|
|  Diana| 28|  Analyst| 65000|2021-02-28|
|    Eve| 32|  Manager| 90000|2017-11-05|
|  Frank| 29| Engineer| 78000|2020-09-12|
+-------+---+---------+------+----------+



In [130]:
print("=== GROUPING AND AGGREGATION ===")
print("1. Group by job title and calculate statistics:")

df.groupBy("job_title").agg(
    F.count("*").alias("count"),
    F.avg("salary").alias("avg_salary"),
    F.min("age").alias("min_age"),
    F.max("age").alias("max_age")
).show()

=== GROUPING AND AGGREGATION ===
1. Group by job title and calculate statistics:
+---------+-----+-----------------+-------+-------+
|job_title|count|       avg_salary|min_age|max_age|
+---------+-----+-----------------+-------+-------+
| Engineer|    3|77666.66666666667|     25|     35|
|  Manager|    2|          87500.0|     30|     32|
|  Analyst|    1|          65000.0|     28|     28|
+---------+-----+-----------------+-------+-------+



In [131]:
print("\n2. Multiple grouping columns:")

(
    df
    .withColumn("age_group", F.when(df["age"] < 30, "Young").otherwise("Experienced"))
    .groupBy("job_title", "age_group")
    .agg(
        F.count("*").alias("count"),
        F.avg("salary").alias("avg_salary")
        )
    .orderBy("job_title", "age_group")
).show()


2. Multiple grouping columns:
+---------+-----------+-----+----------+
|job_title|  age_group|count|avg_salary|
+---------+-----------+-----+----------+
|  Analyst|      Young|    1|   65000.0|
| Engineer|Experienced|    1|   80000.0|
| Engineer|      Young|    2|   76500.0|
|  Manager|Experienced|    2|   87500.0|
+---------+-----------+-----+----------+



In [151]:
print("\n=== WINDOW FUNCTIONS ===")
from pyspark.sql.window import  Window

window_spec = Window.partitionBy("job_title").orderBy("salary")
window_all = Window.orderBy("salary")

df.select(
    df["name"],
    df["job_title"],
    df["salary"],
    F.row_number().over(window_spec).alias("rank_in_job"),
    F.rank().over(window_all).alias("overall_rank"),
    F.lag(df["salary"], 1).over(window_all).alias("prev_salary")
).show()



=== WINDOW FUNCTIONS ===
+-------+---------+------+-----------+------------+-----------+
|   name|job_title|salary|rank_in_job|overall_rank|prev_salary|
+-------+---------+------+-----------+------------+-----------+
|  Diana|  Analyst| 65000|          1|           1|       NULL|
|  Alice| Engineer| 75000|          1|           2|      65000|
|  Frank| Engineer| 78000|          2|           3|      75000|
|Charlie| Engineer| 80000|          3|           4|      78000|
|    Bob|  Manager| 85000|          1|           5|      80000|
|    Eve|  Manager| 90000|          2|           6|      85000|
+-------+---------+------+-----------+------------+-----------+



25/07/13 12:45:15 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/07/13 12:45:15 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/07/13 12:45:15 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/07/13 12:45:15 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.


In [135]:
df.orderBy("job_title").show()

+-------+---+---------+------+----------+
|   name|age|job_title|salary| hire_date|
+-------+---+---------+------+----------+
|  Diana| 28|  Analyst| 65000|2021-02-28|
|  Alice| 25| Engineer| 75000|2020-01-15|
|Charlie| 35| Engineer| 80000|2018-06-10|
|  Frank| 29| Engineer| 78000|2020-09-12|
|    Bob| 30|  Manager| 85000|2019-03-20|
|    Eve| 32|  Manager| 90000|2017-11-05|
+-------+---+---------+------+----------+



## 6. Practice Exercises

Complete these exercises to test your understanding of DataFrame operations.


In [152]:
sales_data = [
    ("Product A", "Electronics", 1500, 10),
    ("Product B", "Clothing", 800, 25),
    ("Product C", "Electronics", 2000, 5),
    ("Product D", "Books", 300, 50),
    ("Product E", "Clothing", 1200, 15),
    ("Product F", "Electronics", 1800, 8)
]

sales_df = spark.createDataFrame(sales_data, ["product_name", "category", "price", "quantity"])
sales_df.show()

+------------+-----------+-----+--------+
|product_name|   category|price|quantity|
+------------+-----------+-----+--------+
|   Product A|Electronics| 1500|      10|
|   Product B|   Clothing|  800|      25|
|   Product C|Electronics| 2000|       5|
|   Product D|      Books|  300|      50|
|   Product E|   Clothing| 1200|      15|
|   Product F|Electronics| 1800|       8|
+------------+-----------+-----+--------+



In [166]:
print("Exercise 1: Select product_name and calculate total_value (price * quantity)")
# Your code here
sales_df.select(
    F.col("product_name"),
    (F.col("price") * F.col("quantity")).alias("total_value")
).show()

Exercise 1: Select product_name and calculate total_value (price * quantity)
+------------+-----------+
|product_name|total_value|
+------------+-----------+
|   Product A|      15000|
|   Product B|      20000|
|   Product C|      10000|
|   Product D|      15000|
|   Product E|      18000|
|   Product F|      14400|
+------------+-----------+



In [171]:
print("\nExercise 2: Filter products with price > 1000")

mask1 = sales_df["price"] > 1000
sales_df.filter(mask1).show()


Exercise 2: Filter products with price > 1000
+------------+-----------+-----+--------+
|product_name|   category|price|quantity|
+------------+-----------+-----+--------+
|   Product A|Electronics| 1500|      10|
|   Product C|Electronics| 2000|       5|
|   Product E|   Clothing| 1200|      15|
|   Product F|Electronics| 1800|       8|
+------------+-----------+-----+--------+



In [173]:
print("\nExercise 3: Add a price_category column: 'Expensive' if price > 1500, else 'Affordable'")

sales_df.withColumn(
    "price_category",
    F.when(sales_df["price"]>1500, "Expensive").otherwise("Affordable")
).show()


Exercise 3: Add a price_category column: 'Expensive' if price > 1500, else 'Affordable'
+------------+-----------+-----+--------+--------------+
|product_name|   category|price|quantity|price_category|
+------------+-----------+-----+--------+--------------+
|   Product A|Electronics| 1500|      10|    Affordable|
|   Product B|   Clothing|  800|      25|    Affordable|
|   Product C|Electronics| 2000|       5|     Expensive|
|   Product D|      Books|  300|      50|    Affordable|
|   Product E|   Clothing| 1200|      15|    Affordable|
|   Product F|Electronics| 1800|       8|     Expensive|
+------------+-----------+-----+--------+--------------+



In [177]:
print("\nExercise 4: Group by category and calculate average price and total quantity")
# Your code here
sales_df.groupby("category").agg(
    F.avg("price").alias("total_price"),
    F.count("quantity").alias("total_quantity")
).show()


Exercise 4: Group by category and calculate average price and total quantity
+-----------+------------------+--------------+
|   category|       total_price|total_quantity|
+-----------+------------------+--------------+
|Electronics|1766.6666666666667|             3|
|   Clothing|            1000.0|             2|
|      Books|             300.0|             1|
+-----------+------------------+--------------+



In [181]:
print("\nExercise 5: Find the most expensive product in each category")
window_spec = Window.partitionBy("category").orderBy("price")

sales_df.select(
    sales_df["product_name"],
    sales_df["category"],
    sales_df["price"],
    F.rank().over(window_spec).alias("expensive_product")
).show()


Exercise 5: Find the most expensive product in each category
+------------+-----------+-----+-----------------+
|product_name|   category|price|expensive_product|
+------------+-----------+-----+-----------------+
|   Product D|      Books|  300|                1|
|   Product B|   Clothing|  800|                1|
|   Product E|   Clothing| 1200|                2|
|   Product A|Electronics| 1500|                1|
|   Product F|Electronics| 1800|                2|
|   Product C|Electronics| 2000|                3|
+------------+-----------+-----+-----------------+



In [None]:
from pyspark.sql.functions import window

window_spec = Window.partitionBy("job_title").orderBy("salary")
window_all = Window.orderBy("salary")

df.select(
    df["name"],
    df["job_title"],
    df["salary"],
    F.row_number().over(window_spec).alias("rank_in_job"),
    F.rank().over(window_all).alias("overall_rank"),
    F.lag(df["salary"], 1).over(window_all).alias("prev_salary")
).show()