# DataFrame Operations - Practice Notebook

This notebook focuses on **Untyped Dataset Operations** (DataFrame operations) as covered in the [Spark SQL Getting Started Guide](https://spark.apache.org/docs/latest/sql-getting-started.html).

## Learning Objectives
- Master DataFrame transformations and actions
- Understand lazy evaluation in Spark
- Practice column operations and expressions
- Work with different data types and functions

## Sections
1. **Setup and Data Preparation**
2. **Column Selection and Expressions**
3. **Filtering and Conditional Operations**
4. **Transformations vs Actions**
5. **Working with Functions**
6. **Practice Exercises**

---


In [3]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import *

spark = SparkSession.builder.appName("DataFrame Operations").getOrCreate()

# Create sample data for practice
data = [
    ("Alice", 25, "Engineer", 75000, "2020-01-15"),
    ("Bob", 30, "Manager", 85000, "2019-03-20"),
    ("Charlie", 35, "Engineer", 80000, "2018-06-10"),
    ("Diana", 28, "Analyst", 65000, "2021-02-28"),
    ("Eve", 32, "Manager", 90000, "2017-11-05"),
    ("Frank", 29, "Engineer", 78000, "2020-09-12"),
]

columns = ["name", "age", "job_title", "salary", "hire_date"]
df = spark.createDataFrame(data, columns)
df.show()
df.printSchema()

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/09/10 05:42:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
                                                                                

+-------+---+---------+------+----------+
|   name|age|job_title|salary| hire_date|
+-------+---+---------+------+----------+
|  Alice| 25| Engineer| 75000|2020-01-15|
|    Bob| 30|  Manager| 85000|2019-03-20|
|Charlie| 35| Engineer| 80000|2018-06-10|
|  Diana| 28|  Analyst| 65000|2021-02-28|
|    Eve| 32|  Manager| 90000|2017-11-05|
|  Frank| 29| Engineer| 78000|2020-09-12|
+-------+---+---------+------+----------+

root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)
 |-- job_title: string (nullable = true)
 |-- salary: long (nullable = true)
 |-- hire_date: string (nullable = true)



## 1. Column Selection and Expressions

Learn different ways to select and manipulate columns in DataFrames.


In [4]:
df.select("name").show()

+-------+
|   name|
+-------+
|  Alice|
|    Bob|
|Charlie|
|  Diana|
|    Eve|
|  Frank|
+-------+



In [5]:
df.select("name", "age", "salary").show()

+-------+---+------+
|   name|age|salary|
+-------+---+------+
|  Alice| 25| 75000|
|    Bob| 30| 85000|
|Charlie| 35| 80000|
|  Diana| 28| 65000|
|    Eve| 32| 90000|
|  Frank| 29| 78000|
+-------+---+------+



In [6]:
df.select(df.name, df.age).show()

+-------+---+
|   name|age|
+-------+---+
|  Alice| 25|
|    Bob| 30|
|Charlie| 35|
|  Diana| 28|
|    Eve| 32|
|  Frank| 29|
+-------+---+



In [7]:
df.select(df["age"]).show()

+---+
|age|
+---+
| 25|
| 30|
| 35|
| 28|
| 32|
| 29|
+---+



In [8]:
df.select("*").show()

+-------+---+---------+------+----------+
|   name|age|job_title|salary| hire_date|
+-------+---+---------+------+----------+
|  Alice| 25| Engineer| 75000|2020-01-15|
|    Bob| 30|  Manager| 85000|2019-03-20|
|Charlie| 35| Engineer| 80000|2018-06-10|
|  Diana| 28|  Analyst| 65000|2021-02-28|
|    Eve| 32|  Manager| 90000|2017-11-05|
|  Frank| 29| Engineer| 78000|2020-09-12|
+-------+---+---------+------+----------+



In [9]:
df.select(df["name"], df["salary"], (df["age"] * 0.1).alias("bonus")).show()

+-------+------+------------------+
|   name|salary|             bonus|
+-------+------+------------------+
|  Alice| 75000|               2.5|
|    Bob| 85000|               3.0|
|Charlie| 80000|               3.5|
|  Diana| 65000|2.8000000000000003|
|    Eve| 90000|               3.2|
|  Frank| 78000|2.9000000000000004|
+-------+------+------------------+



In [10]:
df.select(df["name"], F.upper(df["name"]).alias("name_upper")).show()

+-------+----------+
|   name|name_upper|
+-------+----------+
|  Alice|     ALICE|
|    Bob|       BOB|
|Charlie|   CHARLIE|
|  Diana|     DIANA|
|    Eve|       EVE|
|  Frank|     FRANK|
+-------+----------+



In [11]:
df.select(
    df["name"],
    df["age"],
    (df["age"] + 5).alias("age_in_5_years"),
    (df["salary"] / 12).alias("monthly_salary"),
).show()

+-------+---+--------------+-----------------+
|   name|age|age_in_5_years|   monthly_salary|
+-------+---+--------------+-----------------+
|  Alice| 25|            30|           6250.0|
|    Bob| 30|            35|7083.333333333333|
|Charlie| 35|            40|6666.666666666667|
|  Diana| 28|            33|5416.666666666667|
|    Eve| 32|            37|           7500.0|
|  Frank| 29|            34|           6500.0|
+-------+---+--------------+-----------------+



In [12]:
df.select(
    df["name"],
    df["age"],
    F.when(df["age"] >= 30, "Senior").otherwise("Junior").alias("seniority"),
).show()

+-------+---+---------+
|   name|age|seniority|
+-------+---+---------+
|  Alice| 25|   Junior|
|    Bob| 30|   Senior|
|Charlie| 35|   Senior|
|  Diana| 28|   Junior|
|    Eve| 32|   Senior|
|  Frank| 29|   Junior|
+-------+---+---------+



## 2. Filtering and Conditional Operations

Practice different filtering techniques and conditional logic.


In [13]:
df.filter(df["age"] > 30).show()

+-------+---+---------+------+----------+
|   name|age|job_title|salary| hire_date|
+-------+---+---------+------+----------+
|Charlie| 35| Engineer| 80000|2018-06-10|
|    Eve| 32|  Manager| 90000|2017-11-05|
+-------+---+---------+------+----------+



In [14]:
df.filter(df["job_title"] == "Engineer").show()

+-------+---+---------+------+----------+
|   name|age|job_title|salary| hire_date|
+-------+---+---------+------+----------+
|  Alice| 25| Engineer| 75000|2020-01-15|
|Charlie| 35| Engineer| 80000|2018-06-10|
|  Frank| 29| Engineer| 78000|2020-09-12|
+-------+---+---------+------+----------+



In [15]:
df.filter((df["age"] > 28) & (df["salary"] > 75000)).show()

+-------+---+---------+------+----------+
|   name|age|job_title|salary| hire_date|
+-------+---+---------+------+----------+
|    Bob| 30|  Manager| 85000|2019-03-20|
|Charlie| 35| Engineer| 80000|2018-06-10|
|    Eve| 32|  Manager| 90000|2017-11-05|
|  Frank| 29| Engineer| 78000|2020-09-12|
+-------+---+---------+------+----------+



In [16]:
df.filter((df["job_title"] == "Manager") | (df["salary"] > 85000)).show()

+----+---+---------+------+----------+
|name|age|job_title|salary| hire_date|
+----+---+---------+------+----------+
| Bob| 30|  Manager| 85000|2019-03-20|
| Eve| 32|  Manager| 90000|2017-11-05|
+----+---+---------+------+----------+



In [17]:
df.filter(df["name"].contains("Diana")).show()

+-----+---+---------+------+----------+
| name|age|job_title|salary| hire_date|
+-----+---+---------+------+----------+
|Diana| 28|  Analyst| 65000|2021-02-28|
+-----+---+---------+------+----------+



In [18]:
df.filter(df["job_title"].isin(["Engineer", "Manager"])).show()

+-------+---+---------+------+----------+
|   name|age|job_title|salary| hire_date|
+-------+---+---------+------+----------+
|  Alice| 25| Engineer| 75000|2020-01-15|
|    Bob| 30|  Manager| 85000|2019-03-20|
|Charlie| 35| Engineer| 80000|2018-06-10|
|    Eve| 32|  Manager| 90000|2017-11-05|
|  Frank| 29| Engineer| 78000|2020-09-12|
+-------+---+---------+------+----------+



## 3. Transformations vs Actions

Understanding the difference between lazy transformations and actions that trigger execution.


In [19]:
filtered_df = df.filter(df["age"] > 25)
selected_df = filtered_df.select("name", "age", "salary")
sorted_df = selected_df.orderBy("salary", ascending=False)

In [20]:
type(sorted_df)

pyspark.sql.classic.dataframe.DataFrame

In [21]:
sorted_df.show()

+-------+---+------+
|   name|age|salary|
+-------+---+------+
|    Eve| 32| 90000|
|    Bob| 30| 85000|
|Charlie| 35| 80000|
|  Frank| 29| 78000|
|  Diana| 28| 65000|
+-------+---+------+



In [22]:
sorted_df.count()

5

In [23]:
collected_data = sorted_df.collect()

In [24]:
type(collected_data)

list

In [25]:
collected_data[0]

Row(name='Eve', age=32, salary=90000)

In [26]:
sorted_df.first()

Row(name='Eve', age=32, salary=90000)

In [27]:
sorted_df.take(2)

[Row(name='Eve', age=32, salary=90000), Row(name='Bob', age=30, salary=85000)]

In [28]:
# Transformations (lazy - don't execute immediately)
print("=== TRANSFORMATIONS (Lazy) ===")
print("These operations don't execute until an action is called")

# Create a series of transformations
filtered_df = df.filter(df["age"] > 25)
selected_df = filtered_df.select("name", "age", "salary")
sorted_df = selected_df.orderBy("salary", ascending=False)

print("Transformations created, but not executed yet...")
print("Type of result:", type(sorted_df))

# Actions (eager - trigger execution)
print("\n=== ACTIONS (Eager) ===")
print("These operations trigger execution of all transformations")

print("\n1. show() - Display data:")
sorted_df.show()

print("\n2. count() - Count rows:")
print(f"Number of rows: {sorted_df.count()}")

print("\n3. collect() - Collect all data to driver:")
collected_data = sorted_df.collect()
print(f"Collected data type: {type(collected_data)}")
print(f"First row: {collected_data[0]}")

print("\n4. first() - Get first row:")
first_row = sorted_df.first()
print(f"First row: {first_row}")

print("\n5. take(n) - Take first n rows:")
first_two = sorted_df.take(2)
print(f"First two rows: {first_two}")

=== TRANSFORMATIONS (Lazy) ===
These operations don't execute until an action is called
Transformations created, but not executed yet...
Type of result: <class 'pyspark.sql.classic.dataframe.DataFrame'>

=== ACTIONS (Eager) ===
These operations trigger execution of all transformations

1. show() - Display data:
+-------+---+------+
|   name|age|salary|
+-------+---+------+
|    Eve| 32| 90000|
|    Bob| 30| 85000|
|Charlie| 35| 80000|
|  Frank| 29| 78000|
|  Diana| 28| 65000|
+-------+---+------+


2. count() - Count rows:
Number of rows: 5

3. collect() - Collect all data to driver:
Collected data type: <class 'list'>
First row: Row(name='Eve', age=32, salary=90000)

4. first() - Get first row:
First row: Row(name='Eve', age=32, salary=90000)

5. take(n) - Take first n rows:
First two rows: [Row(name='Eve', age=32, salary=90000), Row(name='Bob', age=30, salary=85000)]


## 4. Working with Built-in Functions

Explore Spark's built-in functions for data manipulation.


In [29]:
df.show()

+-------+---+---------+------+----------+
|   name|age|job_title|salary| hire_date|
+-------+---+---------+------+----------+
|  Alice| 25| Engineer| 75000|2020-01-15|
|    Bob| 30|  Manager| 85000|2019-03-20|
|Charlie| 35| Engineer| 80000|2018-06-10|
|  Diana| 28|  Analyst| 65000|2021-02-28|
|    Eve| 32|  Manager| 90000|2017-11-05|
|  Frank| 29| Engineer| 78000|2020-09-12|
+-------+---+---------+------+----------+



In [30]:
print("=== STRING FUNCTIONS ===")
df.select(
    df["name"],
    F.upper(df["name"]).alias("name_upper"),
    F.lower(df["name"]).alias("name_lower"),
    F.length(df["name"]).alias("name_length"),
    F.substring(df["name"], 1, 3).alias("first_3_chars"),
).show()

=== STRING FUNCTIONS ===
+-------+----------+----------+-----------+-------------+
|   name|name_upper|name_lower|name_length|first_3_chars|
+-------+----------+----------+-----------+-------------+
|  Alice|     ALICE|     alice|          5|          Ali|
|    Bob|       BOB|       bob|          3|          Bob|
|Charlie|   CHARLIE|   charlie|          7|          Cha|
|  Diana|     DIANA|     diana|          5|          Dia|
|    Eve|       EVE|       eve|          3|          Eve|
|  Frank|     FRANK|     frank|          5|          Fra|
+-------+----------+----------+-----------+-------------+



In [31]:
df.select(
    df["name"],
    df["salary"],
    F.round(df["salary"] / 12, 2).alias("monthly_salary"),
    F.sqrt(df["age"]).alias("sqrt_age"),
    F.abs(df["age"] - 30).alias("age_diff_from_30"),
).show()

+-------+------+--------------+-----------------+----------------+
|   name|salary|monthly_salary|         sqrt_age|age_diff_from_30|
+-------+------+--------------+-----------------+----------------+
|  Alice| 75000|        6250.0|              5.0|               5|
|    Bob| 85000|       7083.33|5.477225575051661|               0|
|Charlie| 80000|       6666.67|5.916079783099616|               5|
|  Diana| 65000|       5416.67|5.291502622129181|               2|
|    Eve| 90000|        7500.0|5.656854249492381|               2|
|  Frank| 78000|        6500.0|5.385164807134504|               1|
+-------+------+--------------+-----------------+----------------+



In [32]:
df_with_date = df.withColumn("hire_date", F.to_date(df["hire_date"], "yyyy-MM-dd"))

df_with_date.select(
    df["name"],
    df_with_date["hire_date"],
    F.year(df_with_date["hire_date"]).alias("hire_year"),
    F.month(df_with_date["hire_date"]).alias("hire_month"),
    F.datediff(F.current_date(), df_with_date["hire_date"]).alias("days_since_hire"),
).show()

+-------+----------+---------+----------+---------------+
|   name| hire_date|hire_year|hire_month|days_since_hire|
+-------+----------+---------+----------+---------------+
|  Alice|2020-01-15|     2020|         1|           2065|
|    Bob|2019-03-20|     2019|         3|           2366|
|Charlie|2018-06-10|     2018|         6|           2649|
|  Diana|2021-02-28|     2021|         2|           1655|
|    Eve|2017-11-05|     2017|        11|           2866|
|  Frank|2020-09-12|     2020|         9|           1824|
+-------+----------+---------+----------+---------------+



## 5. Advanced DataFrame Operations

Explore more advanced operations like grouping, aggregations, and window functions.


In [37]:
print("=== GROUPING AND AGGREGATION ===")
print("1. Group by job title and calculate statistics:")
df.groupBy("job_title").agg(
    F.count("*").alias("count"),
    F.avg("salary").alias("avg_salary"),
    F.min("age").alias("min_age"),
    F.max("age").alias("max_age"),
).show()

=== GROUPING AND AGGREGATION ===
1. Group by job title and calculate statistics:
+---------+-----+-----------------+-------+-------+
|job_title|count|       avg_salary|min_age|max_age|
+---------+-----+-----------------+-------+-------+
| Engineer|    3|77666.66666666667|     25|     35|
|  Manager|    2|          87500.0|     30|     32|
|  Analyst|    1|          65000.0|     28|     28|
+---------+-----+-----------------+-------+-------+



In [43]:
df.withColumn(
    "age_group", F.when(df["age"] < 30, "Young").otherwise("Experienced")
).groupBy("age_group").agg(
    F.count("*").alias("count"), F.avg("salary").alias("avg_salary")
).show()

+-----------+-----+-----------------+
|  age_group|count|       avg_salary|
+-----------+-----+-----------------+
|      Young|    3|72666.66666666667|
|Experienced|    3|          85000.0|
+-----------+-----+-----------------+



In [44]:
df.show()

+-------+---+---------+------+----------+
|   name|age|job_title|salary| hire_date|
+-------+---+---------+------+----------+
|  Alice| 25| Engineer| 75000|2020-01-15|
|    Bob| 30|  Manager| 85000|2019-03-20|
|Charlie| 35| Engineer| 80000|2018-06-10|
|  Diana| 28|  Analyst| 65000|2021-02-28|
|    Eve| 32|  Manager| 90000|2017-11-05|
|  Frank| 29| Engineer| 78000|2020-09-12|
+-------+---+---------+------+----------+



In [49]:
from pyspark.sql.window import Window

window_spec = Window.partitionBy("job_title").orderBy("salary")
window_all = Window.orderBy("salary")

df.select(
    df["name"],
    df["job_title"],
    df["salary"],
    F.row_number().over(window_spec).alias("rank_in_job"),
    F.rank().over(window_all).alias("overall_rank"),
    F.lag(df["salary"], 1).over(window_all).alias("prev_salary"),
).show()

+-------+---------+------+-----------+------------+-----------+
|   name|job_title|salary|rank_in_job|overall_rank|prev_salary|
+-------+---------+------+-----------+------------+-----------+
|  Diana|  Analyst| 65000|          1|           1|       NULL|
|  Alice| Engineer| 75000|          1|           2|      65000|
|  Frank| Engineer| 78000|          2|           3|      75000|
|Charlie| Engineer| 80000|          3|           4|      78000|
|    Bob|  Manager| 85000|          1|           5|      80000|
|    Eve|  Manager| 90000|          2|           6|      85000|
+-------+---------+------+-----------+------------+-----------+



25/09/10 06:00:46 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/09/10 06:00:46 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/09/10 06:00:46 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/09/10 06:00:46 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.


## 6. Practice Exercises

Complete these exercises to test your understanding of DataFrame operations.
