# Module 3 - Basic DataFrame Operations

## Introduction

DataFrame operations are the core of PySpark data processing. This module covers basic operations like selecting columns, filtering data, sorting, and basic transformations. These are the fundamental building blocks for working with PySpark DataFrames.

## What You'll Learn

- Filtering data (WHERE clause equivalent)
- Selecting columns (SELECT clause equivalent)
- Sorting data (ORDER BY equivalent)
- Handling duplicates
- Understanding different ways to access columns
- Basic column operations


In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
from pyspark.sql.functions import col, when, sum, avg, count, max, min

# Create SparkSession
spark = SparkSession.builder \
    .appName("DataFrame Transformations") \
    .master("local[*]") \
    .getOrCreate()

# Create sample DataFrame
data = [
    ("Alice", 25, "Sales", 50000, "New York"),
    ("Bob", 30, "IT", 60000, "London"),
    ("Charlie", 35, "Sales", 70000, "Tokyo"),
    ("Diana", 28, "IT", 55000, "Paris"),
    ("Eve", 32, "HR", 65000, "Sydney"),
    ("Frank", 27, "Sales", 52000, "New York"),
    ("Grace", 29, None, 58000, "London")  # Department is null
]

schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("Department", StringType(), True),
    StructField("Salary", IntegerType(), True),
    StructField("City", StringType(), True)
])

df = spark.createDataFrame(data, schema)
print("Sample DataFrame:")
df.show()


25/12/28 21:31:38 WARN Utils: Your hostname, N-MacBookPro-37.local resolves to a loopback address: 127.0.0.1; using 192.168.1.2 instead (on interface en0)
25/12/28 21:31:38 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/12/28 21:31:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/12/28 21:31:39 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/12/28 21:31:39 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
25/12/28 21:31:39 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.


Sample DataFrame:


                                                                                

+-------+---+----------+------+--------+
|   Name|Age|Department|Salary|    City|
+-------+---+----------+------+--------+
|  Alice| 25|     Sales| 50000|New York|
|    Bob| 30|        IT| 60000|  London|
|Charlie| 35|     Sales| 70000|   Tokyo|
|  Diana| 28|        IT| 55000|   Paris|
|    Eve| 32|        HR| 65000|  Sydney|
|  Frank| 27|     Sales| 52000|New York|
|  Grace| 29|      NULL| 58000|  London|
+-------+---+----------+------+--------+



## Filtering Data

Filtering allows you to select rows based on conditions. Similar to SQL's WHERE clause.


In [2]:
# Filter using column expression
df_filtered = df.filter(df.Age > 28)
print("Employees older than 28:")
df_filtered.show()

# Alternative syntax using col()
df_filtered2 = df.filter(col("Age") > 28)
print("\nSame result using col():")
df_filtered2.show()


Employees older than 28:
+-------+---+----------+------+------+
|   Name|Age|Department|Salary|  City|
+-------+---+----------+------+------+
|    Bob| 30|        IT| 60000|London|
|Charlie| 35|     Sales| 70000| Tokyo|
|    Eve| 32|        HR| 65000|Sydney|
|  Grace| 29|      NULL| 58000|London|
+-------+---+----------+------+------+


Same result using col():
+-------+---+----------+------+------+
|   Name|Age|Department|Salary|  City|
+-------+---+----------+------+------+
|    Bob| 30|        IT| 60000|London|
|Charlie| 35|     Sales| 70000| Tokyo|
|    Eve| 32|        HR| 65000|Sydney|
|  Grace| 29|      NULL| 58000|London|
+-------+---+----------+------+------+



In [3]:
# Multiple conditions using & (and) and | (or)
df_complex_filter = df.filter((df.Age > 28) & (df.Salary > 60000))
print("Age > 28 AND Salary > 60000:")
df_complex_filter.show()

# Using OR
df_or_filter = df.filter((df.Department == "Sales") | (df.Department == "IT"))
print("\nDepartment is Sales OR IT:")
df_or_filter.show()


Age > 28 AND Salary > 60000:
+-------+---+----------+------+------+
|   Name|Age|Department|Salary|  City|
+-------+---+----------+------+------+
|Charlie| 35|     Sales| 70000| Tokyo|
|    Eve| 32|        HR| 65000|Sydney|
+-------+---+----------+------+------+


Department is Sales OR IT:
+-------+---+----------+------+--------+
|   Name|Age|Department|Salary|    City|
+-------+---+----------+------+--------+
|  Alice| 25|     Sales| 50000|New York|
|    Bob| 30|        IT| 60000|  London|
|Charlie| 35|     Sales| 70000|   Tokyo|
|  Diana| 28|        IT| 55000|   Paris|
|  Frank| 27|     Sales| 52000|New York|
+-------+---+----------+------+--------+



In [4]:
# Filter using where() - same as filter()
df_where = df.where(df.City == "New York")
print("Employees in New York:")
df_where.show()


Employees in New York:
+-----+---+----------+------+--------+
| Name|Age|Department|Salary|    City|
+-----+---+----------+------+--------+
|Alice| 25|     Sales| 50000|New York|
|Frank| 27|     Sales| 52000|New York|
+-----+---+----------+------+--------+



## Selecting Columns

Select specific columns from a DataFrame. Similar to SQL's SELECT clause.

### Accessing Columns in PySpark

PySpark provides multiple ways to access columns in a DataFrame. Understanding these methods is crucial for writing effective PySpark code.

#### 1. String Notation

The simplest way to reference a column is by using a string with the column name.

```python
df.select("Name", "Age", "Salary")
df.filter("Age > 30")
```

**When to use**: Simple column references in `select()`, `filter()`, `groupBy()`, etc.

#### 2. Prefixing Column Name with DataFrame

You can access columns using dot notation by prefixing the column name with the DataFrame name.

```python
df.select(df.Name, df.Age)
df.filter(df.Age > 30)
df.filter(df.Department == "IT")
```

**When to use**: When you need to reference columns in a more object-oriented style.

#### 3. Array Notation

You can access columns using bracket notation (similar to dictionary access).

```python
df.select(df["Name"], df["Age"])
df.filter(df["Age"] > 30)
```

**When to use**: Alternative syntax that works similarly to dot notation.

#### 4. Column Object Notation

Using the `col()` function from `pyspark.sql.functions` to create a Column object.

```python
from pyspark.sql.functions import col

df.select(col("Name"), col("Age"))
df.filter(col("Age") > 30)
```

**When to use**: When you need to use column methods and functions programmatically.

#### 5. Column Expression

Using `expr()` function to write SQL-like expressions as strings.

```python
from pyspark.sql.functions import expr

df.select(expr("Name"), expr("Age"))
df.select(expr("Salary * 1.1 as NewSalary"))
df.filter(expr("Age > 30"))
```

**When to use**: When you need to write complex SQL-like expressions or when working with calculated columns.

### Why Are There So Many Ways of Accessing Columns?

Each method serves different purposes and has specific use cases:

#### Prefixing Column Name with DataFrame

**Purpose**: Resolve ambiguity when multiple DataFrames have columns with the same name.

**Example**: If two different DataFrames have columns with the same name, prefixing helps the system identify which DataFrame's column to use.

```python
# If both orders_df and customer_df have a 'cust_id' column
# Without prefixing - AMBIGUOUS:
# df.select("cust_id")  # Which DataFrame's cust_id?

# With prefixing - CLEAR:
orders_df.select(orders_df.cust_id)
customer_df.select(customer_df.cust_id)
```

**When to use**: When joining DataFrames with common column names, or when working with multiple DataFrames in the same context.

#### Column Expression

**Purpose**: Required when evaluation needs to be performed in a SQL way.

**Example**: When you need to perform calculations or transformations that are easier to express in SQL syntax.

```python
# Increment customer ID and create a new customer ID
df.select(expr("cust_id + 1 as new_cust_id"))
```

**When to use**: 
- Complex calculations that are easier to write in SQL syntax
- When you want to leverage SQL's expression capabilities
- Creating aliased columns with calculations

#### Column Object

**Purpose**: Provides various predefined functions to achieve desired results in a programmatic approach.

**Example**: Using column methods for filtering and transformations.

```python
# Using Column Object methods
orders_df.select("*").where(col('order_status').like('PENDING%')).show()

# Equivalent using string expression
orders_df.select("*").where("order_status like 'PENDING%'").show()
```

**When to use**:
- When you need to use column methods (`.like()`, `.isNull()`, `.isNotNull()`, etc.)
- Programmatic column manipulation
- Type-safe column operations
- Chaining column transformations


### Understanding `select()` vs `selectExpr()`

Both `select()` and `selectExpr()` are used to select columns, but they handle expressions differently:

**Key Difference:**

- **`select()`**: You must **explicitly segregate** column names and expressions. For expressions, you need to use `expr()` or column operations.
- **`selectExpr()`**: **Automatically identifies** whether the value passed is a column name or an expression and handles it accordingly. You can write SQL-like expressions directly as strings.

**When to Use:**
- Use `select()` when you want explicit control and type safety
- Use `selectExpr()` when you prefer SQL-like syntax and want Spark to automatically parse expressions


In [5]:
# Example: select() vs selectExpr()

from pyspark.sql.functions import expr

print("="*60)
print("Using select() - Must explicitly use expr() for expressions")
print("="*60)

# With select(), you need to explicitly use expr() for SQL expressions
df_select = df.select(
    "Name",                    # Column name - direct
    "Age",                     # Column name - direct
    expr("Salary * 1.1 as NewSalary")  # Expression - must use expr()
)

print("Using select() with expr():")
df_select.show()

print("\n" + "="*60)
print("Using selectExpr() - Automatically handles expressions")
print("="*60)

# With selectExpr(), you can write SQL-like expressions directly
df_selectExpr = df.selectExpr(
    "Name",                    # Column name - works
    "Age",                     # Column name - works
    "Salary * 1.1 as NewSalary"  # Expression - automatically parsed!
)

print("Using selectExpr() - no need for expr():")
df_selectExpr.show()

print("\n" + "="*60)
print("Key Takeaway:")
print("="*60)
print("select(): Must explicitly use expr() for SQL expressions")
print("selectExpr(): Automatically parses SQL expressions from strings")
print("\nBoth produce the same result, but selectExpr() is more convenient")
print("for SQL-like expressions!")


Using select() - Must explicitly use expr() for expressions
Using select() with expr():
+-------+---+---------+
|   Name|Age|NewSalary|
+-------+---+---------+
|  Alice| 25|  55000.0|
|    Bob| 30|  66000.0|
|Charlie| 35|  77000.0|
|  Diana| 28|  60500.0|
|    Eve| 32|  71500.0|
|  Frank| 27|  57200.0|
|  Grace| 29|  63800.0|
+-------+---+---------+


Using selectExpr() - Automatically handles expressions
Using selectExpr() - no need for expr():
+-------+---+---------+
|   Name|Age|NewSalary|
+-------+---+---------+
|  Alice| 25|  55000.0|
|    Bob| 30|  66000.0|
|Charlie| 35|  77000.0|
|  Diana| 28|  60500.0|
|    Eve| 32|  71500.0|
|  Frank| 27|  57200.0|
|  Grace| 29|  63800.0|
+-------+---+---------+


Key Takeaway:
select(): Must explicitly use expr() for SQL expressions
selectExpr(): Automatically parses SQL expressions from strings

Both produce the same result, but selectExpr() is more convenient
for SQL-like expressions!


## Handling Duplicates in DataFrames

Duplicate rows can occur in DataFrames for various reasons. PySpark provides two methods to handle duplicates:

### 1. `distinct()` - Remove All Duplicate Rows

**Usage**: `df.distinct()` or `df.dropDuplicates()`

**Behavior**: Removes duplicate rows when **all columns** are considered. Two rows are considered duplicates only if all their column values are identical.

**When to Use**: When you want to remove rows that are completely identical across all columns.

### 2. `dropDuplicates()` - Remove Duplicates Based on Subset of Columns

**Usage**: `df.dropDuplicates([column1, column2, ...])`

**Behavior**: Removes duplicate rows when only a **subset of columns** are considered. You can specify which columns to use for duplicate detection.

**When to Use**: When you want to remove duplicates based on specific columns (e.g., keep only one row per customer_id, even if other columns differ).

**Key Difference:**
- `distinct()`: Considers all columns
- `dropDuplicates()`: Can consider a subset of columns (or all columns if no subset specified)


In [6]:
# Example: Handling Duplicates

# Create a DataFrame with duplicate rows
data_with_duplicates = [
    ("Alice", 25, "Sales", 50000, "New York"),
    ("Bob", 30, "IT", 60000, "London"),
    ("Alice", 25, "Sales", 50000, "New York"),  # Complete duplicate
    ("Charlie", 35, "Sales", 70000, "Tokyo"),
    ("Bob", 30, "IT", 60000, "London"),  # Complete duplicate
    ("Alice", 28, "Sales", 55000, "Boston"),  # Same name, different other values
]

df_duplicates = spark.createDataFrame(data_with_duplicates, ["Name", "Age", "Department", "Salary", "City"])

print("Original DataFrame with duplicates:")
df_duplicates.show()

print("\n" + "="*60)
print("Method 1: distinct() - Removes duplicates when ALL columns match")
print("="*60)

# distinct() removes rows where ALL columns are identical
df_distinct = df_duplicates.distinct()

print("After distinct() - only complete duplicates removed:")
df_distinct.show()

print("\n" + "="*60)
print("Method 2: dropDuplicates() - Remove duplicates based on ALL columns")
print("="*60)

# dropDuplicates() without arguments works like distinct()
df_drop_all = df_duplicates.dropDuplicates()

print("After dropDuplicates() (all columns):")
df_drop_all.show()

print("\n" + "="*60)
print("Method 3: dropDuplicates([columns]) - Remove duplicates based on SUBSET")
print("="*60)

# dropDuplicates() with column subset - removes duplicates based on Name only
# Keeps the first occurrence when multiple rows have the same Name
df_drop_subset = df_duplicates.dropDuplicates(["Name"])

print("After dropDuplicates(['Name']) - keeps one row per Name:")
df_drop_subset.show()

print("\n" + "="*60)
print("Summary:")
print("="*60)
print("1. df.distinct() → Removes duplicates when ALL columns match")
print("2. df.dropDuplicates() → Same as distinct() (all columns)")
print("3. df.dropDuplicates(['col1', 'col2']) → Removes duplicates based on specified columns")
print("\nNote: When using dropDuplicates() with subset, Spark keeps the FIRST occurrence")
print("and removes subsequent duplicates based on the specified columns.")


Original DataFrame with duplicates:
+-------+---+----------+------+--------+
|   Name|Age|Department|Salary|    City|
+-------+---+----------+------+--------+
|  Alice| 25|     Sales| 50000|New York|
|    Bob| 30|        IT| 60000|  London|
|  Alice| 25|     Sales| 50000|New York|
|Charlie| 35|     Sales| 70000|   Tokyo|
|    Bob| 30|        IT| 60000|  London|
|  Alice| 28|     Sales| 55000|  Boston|
+-------+---+----------+------+--------+


Method 1: distinct() - Removes duplicates when ALL columns match
After distinct() - only complete duplicates removed:
+-------+---+----------+------+--------+
|   Name|Age|Department|Salary|    City|
+-------+---+----------+------+--------+
|  Alice| 25|     Sales| 50000|New York|
|    Bob| 30|        IT| 60000|  London|
|Charlie| 35|     Sales| 70000|   Tokyo|
|  Alice| 28|     Sales| 55000|  Boston|
+-------+---+----------+------+--------+


Method 2: dropDuplicates() - Remove duplicates based on ALL columns
After dropDuplicates() (all columns)

In [7]:
# Select specific columns
df_selected = df.select("Name", "Age", "Salary")
print("Selected columns:")
df_selected.show()

# Select using col()
df_selected2 = df.select(col("Name"), col("Age"))
df_selected2.show()


Selected columns:
+-------+---+------+
|   Name|Age|Salary|
+-------+---+------+
|  Alice| 25| 50000|
|    Bob| 30| 60000|
|Charlie| 35| 70000|
|  Diana| 28| 55000|
|    Eve| 32| 65000|
|  Frank| 27| 52000|
|  Grace| 29| 58000|
+-------+---+------+

+-------+---+
|   Name|Age|
+-------+---+
|  Alice| 25|
|    Bob| 30|
|Charlie| 35|
|  Diana| 28|
|    Eve| 32|
|  Frank| 27|
|  Grace| 29|
+-------+---+



In [8]:
# Select with expressions
df_with_expr = df.select("Name", "Age", (col("Salary") * 1.1).alias("NewSalary"))
print("Select with calculated column:")
df_with_expr.show()


Select with calculated column:
+-------+---+-----------------+
|   Name|Age|        NewSalary|
+-------+---+-----------------+
|  Alice| 25|55000.00000000001|
|    Bob| 30|          66000.0|
|Charlie| 35|          77000.0|
|  Diana| 28|60500.00000000001|
|    Eve| 32|          71500.0|
|  Frank| 27|57200.00000000001|
|  Grace| 29|63800.00000000001|
+-------+---+-----------------+



## Sorting Data

Sort data by one or more columns. Similar to SQL's ORDER BY clause.


In [9]:
# Sort by single column (ascending)
df_sorted = df.orderBy("Salary")
print("Sorted by Salary (ascending):")
df_sorted.show()

# Sort descending
df_sorted_desc = df.orderBy(col("Salary").desc())
print("\nSorted by Salary (descending):")
df_sorted_desc.show()


Sorted by Salary (ascending):
+-------+---+----------+------+--------+
|   Name|Age|Department|Salary|    City|
+-------+---+----------+------+--------+
|  Alice| 25|     Sales| 50000|New York|
|  Frank| 27|     Sales| 52000|New York|
|  Diana| 28|        IT| 55000|   Paris|
|  Grace| 29|      NULL| 58000|  London|
|    Bob| 30|        IT| 60000|  London|
|    Eve| 32|        HR| 65000|  Sydney|
|Charlie| 35|     Sales| 70000|   Tokyo|
+-------+---+----------+------+--------+


Sorted by Salary (descending):
+-------+---+----------+------+--------+
|   Name|Age|Department|Salary|    City|
+-------+---+----------+------+--------+
|Charlie| 35|     Sales| 70000|   Tokyo|
|    Eve| 32|        HR| 65000|  Sydney|
|    Bob| 30|        IT| 60000|  London|
|  Grace| 29|      NULL| 58000|  London|
|  Diana| 28|        IT| 55000|   Paris|
|  Frank| 27|     Sales| 52000|New York|
|  Alice| 25|     Sales| 50000|New York|
+-------+---+----------+------+--------+



In [10]:
# Sort by multiple columns
df_multi_sort = df.orderBy("Department", col("Salary").desc())
print("Sorted by Department, then Salary (descending):")
df_multi_sort.show()


Sorted by Department, then Salary (descending):
+-------+---+----------+------+--------+
|   Name|Age|Department|Salary|    City|
+-------+---+----------+------+--------+
|  Grace| 29|      NULL| 58000|  London|
|    Eve| 32|        HR| 65000|  Sydney|
|    Bob| 30|        IT| 60000|  London|
|  Diana| 28|        IT| 55000|   Paris|
|Charlie| 35|     Sales| 70000|   Tokyo|
|  Frank| 27|     Sales| 52000|New York|
|  Alice| 25|     Sales| 50000|New York|
+-------+---+----------+------+--------+



## Summary

In this module, you learned:

1. **Filtering Data**: Using `filter()` and `where()` to select rows based on conditions (SQL WHERE equivalent)
2. **Selecting Columns**: Using `select()` to choose specific columns (SQL SELECT equivalent)
3. **Sorting Data**: Using `orderBy()` to sort data (SQL ORDER BY equivalent)
4. **Handling Duplicates**: Using `distinct()` and `dropDuplicates()`
5. **Column Access Methods**: Understanding different ways to reference columns (string, dot notation, col(), expr())

**Key Takeaway**: These are the fundamental operations for working with PySpark DataFrames. They are lazy transformations that create a plan but don't execute until an action is called.

**Next Steps**: In Module 4, we'll learn about data transformations including grouping, aggregations, adding/renaming columns, and handling null values.
