# üéØ Column Expressions: Advanced DataFrame Operations

**Time to complete:** 35 minutes  
**Difficulty:** Intermediate  
**Prerequisites:** DataFrame basics, SQL knowledge

---

## üéØ Learning Objectives

By the end of this notebook, you will master:
- ‚úÖ **Column operations** - select, withColumn, drop
- ‚úÖ **Column expressions** - arithmetic, string, date operations
- ‚úÖ **Conditional logic** - when/otherwise, case statements
- ‚úÖ **Type casting** - cast, data type conversions
- ‚úÖ **User Defined Functions (UDFs)** - custom column logic
- ‚úÖ **Performance optimization** - efficient column operations

**Column expressions are the heart of DataFrame transformations!**

---

## üîç Understanding Column Expressions

**Column expressions** are the building blocks of DataFrame operations. Every transformation you perform on columns creates expressions that Spark optimizes and executes.

### Expression Types:
- **Arithmetic expressions**: `col("price") * 0.8`
- **String expressions**: `upper(col("name"))`
- **Conditional expressions**: `when(col("age") > 18, "Adult").otherwise("Minor")`
- **Type cast expressions**: `col("price").cast("double")`
- **UDF expressions**: Custom functions applied to columns

**All expressions are lazily evaluated and optimized by Catalyst!**

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when, lit, expr, udf
from pyspark.sql.types import StringType, DoubleType, IntegerType
import pyspark.sql.functions as F

spark = SparkSession.builder \
    .appName("Column_Expressions") \
    .master("local[*]") \
    .getOrCreate()

print(f"‚úÖ Spark ready - Version: {spark.version}")

# Create sample data
data = [
    (1, "Alice", 25, "Engineering", 75000.0, "2023-01-15"),
    (2, "Bob", 30, "Sales", 65000.0, "2023-02-20"),
    (3, "Charlie", 35, "Engineering", 85000.0, "2023-03-10"),
    (4, "Diana", 28, "HR", 55000.0, "2023-04-05"),
    (5, "Eve", 32, "Sales", 70000.0, "2023-05-12")
]

columns = ["id", "name", "age", "department", "salary", "hire_date"]
df = spark.createDataFrame(data, columns)

print("üìä Sample DataFrame:")
df.show()
df.printSchema()

## üéØ Basic Column Operations

### Selecting Columns

In [None]:
# Different ways to select columns
print("üéØ COLUMN SELECTION METHODS")
print("=" * 50)

# Method 1: Column names as strings
basic_select = df.select("name", "department", "salary")
print("Basic column selection:")
basic_select.show()

# Method 2: Using col() function
col_select = df.select(col("name"), col("age"), col("salary"))
print("\nUsing col() function:")
col_select.show()

# Method 3: Column expressions
expr_select = df.select(
    "name",
    (col("salary") * 1.1).alias("salary_with_bonus"),
    (col("age") + 1).alias("age_next_year")
)
print("\nWith column expressions:")
expr_select.show()

### Adding and Modifying Columns

In [None]:
# Adding new columns with expressions
print("‚ûï ADDING NEW COLUMNS")
print("=" * 50)

# withColumn() - add or replace columns
df_enhanced = df \
    .withColumn("annual_bonus", col("salary") * 0.1) \
    .withColumn("total_compensation", col("salary") + col("annual_bonus")) \
    .withColumn("experience_level", 
                when(col("age") < 30, "Junior")
                .when(col("age") < 35, "Mid-level")
                .otherwise("Senior"))

print("Enhanced DataFrame:")
df_enhanced.select("name", "age", "salary", "annual_bonus", "total_compensation", "experience_level").show()

# withColumnRenamed() - rename columns
df_renamed = df.withColumnRenamed("salary", "annual_salary")
print("\nRenamed column:")
df_renamed.select("name", "annual_salary").show()

### Removing Columns

In [None]:
# Removing columns
print("üóëÔ∏è REMOVING COLUMNS")
print("=" * 50)

# drop() method
df_reduced = df.drop("hire_date", "id")
print("After dropping columns:")
df_reduced.show()

# Drop multiple columns at once
df_minimal = df.drop(*["id", "hire_date", "department"])
print("\nMinimal DataFrame:")
df_minimal.show()

## üßÆ Arithmetic Expressions

### Basic Arithmetic Operations

In [None]:
# Arithmetic expressions
print("üßÆ ARITHMETIC EXPRESSIONS")
print("=" * 50)

# Create sample sales data
sales_data = [
    ("Product_A", 100, 25.50, 0.15),
    ("Product_B", 200, 15.75, 0.20),
    ("Product_C", 150, 30.00, 0.10),
    ("Product_D", 300, 12.50, 0.25)
]

sales_df = spark.createDataFrame(sales_data, ["product", "quantity", "price", "discount_rate"])

# Calculate total revenue with expressions
revenue_df = sales_df.withColumn(
    "total_before_discount", col("quantity") * col("price")
).withColumn(
    "discount_amount", col("total_before_discount") * col("discount_rate")
).withColumn(
    "final_revenue", col("total_before_discount") - col("discount_amount")
).withColumn(
    "avg_price_per_unit", col("final_revenue") / col("quantity")
)

print("Revenue calculations:")
revenue_df.select(
    "product", "quantity", "price", "discount_rate",
    "total_before_discount", "discount_amount", "final_revenue", "avg_price_per_unit"
).show()

# Round results for cleaner display
rounded_df = revenue_df.withColumn("avg_price_per_unit", F.round(col("avg_price_per_unit"), 2))
print("\nRounded results:")
rounded_df.select("product", "avg_price_per_unit").show()

### Mathematical Functions

In [None]:
# Mathematical functions
print("üî¢ MATHEMATICAL FUNCTIONS")
print("=" * 50)

# Sample numerical data
nums_df = spark.createDataFrame(
    [(1.5,), (2.7,), (-3.2,), (4.8,), (0.0,)], 
    ["value"]
)

# Apply mathematical functions
math_df = nums_df \
    .withColumn("abs_value", F.abs(col("value"))) \
    .withColumn("ceil_value", F.ceil(col("value"))) \
    .withColumn("floor_value", F.floor(col("value"))) \
    .withColumn("round_value", F.round(col("value"))) \
    .withColumn("sqrt_value", F.sqrt(F.abs(col("value")))) \
    .withColumn("exp_value", F.exp(col("value"))) \
    .withColumn("log_value", F.log(F.abs(col("value")) + 1))  # log(0) is undefined

print("Mathematical transformations:")
math_df.show()

# Statistical functions on our employee data
stats_df = df.withColumn(
    "salary_zscore", (col("salary") - F.avg(col("salary")).over()) / F.stddev(col("salary")).over()
).withColumn(
    "salary_percentile", F.percent_rank().over(F.orderBy("salary"))
)

print("\nStatistical calculations:")
stats_df.select("name", "salary", "salary_zscore", "salary_percentile").show()

## üìù String Expressions

### String Manipulation Functions

In [None]:
# String expressions
print("üìù STRING EXPRESSIONS")
print("=" * 50)

# Sample string data
text_data = [
    ("john doe", "engineer"),
    ("JANE SMITH", "manager"),
    ("bob johnson", "analyst"),
    ("Alice Cooper", "director")
]

text_df = spark.createDataFrame(text_data, ["name", "title"])

# String transformations
string_df = text_df \
    .withColumn("name_upper", F.upper(col("name"))) \
    .withColumn("name_lower", F.lower(col("name"))) \
    .withColumn("name_initcap", F.initcap(col("name"))) \
    .withColumn("name_length", F.length(col("name"))) \
    .withColumn("first_name", F.split(col("name"), " ")[0]) \
    .withColumn("last_name", F.split(col("name"), " ")[1]) \
    .withColumn("full_title", F.concat(col("first_name"), F.lit(" "), col("last_name"), F.lit(" - "), col("title")))

print("String transformations:")
string_df.select("name", "name_upper", "name_initcap", "first_name", "last_name", "full_title").show(truncate=False)

# String search and replace
search_df = text_df \
    .withColumn("contains_john", F.instr(col("name"), "john")) \
    .withColumn("starts_with_j", F.col("name").startswith("j")) \
    .withColumn("ends_with_n", F.col("name").endswith("n")) \
    .withColumn("replaced", F.regexp_replace(col("name"), "john", "JAKE"))

print("\nString search and replace:")
search_df.select("name", "contains_john", "starts_with_j", "ends_with_n", "replaced").show()

## üéõÔ∏è Conditional Logic

### When/Otherwise Expressions

In [None]:
# Conditional expressions
print("üéõÔ∏è CONDITIONAL LOGIC")
print("=" * 50)

# Using when/otherwise for categorization
conditional_df = df \
    .withColumn("salary_category",
                when(col("salary") >= 80000, "High")
                .when(col("salary") >= 60000, "Medium")
                .otherwise("Low")) \
    .withColumn("age_group",
                when(col("age") < 30, "Young")
                .when(col("age") < 35, "Mid-age")
                .otherwise("Experienced")) \
    .withColumn("performance_rating",
                when((col("age") >= 30) & (col("salary") >= 70000), "Excellent")
                .when((col("age") >= 25) | (col("salary") >= 65000), "Good")
                .otherwise("Needs Improvement"))

print("Conditional categorization:")
conditional_df.select("name", "age", "salary", "salary_category", "age_group", "performance_rating").show()

# Complex business logic
business_df = df.withColumn(
    "promotion_eligible",
    when(
        (col("age") >= 30) & 
        (col("department") == "Engineering") & 
        (col("salary") >= 75000),
        "Eligible for Senior Role"
    ).when(
        (col("age") >= 28) & 
        (col("department") == "Sales") & 
        (col("salary") >= 65000),
        "Eligible for Manager Role"
    ).otherwise("Not Eligible")
)

print("\nBusiness logic example:")
business_df.select("name", "department", "age", "salary", "promotion_eligible").show()

## üîÑ Type Casting and Conversion

### Data Type Conversions

In [None]:
# Type casting
print("üîÑ TYPE CASTING")
print("=" * 50)

# Create mixed-type data
mixed_data = [
    ("100", "25.5", "1"),
    ("200", "30.7", "0"),
    ("150", "28.2", "1")
]

mixed_df = spark.createDataFrame(mixed_data, ["str_num", "str_float", "str_bool"])
print("Original string data:")
mixed_df.show()
mixed_df.printSchema()

# Type casting
casted_df = mixed_df \
    .withColumn("int_value", col("str_num").cast(IntegerType())) \
    .withColumn("double_value", col("str_float").cast(DoubleType())) \
    .withColumn("bool_value", col("str_bool").cast("boolean")) \
    .withColumn("date_value", F.to_date(F.lit("2023-01-01"))) \
    .withColumn("timestamp_value", F.to_timestamp(F.lit("2023-01-01 12:30:45")))

print("\nAfter type casting:")
casted_df.show()
casted_df.printSchema()

# Safe casting with error handling
safe_cast_df = mixed_df.withColumn(
    "safe_int", 
    when(F.col("str_num").rlike("^\\d+$"), F.col("str_num").cast(IntegerType()))
    .otherwise(F.lit(None))
)

print("\nSafe casting (handles invalid data):")
safe_cast_df.select("str_num", "safe_int").show()

## üõ†Ô∏è User Defined Functions (UDFs)

### Creating and Using UDFs

In [None]:
# User Defined Functions
print("üõ†Ô∏è USER DEFINED FUNCTIONS (UDFS)")
print("=" * 50)

# Define Python functions
def calculate_tax(salary):
    """Calculate tax based on salary brackets"""
    if salary >= 80000:
        return salary * 0.25
    elif salary >= 60000:
        return salary * 0.20
    else:
        return salary * 0.15

def format_currency(amount):
    """Format number as currency string"""
    return f"${amount:,.2f}"

# Register UDFs
tax_udf = udf(calculate_tax, DoubleType())
currency_udf = udf(format_currency, StringType())

# Apply UDFs
udf_df = df \
    .withColumn("tax_amount", tax_udf(col("salary"))) \
    .withColumn("net_salary", col("salary") - col("tax_amount")) \
    .withColumn("salary_formatted", currency_udf(col("salary"))) \
    .withColumn("tax_formatted", currency_udf(col("tax_amount"))) \
    .withColumn("net_formatted", currency_udf(col("net_salary")))

print("UDF applications:")
udf_df.select(
    "name", "salary_formatted", "tax_formatted", "net_formatted"
).show()

# UDF with conditional logic
def performance_category(salary, age):
    """Categorize employee performance"""
    score = (salary / 1000) + age
    if score >= 110:
        return "Outstanding"
    elif score >= 90:
        return "Excellent"
    elif score >= 70:
        return "Good"
    else:
        return "Needs Improvement"

performance_udf = udf(performance_category, StringType())

performance_df = df.withColumn(
    "performance", 
    performance_udf(col("salary"), col("age"))
)

print("\nPerformance categorization with UDF:")
performance_df.select("name", "age", "salary", "performance").show()

### UDF Performance Considerations

In [None]:
# UDF performance comparison
print("‚ö° UDF PERFORMANCE CONSIDERATIONS")
print("=" * 50)

# Create larger dataset
large_data = [(i, f"name_{i}", 25000 + (i * 100)) for i in range(1, 10001)]
large_df = spark.createDataFrame(large_data, ["id", "name", "salary"])

print(f"Large dataset: {large_df.count():,} rows")

# Method 1: Built-in functions (fast)
start_time = time.time()
builtin_result = large_df.withColumn("tax", col("salary") * 0.2).count()
builtin_time = time.time() - start_time

# Method 2: UDF (slower)
def calculate_tax_udf(salary):
    return salary * 0.2

tax_udf_func = udf(calculate_tax_udf, DoubleType())

start_time = time.time()
udf_result = large_df.withColumn("tax", tax_udf_func(col("salary"))).count()
udf_time = time.time() - start_time

print(f"Built-in functions: {builtin_time:.3f} seconds")
print(f"UDF approach: {udf_time:.3f} seconds")
if udf_time > 0:
    print(f"UDF is {udf_time/builtin_time:.1f}x slower")

print("\nüí° Performance Tips:")
print("1. Use built-in functions when possible")
print("2. Avoid UDFs for simple arithmetic")
print("3. Consider pandas UDFs for complex operations")
print("4. Test UDF performance on realistic data sizes")

## üéØ Complex Expressions with expr()

### Using SQL-like Expressions

In [None]:
# Complex expressions using expr()
print("üéØ COMPLEX EXPRESSIONS")
print("=" * 50)

# expr() allows SQL-like syntax
complex_df = df.withColumn(
    "complex_calc",
    expr("salary * 1.1 + (age * 100) - CASE WHEN department = 'Engineering' THEN 5000 ELSE 2000 END")
).withColumn(
    "sql_expression",
    expr("CASE "
         "WHEN salary > 70000 THEN 'High Earner' "
         "WHEN salary > 60000 THEN 'Mid Earner' "
         "ELSE 'Entry Level' END")
)

print("Complex expressions with expr():")
complex_df.select("name", "salary", "age", "department", "complex_calc", "sql_expression").show()

# Combining expr() with regular column operations
hybrid_df = df.withColumn(
    "bonus_calculation",
    expr("salary * (CASE WHEN department = 'Sales' THEN 1.15 WHEN department = 'Engineering' THEN 1.10 ELSE 1.05 END)")
).withColumn(
    "final_package",
    col("bonus_calculation") + col("age") * 100  # Mix expr and col
)

print("\nHybrid expressions:")
hybrid_df.select("name", "department", "salary", "bonus_calculation", "final_package").show()

## üö® Common Mistakes and Debugging

In [None]:
# Common mistakes and solutions
print("üö® COMMON MISTAKES")
print("=" * 50)

# Mistake 1: Column name typos
try:
    bad_column = df.select("nonexistent_column")
    bad_column.show()
except Exception as e:
    print(f"‚ùå Column name error: {str(e)[:100]}...")

# Correct approach
print("\n‚úÖ Correct column names:")
df.select("name", "salary").show()

# Mistake 2: Type mismatch in operations
try:
    bad_math = df.withColumn("bad_result", col("name") + col("salary"))
    bad_math.show()
except Exception as e:
    print(f"\n‚ùå Type mismatch error: {str(e)[:100]}...")

# Correct approach
print("\n‚úÖ Proper type handling:")
df.withColumn("salary_str", F.concat(col("name"), F.lit(" earns "), col("salary").cast("string"))).show()

# Mistake 3: Forgetting to handle nulls
null_data = [(1, "Alice", None), (2, "Bob", 30000)]
null_df = spark.createDataFrame(null_data, ["id", "name", "salary"])

print("\n‚ùå Without null handling:")
try:
    null_df.withColumn("bonus", col("salary") * 0.1).show()
except Exception as e:
    print(f"Error: {str(e)[:100]}...")

print("\n‚úÖ With null handling:")
null_df.withColumn("bonus", F.coalesce(col("salary"), F.lit(0)) * 0.1).show()

## üéØ Key Takeaways

### What You Learned:
- ‚úÖ **`select()`** - Choose specific columns
- ‚úÖ **`withColumn()`** - Add/modify columns with expressions
- ‚úÖ **Arithmetic expressions** - Mathematical operations on columns
- ‚úÖ **String functions** - Text manipulation (upper, lower, concat, etc.)
- ‚úÖ **Conditional logic** - `when/otherwise` for complex conditions
- ‚úÖ **Type casting** - Convert between data types safely
- ‚úÖ **UDFs** - Custom functions (but prefer built-ins when possible)
- ‚úÖ **`expr()`** - SQL-like expressions in DataFrames

### Performance Best Practices:
- üî∏ **Prefer built-in functions** over UDFs for speed
- üî∏ **Use `expr()`** for complex SQL-like logic
- üî∏ **Chain operations efficiently** to minimize shuffles
- üî∏ **Handle nulls explicitly** to avoid runtime errors
- üî∏ **Check data types** before operations

### Common Patterns:
- üî∏ `col("column")` - Reference columns in expressions
- üî∏ `when(condition, value).otherwise(default)` - Conditional logic
- üî∏ `F.function_name()` - Access PySpark SQL functions
- üî∏ `expr("SQL expression")` - SQL syntax in DataFrames
- üî∏ `udf(function, return_type)` - Register custom functions

---

## üöÄ Next Steps

Now that you master column expressions, you're ready for:

1. **DataFrame Aggregations** - GroupBy and statistical operations
2. **Window Functions** - Advanced analytical operations
3. **Joins** - Combining multiple DataFrames
4. **Complex Data Types** - Arrays, maps, and structs

**Column expressions are fundamental to all DataFrame operations!**

---

**üéâ Congratulations! You now wield the power of DataFrame column expressions like a Spark expert!**