# üìä DataFrame Operations & Transformations

**Phase 3: DataFrame Mastery - Structured Data Processing**

**Prerequisites**: [02_rdd_mastery/RDD_Transformations_Practice.ipynb](../02_rdd_mastery/RDD_Transformations_Practice.ipynb)

**Estimated time**: 60-75 minutes

---

## üéØ Learning Goals

By the end of this notebook, you'll be able to:
- ‚úÖ Create DataFrames from various data sources (CSV, JSON, RDDs)
- ‚úÖ Perform column operations and expressions
- ‚úÖ Apply filtering, sorting, and aggregation operations
- ‚úÖ Handle missing data and data types
- ‚úÖ Use window functions for advanced analytics
- ‚úÖ Optimize DataFrame operations for performance

---

## ‚öôÔ∏è Setup & Data Sources

**Initialize Spark and explore different DataFrame creation methods.**

In [None]:
// Import Spark libraries
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import spark.implicits._

In [None]:
// Create SparkSession
val spark = SparkSession.builder()
  .appName("DataFrame Operations Practice")
  .master("local[*]")
  .getOrCreate()

println("üöÄ DataFrame Operations Practice Session Started")
println(s"Spark Version: ${spark.version}")
println(s"Scala Version: ${scala.util.Properties.versionString}")

In [None]:
// Create sample datasets
val employees = Seq(
  (1, "Alice", "Engineering", 75000, "2020-01-15"),
  (2, "Bob", "Engineering", 80000, "2019-03-22"),
  (3, "Charlie", "Sales", 65000, "2021-07-10"),
  (4, "Diana", "Engineering", 90000, "2018-11-05"),
  (5, "Eve", "Marketing", 70000, "2020-09-18"),
  (6, "Frank", "Sales", 72000, "2019-12-08"),
  (7, "Grace", "Engineering", 85000, "2021-02-28"),
  (8, "Henry", "Marketing", 68000, "2020-06-14")
)

println("üìä Sample datasets created for practice")

## üèóÔ∏è DataFrame Creation

**Multiple ways to create DataFrames from different data sources.**

In [None]:
// Method 1: From Seq with toDF()
println("üèóÔ∏è DataFrame Creation Methods:")

val employeesDF = employees.toDF("id", "name", "department", "salary", "hire_date")
println("\n1. Created from Seq with toDF():")
employeesDF.show(5)
employeesDF.printSchema()

In [None]:
// Method 2: From RDD
val employeesRDD = spark.sparkContext.parallelize(employees)
val employeesFromRDD = spark.createDataFrame(employeesRDD)
  .toDF("id", "name", "department", "salary", "hire_date")

println("\n2. Created from RDD:")
employeesFromRDD.show(3)

## üìä Column Operations & Expressions

**Working with DataFrame columns using expressions and functions.**

In [None]:
// Column selection and basic operations
println("üìä Column Operations:")

val basicOps = employeesDF.select(
  $"name",
  $"department",
  $"salary",
  ($"salary" * 1.1).as("salary_with_bonus"),
  year(to_date($"hire_date")).as("hire_year")
)

println("Basic column operations:")
basicOps.show(5)

In [None]:
// String operations
println("\nüî§ String Operations:")
val stringOps = employeesDF.select(
  $"name",
  length($"name").as("name_length"),
  upper($"name").as("name_upper"),
  lower($"department").as("dept_lower"),
  concat($"name", lit(" - "), $"department").as("name_dept")
)

stringOps.show(5)

## üîç Filtering, Sorting & Selection

**Powerful operations for data filtering and ordering.**

In [None]:
// Filtering operations
println("üîç Filtering Operations:")

// Simple filters
val engineers = employeesDF.filter($"department" === "Engineering")
println("Engineering employees:")
engineers.show()

val highEarners = employeesDF.filter($"salary" > 75000)
println("\nHigh earners (>75k):")
highEarners.show()

In [None]:
// Sorting operations
println("\nüîÑ Sorting Operations:")

// Sort by single column
val sortedBySalary = employeesDF.orderBy($"salary".desc)
println("Sorted by salary (descending):")
sortedBySalary.select("name", "salary").show(5)

## üìà Aggregation Operations

**Group and aggregate data for analytics and reporting.**

In [None]:
// Grouped aggregations
println("üìä Grouped Aggregations:")

val deptStats = employeesDF.groupBy("department")
  .agg(
    count("id").as("employee_count"),
    sum("salary").as("total_salary"),
    avg("salary").as("avg_salary"),
    max("salary").as("max_salary"),
    min("salary").as("min_salary")
  )
  .orderBy("department")

println("Department statistics:")
deptStats.show()

## üèÜ Practice Exercises

**Apply DataFrame operations to solve real-world problems.**

In [None]:
// Exercise 1: Employee Analytics
// FIXME: Implement employee analytics
// 1. Department with highest average salary
// 2. Employees hired in 2021 or later
// 3. Salary distribution by location
// 4. Top 3 highest paid employees
// 5. Department with most employees

println("üíº Employee Analytics Exercise:")

// 1. Department with highest average salary
val deptAvgSalary = ??? // Hint: groupBy + agg + orderBy
println("\n1. Department with highest average salary:")
deptAvgSalary.show(1)

## üõë Cleanup

**Always stop your SparkSession to free resources!**

In [None]:
// Stop SparkSession
spark.stop()
println("üõë Spark Session Stopped")
println("‚úÖ All resources cleaned up!")

## üìö What Next?

**üéâ Congratulations!** You've mastered DataFrame operations!

**You've learned:**
- ‚úÖ Creating DataFrames from various sources
- ‚úÖ Column operations and expressions
- ‚úÖ Filtering, sorting, and aggregation
- ‚úÖ Join operations between DataFrames

**Next Steps:**
1. Complete all exercises with your own implementations
2. Move to **04_spark_sql/** for SQL-based processing
3. Explore **05_performance_optimization/** for tuning

**Remember:** DataFrames make Spark accessible to SQL users! ‚ö°