# ‚ö° Spark Fundamentals

**Phase 2 (Intermediate) - Module 5 of 6**

**Estimated time**: 90-120 minutes

**Prerequisites**: [04_Testing_with_ScalaTest.ipynb](04_Testing_with_ScalaTest.ipynb)

## üéØ Learning Goals

- Understand Apache Spark architecture and principles
- Work with RDDs, DataFrames, and Datasets
- Master Spark SQL and DataFrame operations
- Implement distributed transformations and actions
- Handle data partitioning and caching
- Build complete Spark applications

---

## üìã Table of Contents

1. [Spark Overview](#overview)
2. [RDD Operations](#rdd)
3. [DataFrames & Spark SQL](#dataframe)
4. [Datasets](#datasets)
5. [Data Partitioning](#partitioning)
6. [Caching & Persistence](#caching)
7. [Exercises](#exercises)
8. [What Next](#next)

## üöÄ Why Apache Spark?

**Data Processing Evolution:**

**Traditional:** Single machine ‚Üí Memory & storage limits
**Hadoop MapReduce:** Cluster processing ‚Üí But slow (disk I/O)
**Apache Spark:** In-memory processing ‚Üí **100x faster**

**Spark solves:**
- **Big Data processing** at scale
- **Real-time analytics** and batch processing
- **Complex data workflows** (ML, SQL, streaming)
- **Fault tolerance** with automatic recovery
- **Multi-language support** (Scala, Python, Java, R)

**Key Features:**
- **In-memory computing**: Keep data in RAM
- **Lazy evaluation**: Optimize execution plans
- **DAG execution**: Efficient task scheduling
- **Rich APIs**: Functional programming with collections
- **Unified platform**: Spark SQL, MLlib, Streaming, GraphX

---

## ‚ö° Spark Setup

Setting up Spark environment and context.

In [None]:
// In a real Spark application, add dependencies:
// libraryDependencies += "org.apache.spark" %% "spark-core" % "3.3.2"
// libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.3.2"

// For this notebook, we'll simulate Spark concepts
println("=== Spark Environment Setup ===")
println("In a real project:")
println("1. Add Spark dependencies to build.sbt")
println("2. Configure spark-defaults.conf")
println("3. Set up master URL (local[*], yarn, k8s, etc.)")
println("4. Create SparkSession")

println("\nSpark Architecture:")
println("‚úì Driver Program: Main application")
println("‚úì Cluster Manager: Allocate resources (YARN, Kubernetes, Mesos)")
println("‚úì Worker Nodes: Execute tasks")
println("‚úì Executors: Run tasks on worker nodes")
println()

## üèóÔ∏è RDD Fundamentals

Resilient Distributed Datasets - Spark's core abstraction.

In [None]:
// Simulate RDD concepts (in real Spark, these would be distributed)
println("=== RDD Basics ===")

// Simulate creating RDDs
object SimRDD {
  def makeRDD[T](data: Seq[T]): SimRDD[T] = new SimRDD(data)

  def textFile(path: String): SimRDD[String] = {
    val simulatedData = List(
      "Spark is fast",
      "RDDs are immutable",
      "Transformations are lazy",
      "Actions trigger execution"
    )
    new SimRDD(simulatedData)
  }
}

class SimRDD[T](private val data: Seq[T]) {
  // Transformations (lazy)
  def map[U](f: T => U): SimRDD[U] = {
    println(s"MAP transformation (lazy): $f")
    new SimRDD(data.map(f))
  }

  def filter(f: T => Boolean): SimRDD[T] = {
    println(s"FILTER transformation (lazy): $f")
    new SimRDD(data.filter(f))
  }

  def flatMap[U](f: T => Seq[U]): SimRDD[U] = {
    println(s"FLATMAP transformation (lazy): $f")
    new SimRDD(data.flatMap(f))
  }

  // Actions (trigger computation)
  def collect(): Seq[T] = {
    println("COLLECT action (executes all transformations)")
    data
  }

  def count(): Int = {
    println("COUNT action (executes transformations)")
    data.size
  }
}

// Demonstrate RDD operations
val numbersRDD = SimRDD.makeRDD(Seq(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))

println("Creating pipeline (lazy - no execution yet):")
val resultRDD = numbersRDD
  .filter(_ % 2 == 0)    // Even numbers only
  .map(_ * 2)            // Double them
  .filter(_ > 5)         // Greater than 5

println("\nAction triggers execution:")
val result = resultRDD.collect()
println(s"Final result: ${result.mkString(", ")}")
println()

## üìä Word Count Example

Classic Spark example showing transformations and actions.

In [None]:
// Word count with simulated RDD
println("=== Word Count Example ===")

val linesRDD = SimRDD.textFile("sample.txt")

// Word count pipeline
val wordCounts = linesRDD
  .flatMap(_.split("\\s+"))        // Split into words
  .map(_.toLowerCase)              // Normalize case
  .map(word => (word, 1))          // Create pairs
  .groupBy(_._1)                   // Group by word
  .map { case (word, pairs) =>     // Sum counts
    (word, pairs.size)
  }

println("Word count pipeline created (lazy)")

// Collect results
val counts = wordCounts.collect()
println("\nWord counts:")
counts.sortBy(-_._2).foreach { case (word, count) =>
  println(f"  $word%-10s : $count")
}

println(f"\nTotal unique words: ${counts.size}")
println(f"Total words: ${counts.map(_._2).sum}")
println()

## üìã DataFrames & Spark SQL

Higher-level abstraction with SQL-like operations.

In [None]:
// Simulated DataFrame operations
println("=== DataFrames & Spark SQL ===")

case class Employee(id: Int, name: String, department: String, salary: Double)

// Simulate Spark DataFrame-like operations
object SimDataFrame {
  def fromData[T](data: Seq[T]): SimDataFrame[T] = new SimDataFrame[T](data)
}

class SimDataFrame[T](private val data: Seq[T]) {
  def filter(f: T => Boolean): SimDataFrame[T] = {
    println(s"FILTER: $f")
    new SimDataFrame(data.filter(f))
  }

  def map[U](f: T => U): SimDataFrame[U] = {
    println(s"MAP: $f")
    new SimDataFrame(data.map(f))
  }

  def select[U](f: T => U): SimDataFrame[U] = map(f)

  def groupBy[K](f: T => K): Map[K, Seq[T]] = {
    println(s"GROUPBY: $f")
    data.groupBy(f)
  }

  def agg[K, V: Numeric](groupByF: T => K)(aggF: Seq[T] => V): Map[K, V] = {
    println(s"AGGREGATE via $groupByF")
    val grouped = data.groupBy(groupByF)
    grouped.map { case (k, vs) => k -> aggF(vs) }
  }

  def collect(): Seq[T] = {
    println("COLLECT action")
    data
  }

  def show(): Unit = {
    println("=== DataFrame Preview ===")
    data.take(5).foreach(println)
    if (data.size > 5) println(s"... (${data.size - 5} more rows)")
  }
}

// Employee data
val employees = Seq(
  Employee(1, "Alice", "Engineering", 75000),
  Employee(2, "Bob", "Engineering", 80000),
  Employee(3, "Charlie", "Sales", 65000),
  Employee(4, "Diana", "Sales", 70000),
  Employee(5, "Eve", "HR", 60000),
  Employee(6, "Frank", "Engineering", 85000)
)

val employeeDF = SimDataFrame.fromData(employees)

// DataFrame operations
employeeDF.show()

println("\n=== Analysis Operations ===")

// Filter
val engineers = employeeDF.filter(_.department == "Engineering")
println(f"\nEngineers: ${engineers.collect().size}")

// Group and aggregate
val deptSalaries = employeeDF.agg(_.department)(_.map(_.salary).sum)
println("\nDepartment salaries:")
deptSalaries.foreach { case (dept, total) =>
  println(f"  $dept%-12s : $$$total%,.0f")
}

// Average salary by department
val deptAvg = employeeDF.agg(_.department) { employees =>
  employees.map(_.salary).sum / employees.size
}
println("\nDepartment average salaries:")
deptAvg.foreach { case (dept, avg) =>
  println(f"  $dept%-12s : $$$avg%,.0f")
}
println()

## üèõÔ∏è Datasets

Type-safe operations with case classes and implicit conversions.

In [None]:
// Datasets - strongly typed API
println("=== Dataset Operations ===")

// Type-safe Dataset simulation
class SimDataset[T: Manifest](private val data: Seq[T]) {
  def filter(f: T => Boolean): SimDataset[T] = {
    new SimDataset(data.filter(f))
  }

  def map[U: Manifest](f: T => U): SimDataset[U] = {
    new SimDataset(data.map(f))
  }

  def flatMap[U: Manifest](f: T => Seq[U]): SimDataset[U] = {
    new SimDataset(data.flatMap(f))
  }

  // Type-safe aggregation
  def groupByKey[K: Manifest](keyFunc: T => K): SimGroupedDataset[K, T] = {
    new SimGroupedDataset(data.groupBy(keyFunc))
  }

  def collect(): Seq[T] = data
  def count(): Long = data.size
}

class SimGroupedDataset[K, T](private val grouped: Map[K, Seq[T]]) {
  def count(): SimDataset[(K, Long)] = {
    new SimDataset(grouped.map { case (k, v) => (k, v.size.toLong) }.toSeq)
  }

  def sum[U: Numeric](valueFunc: T => U): SimDataset[(K, U)] = {
    new SimDataset(grouped.map { case (k, vs) =>
      (k, vs.map(valueFunc).sum(implicitly[Numeric[U]]))
    }.toSeq)
  }

  def avg[U: Numeric](valueFunc: T => U): SimDataset[(K, Double)] = {
    new SimDataset(grouped.map { case (k, vs) =>
      val values = vs.map(valueFunc)
      val sum = values.sum(implicitly[Numeric[U]]).toDouble
      val count = values.size.toDouble
      (k, sum / count)
    }.toSeq)
  }
}

// Using Datasets
val employeeDS = new SimDataset(employees)

println("Dataset Operations:")

// Type-safe filtering
val highPaidDS = employeeDS.filter(_.salary > 70000)
println(s"\nHigh-paid employees: ${highPaidDS.count()}")

// Type-safe aggregation
val deptCount = employeeDS.groupByKey(_.department).count()
println("\nEmployee count by department:")
deptCount.collect().foreach { case (dept, count) =>
  println(s"  $dept: $count")
}

val deptTotalSalary = employeeDS.groupByKey(_.department).sum(_.salary)
println("\nTotal salary by department:")
deptTotalSalary.collect().foreach { case (dept, total) =>
  println(f"  $dept: $$$total%,.0f")
}

println("\nDatasets provide compile-time type safety!")
println()

## üîÄ Data Partitioning

Controlling data distribution for performance optimization.

In [None]:
// Data partitioning concepts
println("=== Data Partitioning ===")

// Partition simulator
class SimPartitioner(private val partitions: Seq[Seq[String]]) {
  def getPartition(key: String): Int = {
    key.hashCode % partitions.size
  }

  def repartition(newNumPartitions: Int, data: Seq[String]): SimPartitioner = {
    println(s"REPARTITION: ${partitions.size} -> $newNumPartitions partitions")
    val newPartitions = (0 until newNumPartitions).map { i =>
      data.filter(item => item.hashCode % newNumPartitions == i)
    }
    new SimPartitioner(newPartitions)
  }

  def analyze(): Unit = {
    println("Partition Analysis:")
    partitions.zipWithIndex.foreach { case (data, idx) =>
      println(f"  Partition $idx: ${data.size}%3d items")
    }
    val totalItems = partitions.map(_.size).sum
    val avgItems = totalItems.toDouble / partitions.size
    val maxItems = partitions.map(_.size).max
    val minItems = partitions.map(_.size).min
    println(f"  Total items: $totalItems")
    println(f"  Average per partition: $avgItems%.1f")
    println(f"  Data skew: ${maxItems - minItems} (max $maxItems, min $minItems)")
  }
}

// Create some test data
val testData = Seq("apple", "banana", "cherry", "date",
                   "apple", "banana", "cherry", "date",
                   "fig", "grape", "honeydew", "kiwi")

// Initial partitioning
val initialPartitions = (0 until 3).map { i =>
  testData.filter(item => item.hashCode % 3 == i)
}

val partitioner = new SimPartitioner(initialPartitions)
println("Initial partitioning:")
partitioner.analyze()

println("\nRepartitioning to 4 partitions:")
val repartitioned = partitioner.repartition(4, testData)
repartitioned.analyze()

println("\n=== Partitioning Strategies ===")
println("‚úì Hash Partitioning: key.hashCode % numPartitions")
println("‚úì Range Partitioning: sorted ranges of keys")
println("‚úì Custom Partitioning: business logic based")
println()

## üèÜ Exercises

### Exercise 1: RDD Operations

Implement common RDD operations and transformations.

In [None]:
// Exercise 1: RDD Operations
// FIXME: Replace ??? with your code

class EnhancedRDD[T](data: Seq[T]) extends SimRDD[T](data) {
  def distinct: EnhancedRDD[T] = {
    new EnhancedRDD(data.distinct)
  }

  def union(other: EnhancedRDD[T]): EnhancedRDD[T] = {
    new EnhancedRDD(data ++ other.collect())
  }

  def intersection(other: EnhancedRDD[T]): EnhancedRDD[T] = {
    new EnhancedRDD(data.intersect(other.collect()))
  }

  def sortBy[U: Ordering](f: T => U): EnhancedRDD[T] = {
    new EnhancedRDD(data.sortBy(f))
  }

  def take(n: Int): Seq[T] = data.take(n)

  def first(): Option[T] = data.headOption

  def zipWithIndex: EnhancedRDD[(T, Long)] = {
    new EnhancedRDD(data.zipWithIndex.map { case (t, i) => (t, i.toLong) })
  }
}

// Test RDD operations
println("RDD Operations Exercise:")
println("=" * 25)

val numbers = new EnhancedRDD(Seq(3, 1, 4, 1, 5, 9, 2, 6))
val moreNumbers = new EnhancedRDD(Seq(1, 2, 7, 8))

println("Original: " + numbers.collect().mkString(", "))
println("Distinct: " + numbers.distinct.collect().mkString(", "))
println("Union: " + numbers.union(moreNumbers).collect().mkString(", "))
println("Intersection: " + numbers.intersection(moreNumbers).collect().mkString(", "))
println("Sorted: " + numbers.sortBy(identity).collect().mkString(", "))
println("First 3: " + numbers.take(3).mkString(", "))
println("First: " + numbers.first().getOrElse("empty"))

val withIndices = numbers.zipWithIndex.collect()
println("With indices: " + withIndices.map{case (n,i)=>s"$n->$i"}.mkString(", "))

println()

// Word processing operations
val words = new EnhancedRDD(Seq("hello", "world", "hello", "spark", "big", "data"))
val moreWords = new EnhancedRDD(Seq("hello", "distributed", "computing"))

println("Word analysis:")
println("Words: " + words.collect().mkString(", "))
println("Distinct words: " + words.distinct.collect().mkString(", "))
println("Combined unique: " + words.union(moreWords).distinct.collect().sorted.mkString(", "))
println("Common words: " + words.intersection(moreWords).collect().mkString(", "))
println("Alphabetical: " + words.sortBy(identity).collect().mkString(", "))
println()

### Exercise 2: DataFrame Analytics

Perform data analysis operations on employee dataset.

In [None]:
// Exercise 2: DataFrame Analytics
// FIXME: Replace ??? with your code

val analyticsDF = SimDataFrame.fromData(employees)

println("DataFrame Analytics Exercise:")
println("=" * 32)

// Basic statistics
val totalEmployees = analyticsDF.collect().size
val totalSalary = analyticsDF.collect().map(_.salary).sum
val avgSalary = totalSalary / totalEmployees
val highestSalary = analyticsDF.collect().map(_.salary).max
val lowestSalary = analyticsDF.collect().map(_.salary).min

println("Overall Statistics:")
println(f"  Total Employees: $totalEmployees")
println(f"  Total Salary Budget: $$$totalSalary%,.0f")
println(f"  Average Salary: $$$avgSalary%.0f")
println(f"  Salary Range: $$$lowestSalary%,.0f - $$$highestSalary%,.0f")
println()

// Department statistics
val deptStats = analyticsDF
  .agg(_.department) { emps =>
    val salaries = emps.map(_.salary)
    (emps.size, salaries.sum, salaries.max, salaries.min, salaries.sum / emps.size)
  }

println("Department Analytics:")
deptStats.foreach { case (dept, (count, total, max, min, avg)) =>
  println(f"  $dept%-12s: $count%2d employees, total $$$total%7,.0f, avg $$$avg%6.0f")
  println(f"                   Salary range: $$$min%,.0f - $$$max%,.0f")
}

println()
println("Highest paid by department:")
deptStats.keys.foreach { dept =>
  val deptEmps = analyticsDF.collect().filter(_.department == dept)
  val highest = deptEmps.maxBy(_.salary)
  println(f"  $dept%-12s: ${highest.name} ($$${highest.salary}%,.0f)")
}

println()
println("Spark DataFrames enable complex analytics with simple APIs!")
println()

## üìù What Next?

üéâ **Congratulations!** You've mastered Spark Fundamentals!

**You've learned:**
- Apache Spark architecture and principles
- RDDs: Spark's core abstraction
- DataFrames: Higher-level operations
- Datasets: Type-safe operations
- Data partitioning for performance
- Distributed computing concepts

**Key Concepts:**
- **RDD**: Immutable, distributed collections
- **Transformations**: Lazy operations
- **Actions**: Trigger execution
- **DataFrames**: SQL-like operations
- **Datasets**: Type-safe DataFrames
- **Partitioning**: Data distribution strategy

**Next Steps:**
1. Complete exercises - experiment with all APIs
2. Move to **06: Macros & Metaprogramming**
3. Set up real Spark cluster for hands-on experience
4. Explore Spark Streaming and MLlib

**Production Tips:**
- Monitor partition balance (data skew hurts performance)
- Cache frequently used data
- Use appropriate storage levels
- Tune memory allocation
- Monitor job execution in Spark UI

**Real-world Spark projects:**
- ETL pipelines
- Real-time analytics
- Machine learning at scale
- Graph processing

---

*"Spark is the Tesla of Big Data: faster, smarter, and more beautiful."*