# üéØ Spark Coding Interview Questions

**Master the Most Common Spark Coding Interview Questions**

**Prerequisites**: All previous modules (01-05)

**Estimated time**: 90-120 minutes

---

## üéØ Interview Question Categories

This notebook covers the most frequently asked Spark coding questions:
- ‚úÖ **RDD Operations**: Transformations, actions, key-value operations
- ‚úÖ **DataFrame Operations**: Filtering, aggregations, joins, window functions
- ‚úÖ **Performance Optimization**: Caching, partitioning, shuffle minimization
- ‚úÖ **SQL Queries**: Complex queries, window functions, CTEs
- ‚úÖ **Real-World Scenarios**: ETL pipelines, data processing patterns

---

## ‚öôÔ∏è Setup

**Initialize Spark for interview practice.**

In [None]:
// Import Spark libraries
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.rdd.RDD
import spark.implicits._

In [None]:
// Create SparkSession
val spark = SparkSession.builder()
  .appName("Spark Interview Questions")
  .master("local[*]")
  .getOrCreate()

val sc = spark.sparkContext

println("üöÄ Spark Interview Practice Session Started")
println(s"Spark Version: ${spark.version}")

## üî¢ Question 1: Word Count (Classic Interview Question)

**Implement word count with multiple optimizations.**

**Difficulty**: ‚≠ê‚≠ê‚≠ê
**Frequency**: Very High

**Requirements**:
- Count word frequencies in a text
- Handle case sensitivity
- Remove punctuation
- Return top N words

In [None]:
// Sample text data
val textData = Seq(
  "Spark is a powerful big data framework!",
  "Scala and Spark work great together.",
  "Big data processing with Spark is efficient.",
  "Learning Spark and Scala is rewarding!",
  "Spark provides excellent performance."
)

// FIXME: Implement optimized word count
def wordCount(texts: Seq[String], topN: Int = 5): Array[(String, Int)] = {
  // Your implementation here
  // 1. Create RDD from texts
  // 2. Split into words (remove punctuation)
  // 3. Convert to lowercase
  // 4. Count frequencies
  // 5. Return top N words
  
  ??? // Replace with your solution
}

// Test your implementation
val result = wordCount(textData, 3)
println("Top 3 words:")
result.foreach { case (word, count) =>
  println(f"  $word%-10s: $count")
}

// Expected: spark: 4, data: 2, is: 2 (or similar)

## üìä Question 2: Employee Salary Analysis

**Analyze employee data using DataFrame operations.**

**Difficulty**: ‚≠ê‚≠ê‚≠ê‚≠ê
**Frequency**: High

**Requirements**:
- Find highest paid employee per department
- Calculate department statistics
- Identify employees above average salary
- Show salary distribution

In [None]:
// Employee data
val employees = Seq(
  (1, "Alice", "Engineering", 75000, "2020-01-15"),
  (2, "Bob", "Engineering", 80000, "2019-03-22"),
  (3, "Charlie", "Sales", 65000, "2021-07-10"),
  (4, "Diana", "Engineering", 90000, "2018-11-05"),
  (5, "Eve", "Marketing", 70000, "2020-09-18"),
  (6, "Frank", "Sales", 72000, "2019-12-08"),
  (7, "Grace", "Engineering", 85000, "2021-02-28"),
  (8, "Henry", "Marketing", 68000, "2020-06-14")
)

val employeesDF = employees.toDF("id", "name", "dept", "salary", "hire_date")

// FIXME: Implement salary analysis

// 1. Highest paid employee per department
println("1. Highest paid employee per department:")
val highestPaid = ??? // Window function with row_number
highestPaid.show()

// 2. Department statistics
println("\n2. Department statistics:")
val deptStats = ??? // Group by with aggregations
deptStats.show()

// 3. Employees above department average
println("\n3. Employees above department average:")
val aboveAvg = ??? // Join with subquery or window function
aboveAvg.show()

// 4. Salary distribution
println("\n4. Salary distribution:")
val salaryDist = ??? // Bucket salaries into ranges
salaryDist.show()

## üîë Question 3: Key-Value RDD Operations

**Implement complex key-value transformations.**

**Difficulty**: ‚≠ê‚≠ê‚≠ê‚≠ê
**Frequency**: High

**Requirements**:
- Group records by key
- Calculate aggregations per group
- Find top N per group
- Join with another dataset

In [None]:
// Sales data: (customer_id, product, amount, date)
val salesData = Seq(
  (1, "Laptop", 1200.0, "2023-01"),
  (1, "Mouse", 25.0, "2023-01"),
  (2, "Laptop", 1200.0, "2023-01"),
  (1, "Keyboard", 75.0, "2023-02"),
  (2, "Monitor", 300.0, "2023-02"),
  (3, "Laptop", 1200.0, "2023-02"),
  (2, "Mouse", 25.0, "2023-03"),
  (3, "Keyboard", 75.0, "2023-03")
)

// FIXME: Implement key-value operations
val salesRDD = sc.parallelize(salesData)

// 1. Total sales per customer
println("1. Total sales per customer:")
val customerTotals = ??? // reduceByKey
customerTotals.collect().foreach(println)

// 2. Customer purchase history
println("\n2. Customer purchase history:")
val customerHistory = ??? // groupByKey
customerHistory.take(2).foreach { case (cust, purchases) =>
  println(s"Customer $cust: ${purchases.mkString(", ")}")
}

// 3. Top product per customer by amount
println("\n3. Top product per customer:")
val topProducts = ??? // Complex transformation
topProducts.collect().foreach(println)

## ‚ö° Question 4: Performance Optimization

**Optimize a slow Spark job.**

**Difficulty**: ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê
**Frequency**: Very High

**Scenario**: A Spark job is running slowly. Identify and fix performance issues.

In [None]:
// Simulate a slow job
val largeData = sc.parallelize(1 to 100000).map(i => (s"Group${i % 100}", i))

// FIXME: Optimize this slow implementation
println("Original slow implementation:")
val start1 = System.nanoTime()

// Inefficient: groupByKey + mapValues
val result1 = largeData
  .groupByKey()
  .mapValues(_.sum)
  .filter(_._2 > 5000000)
  .collect()

val time1 = (System.nanoTime() - start1) / 1e6
println(f"Original approach: $time1%.2f ms, Results: ${result1.length}")

// FIXME: Implement optimized version
println("\nOptimized implementation:")
val start2 = System.nanoTime()

// Optimized: reduceByKey + coalesce
val result2 = ??? // Your optimized implementation
  // Hint: Use reduceByKey, coalesce, cache strategically

val time2 = (System.nanoTime() - start2) / 1e6
println(f"Optimized approach: $time2%.2f ms, Results: ${result2.length}")

println(f"\nPerformance improvement: ${time1/time2}%.1fx faster")

// Explain your optimizations:
println("\nüí° Optimizations applied:")
println("   1. replace groupByKey with reduceByKey")
println("   2. use coalesce to reduce partitions")
println("   3. cache intermediate results if needed")
println("   4. minimize shuffles")

## üîó Question 5: DataFrame Joins & Window Functions

**Complex DataFrame operations with joins and analytics.**

**Difficulty**: ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê
**Frequency**: High

**Requirements**:
- Join multiple DataFrames
- Use window functions for rankings
- Calculate running totals
- Handle complex business logic

In [None]:
// Create multiple related datasets
val orders = Seq(
  (1001, "Alice", "2023-01-15", 150.0),
  (1002, "Bob", "2023-01-16", 200.0),
  (1003, "Alice", "2023-01-17", 75.0),
  (1004, "Charlie", "2023-01-18", 300.0),
  (1005, "Bob", "2023-01-19", 125.0)
)

val customers = Seq(
  ("Alice", "Premium"),
  ("Bob", "Standard"),
  ("Charlie", "Premium"),
  ("Diana", "Basic")
)

val ordersDF = orders.toDF("order_id", "customer", "order_date", "amount")
val customersDF = customers.toDF("customer", "tier")

// FIXME: Implement complex DataFrame operations

// 1. Join orders with customer tiers
println("1. Orders with customer tiers:")
val ordersWithTiers = ??? // Join DataFrames
ordersWithTiers.show()

// 2. Customer ranking by total spend
println("\n2. Customer ranking by total spend:")
val customerRanking = ??? // Window function with rank
customerRanking.show()

// 3. Running total per customer
println("\n3. Running total per customer:")
val runningTotals = ??? // Window function with sum over order by date
runningTotals.show()

// 4. Premium customer analysis
println("\n4. Premium customer analysis:")
val premiumAnalysis = ??? // Filter + aggregations
premiumAnalysis.show()

## üèÜ Question 6: ETL Pipeline Design

**Design and implement a complete ETL pipeline.**

**Difficulty**: ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê
**Frequency**: Very High

**Scenario**: Build an ETL pipeline that processes raw data, applies transformations, and loads results.

In [None]:
// Raw data simulation
val rawLogs = Seq(
  "2023-01-15 10:30:00,user1,login,success",
  "2023-01-15 10:35:00,user2,login,failed",
  "2023-01-15 10:40:00,user1,purchase,299.99",
  "2023-01-15 10:45:00,user3,login,success",
  "2023-01-15 10:50:00,user2,purchase,149.99",
  "2023-01-15 10:55:00,user1,logout,success"
)

// FIXME: Implement ETL pipeline
// Extract: Parse raw logs
// Transform: Clean, validate, enrich data
// Load: Aggregate and save results

println("ETL Pipeline Implementation:")

// Extract phase
val rawRDD = sc.parallelize(rawLogs)
println("1. Extract - Raw data loaded")

// Transform phase
val parsedData = ??? // Parse CSV-like data into structured format
println("2. Transform - Data parsed and cleaned")

val enrichedData = ??? // Add derived columns (hour, day, etc.)
println("3. Transform - Data enriched with derived fields")

val validatedData = ??? // Filter invalid records
println("4. Transform - Data validated")

// Load phase
val summaryStats = ??? // Calculate summary statistics
println("5. Load - Summary statistics calculated")

// Display results
println("\nFinal Results:")
summaryStats.collect().foreach(println)

println("\nüí° ETL Best Practices Demonstrated:")
println("   - Modular design (Extract/Transform/Load)")
println("   - Data validation and cleaning")
println("   - Efficient transformations")
println("   - Summary aggregations")

## üõë Cleanup

**Clean up resources.**

In [None]:
// Stop SparkSession
spark.stop()
println("üõë Interview Practice Session Stopped")
println("‚úÖ All resources cleaned up!")

## üìö Interview Preparation Tips

**üéØ Key Points for Spark Interviews:**

### **Technical Concepts:**
- **RDD vs DataFrame vs Dataset**: Know when to use each
- **Transformations vs Actions**: Lazy evaluation understanding
- **Shuffle Operations**: groupByKey vs reduceByKey performance
- **Caching Strategies**: When and how to cache data
- **Partitioning**: repartition vs coalesce

### **Common Questions:**
1. **Word Count**: Every interview starts here
2. **Performance Issues**: How to optimize slow jobs
3. **Data Skew**: Causes and solutions
4. **Window Functions**: Ranking and analytics
5. **ETL Design**: Pipeline architecture

### **Behavioral Questions:**
- **Problem-Solving**: How do you debug Spark issues?
- **Architecture**: How do you design scalable pipelines?
- **Performance**: How do you handle large datasets?
- **Best Practices**: Code organization and testing

### **Preparation Strategy:**
1. **Master Fundamentals**: RDD operations, DataFrame API
2. **Practice Coding**: Implement common algorithms
3. **Study Performance**: Understand optimization techniques
4. **Learn Patterns**: Common ETL and analytics patterns
5. **Mock Interviews**: Practice explaining your solutions

**Remember**: Spark interviews test both coding skills and system understanding. Be ready to explain WHY you chose certain approaches!

**üöÄ Good luck with your Spark interviews!**