# 🎯 Aggregations & Analytics - Interview Scenarios

## **Module Overview**
**Master powerful aggregation techniques, window functions, and analytical operations commonly asked in PySpark interviews.**

**🎯 Difficulty**: 🟡 Intermediate
**⏱️ Estimated Time**: 2-3 hours
**🎓 Learning Focus**: Advanced analytics and window operations

---


## 🎯 What You Will Learn

By completing this module, you will master:
- ✅ **Advanced aggregation functions** (GROUP BY, HAVING)
- ✅ **Window functions and ranking** (ROW_NUMBER, DENSE_RANK)
- ✅ **Complex analytical operations** (running totals, moving averages)
- ✅ **UDF implementation** for custom business logic
- ✅ **Performance optimization** techniques

---


# 🎯 Aggregations & Analytics

**Learn powerful aggregation techniques, window functions, and analytical operations commonly asked in PySpark interviews.**

**Difficulty**: 🟡 Intermediate
**Estimated Time**: 2-3 hours

---


In [None]:
# Scenario 1: Employee Designation Based on Salaryemployee_data = [    ("1", "a", 10000),    ("2", "b", 5000),    ("3", "c", 15000),    ("4", "d", 25000),    ("5", "e", 50000),    ("6", "f", 7000)]df_employees = spark.createDataFrame(employee_data, ["empid", "name", "salary"])df_employees.show()# Method 1: SQL Approachdf_employees.createOrReplaceTempView("emptab")spark.sql("""SELECT *,                   CASE WHEN salary > 10000 THEN 'Manager' ELSE 'Employee' END AS designation              FROM emptab""").show()# Method 2: DSL Approachdf_with_designation = df_employees.withColumn(    "designation",     when(col("salary") > 10000, "Manager").otherwise("Employee"))df_with_designation.show()

## 💡 Solution Explanation

### **🔍 Approach Analysis:**
This scenario demonstrates advanced analytical operations in PySpark.

### **⚡ Performance Considerations:**
- **Window Functions**: Efficient for ranking and analytical queries
- **UDFs**: Useful for custom transformations
- **Aggregations**: Optimized for grouped operations

---


In [None]:
# Scenario 2: Top Quantity Sales per Yearsales_data = [    (1, 100, 2010, 25, 5000),    (2, 100, 2011, 16, 5000),    (3, 100, 2012, 8, 5000),    (4, 200, 2010, 10, 9000),    (5, 200, 2011, 15, 9000),    (6, 200, 2012, 20, 7000),    (7, 300, 2010, 20, 7000),    (8, 300, 2011, 18, 7000),    (9, 300, 2012, 20, 7000)]df_sales = spark.createDataFrame(sales_data, ["sale_id", "product_id", "year", "quantity", "price"])df_sales.show()# Method 1: SQL Approachdf_sales.createOrReplaceTempView("salestab")spark.sql("""SELECT *             FROM (SELECT *, DENSE_RANK() OVER (PARTITION BY year ORDER BY quantity DESC) AS rank                    FROM salestab) ranked_sales              WHERE rank = 1              ORDER BY sale_id""").show()# Method 2: DSL Approachwindow_spec = Window.partitionBy("year").orderBy(col("quantity").desc())df_ranked = df_sales.withColumn("rank", dense_rank().over(window_spec))df_ranked.show()df_top_sales = df_ranked.filter(col("rank") == 1).drop("rank").orderBy("sale_id")df_top_sales.show()

## 💡 Solution Explanation

### **🔍 Approach Analysis:**
This scenario demonstrates advanced analytical operations in PySpark.

### **⚡ Performance Considerations:**
- **Window Functions**: Efficient for ranking and analytical queries
- **UDFs**: Useful for custom transformations
- **Aggregations**: Optimized for grouped operations

---


In [None]:
# Scenario 3: Generate Cricket Match Combinationsteams_data = [("India",), ("Pakistan",), ("SriLanka",)]df_teams = spark.createDataFrame(teams_data, ["teams"])df_teams.show()# Method 1: SQL Approachdf_teams.createOrReplaceTempView("crickettab")spark.sql("""SELECT CONCAT(a.teams, ' Vs ', b.teams) AS matches             FROM crickettab a              INNER JOIN crickettab b ON a.teams < b.teams""").show()# Method 2: DSL Approachdf_matches = df_teams.alias("a")    .join(df_teams.alias("b"), col("a.teams") < col("b.teams"), "inner")    .select(concat(col("a.teams"), lit(" Vs "), col("b.teams")).alias("matches"))df_matches.show()

## 💡 Solution Explanation

### **🔍 Approach Analysis:**
This scenario demonstrates advanced analytical operations in PySpark.

### **⚡ Performance Considerations:**
- **Window Functions**: Efficient for ranking and analytical queries
- **UDFs**: Useful for custom transformations
- **Aggregations**: Optimized for grouped operations

---


In [None]:
# Scenario 4: Find Name with Most Rank 1 Occurrencesrank_data = [    ("a", [1, 1, 1, 3]),    ("b", [1, 2, 3, 4]),    ("c", [1, 1, 1, 1, 4]),    ("d", [3])]df_ranks = spark.createDataFrame(rank_data, ["name", "rank"])df_ranks.show()# Explode the array and count occurrences of rank 1df_exploded = df_ranks.withColumn("rank", explode(col("rank")))df_exploded.show()df_filtered = df_exploded.filter(col("rank") == 1)df_filtered.show()df_counts = df_filtered.groupBy("name").agg(count("*").alias("count"))df_counts.show()# Get the name with maximum countmax_count_row = df_counts.orderBy(col("count").desc()).first()print(f"Name with most rank 1s: {max_count_row['name']}")

## 💡 Solution Explanation

### **🔍 Approach Analysis:**
This scenario demonstrates advanced analytical operations in PySpark.

### **⚡ Performance Considerations:**
- **Window Functions**: Efficient for ranking and analytical queries
- **UDFs**: Useful for custom transformations
- **Aggregations**: Optimized for grouped operations

---


In [None]:
# Scenario 5: Get Latest Commission for Each Employeecommission_data = [    (1, 300, "31-Jan-2021"),    (1, 400, "28-Feb-2021"),    (1, 200, "31-Mar-2021"),    (2, 1000, "31-Oct-2021"),    (2, 900, "31-Dec-2021")]df_commissions = spark.createDataFrame(commission_data, ["empid", "commissionamt", "monthlastdate"])df_commissions.show()# Get max date per employeedf_max_dates = df_commissions.groupBy("empid")                             .agg(max("monthlastdate").alias("max_date"))df_max_dates.show()# Join back to get the latest commissiondf_latest_commissions = df_commissions.join(    df_max_dates,    (df_commissions["empid"] == df_max_dates["empid"]) &     (df_commissions["monthlastdate"] == df_max_dates["max_date"]),    "inner").drop(df_max_dates["empid"]).drop("max_date")df_latest_commissions.show()

## 💡 Solution Explanation

### **🔍 Approach Analysis:**
This scenario demonstrates advanced analytical operations in PySpark.

### **⚡ Performance Considerations:**
- **Window Functions**: Efficient for ranking and analytical queries
- **UDFs**: Useful for custom transformations
- **Aggregations**: Optimized for grouped operations

---


In [None]:
# Scenario 6: Employee Grade Based on Salaryemp_salary_data = [    (1, "Jhon", 4000),    (2, "Tim David", 12000),    (3, "Json Bhrendroff", 7000),    (4, "Jordon", 8000),    (5, "Green", 14000),    (6, "Brewis", 6000)]df_emp_salary = spark.createDataFrame(emp_salary_data, ["emp_id", "emp_name", "salary"])df_emp_salary.show()# Method 1: SQL Approachdf_emp_salary.createOrReplaceTempView("emptab")spark.sql("""SELECT *,                   CASE WHEN salary < 5000 THEN 'C'                       WHEN salary BETWEEN 5000 AND 10000 THEN 'B'                        ELSE 'A' END AS grade              FROM emptab""").show()# Method 2: DSL Approachdf_with_grades = df_emp_salary.withColumn(    "grade",    when(col("salary") < 5000, "C")    .when((col("salary") >= 5000) & (col("salary") <= 10000), "B")    .otherwise("A"))df_with_grades.show()

## 💡 Solution Explanation

### **🔍 Approach Analysis:**
This scenario demonstrates advanced analytical operations in PySpark.

### **⚡ Performance Considerations:**
- **Window Functions**: Efficient for ranking and analytical queries
- **UDFs**: Useful for custom transformations
- **Aggregations**: Optimized for grouped operations

---


In [None]:
# Scenario 7: Data Masking Using UDFsdef mask_email(email):    return email[0] + "**********" + email[email.find("@"):]def mask_mobile(mobile):    return mobile[:2] + "*****" + mobile[-3:]contact_data = [    ("Renuka1992@gmail.com", "9856765434"),    ("anbu.arasu@gmail.com", "9844567788")]df_contacts = spark.createDataFrame(contact_data, ["email", "mobile"])df_contacts.show()# Register UDFsmask_email_udf = udf(mask_email, StringType())mask_mobile_udf = udf(mask_mobile, StringType())# Apply maskingdf_masked = df_contacts.withColumn("masked_email", mask_email_udf(col("email")))                       .withColumn("masked_mobile", mask_mobile_udf(col("mobile")))df_masked.show()

## 💡 Solution Explanation

### **🔍 Approach Analysis:**
This scenario demonstrates advanced analytical operations in PySpark.

### **⚡ Performance Considerations:**
- **Window Functions**: Efficient for ranking and analytical queries
- **UDFs**: Useful for custom transformations
- **Aggregations**: Optimized for grouped operations

---


In [None]:
# Scenario 8: Employee Count per Departmentdept_data = [    (1, "Jhon", "Development"),    (2, "Tim", "Development"),    (3, "David", "Testing"),    (4, "Sam", "Testing"),    (5, "Green", "Testing"),    (6, "Miller", "Production"),    (7, "Brevis", "Production"),    (8, "Warner", "Production"),    (9, "Salt", "Production")]df_dept = spark.createDataFrame(dept_data, ["emp_id", "emp_name", "dept"])df_dept.show()# Method 1: SQL Approachdf_dept.createOrReplaceTempView("emptab")spark.sql("SELECT dept, COUNT(*) AS total FROM emptab GROUP BY dept").show()# Method 2: DSL Approachdf_dept_counts = df_dept.groupBy("dept").agg(count("*").alias("total"))df_dept_counts.show()

## 💡 Solution Explanation

### **🔍 Approach Analysis:**
This scenario demonstrates advanced analytical operations in PySpark.

### **⚡ Performance Considerations:**
- **Window Functions**: Efficient for ranking and analytical queries
- **UDFs**: Useful for custom transformations
- **Aggregations**: Optimized for grouped operations

---


In [None]:
# Scenario 9: Calculate Total Marksmarks_data = [(203040, "rajesh", 10, 20, 30, 40, 50)]df_marks = spark.createDataFrame(marks_data, ["rollno", "name", "telugu", "english", "maths", "science", "social"])df_marks.show()# Method 1: SQL Approachdf_marks.createOrReplaceTempView("marks")spark.sql("""SELECT *, (telugu + english + maths + science + social) AS total             FROM marks""").show()# Method 2: DSL Approachdf_with_total = df_marks.withColumn(    "total",     col("telugu") + col("english") + col("maths") + col("science") + col("social"))df_with_total.show()

## 💡 Solution Explanation

### **🔍 Approach Analysis:**
This scenario demonstrates advanced analytical operations in PySpark.

### **⚡ Performance Considerations:**
- **Window Functions**: Efficient for ranking and analytical queries
- **UDFs**: Useful for custom transformations
- **Aggregations**: Optimized for grouped operations

---


# 🎉 Module Complete!

## **🏆 Congratulations!**

You have successfully completed the **Aggregations & Analytics** interview scenarios module!

### **📊 What You Mastered:**
- **Window functions** for advanced analytics
- **Complex aggregations** and ranking operations
- **UDF implementation** for custom logic
- **Performance optimization** for analytical queries

### **🚀 Next Steps:**
Ready for data quality challenges? Try the **Data Quality & Joins** module!

---
