In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F, Window as W
from pyspark.sql.types import StructType, StructField, StringType, DateType, IntegerType
from datetime import datetime

In [2]:
spark = SparkSession.builder.appName("DailyCodingProblem-day-11").getOrCreate()

## **Problem Scenario: Online Learning Engagement Analysis**

### **Background**

You are working for an online learning platform (like Coursera or Udemy).
The platform tracks student activities such as **logins, watching videos, attempting quizzes, completing courses, and forum participation** across multiple courses and devices.

However, the raw data contains:

* Missing device/location information
* Multiple activity types per student per day
* Varying engagement durations for different activities

The company wants **daily engagement analytics** and **course-level insights** for product and academic teams.

---

## **Business Objectives**

1. **Daily Engagement Metrics**

   * Total active students per day
   * Total activities per day
   * Average session duration per day
   * Device and location distribution

2. **Course-Level Analytics**

   * Course completion rates
   * Top 5 courses by engagement time
   * Unique students per course per day

3. **Student Behavior Metrics**

   * Average time spent per student per day
   * Number of quizzes attempted per student
   * Students completing at least 1 course per week

---

## **Sample Input Table (Simplified)**

| activity\_id | student\_id | course\_id | activity\_type   | activity\_time      | device  | location | duration\_minutes |
| ------------ | ----------- | ---------- | ---------------- | ------------------- | ------- | -------- | ----------------- |
| A123456      | S1          | C2         | Login            | 2025-08-20 09:15:00 | Mobile  | USA      |                   |
| A234567      | S1          | C2         | Watch\_Video     | 2025-08-20 09:30:00 | Mobile  | USA      | 20                |
| A345678      | S2          | C3         | Attempt\_Quiz    | 2025-08-20 10:00:00 | Desktop | India    | 15                |
| A456789      | S3          | C5         | Complete\_Course | 2025-08-20 11:00:00 | Tablet  | UK       |                   |
| A567890      | S1          | C2         | Forum\_Post      | 2025-08-20 11:20:00 | Mobile  | USA      |                   |

---

## **Expected Output 1 – Daily Engagement Metrics**

| date       | total\_activities | unique\_students | avg\_duration\_per\_activity | top\_device | top\_location |
| ---------- | ----------------- | ---------------- | ---------------------------- | ----------- | ------------- |
| 2025-08-20 | 400               | 120              | 18.5 mins                    | Mobile      | India         |
| 2025-08-21 | 390               | 118              | 17.2 mins                    | Desktop     | USA           |

---

## **Expected Output 2 – Course-Level Analytics**

| date       | course\_id | completion\_rate(%) | total\_time\_spent | unique\_students |
| ---------- | ---------- | ------------------- | ------------------ | ---------------- |
| 2025-08-20 | C2         | 35%                 | 1200 mins          | 45               |
| 2025-08-20 | C3         | 28%                 | 850 mins           | 38               |
| 2025-08-20 | C5         | 40%                 | 600 mins           | 20               |

---

## **Expected Output 3 – Student Behavior Metrics**

| student\_id | date       | total\_time\_spent | quizzes\_attempted | courses\_completed |
| ----------- | ---------- | ------------------ | ------------------ | ------------------ |
| S1          | 2025-08-20 | 250 mins           | 2                  | 1                  |
| S2          | 2025-08-20 | 180 mins           | 1                  | 0                  |
| S3          | 2025-08-20 | 75 mins            | 0                  | 1                  |

---

This scenario mimics **real-world complexity** with:

* Missing values handling
* Per-day, per-course, and per-student aggregations
* Multiple dimensions like device, location, and activity types

---

In [3]:
df = spark.read.csv(
    "/home/jupyter/work/data/sources/csv/day-12/online_learning_activity.csv", header=True, inferSchema=True
)

In [4]:
df.printSchema()

root
 |-- activity_id: string (nullable = true)
 |-- student_id: string (nullable = true)
 |-- course_id: string (nullable = true)
 |-- activity_type: string (nullable = true)
 |-- activity_time: timestamp (nullable = true)
 |-- device: string (nullable = true)
 |-- location: string (nullable = true)
 |-- duration_minutes: double (nullable = true)



In [6]:
df = df.fillna({"device": "NA", "location": "NA"})
df = df.withColumn(
    "date", 
    F.to_date("activity_time")
).withColumn(
    "time",
    F.date_format("activity_time", format="HH:mm:ss")
)

df.show()

+-----------+----------+---------+---------------+-------------------+-------+---------+----------------+----------+--------+
|activity_id|student_id|course_id|  activity_type|      activity_time| device| location|duration_minutes|      date|    time|
+-----------+----------+---------+---------------+-------------------+-------+---------+----------------+----------+--------+
|    A274850|       S27|      C13|Complete_Course|2025-08-21 08:20:00| Tablet|   Canada|            NULL|2025-08-21|08:20:00|
|    A110285|       S25|       C8|     Forum_Post|2025-08-21 06:09:00| Tablet|   Canada|            NULL|2025-08-21|06:09:00|
|    A490154|       S72|      C15|    Watch_Video|2025-08-26 13:36:00|Desktop|      USA|            33.0|2025-08-26|13:36:00|
|    A618340|      S100|      C18|          Login|2025-08-25 13:38:00| Mobile|       UK|            NULL|2025-08-25|13:38:00|
|    A245486|      S126|      C10|          Login|2025-08-25 10:14:00|Desktop|    India|            NULL|2025-08-25|10

In [8]:
daily_engagement_df = (
    df.groupBy("date")
      .agg(
          F.count("*").alias("total_activities"),
          F.countDistinct("student_id").alias("unique_students"),
          F.avg("duration_minutes").alias("avg_duration_per_activity")
      )
)

In [10]:
device_window = W.partitionBy("date").orderBy(F.desc("device_count"))
location_window = W.partitionBy("date").orderBy(F.desc("location_count"))

In [11]:
daily_device_location_df = (
    df.groupBy("date", "device", "location")
      .agg(
          F.count("*").alias("activity_count"),
          F.count("*").alias("device_count"),  
          F.count("*").alias("location_count")
      )
      .withColumn("device_rank", F.rank().over(device_window))
      .withColumn("location_rank", F.rank().over(location_window))
      .filter((F.col("device_rank") == 1) | (F.col("location_rank") == 1))
      .groupBy("date")
      .agg(
          F.first(F.col("device")).alias("top_device"),
          F.first(F.col("location")).alias("top_location")
      )
)

In [13]:
final_daily = daily_engagement_df.join(daily_device_location_df, "date")

final_daily.show(truncate=False)

+----------+----------------+---------------+-------------------------+----------+------------+
|date      |total_activities|unique_students|avg_duration_per_activity|top_device|top_location|
+----------+----------------+---------------+-------------------------+----------+------------+
|2025-08-25|400             |141            |31.7816091954023         |Tablet    |Canada      |
|2025-08-26|399             |141            |27.466257668711656       |Mobile    |USA         |
|2025-08-20|400             |138            |31.251851851851853       |Desktop   |USA         |
|2025-08-22|400             |141            |30.58219178082192        |Desktop   |USA         |
|2025-08-23|400             |139            |27.770114942528735       |Tablet    |USA         |
|2025-08-21|400             |139            |29.388235294117646       |Mobile    |UK          |
|2025-08-27|1               |1              |9.0                      |Tablet    |USA         |
|2025-08-24|400             |136        

In [14]:
course_stats_df = (
    df.groupBy("date", "course_id")
      .agg(
          F.countDistinct("student_id").alias("unique_students"),
          F.countDistinct(
              F.when(F.col("activity_type") == "Complete_Course", F.col("student_id"))
          ).alias("completed_students"),
          F.sum("duration_minutes").alias("total_time_spent")
      )
      .withColumn("completion_rate",
                  (F.col("completed_students") / F.col("unique_students")) * 100)
)

In [19]:
course_stats_df.show()

+----------+---------+---------------+------------------+----------------+------------------+
|      date|course_id|unique_students|completed_students|total_time_spent|   completion_rate|
+----------+---------+---------------+------------------+----------------+------------------+
|2025-08-23|      C15|             23|                 5|           362.0| 21.73913043478261|
|2025-08-25|      C10|             27|                 6|           491.0| 22.22222222222222|
|2025-08-20|      C14|             22|                 7|           344.0|31.818181818181817|
|2025-08-20|       C8|             15|                 5|           186.0| 33.33333333333333|
|2025-08-22|       C5|             22|                 6|           248.0| 27.27272727272727|
|2025-08-20|       C4|             19|                 4|           182.0|21.052631578947366|
|2025-08-22|      C11|             24|                 7|           262.0|29.166666666666668|
|2025-08-22|      C13|             23|                 5|   

In [16]:
w = W.partitionBy("date").orderBy(F.desc("total_time_spent"))

In [20]:
top_5_courses_df = (
    course_stats_df.withColumn("rank", F.rank().over(w))
                .filter(F.col("rank") <= 5)
)
top_5_courses_df.show()

+----------+---------+---------------+------------------+----------------+------------------+----+
|      date|course_id|unique_students|completed_students|total_time_spent|   completion_rate|rank|
+----------+---------+---------------+------------------+----------------+------------------+----+
|2025-08-20|       C5|             24|                 7|           454.0|29.166666666666668|   1|
|2025-08-20|      C14|             22|                 7|           344.0|31.818181818181817|   2|
|2025-08-20|       C6|             16|                 4|           312.0|              25.0|   3|
|2025-08-20|       C3|             24|                 8|           310.0| 33.33333333333333|   4|
|2025-08-20|      C15|             26|                10|           286.0| 38.46153846153847|   5|
|2025-08-21|       C3|             21|                 4|           420.0|19.047619047619047|   1|
|2025-08-21|      C20|             21|                 5|           338.0|23.809523809523807|   2|
|2025-08-2

In [22]:
student_behavior_df = df.groupBy("student_id", "date").agg(
    F.sum(F.coalesce("duration_minutes", F.lit(0))).alias("total_time_spent"),
    F.sum(F.when(F.col("activity_type") == "Attempt_Quiz", 1).otherwise(0)).alias("quizzes_attempted"),
    F.countDistinct(F.when(F.col("activity_type") == "Complete_Course", F.col("course_id"))).alias("courses_completed")
)

student_behavior_df.show()

+----------+----------+----------------+-----------------+-----------------+
|student_id|      date|total_time_spent|quizzes_attempted|courses_completed|
+----------+----------+----------------+-----------------+-----------------+
|      S132|2025-08-24|             0.0|                0|                2|
|      S144|2025-08-26|            30.0|                0|                1|
|       S60|2025-08-23|            39.0|                1|                1|
|       S54|2025-08-25|            40.0|                0|                0|
|       S73|2025-08-22|             0.0|                0|                0|
|       S99|2025-08-24|            31.0|                2|                2|
|       S42|2025-08-21|            35.0|                2|                1|
|       S45|2025-08-23|             0.0|                0|                1|
|      S126|2025-08-26|            55.0|                0|                0|
|       S89|2025-08-25|            46.0|                1|                0|