## **Problem Scenario: Online Learning Engagement Analysis**

### **Background**

You are working for an online learning platform (like Coursera or Udemy).
The platform tracks student activities such as **logins, watching videos, attempting quizzes, completing courses, and forum participation** across multiple courses and devices.

However, the raw data contains:

* Missing device/location information
* Multiple activity types per student per day
* Varying engagement durations for different activities

The company wants **daily engagement analytics** and **course-level insights** for product and academic teams.

---

## **Business Objectives**

1. **Daily Engagement Metrics**

   * Total active students per day
   * Total activities per day
   * Average session duration per day
   * Device and location distribution

2. **Course-Level Analytics**

   * Course completion rates
   * Top 5 courses by engagement time
   * Unique students per course per day

3. **Student Behavior Metrics**

   * Average time spent per student per day
   * Number of quizzes attempted per student
   * Students completing at least 1 course per week

---

## **Sample Input Table (Simplified)**

| activity\_id | student\_id | course\_id | activity\_type   | activity\_time      | device  | location | duration\_minutes |
| ------------ | ----------- | ---------- | ---------------- | ------------------- | ------- | -------- | ----------------- |
| A123456      | S1          | C2         | Login            | 2025-08-20 09:15:00 | Mobile  | USA      |                   |
| A234567      | S1          | C2         | Watch\_Video     | 2025-08-20 09:30:00 | Mobile  | USA      | 20                |
| A345678      | S2          | C3         | Attempt\_Quiz    | 2025-08-20 10:00:00 | Desktop | India    | 15                |
| A456789      | S3          | C5         | Complete\_Course | 2025-08-20 11:00:00 | Tablet  | UK       |                   |
| A567890      | S1          | C2         | Forum\_Post      | 2025-08-20 11:20:00 | Mobile  | USA      |                   |

---

## **Expected Output 1 – Daily Engagement Metrics**

| date       | total\_activities | unique\_students | avg\_duration\_per\_activity | top\_device | top\_location |
| ---------- | ----------------- | ---------------- | ---------------------------- | ----------- | ------------- |
| 2025-08-20 | 400               | 120              | 18.5 mins                    | Mobile      | India         |
| 2025-08-21 | 390               | 118              | 17.2 mins                    | Desktop     | USA           |

---

## **Expected Output 2 – Course-Level Analytics**

| date       | course\_id | completion\_rate(%) | total\_time\_spent | unique\_students |
| ---------- | ---------- | ------------------- | ------------------ | ---------------- |
| 2025-08-20 | C2         | 35%                 | 1200 mins          | 45               |
| 2025-08-20 | C3         | 28%                 | 850 mins           | 38               |
| 2025-08-20 | C5         | 40%                 | 600 mins           | 20               |

---

## **Expected Output 3 – Student Behavior Metrics**

| student\_id | date       | total\_time\_spent | quizzes\_attempted | courses\_completed |
| ----------- | ---------- | ------------------ | ------------------ | ------------------ |
| S1          | 2025-08-20 | 250 mins           | 2                  | 1                  |
| S2          | 2025-08-20 | 180 mins           | 1                  | 0                  |
| S3          | 2025-08-20 | 75 mins            | 0                  | 1                  |

---

This scenario mimics **real-world complexity** with:

* Missing values handling
* Per-day, per-course, and per-student aggregations
* Multiple dimensions like device, location, and activity types

---

In [7]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.appName('Daily-Day12').getOrCreate()

In [8]:
df = spark.read.option('header','true').option('inferSchema','true').csv('/content/sample_data/data/learning.csv')

In [5]:
df.printSchema()

root
 |-- activity_id: string (nullable = true)
 |-- student_id: string (nullable = true)
 |-- course_id: string (nullable = true)
 |-- activity_type: string (nullable = true)
 |-- activity_time: timestamp (nullable = true)
 |-- device: string (nullable = true)
 |-- location: string (nullable = true)
 |-- duration_minutes: double (nullable = true)



In [11]:
null_counts = df.select([F.count(F.when(F.col(c).isNull(),c)).alias(c) for c in df.columns])
null_counts.show()

+-----------+----------+---------+-------------+-------------+------+--------+----------------+
|activity_id|student_id|course_id|activity_type|activity_time|device|location|duration_minutes|
+-----------+----------+---------+-------------+-------------+------+--------+----------------+
|          0|         0|        0|            0|            0|    65|      63|            1677|
+-----------+----------+---------+-------------+-------------+------+--------+----------------+



In [13]:
df_filled = df.fillna({
    "activity_id": 0,
    "student_id": 0,
    "course_id": 0,
    "activity_type": "unknown",
    "device": "unknown",
    "location": "unknown",
    "duration_minutes": 0
})


In [14]:
df_filled = df_filled.withColumn('activity_time' , F.coalesce(F.col("activity_time"), F.current_timestamp()))
null_counts = df_filled.select([F.count(F.when(F.col(c).isNull(),c)).alias(c) for c in df.columns])
null_counts.show()

+-----------+----------+---------+-------------+-------------+------+--------+----------------+
|activity_id|student_id|course_id|activity_type|activity_time|device|location|duration_minutes|
+-----------+----------+---------+-------------+-------------+------+--------+----------------+
|          0|         0|        0|            0|            0|     0|       0|               0|
+-----------+----------+---------+-------------+-------------+------+--------+----------------+



In [15]:
df_filled = df_filled.withColumn('date',F.col('activity_time').cast('date'))

+-----------+----------+---------+---------------+-------------------+-------+---------+----------------+----------+
|activity_id|student_id|course_id|  activity_type|      activity_time| device| location|duration_minutes|      date|
+-----------+----------+---------+---------------+-------------------+-------+---------+----------------+----------+
|    A274850|       S27|      C13|Complete_Course|2025-08-21 08:20:00| Tablet|   Canada|             0.0|2025-08-21|
|    A110285|       S25|       C8|     Forum_Post|2025-08-21 06:09:00| Tablet|   Canada|             0.0|2025-08-21|
|    A490154|       S72|      C15|    Watch_Video|2025-08-26 13:36:00|Desktop|      USA|            33.0|2025-08-26|
|    A618340|      S100|      C18|          Login|2025-08-25 13:38:00| Mobile|       UK|             0.0|2025-08-25|
|    A245486|      S126|      C10|          Login|2025-08-25 10:14:00|Desktop|    India|             0.0|2025-08-25|
|    A847712|        S7|      C17|          Login|2025-08-26 15:

In [18]:
daily_metrics = df_filled.groupBy('date').agg(
    F.count('activity_type').alias('total_activities'),
    F.countDistinct('student_id').alias('unique_students'),
    F.sum('duration_minutes').alias('total_duration'),
    F.expr('mode(device)').alias('device'),
    F.expr('mode(location)').alias('location')
).withColumn('avg_duration_per_activity' , F.col('total_duration')/F.col('total_activities'))
daily_metrics.show()

+----------+----------------+---------------+--------------+-------+--------+-------------------------+
|      date|total_activities|unique_students|total_duration| device|location|avg_duration_per_activity|
+----------+----------------+---------------+--------------+-------+--------+-------------------------+
|2025-08-25|             400|            141|        5530.0| Tablet|   India|                   13.825|
|2025-08-26|             399|            141|        4477.0| Tablet|  Canada|       11.220551378446116|
|2025-08-20|             400|            138|        4219.0|Desktop|      UK|                  10.5475|
|2025-08-22|             400|            141|        4465.0| Tablet|     USA|                  11.1625|
|2025-08-23|             400|            139|        4832.0| Tablet|     USA|                    12.08|
|2025-08-21|             400|            139|        4996.0| Tablet|      UK|                    12.49|
|2025-08-27|               1|              1|           9.0| Tab

In [20]:
course_metrics = df_filled.groupBy('date','course_id').agg(
    F.countDistinct('student_id').alias('unique_students'),
    F.sum('duration_minutes').alias('total_time_spent'),
    F.countDistinct(F.when(F.col("activity_type") == "Complete_Course", "student_id")).alias("completed_students"),
    F.countDistinct("student_id").alias("total_students")
).withColumn("completion_rate", (F.col("completed_students") / F.col("total_students")) * 100).orderBy(F.col('total_time_spent').desc()).drop('completed_students','total_students')
course_metrics.show()

+----------+---------+---------------+----------------+------------------+
|      date|course_id|unique_students|total_time_spent|   completion_rate|
+----------+---------+---------------+----------------+------------------+
|2025-08-25|      C10|             27|           491.0|3.7037037037037033|
|2025-08-25|       C5|             27|           477.0|3.7037037037037033|
|2025-08-22|      C18|             25|           466.0|               4.0|
|2025-08-20|       C5|             24|           454.0| 4.166666666666666|
|2025-08-24|       C4|             22|           449.0| 4.545454545454546|
|2025-08-25|      C17|             25|           447.0|               4.0|
|2025-08-22|      C10|             24|           442.0| 4.166666666666666|
|2025-08-26|      C20|             22|           428.0| 4.545454545454546|
|2025-08-21|       C3|             21|           420.0| 4.761904761904762|
|2025-08-25|      C15|             18|           390.0| 5.555555555555555|
|2025-08-23|      C17|   

student_id 	date 	total_time_spent 	quizzes_attempted 	courses_completed


In [27]:
student_metrics = df_filled.groupBy('student_id','date').agg(
    sum('duration_minutes').alias('total_time_spent'),
    count(F.col('activity_type') == 'Attempt_Quiz').alias('quizzes_attempted'),
    count(F.col('activity_type')== 'Complete_Course').alias('courses_completed')
).orderBy('total_time_spent',ascending=False)
student_metrics.show()

+----------+----------+----------------+-----------------+-----------------+
|student_id|      date|total_time_spent|quizzes_attempted|courses_completed|
+----------+----------+----------------+-----------------+-----------------+
|       S45|2025-08-25|           188.0|                6|                6|
|       S35|2025-08-22|           173.0|                8|                8|
|      S117|2025-08-26|           169.0|                6|                6|
|      S148|2025-08-21|           163.0|                7|                7|
|       S16|2025-08-20|           154.0|               10|               10|
|       S64|2025-08-23|           150.0|                4|                4|
|       S40|2025-08-25|           149.0|                3|                3|
|       S11|2025-08-23|           144.0|                7|                7|
|       S80|2025-08-21|           144.0|                4|                4|
|      S120|2025-08-21|           141.0|                5|                5|