**Week 3 - Pyspark for Course Analysis**

**Tools: Pyspark**

**Tasks:**

- Load large enrollment and progress data using PySpark
- Join tables to get course-wise progress
- Group by course to count total enrolled and completed students

**Deliverables:**
- PySpark script with joins and group aggregations
- Output showing top completed and dropped-out courses

Load large enrollment and progress data using PySpark

In [1]:
from google.colab import files
uploaded = files.upload()

Saving courses_cleaned.csv to courses_cleaned.csv
Saving enrollments_cleaned.csv to enrollments_cleaned.csv
Saving merged_data.csv to merged_data.csv
Saving progress_cleaned.csv to progress_cleaned.csv
Saving students_cleaned.csv to students_cleaned.csv


In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('CourseAnalysis').getOrCreate()


In [3]:
students_df = spark.read.csv('students_cleaned.csv', header=True, inferSchema=True)
courses_df = spark.read.csv('courses_cleaned.csv', header=True, inferSchema=True)
enrollments_df = spark.read.csv('enrollments_cleaned.csv', header=True, inferSchema=True)
progress_df = spark.read.csv('progress_cleaned.csv', header=True, inferSchema=True)

**Join tables to get course-wise progress**

In [4]:
enroll_progress_df = enrollments_df.join(progress_df, 'enroll_id', 'left') \
                                   .join(students_df, 'student_id', 'left') \
                                   .join(courses_df, 'course_id', 'left')
enroll_progress_df.cache()

enroll_progress_df.show(5)

+---------+----------+---------+-----------+---------+-----------+-----------------+-----------+---------+----------------+----+-----------------+--------------------+--------------------+---------------+-------------+--------------+----------+
|course_id|student_id|enroll_id|enroll_date|   status|progress_id|modules_completed|last_update|     name|           email| age|registration_date|         course_name|         description|instructor_name|No_Of_modules|duration_weeks|created_at|
+---------+----------+---------+-----------+---------+-----------+-----------------+-----------+---------+----------------+----+-----------------+--------------------+--------------------+---------------+-------------+--------------+----------+
|      501|       1.0|     1001| 2024-10-07|Completed|       2001|              4.0| 2024-07-21|   Harish|harish@gmail.com|21.0|       2024-01-07|Foundation of Python|Basics of Python ...|       Santhosh|            4|             2|2024-01-06|
|      501|       2.

**Group by course to count total enrolled and completed students**

In [5]:
from pyspark.sql.functions import col, when, avg, count

summary_df = enroll_progress_df.groupBy('course_id', 'course_name') \
    .agg(
        count('*').alias('total_enrollments'),
        count(when(col('status') == 'Completed', True)).alias('completed'),
        count(when(col('status').like('Drop%'), True)).alias('dropped'),
        avg('modules_completed').alias('avg_modules_completed'),
        avg('No_Of_modules').alias('avg_total_modules')
    )

summary_df = summary_df.withColumn(
    'completion_rate_percent',
    (col('avg_modules_completed') / col('avg_total_modules')) * 100
)

summary_df.orderBy('total_enrollments', ascending=False).show(truncate=False)


+---------+--------------------+-----------------+---------+-------+---------------------+-----------------+-----------------------+
|course_id|course_name         |total_enrollments|completed|dropped|avg_modules_completed|avg_total_modules|completion_rate_percent|
+---------+--------------------+-----------------+---------+-------+---------------------+-----------------+-----------------------+
|504      |Data Analysis       |3                |1        |1      |2.6666666666666665   |5.0              |53.333333333333336     |
|507      |Databases Intro     |3                |1        |0      |2.0                  |4.0              |50.0                   |
|502      |Advanced Python     |3                |0        |2      |2.0                  |6.0              |33.33333333333333      |
|501      |Foundation of Python|3                |1        |0      |2.3333333333333335   |4.0              |58.333333333333336     |
|505      |Web Dev Basics      |3                |1        |0      |5

Incase if the merge csv is used from week 2 the rest of the csv file is not needed (merge csv is already grouped so again grouping is not required)  

In [6]:

merged_df = spark.read.csv("merged_data.csv", header=True, inferSchema=True)

course_summary_df = merged_df.groupBy("course_id", "course_name").agg(
    count("*").alias("total_enrolled"),
    count(when(col("status") == "Completed", True)).alias("completed"),
    count(when(col("status").rlike("(?i)drop"), True)).alias("dropped"),
    avg("completion_percent").alias("avg_completion_percent")
)

course_summary_df = course_summary_df.orderBy(col("total_enrolled").desc())
course_summary_df.show(truncate=False)

+---------+--------------------+--------------+---------+-------+----------------------+
|course_id|course_name         |total_enrolled|completed|dropped|avg_completion_percent|
+---------+--------------------+--------------+---------+-------+----------------------+
|504      |Data Analysis       |3             |1        |1      |53.333333333333336    |
|507      |Databases Intro     |3             |1        |0      |50.0                  |
|502      |Advanced Python     |3             |0        |2      |33.333333333333336    |
|501      |Foundation of Python|3             |1        |0      |58.333333333333336    |
|505      |Web Dev Basics      |3             |1        |0      |88.88888888888887     |
|509      |Cloud Fundamentals  |2             |0        |1      |50.0                  |
|515      |Mobile Dev          |2             |2        |0      |90.0                  |
|514      |Security 101        |2             |0        |0      |50.0                  |
|506      |Machine Le

In [7]:
summary_df.orderBy('completed', ascending=False).show()
summary_df.orderBy('dropped', ascending=False).show()

+---------+--------------------+-----------------+---------+-------+---------------------+-----------------+-----------------------+
|course_id|         course_name|total_enrollments|completed|dropped|avg_modules_completed|avg_total_modules|completion_rate_percent|
+---------+--------------------+-----------------+---------+-------+---------------------+-----------------+-----------------------+
|      515|          Mobile Dev|                2|        2|      0|                  4.5|              5.0|                   90.0|
|      513|              DevOps|                2|        2|      0|                  4.0|              6.0|      66.66666666666666|
|      512|   Networking Basics|                2|        2|      0|                  4.0|              5.0|                   80.0|
|      504|       Data Analysis|                3|        1|      1|   2.6666666666666665|              5.0|     53.333333333333336|
|      507|     Databases Intro|                3|        1|      0| 

Saving the file to csv

In [8]:
summary_df.toPandas().to_csv('course_summary_week3.csv', index=False)

Downloading the course summary file

In [9]:
files.download('course_summary_week3.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>