# Online Course Enrollment and Progress Tracker

## Week 2 - Data Preprocessing in Python

**Tools**: Python (Pandas, Numpy)
**Tasks**:
- Load enrollment and progress data from CSV or mock API
- Clean invalid entries (missing names, dates, percentages)
- Use numpy to calculate average progress
- Use pandas to group by course and compute average completion rates

**Deliverables**:
- Cleaned data showing student progress
- Python report of course-level averages

In [2]:
from google.colab import drive
drive.mount('/content/mydrive')

Mounted at /content/mydrive


**1. DATA EXTRACTION**

In [3]:
import pandas as pd
import numpy as np

enrollment_path = (''
    "/content/mydrive/MyDrive/Hexware_Training_DataEngineering/Project/"
    "Online_Course_Enrollment_And_Progress_Tracker/Week 2/Enrollment.csv"
)

progress_path = (
    "/content/mydrive/MyDrive/Hexware_Training_DataEngineering/Project/"
    "Online_Course_Enrollment_And_Progress_Tracker/Week 2/Progress.csv"
)

enrollment_df = pd.read_csv(enrollment_path)
progress_df = pd.read_csv(progress_path)

In [None]:
print("Enrollment data:\n",enrollment_df)
print("\nProgress data:\n",progress_df)

Enrollment data:
     enrollment_id  student_id  course_id enrollment_date
0             302       101.0      202.0      2025-07-02
1             303       102.0      201.0      2025-07-03
2             304       103.0      203.0      2025-07-04
3             305       104.0      204.0      2025-07-05
4             306       101.0      201.0      2025-05-10
5             307       104.0      202.0      2025-07-10
6             308       105.0      203.0      2025-07-11
7             309       101.0      203.0      2025-07-12
8             310       103.0      201.0      2025-07-13
9             311         NaN      203.0      2024-05-01
10            312       104.0        NaN      2024-06-15
11            313       105.0      201.0             NaN
12            314       104.0      202.0    invalid-date
13            315       106.0      204.0      2024-07-01

Progress data:
     progress_id  student_id  course_id  completion_percentage last_updated
0           401       101.0        

In [7]:
print("Enrollment Info:\n" )
enrollment_df.info()
print("\nProgress Info:\n")
progress_df.info()

Enrollment Info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   enrollment_id    14 non-null     int64  
 1   student_id       13 non-null     float64
 2   course_id        13 non-null     float64
 3   enrollment_date  13 non-null     object 
dtypes: float64(2), int64(1), object(1)
memory usage: 580.0+ bytes

Progress Info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 5 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   progress_id            15 non-null     int64  
 1   student_id             14 non-null     float64
 2   course_id              15 non-null     int64  
 3   completion_percentage  14 non-null     float64
 4   last_updated           14 non-null     object 
dtypes: float64(2), int64(2), object(1)
memory usage: 

**2. DATA CLEANING**

In [4]:
# Data Cleaning
# Checking for missing Values in both tables

print("Missing values in Enrollment:\n", enrollment_df.isnull().sum())
print("\nMissing values in Progress:\n", progress_df.isnull().sum())

Missing values in Enrollment:
 enrollment_id      0
student_id         1
course_id          1
enrollment_date    1
dtype: int64

Missing values in Progress:
 progress_id              0
student_id               1
course_id                0
completion_percentage    1
last_updated             1
dtype: int64


In [9]:
# Filling the missing values
enrollment_cleaned = enrollment_df.dropna(subset=['student_id', 'course_id' ])
print("Successfully Enrollment missing values dropped")
progress_cleaned = progress_df.dropna(subset=['student_id'])
print("\nSuccessfully Progress missing values dropped")


Successfully Enrollment missing values dropped

Successfully Progress missing values dropped


In [13]:
# Drop rows with missing IDs

enrollment_cleaned = enrollment_df.dropna(subset=['student_id', 'course_id']).copy()
progress_cleaned = progress_df.dropna(subset=['student_id']).copy()

print("Successfully Enrollment & Progress missing values dropped")

#  Converting date columns
enrollment_cleaned.loc[:, 'enrollment_date'] = pd.to_datetime(
    enrollment_cleaned['enrollment_date'], errors='coerce'
)

progress_cleaned.loc[:, 'last_updated'] = pd.to_datetime(
    progress_cleaned['last_updated'], errors='coerce'
)

# Step 3: Drop rows with failed date conversion
enrollment_cleaned = enrollment_cleaned.dropna(subset=['enrollment_date'])
progress_cleaned = progress_cleaned.dropna(subset=['last_updated'])

print("\nSuccessfully cleaned invalid date entries")


Successfully Enrollment & Progress missing values dropped

Successfully cleaned invalid date entries


In [17]:
print("Enrollment data:\n",enrollment_cleaned)
print("\nProgress data:\n",progress_cleaned)

print("\nMissing values in Enrollment:\n", enrollment_cleaned.isnull().sum())
print("\nMissing values in Progress:\n", progress_cleaned.isnull().sum())

Enrollment data:
     enrollment_id  student_id  course_id      enrollment_date
0             302       101.0      202.0  2025-07-02 00:00:00
1             303       102.0      201.0  2025-07-03 00:00:00
2             304       103.0      203.0  2025-07-04 00:00:00
3             305       104.0      204.0  2025-07-05 00:00:00
4             306       101.0      201.0  2025-05-10 00:00:00
5             307       104.0      202.0  2025-07-10 00:00:00
6             308       105.0      203.0  2025-07-11 00:00:00
7             309       101.0      203.0  2025-07-12 00:00:00
8             310       103.0      201.0  2025-07-13 00:00:00
13            315       106.0      204.0  2024-07-01 00:00:00

Progress data:
     progress_id  student_id  course_id  completion_percentage  \
0           401       101.0        201                   85.0   
1           402       101.0        202                   40.0   
2           403       102.0        201                   60.0   
3           404       1

In [18]:
# At last we have only missing value in completion_percentage
progress_cleaned = progress_cleaned.dropna(subset=['completion_percentage'])

**3. DATA TRANSFORMATION**

In [26]:
# Converting data types
# Enrollment table
enrollment_cleaned['student_id'] = enrollment_cleaned['student_id'].astype('Int64')
enrollment_cleaned['course_id'] = enrollment_cleaned['course_id'].astype('Int64')



print("Type changed in Enrollment Data")
# Progress table
progress_cleaned['student_id'] = progress_cleaned['student_id'].astype('Int64')
progress_cleaned['course_id'] = progress_cleaned['course_id'].astype('Int64')


progress_cleaned['completion_percentage'] = pd.to_numeric(progress_cleaned['completion_percentage'], errors='coerce').fillna(0)
progress_cleaned['completion_percentage'] = np.clip(progress_cleaned['completion_percentage'], 0, 100)
print("Progress table percentage type changed")


Type changed in Enrollment Data
Progress table percentage type changed


**Final Check**

1. Checking for the missing values in both table
2. Checking the data type of each column
3. Showing the Datasets

In [28]:
# 1
print("Missing values in Cleaned Enrollment:\n", enrollment_cleaned.isnull().sum())
print("\nMissing values in Cleaned Progress:\n", progress_cleaned.isnull().sum())

# 2
print("\nCleaned Enrollment data types:\n", enrollment_cleaned.dtypes)
print("\nCleaned Progress data types:\n", progress_cleaned.dtypes)

# 3
print("\nCleaned Enrollment data:\n", enrollment_cleaned)
print("\nCleaned Progress data:\n", progress_cleaned)

Missing values in Cleaned Enrollment:
 enrollment_id      0
student_id         0
course_id          0
enrollment_date    0
dtype: int64

Missing values in Cleaned Progress:
 progress_id              0
student_id               0
course_id                0
completion_percentage    0
last_updated             0
dtype: int64

Cleaned Enrollment data types:
 enrollment_id       int64
student_id          Int64
course_id           Int64
enrollment_date    object
dtype: object

Cleaned Progress data types:
 progress_id                int64
student_id                 Int64
course_id                  Int64
completion_percentage    float64
last_updated              object
dtype: object

Cleaned Enrollment data:
     enrollment_id  student_id  course_id      enrollment_date
0             302         101        202  2025-07-02 00:00:00
1             303         102        201  2025-07-03 00:00:00
2             304         103        203  2025-07-04 00:00:00
3             305         104        204  

**Average Progress**

In [29]:
avg_total = np.mean(progress_cleaned['completion_percentage'])
print(f"Overall Average Progress: {round(avg_total, 2)}%")

Overall Average Progress: 63.75%


**Group by Course**

In [30]:
course_avg = progress_cleaned.groupby('course_id')['completion_percentage'].mean().reset_index()
course_avg.rename(columns={'completion_percentage': 'avg_completion_percentage'}, inplace=True)
print("\nCourse-level Averages:")
print(course_avg)



Course-level Averages:
   course_id  avg_completion_percentage
0        201                       72.5
1        202                       42.5
2        203                       58.0
3        204                      100.0


**4. LOADING DATA**

Saving the Cleaned Datasets and the Course Level Average progress into Google Drive

In [31]:
# Saving Cleaned data
clean_enrollment_path = (''
    "/content/mydrive/MyDrive/Hexware_Training_DataEngineering/Project/"
    "Online_Course_Enrollment_And_Progress_Tracker/Week 2/Cleaned_Enrollment.csv"
)

clean_progress_path = (
    "/content/mydrive/MyDrive/Hexware_Training_DataEngineering/Project/"
    "Online_Course_Enrollment_And_Progress_Tracker/Week 2/Cleaned_Progress.csv"
)

course_avg_path = (
    "/content/mydrive/MyDrive/Hexware_Training_DataEngineering/Project/"
    "Online_Course_Enrollment_And_Progress_Tracker/Week 2/Course_Level_Average_Progress.csv"
)
enrollment_cleaned.to_csv(clean_enrollment_path, index=False)
progress_cleaned.to_csv(clean_progress_path, index=False)
course_avg.to_csv(course_avg_path, index=False)