Week 2 - Data Preprocessing in Python

Tools: Python (Pandas, Numpy) Tasks:

1. Load enrollment and progress data from CSV or mock API

2. Clean invalid entries (missing names, dates, percentages)

3. Use numpy to calculate average progress

4. Use pandas to group by course and compute average completion rates

**Deliverables:**

Cleaned data showing student progress

Python report of course-level averages

1. Load enrollment and progress data from CSV or mock API

In [1]:
from google.colab import files
uploaded = files.upload()

Saving courses.csv to courses.csv
Saving enrollments.csv to enrollments.csv
Saving progress.csv to progress.csv
Saving students.csv to students.csv


In [2]:
import pandas as pd
import numpy as np

students = pd.read_csv("students.csv")
courses = pd.read_csv("courses.csv")
enrollments = pd.read_csv("enrollments.csv")
progress = pd.read_csv("progress.csv")

In [3]:
print("Students:\n", students.head())
print("\nCourses:\n", courses.head())
print("\nEnrollments:\n", enrollments.head())
print("\nProgress:\n", progress.head())

Students:
    student_id       name             email   age registration_date
0           1     Harish  harish@gmail.com  21.0        01-07-2024
1           2  saravanan   saran@gmail.com  28.0        02-07-2024
2           3      Valar   valar@gmail.com  30.0        03-07-2024
3           4       Mary    mary@gmail.com  27.0        04-07-2024
4           5      Leena   leena@gmail.com  22.0        05-07-2024

Courses:
    course_id           course_name                      description  \
0        501  Foundation of Python    Basics of Python - data types   
1        502       Advanced Python                        Functions   
2        503           Java Basics  Core Java Syntax and Structures   
3        504         Data Analysis          Intro to Data Analytics   
4        505        Web Dev Basics                      HTML CSS JS   

  instructor_name  No_Of_modules  duration_weeks  created_at  
0        Santhosh              4               2  01-06-2024  
1         Vasanth      

**2. Clean invalid entries (missing names, dates, percentages)**

Cleaning each tables

In [4]:
# cleaning the students
students['student_id'] = pd.to_numeric(students['student_id'], errors='coerce')
students['age'] = pd.to_numeric(students['age'], errors='coerce').fillna(students['age'].mean())
students['name']=students['name'].fillna('Unknown')
students['email']=students['email'].fillna('Unknown')
students['registration_date']=pd.to_datetime(students['registration_date'], errors='coerce').fillna('2024-01-01')

# Cleaning the courses
courses['course_id'] = pd.to_numeric(courses['course_id'], errors='coerce')
courses['instructor_name'] = courses['instructor_name'].fillna('Unknown')
courses['description'] = courses['description'].fillna('No description available')
courses['duration_weeks'] = courses['duration_weeks'].fillna(2)
courses['No_Of_modules'] = courses['No_Of_modules'].fillna(4)
courses['course_name'] = courses['course_name'].fillna('Unknown')
courses['created_at'] = pd.to_datetime(courses['created_at'], errors='coerce').fillna('2024-01-01')
courses = courses.loc[:, ~courses.columns.str.contains('^Unnamed')]

# cleaning the enrollments
enrollments['enroll_id'] = pd.to_numeric(enrollments['enroll_id'], errors='coerce')
enrollments['course_id'] = pd.to_numeric(enrollments['course_id'], errors='coerce')
enrollments['student_id'] = pd.to_numeric(enrollments['student_id'], errors='coerce')
enrollments['enroll_date'] = pd.to_datetime(enrollments['enroll_date'], errors='coerce').fillna('2024-01-01')
enrollments['status'] = enrollments['status'].fillna('Active')
enrollments= enrollments.dropna(subset=['student_id','course_id','enroll_id']).reset_index(drop=True)

# cleaning the progress
progress['enroll_id'] = pd.to_numeric(progress['enroll_id'], errors='coerce')
progress['progress_id'] = pd.to_numeric(progress['progress_id'], errors='coerce')
progress['last_update'] = pd.to_datetime(progress['last_update'], errors='coerce').fillna('2024-01-01')
progress['modules_completed'] = pd.to_numeric(progress['modules_completed'], errors='coerce').fillna(0)
progress=progress.dropna(subset=['enroll_id']).reset_index(drop=True)


  progress['last_update'] = pd.to_datetime(progress['last_update'], errors='coerce').fillna('2024-01-01')


Checking the null values

In [5]:
print("Course\n",courses.isnull().sum())
print("\nEnrollments\n",enrollments.isnull().sum())
print("\nProgress\n",progress.isnull().sum())
print("\nStudents\n",students.isnull().sum())

Course
 course_id          0
course_name        0
description        0
instructor_name    0
No_Of_modules      0
duration_weeks     0
created_at         0
dtype: int64

Enrollments
 enroll_id      0
student_id     0
course_id      0
enroll_date    0
status         0
dtype: int64

Progress
 progress_id          0
enroll_id            0
modules_completed    0
last_update          0
dtype: int64

Students
 student_id           0
name                 0
email                0
age                  0
registration_date    0
dtype: int64


Merging the all dataframe to get all details in single dataframe

In [6]:
# Merge enrollments with students
data = enrollments.merge(students, on='student_id', how='left')

# Merge with progress
data = data.merge(progress, on='enroll_id', how='left')

# Merge with courses
data = data.merge(courses, on='course_id', how='left')

**Again cleaning the data to avoid null values occur during merging**

Droping the null values in the key elements (cloumns)

In [7]:
data = data.dropna(subset=['enroll_id', 'course_id', 'student_id', 'progress_id']).reset_index(drop=True)

Changing the date columns to date data type and filling the default date as **'2024-01-01'**

In [8]:
data['enroll_date']=pd.to_datetime(data['enroll_date'], errors='coerce').fillna('2024-01-01')
data['last_update']=pd.to_datetime(data['last_update'], errors='coerce').fillna('2024-01-01')
data['registration_date']=pd.to_datetime(data['registration_date'], errors='coerce').fillna('2024-01-01')
data['created_at']=pd.to_datetime(data['created_at'], errors='coerce').fillna('2024-01-01')

Giving the unknown value to null columns in name, email, course_name, instructor_name, description, status (Active)

In [9]:
columns_to_fill = ['name', 'email', 'course_name', 'instructor_name']
data[columns_to_fill] = data[columns_to_fill].fillna('Unknown')
data['description']=data['description'].fillna('No description available')
data['status']=data['status'].fillna('Active')

Filling the null values with default value , mean

In [10]:
data['modules_completed']=data['modules_completed'].fillna(0)
data['age']=data['age'].mean()
data['duration_weeks']=data['duration_weeks'].fillna(2)
data['No_Of_modules']=data['No_Of_modules'].fillna(4)
data['completion_percent'] = (data['modules_completed'] / data['No_Of_modules']) * 100
data['completion_percent'] = data['completion_percent'].clip(0, 100).fillna(0)

Checking Null records

In [11]:
print(data.isnull().sum())

enroll_id             0
student_id            0
course_id             0
enroll_date           0
status                0
name                  0
email                 0
age                   0
registration_date     0
progress_id           0
modules_completed     0
last_update           0
course_name           0
description           0
instructor_name       0
No_Of_modules         0
duration_weeks        0
created_at            0
completion_percent    0
dtype: int64


In [12]:
data.head()

Unnamed: 0,enroll_id,student_id,course_id,enroll_date,status,name,email,age,registration_date,progress_id,modules_completed,last_update,course_name,description,instructor_name,No_Of_modules,duration_weeks,created_at,completion_percent
0,1001,1.0,501,2024-10-07,Completed,Harish,harish@gmail.com,26.706494,2024-01-07,2001.0,4.0,2024-07-21,Foundation of Python,Basics of Python - data types,Santhosh,4,2,2024-01-06,100.0
1,1002,2.0,501,2024-12-07,Active,saravanan,saran@gmail.com,26.706494,2024-02-07,2002.0,2.0,2024-07-14,Foundation of Python,Basics of Python - data types,Santhosh,4,2,2024-01-06,50.0
2,1003,3.0,502,2024-01-01,Dropped,Valar,valar@gmail.com,26.706494,2024-03-07,2003.0,0.0,2024-07-15,Advanced Python,Functions,Vasanth,6,4,2024-05-06,0.0
3,1004,4.0,502,2024-01-01,Active,Mary,mary@gmail.com,26.706494,2024-04-07,2004.0,3.0,2024-07-16,Advanced Python,Functions,Vasanth,6,4,2024-05-06,50.0
4,1006,6.0,503,2024-01-01,Active,Ajay,ajay@gmail.com,26.706494,2024-06-07,2006.0,4.0,2024-07-18,Java Basics,Core Java Syntax and Structures,Praveen,5,3,2024-10-06,80.0


**3. Use NumPy to calculate average progress**

In [13]:
average_progress = np.mean(data['completion_percent'])
print("Average Progress:", average_progress)

Average Progress: 60.2312925170068


4. Group by course and compute average completion rates


In [14]:
avg_progress_per_course = data.groupby('course_name')['completion_percent'].agg(np.mean).reset_index()
avg_progress_per_course

  avg_progress_per_course = data.groupby('course_name')['completion_percent'].agg(np.mean).reset_index()


Unnamed: 0,course_name,completion_percent
0,Advanced Python,33.333333
1,Big Data Hadoop,85.714286
2,Cloud Fundamentals,50.0
3,Data Analysis,53.333333
4,Data Visualization,12.5
5,Databases Intro,50.0
6,DevOps,66.666667
7,Foundation of Python,58.333333
8,Frontend Frameworks,83.333333
9,Java Basics,60.0


Saving all cleaned data

In [15]:
students.to_csv('students_cleaned.csv', index=False)
courses.to_csv('courses_cleaned.csv', index=False)
enrollments.to_csv('enrollments_cleaned.csv', index=False)
progress.to_csv('progress_cleaned.csv', index=False)
data.to_csv('merged_data.csv', index=False)

dowloading all csv

In [16]:
files.download('students_cleaned.csv')
files.download('courses_cleaned.csv')
files.download('enrollments_cleaned.csv')
files.download('progress_cleaned.csv')
files.download('merged_data.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>