# Student Performance Dataset - Data Cleaning Pipeline

This notebook demonstrates the comprehensive data cleaning process for student survey responses.

## Overview

This pipeline processes raw student survey data exported from Google Forms, applying comprehensive cleaning and standardization procedures. The dataset contains responses from 12,000+ students across various academic departments.


### Key Processing Steps:**Primary Tools:** `pandas` for data manipulation, `numpy` for numerical operations

- **Data validation and type conversion**

- **Outlier detection and removal**- **Quality assurance and export**

- **Missing value imputation**- **Text standardization and normalization**

In [2]:
"""
Data Loading and Initial Setup
==============================
Importing required libraries and loading the raw dataset for processing.
"""

import pandas as pd

import numpy as np 

std = pd.read_csv('../data/raw/forms_responses_12955.csv')
print(f"Dataset loaded successfully: {len(std):,} records")

Dataset loaded successfully: 16,000 records


## Initial Data Exploration

Examining the structure, quality, and characteristics of the raw dataset to inform our cleaning strategy.

In [3]:
# Display random sample to understand data structure

print("Random Sample (10 records):")
std.sample(10)

Random Sample (10 records):


Unnamed: 0,Timestamp,Student ID,Age,Gender,Department,GPA,Satisfaction (1-5),Comments
14013,09/04/2023 06:02:00,STUD6791,22,Other,Zoo,0.21,3.0,
1825,09/26/2023 08:23:00,STUD5035,22,,Geophysics,1.51,5.0,Comment 1825: The course was great!
1284,09/25/2023 06:37:00,STUD7836,22,Other,Microbiology,3.59,4.0,This is spam... ignore me
11761,09/02/2023 22:32:00,STUD8387,21,Malee,Chemistry,3.46,1.0,Comment 11761: The course was great!
7503,09/04/2023 09:52:00,STUD5707,23,Other,Comp Sci,0.93,1.0,This is spam... ignore me
12768,09/19/2023 09:36:00,STUD3679,22,Other,Microbiology,2.87,3.0,This is spam... ignore me
6439,09/20/2023 11:06:00,STUD4286,18,Male,Microbio,3.27,2.0,Comment 6439: The course was great!
9054,09/25/2023 08:36:00,STUD6226,19,Male,Biochemistry,3.96,5.0,This is spam... ignore me
5852,09/03/2023 00:19:00,STUD4790,22,Male,Biochemistry,0.74,5.0,Comment 5852: The course was great!
4468,09/29/2023 14:42:00,STUD6119,15,Other,Cell Biology and Genetics,1.01,1.0,Comment 4468: The course was great!


In [4]:
# Examine dataset structure and data types

print("Dataset Information:")
std.info()

Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16000 entries, 0 to 15999
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Timestamp           16000 non-null  object 
 1   Student ID          16000 non-null  object 
 2   Age                 14769 non-null  object 
 3   Gender              15016 non-null  object 
 4   Department          15215 non-null  object 
 5   GPA                 14896 non-null  object 
 6   Satisfaction (1-5)  14845 non-null  float64
 7   Comments            11174 non-null  object 
dtypes: float64(1), object(7)
memory usage: 1000.1+ KB


In [None]:
# Statistical summary of numerical columns

print("Statistical Summary:")
std.describe()

Unnamed: 0,Satisfaction (1-5)
count,14845.0
mean,3.103941
std,1.6607
min,0.0
25%,2.0
50%,3.0
75%,4.0
max,7.0


In [5]:
# Review column names for cleaning requirements

print("Dataset Columns:")    
for i, col in enumerate(std.columns, 1):
    print(f"{i:2d}. {col}")

Dataset Columns:
 1. Timestamp
 2. Student ID
 3. Age
 4. Gender
 5. Department
 6. GPA
 7. Satisfaction (1-5)
 8. Comments


In [6]:
# Analyze current data types for conversion planning

print("Current Data Types:")
print(f"\nObject columns requiring type conversion: {sum(std.dtypes == 'object')}")
print(std.dtypes)

Current Data Types:

Object columns requiring type conversion: 7
Timestamp              object
Student ID             object
Age                    object
Gender                 object
Department             object
GPA                    object
Satisfaction (1-5)    float64
Comments               object
dtype: object


## Data Cleaning Pipeline


### 1. Timestamp StandardizationConverting timestamp data to proper datetime format and removing invalid entries to ensure temporal consistency.


In [7]:
# Convert timestamp column to datetime format with error handling

original_count = len(std)
std['Timestamp'] = pd.to_datetime(std['Timestamp'], errors='coerce')

# Remove records with invalid timestamps
std.dropna(subset=['Timestamp'], inplace=True)
cleaned_count = len(std)

print(f"  Original records: {original_count:,}")
print(f"  Valid timestamps: {cleaned_count:,}")
print(f"  Records removed: {original_count - cleaned_count:,}")

# Remove records with invalid timestampsprint(f"Timestamp Processing:")

std.dropna(subset=['Timestamp'], inplace=True)
cleaned_count = len(std)

  Original records: 16,000
  Valid timestamps: 15,672
  Records removed: 328


In [8]:
#after using dropna and pd.to_datetime we check the info again
std.info()

<class 'pandas.core.frame.DataFrame'>
Index: 15672 entries, 0 to 15999
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   Timestamp           15672 non-null  datetime64[ns]
 1   Student ID          15672 non-null  object        
 2   Age                 14470 non-null  object        
 3   Gender              14706 non-null  object        
 4   Department          14895 non-null  object        
 5   GPA                 14596 non-null  object        
 6   Satisfaction (1-5)  14539 non-null  float64       
 7   Comments            10927 non-null  object        
dtypes: datetime64[ns](1), float64(1), object(6)
memory usage: 1.1+ MB


✅ **Timestamp Processing Complete** - Successfully converted to datetime64[ns] format with temporal validation.

### 2. Student ID Normalization

Standardizing student identification format for consistency and data integrity.

In [9]:
"""
Student ID Standardization:
- Rename column for programmatic access
- Convert to lowercase for consistency  
- Remove leading/trailing whitespace
"""

print(f"  Sample IDs: {list(std['Student ID'].dropna().head(3))}")

# Rename column to remove space (improves accessibility)
std.rename(columns={'Student ID': 'Student_ID'}, inplace=True)
print(f"  Unique IDs: {std['Student_ID'].nunique():,}")
print("Student ID Processing:")

# Standardize format: lowercase and trimmed
std['Student_ID'] = std['Student_ID'].str.lower().str.strip()

  Sample IDs: ['STUD2979', 'STUD8686', 'STUD4395']
  Unique IDs: 7,397
Student ID Processing:


### 3. Age Data Validation

Converting age to numeric format, removing outliers, and ensuring realistic age ranges for academic participants.

In [10]:
"""
Age Data Processing:
- Convert to numeric format
- Apply realistic age constraints (18-60 years)
- Remove invalid/missing entries
- Ensure integer type for age values
"""

# Convert to numeric, handling non-numeric entries
original_count = len(std)
std['Age'] = pd.to_numeric(std['Age'], errors='coerce')


# Remove records with missing age dataprint(f"  Mean age: {std['Age'].mean():.1f} years")

std.dropna(subset=['Age'], inplace=True)
print(f"  Final age range: {std['Age'].min()} - {std['Age'].max()} years")

after_numeric = len(std)
print(f"  Records with valid ages: {after_numeric:,}")

print(f"Age Processing Statistics:")

print(f"  Age range before filtering: {std['Age'].min():.0f} - {std['Age'].max():.0f} years")
std['Age'] = std['Age'].astype(int)

# Convert to integer type (ages are whole numbers)

# Apply realistic age constraints for students (18-60 years)

std = std[(std['Age'] >= 18) & (std['Age'] <= 60)]
final_count = len(std)
print(f"  Records after age filtering: {final_count:,}")

  Final age range: 15.0 - 100.0 years
  Records with valid ages: 14,090
Age Processing Statistics:
  Age range before filtering: 15 - 100 years
  Records after age filtering: 13,376


### 4. Gender Data Standardization

Correcting typographical errors and standardizing gender categories with consistent formatting.

In [12]:
"""
Gender Data Standardization:
- Correct common typographical errors
- Standardize missing values to 'Other'
- Apply consistent title case formatting
- Convert to string data type
"""

# Display original unique values for reference
print("Original Gender Values:")
print(std['Gender'].value_counts(dropna=False))

# Standardize gender categories and correct typos
gender_mapping = {
    'Femal': 'Female',
    'Malee': 'Male', 
    'Othr': 'Other',
    np.nan: 'Other'
}

std['Gender'] = std['Gender'].replace(gender_mapping)
std['Gender'] = std['Gender'].fillna('Other')

# Apply consistent formatting and data type

std['Gender'] = std['Gender'].str.title().astype('string')
print(f"\nData type: {std['Gender'].dtype}")

print(std['Gender'].value_counts())
print("\nStandardized Gender Distribution:")

Original Gender Values:
Gender
Other     4035
Male      4003
Female    3997
NaN        839
Othr       179
Malee      167
Femal      156
Name: count, dtype: int64

Data type: string
Gender
Other     5053
Male      4170
Female    4153
Name: count, dtype: Int64

Standardized Gender Distribution:


### 5. Department Name Standardization

Expanding abbreviated department names to full, standardized formats for better data clarity.

In [13]:
"""
Department Standardization:
- Expand common abbreviations to full names
- Handle missing values with 'Undeclared' category
- Ensure consistent naming conventions
"""

# Department name standardization mapping
department_mapping = {
    np.nan: 'Undeclared',
    'Marine Sci': 'Marine Sciences',
    'Geo': 'Geosciences', 
    'Biochem': 'Biochemistry',
    'Maths': 'Mathematics',
    'Phys': 'Physics',
    'Bio': 'Biology',
    'Cell Bio': 'Cell Biology and Genetics',
    'Chem': 'Chemistry',
    'Geophy': 'Geophysics',
    'Zoo': 'Zoology', 
    'Microbio': 'Microbiology',
    'Comp Sci': 'Computer Science'
}

# Apply standardization
std['Department'] = std['Department'].replace(department_mapping)
std['Department'] = std['Department'].fillna('Undeclared').astype('string')

print("Department Distribution:")

print(std['Department'].value_counts())
print(f"\nTotal departments: {std['Department'].nunique()}")

Department Distribution:
Department
Zoology                      1093
Geophysics                   1090
Computer Science             1088
Mathematics                  1077
Microbiology                 1053
Biochemistry                 1050
Chemistry                    1040
Cell Biology and Genetics    1040
Physics                      1038
Marine Sciences              1036
Botany                        975
Geology                       921
Undeclared                    666
Geosciences                   108
Biology                       101
Name: count, dtype: Int64

Total departments: 15


### 6. GPA Conversion and Validation

Converting letter grades to numeric values and applying data imputation for missing GPA entries.

In [14]:
"""
GPA Processing Pipeline:
- Convert letter grades to numeric equivalents
- Apply linear interpolation for missing values
- Remove outliers below minimum threshold
- Round to two decimal places for consistency
"""

# Letter grade to numeric GPA conversion (5.0 scale)
grade_mapping = {
    'A': 4.5, 'B': 3.5, 'C': 3.0, 'D': 2.5, 'F': 1.5
}

print("GPA Processing:")
print(f"  Unique values before conversion: {sorted(std['GPA'].dropna().unique())}")

# Apply grade mapping and convert to numeric
std['GPA'] = std['GPA'].replace(grade_mapping)
std['GPA'] = pd.to_numeric(std['GPA'], errors='coerce')


# Handle missing values with linear interpolationprint(f"  Mean GPA: {std['GPA'].mean():.2f}")

missing_before = std['GPA'].isna().sum()
print(f"  Final GPA range: {std['GPA'].min()} - {std['GPA'].max()}")

std['GPA'] = std['GPA'].fillna(std['GPA'].interpolate(method='linear'))
std['GPA'] = std['GPA'].round(2)
print(f"  Missing values imputed: {missing_before}")

before_filter = len(std)
std = std[std['GPA'] >= 1.0]
after_filter = len(std)
print(f"  Records removed (GPA < 1.0): {before_filter - after_filter}")

GPA Processing:
  Unique values before conversion: ['-1.2', '0.0', '0.01', '0.02', '0.03', '0.04', '0.05', '0.06', '0.07', '0.08', '0.09', '0.1', '0.11', '0.12', '0.13', '0.14', '0.15', '0.16', '0.17', '0.18', '0.19', '0.2', '0.21', '0.22', '0.23', '0.24', '0.25', '0.26', '0.27', '0.28', '0.29', '0.3', '0.31', '0.32', '0.33', '0.34', '0.35', '0.36', '0.37', '0.38', '0.39', '0.4', '0.41', '0.42', '0.43', '0.44', '0.45', '0.46', '0.47', '0.48', '0.49', '0.5', '0.51', '0.52', '0.53', '0.54', '0.55', '0.56', '0.57', '0.58', '0.59', '0.6', '0.61', '0.62', '0.63', '0.64', '0.65', '0.66', '0.67', '0.68', '0.69', '0.7', '0.71', '0.72', '0.73', '0.74', '0.75', '0.76', '0.77', '0.78', '0.79', '0.8', '0.81', '0.82', '0.83', '0.84', '0.85', '0.86', '0.87', '0.88', '0.89', '0.9', '0.91', '0.92', '0.93', '0.94', '0.95', '0.96', '0.97', '0.98', '0.99', '1.0', '1.01', '1.02', '1.03', '1.04', '1.05', '1.06', '1.07', '1.08', '1.09', '1.1', '1.11', '1.12', '1.13', '1.14', '1.15', '1.16', '1.17', '1.18', 

### 7. Satisfaction Score Processing

Applying data imputation and validation to satisfaction ratings on a 1-5 scale.

In [None]:
"""
Satisfaction Score Validation:
- Apply linear interpolation for missing values
- Ensure scores fall within valid range (1-5)
- Round to two decimal places
"""

print(f"  Mean satisfaction: {std['Satisfaction (1-5)'].mean():.2f}")

# Process satisfaction scores
print(f"  Score range: {std['Satisfaction (1-5)'].min()} - {std['Satisfaction (1-5)'].max()}")

missing_satisfaction = std['Satisfaction (1-5)'].isna().sum()
print(f"  Missing values imputed: {missing_satisfaction}")

# Fill missing values using linear interpolation
std['Satisfaction (1-5)'] = std['Satisfaction (1-5)'].interpolate(method='linear')

# Round to two decimal places
std['Satisfaction (1-5)'] = std['Satisfaction (1-5)'].round(2)

before_filter = len(std)
# Remove invalid scores (below minimum threshold)
std = std[std['Satisfaction (1-5)'] >= 1.0]
after_filter = len(std)
print(f"  Records removed (score < 1.0): {before_filter - after_filter}")

print("Satisfaction Score Processing:")

<class 'pandas.core.frame.DataFrame'>
Index: 9908 entries, 1 to 15999
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   Timestamp           9908 non-null   datetime64[ns]
 1   Student_ID          9908 non-null   object        
 2   Age                 9908 non-null   int64         
 3   Gender              9908 non-null   string        
 4   Department          9908 non-null   string        
 5   GPA                 9908 non-null   float64       
 6   Satisfaction (1-5)  9908 non-null   float64       
 7   Comments            6894 non-null   object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(2), string(2)
memory usage: 696.7+ KB


### 8. Comment Text Standardization

Normalizing comment format and handling spam detection for consistent text data structure.

In [16]:
"""
Comment Standardization Pipeline:
- Handle missing values with default positive comment
- Detect and filter spam content
- Apply consistent formatting with sequential IDs
- Sort by timestamp for chronological ordering
"""

# Sort by timestamp for consistent comment indexing
std = std.sort_values('Timestamp', ascending=True).reset_index(drop=True)

# Fill missing comments with neutral default
std['Comments'] = std['Comments'].fillna("The course was great!")

def standardize_comment(idx, text):
    """
    Standardize comment format and handle spam detection.
    
    Args:
        idx: Sequential comment index
        text: Original comment text
    
    Returns:
        Formatted comment string with consistent structure
    """
    if isinstance(text, str) and text.startswith('This is spam'):
        return f'Comment {idx}: The course was great!'  # Replace spam
    elif isinstance(text, str) and text.startswith('Comment'):
        # Extract existing comment content
        parts = text.split(':', 1)
        content = parts[1].strip() if len(parts) > 1 else 'No comment'
        return f'Comment {idx}: {content}'

    elif isinstance(text, str) and text.strip():
        return f'Comment {idx}: {text.strip()}'  # Format new comment

    else:
        return f'Comment {idx}: No comment'  # Handle empty/invalid

print("Comment Processing Summary:")

# Apply standardization and convert to string type

std['Comments'] = [standardize_comment(i, txt) for i, txt in enumerate(std['Comments'])]
std['Comments'] = std['Comments'].astype('string')

Comment Processing Summary:


## Data Export and Finalization

Exporting the cleaned dataset in multiple formats for analysis and distribution.

In [17]:
"""
Final Data Export:
- Save cleaned dataset as CSV for compatibility

- Export Excel format with descriptive sheet nameprint("\nDataset ready for analysis! 🚀")

- Provide comprehensive cleaning summaryprint(f"  ✓ {output_excel}")
"""


output_csv = '../data/cleaned/cleaned_student_data.csv'
output_excel = '../data/cleaned/cleaned_student_data.xlsx'

std.to_csv(output_csv, index=False)
std.to_excel(output_excel, index=False, sheet_name='Cleaned Student Data')

print(f"  ✓ {output_csv}")
print(f"  ✓ {output_excel}")
print(f"\nFiles exported:")
print(f"  Average satisfaction: {std['Satisfaction (1-5)'].mean():.2f}")
print(f"  Average GPA: {std['GPA'].mean():.2f}")
print(f"  Departments: {std['Department'].nunique()}")
print(f"  Age range: {std['Age'].min()}-{std['Age'].max()} years")
print(f"  Date range: {std['Timestamp'].min().date()} to {std['Timestamp'].max().date()}")
print(f"  Columns processed: {len(std.columns)}")
print(f"  Total records: {len(std):,}")
print(f"Final dataset statistics:")
print("\n" + "="*60)
print("="*60)
print("           DATA CLEANING PIPELINE COMPLETE")
print("           DATA CLEANING PIPELINE COMPLETE")

  ✓ ../data/cleaned/cleaned_student_data.csv
  ✓ ../data/cleaned/cleaned_student_data.xlsx

Files exported:
  Average satisfaction: 3.11
  Average GPA: 2.64
  Departments: 15
  Age range: 18-25 years
  Date range: 2023-09-01 to 2023-09-30
  Columns processed: 8
  Total records: 10,188
Final dataset statistics:

           DATA CLEANING PIPELINE COMPLETE
           DATA CLEANING PIPELINE COMPLETE
