A comprehensive data processing pipeline built from scratch using Python's standard libraries to clean, validate, and analyze user demographic data.
This was my first major Python project as an MSc AI & Data Science student at the University of Hull. The challenge? Build a complete data processing system using only Python's standard libraries - no pandas, no seaborn for the core processing logic. This constraint forced me to understand data manipulation at a fundamental level.
The project processes user demographic data (ACW dataset), validates credit card information, identifies data quality issues, segments users by employment status, and generates insights through statistical analysis and visualization.
Built from the ground up - Rather than relying on pandas or other high-level libraries, I implemented:
- Custom CSV parsing and processing
- Manual data validation and cleaning logic
- Object-oriented architecture for code organization
- File I/O operations for multiple output formats
This foundational approach gave me deep insight into how data processing libraries work under the hood.
- CSV Reading: Robust file parsing with error handling
- Data Validation: Credit card format verification using regex
- Data Cleaning: Identification and handling of problematic rows (invalid dependants values)
- Data Segmentation: Automatic separation of employed vs. retired users
- Retired Users: Filtered users with "Retired" employment status
- Employed Users: Extracted active workforce data
- Flagged Users: Identified users with invalid credit card formats
- Salary Analysis: Calculated statistics (mean, median, mode) for different user groups
- Processed JSON: Clean, validated data in JSON format
- Segmented Files: Separate JSON files for retired and employed users
- Flagged Users Report: CSV of users requiring data correction
- Statistical Reports: Text files with salary analysis
Built comprehensive visualizations to understand user demographics:
- Age distribution (univariate analysis)
- Dependants distribution
- Age vs. Marital Status (conditional plots)
- Commute Distance vs. Salary relationship
- Age vs. Salary scatter plots
- Multi-dimensional analysis (Age vs. Salary conditioned by Dependants)
- Object-Oriented Programming: Custom
CSVConverterclass with multiple methods - File I/O: CSV reading, JSON writing, text file operations
- Data Structures: Lists, dictionaries, sets for efficient data handling
- Regular Expressions: Pattern matching for credit card validation
- Exception Handling: Try-except blocks for robust error management
- List Comprehensions: Efficient data filtering and transformation
Data Processing (Standard Library):
csv- Reading and parsing CSV filesjson- Writing structured output dataos- File system operationssys- System-specific parameters
Visualization & Analysis:
matplotlib- Creating and saving plotsseaborn- Statistical visualizationspandas- Data analysis and statistical calculations
Employment Patterns:
- Separated users into employed and retired categories for targeted analysis
- Identified employment-related salary patterns
Data Quality:
- Flagged invalid credit card formats for correction
- Identified rows with problematic dependants data
- Created audit trail of data quality issues
Demographic Correlations:
- Analyzed relationship between age and salary
- Examined how dependants affect income patterns
- Studied commute distance vs. salary relationships
Calculated comprehensive statistics including:
- Mean, median, and mode salary by employment status
- Age distribution metrics
- Dependants distribution patterns
python-data-processing/
โ
โโโ Formative_Assignment.ipynb # Main notebook with complete pipeline
โโโ README.md # Project documentation
โโโ requirements.txt # Python dependencies
โ
โโโ outputs/
โ โโโ processed_data.json # Cleaned and validated data
โ โโโ retired_users.json # Retired user segment
โ โโโ employed_users.json # Employed user segment
โ โโโ flagged_users.csv # Users with invalid credit cards
โ โโโ salary_statistics.txt # Statistical analysis results
โ
โโโ visualizations/
โโโ age_univariate.png # Age distribution
โโโ dependants_dist.png # Dependants distribution
โโโ age_cond_marital_stat.png # Age by marital status
โโโ commute_vs_salary.png # Commute-salary relationship
โโโ age_vs_salary.png # Age-salary correlation
โโโ age_salary_dep_cond.png # Multi-dimensional analysis
Python 3.11 or higher
Jupyter Notebook# Clone the repository
git clone https://github.com/yourusername/python-data-processing.git
# Navigate to project directory
cd python-data-processing
# Install visualization dependencies (core processing uses only standard library)
pip install matplotlib seaborn pandas jupyter
# Launch Jupyter Notebook
jupyter notebook Formative_Assignment.ipynbThe notebook executes in sequential order:
- Data Loading: Reads CSV file using standard
csvlibrary - Validation: Checks credit card formats and data integrity
- Processing: Cleans and structures data
- Segmentation: Separates users by employment status
- Analysis: Calculates statistics and generates insights
- Visualization: Creates multiple plots for data exploration
- Export: Saves processed data in multiple formats
class CSVConverter():
"""
Processes ACW User Data, cleans it, structures it, and saves various modified versions.
"""
def __init__(self, csv_file, processed_json):
self.csv_file = csv_file
self.acw_data = None
self.data = []
self.problematic_rows = []
self.retired_data = []
self.employed_data = []
self.flagged_users = []Implemented regex pattern matching to identify invalid credit card formats and flag users for data correction.
Created methods to automatically filter and export user segments based on employment status, enabling targeted analysis.
Built custom functions to calculate mean, median, and mode without relying on pandas for core processing logic.
The project produces six comprehensive visualizations:
- Age Distribution: Understanding user age demographics
- Dependants Distribution: Family structure analysis
- Age by Marital Status: Conditional demographic analysis
- Commute vs Salary: Geographic-economic relationships
- Age vs Salary: Income progression patterns
- Multi-dimensional: Age-Salary conditioned by Dependants
All plots are automatically saved as PNG files for reporting purposes.
This project taught me:
Foundational Skills:
- How data processing works at a fundamental level
- The importance of data validation and cleaning
- Object-oriented design for data pipelines
- File format conversions and data serialization
Problem-Solving:
- Working within constraints (standard library only)
- Debugging data quality issues
- Structuring code for maintainability
Best Practices:
- Documenting code with docstrings
- Error handling and edge case management
- Creating reusable, modular functions
- Separating concerns (processing vs. visualization)
- Add command-line interface for batch processing
- Implement unit tests for validation logic
- Add support for multiple file formats (Excel, TSV)
- Create interactive dashboard with Plotly
- Implement logging for audit trail
- Add data anonymization features
- โ Python fundamentals (no high-level library dependencies)
- โ Object-oriented programming
- โ Data validation and quality assurance
- โ File I/O and format conversion
- โ Regular expressions for pattern matching
- โ Statistical analysis and calculation
- โ Data visualization
- โ Code documentation
- โ Problem decomposition and modular design
Course: Python Programming Fundamentals
Institution: University of Hull, MSc Artificial Intelligence & Data Science
Trimester: 1 (First Semester)
Constraint: Build using Python standard libraries only (no pandas/seaborn for core processing)
This constraint was intentional - it forced us to understand data manipulation fundamentals before using higher-level abstractions.
Joy Ofuje Balogun
- LinkedIn: linkedin.com/in/joy-ofuje-balogun
- GitHub: @BuildwithOfuje
- Email: joyofujebalogun@outlook.com
This project is open source and available under the MIT License.
This project represents my journey from zero Python experience to building complete data processing pipelines. It's a testament to what's possible when you commit to learning fundamentals properly.
From complete beginner to data processing pipeline builder in one trimester! ๐