Skip to content

My first Python project: Data Validation and Processing system built from scratch using standard libraries I MSc Artificial Intelligence and Data Science @ University of Hull

License

Notifications You must be signed in to change notification settings

BuildwithOfuje/python-data-processing-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

2 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Python Data Processing & Validation System

A comprehensive data processing pipeline built from scratch using Python's standard libraries to clean, validate, and analyze user demographic data.

๐ŸŽฏ Project Overview

This was my first major Python project as an MSc AI & Data Science student at the University of Hull. The challenge? Build a complete data processing system using only Python's standard libraries - no pandas, no seaborn for the core processing logic. This constraint forced me to understand data manipulation at a fundamental level.

The project processes user demographic data (ACW dataset), validates credit card information, identifies data quality issues, segments users by employment status, and generates insights through statistical analysis and visualization.

๐Ÿ’ก What Makes This Project Special

Built from the ground up - Rather than relying on pandas or other high-level libraries, I implemented:

  • Custom CSV parsing and processing
  • Manual data validation and cleaning logic
  • Object-oriented architecture for code organization
  • File I/O operations for multiple output formats

This foundational approach gave me deep insight into how data processing libraries work under the hood.

๐Ÿ” Key Features

1. Custom Data Processing Pipeline

  • CSV Reading: Robust file parsing with error handling
  • Data Validation: Credit card format verification using regex
  • Data Cleaning: Identification and handling of problematic rows (invalid dependants values)
  • Data Segmentation: Automatic separation of employed vs. retired users

2. User Segmentation & Analysis

  • Retired Users: Filtered users with "Retired" employment status
  • Employed Users: Extracted active workforce data
  • Flagged Users: Identified users with invalid credit card formats
  • Salary Analysis: Calculated statistics (mean, median, mode) for different user groups

3. Multiple Output Formats

  • Processed JSON: Clean, validated data in JSON format
  • Segmented Files: Separate JSON files for retired and employed users
  • Flagged Users Report: CSV of users requiring data correction
  • Statistical Reports: Text files with salary analysis

4. Data Visualization

Built comprehensive visualizations to understand user demographics:

  • Age distribution (univariate analysis)
  • Dependants distribution
  • Age vs. Marital Status (conditional plots)
  • Commute Distance vs. Salary relationship
  • Age vs. Salary scatter plots
  • Multi-dimensional analysis (Age vs. Salary conditioned by Dependants)

๐Ÿ› ๏ธ Technologies & Techniques

Core Python Skills Demonstrated

  • Object-Oriented Programming: Custom CSVConverter class with multiple methods
  • File I/O: CSV reading, JSON writing, text file operations
  • Data Structures: Lists, dictionaries, sets for efficient data handling
  • Regular Expressions: Pattern matching for credit card validation
  • Exception Handling: Try-except blocks for robust error management
  • List Comprehensions: Efficient data filtering and transformation

Libraries Used

Data Processing (Standard Library):

  • csv - Reading and parsing CSV files
  • json - Writing structured output data
  • os - File system operations
  • sys - System-specific parameters

Visualization & Analysis:

  • matplotlib - Creating and saving plots
  • seaborn - Statistical visualizations
  • pandas - Data analysis and statistical calculations

๐Ÿ“Š Analysis Results

Key Insights Discovered

Employment Patterns:

  • Separated users into employed and retired categories for targeted analysis
  • Identified employment-related salary patterns

Data Quality:

  • Flagged invalid credit card formats for correction
  • Identified rows with problematic dependants data
  • Created audit trail of data quality issues

Demographic Correlations:

  • Analyzed relationship between age and salary
  • Examined how dependants affect income patterns
  • Studied commute distance vs. salary relationships

Statistical Analysis

Calculated comprehensive statistics including:

  • Mean, median, and mode salary by employment status
  • Age distribution metrics
  • Dependants distribution patterns

๐Ÿ—๏ธ Project Structure

python-data-processing/
โ”‚
โ”œโ”€โ”€ Formative_Assignment.ipynb    # Main notebook with complete pipeline
โ”œโ”€โ”€ README.md                      # Project documentation
โ”œโ”€โ”€ requirements.txt               # Python dependencies
โ”‚
โ”œโ”€โ”€ outputs/
โ”‚   โ”œโ”€โ”€ processed_data.json       # Cleaned and validated data
โ”‚   โ”œโ”€โ”€ retired_users.json        # Retired user segment
โ”‚   โ”œโ”€โ”€ employed_users.json       # Employed user segment
โ”‚   โ”œโ”€โ”€ flagged_users.csv         # Users with invalid credit cards
โ”‚   โ””โ”€โ”€ salary_statistics.txt     # Statistical analysis results
โ”‚
โ””โ”€โ”€ visualizations/
    โ”œโ”€โ”€ age_univariate.png        # Age distribution
    โ”œโ”€โ”€ dependants_dist.png       # Dependants distribution
    โ”œโ”€โ”€ age_cond_marital_stat.png # Age by marital status
    โ”œโ”€โ”€ commute_vs_salary.png     # Commute-salary relationship
    โ”œโ”€โ”€ age_vs_salary.png         # Age-salary correlation
    โ””โ”€โ”€ age_salary_dep_cond.png   # Multi-dimensional analysis

๐Ÿš€ Getting Started

Prerequisites

Python 3.11 or higher
Jupyter Notebook

Installation & Usage

# Clone the repository
git clone https://github.com/yourusername/python-data-processing.git

# Navigate to project directory
cd python-data-processing

# Install visualization dependencies (core processing uses only standard library)
pip install matplotlib seaborn pandas jupyter

# Launch Jupyter Notebook
jupyter notebook Formative_Assignment.ipynb

Running the Pipeline

The notebook executes in sequential order:

  1. Data Loading: Reads CSV file using standard csv library
  2. Validation: Checks credit card formats and data integrity
  3. Processing: Cleans and structures data
  4. Segmentation: Separates users by employment status
  5. Analysis: Calculates statistics and generates insights
  6. Visualization: Creates multiple plots for data exploration
  7. Export: Saves processed data in multiple formats

๐Ÿ’ป Code Highlights

Custom CSV Converter Class

class CSVConverter():
    """
    Processes ACW User Data, cleans it, structures it, and saves various modified versions.
    """
    def __init__(self, csv_file, processed_json):
        self.csv_file = csv_file
        self.acw_data = None
        self.data = []
        self.problematic_rows = []
        self.retired_data = []
        self.employed_data = []
        self.flagged_users = []

Credit Card Validation

Implemented regex pattern matching to identify invalid credit card formats and flag users for data correction.

Data Segmentation

Created methods to automatically filter and export user segments based on employment status, enabling targeted analysis.

Statistical Analysis

Built custom functions to calculate mean, median, and mode without relying on pandas for core processing logic.

๐Ÿ“ˆ Visualizations Generated

The project produces six comprehensive visualizations:

  1. Age Distribution: Understanding user age demographics
  2. Dependants Distribution: Family structure analysis
  3. Age by Marital Status: Conditional demographic analysis
  4. Commute vs Salary: Geographic-economic relationships
  5. Age vs Salary: Income progression patterns
  6. Multi-dimensional: Age-Salary conditioned by Dependants

All plots are automatically saved as PNG files for reporting purposes.

๐ŸŽ“ Learning Outcomes

This project taught me:

Foundational Skills:

  • How data processing works at a fundamental level
  • The importance of data validation and cleaning
  • Object-oriented design for data pipelines
  • File format conversions and data serialization

Problem-Solving:

  • Working within constraints (standard library only)
  • Debugging data quality issues
  • Structuring code for maintainability

Best Practices:

  • Documenting code with docstrings
  • Error handling and edge case management
  • Creating reusable, modular functions
  • Separating concerns (processing vs. visualization)

๐Ÿ”ฎ Future Enhancements

  • Add command-line interface for batch processing
  • Implement unit tests for validation logic
  • Add support for multiple file formats (Excel, TSV)
  • Create interactive dashboard with Plotly
  • Implement logging for audit trail
  • Add data anonymization features

๐Ÿ“ Skills Demonstrated

  • โœ… Python fundamentals (no high-level library dependencies)
  • โœ… Object-oriented programming
  • โœ… Data validation and quality assurance
  • โœ… File I/O and format conversion
  • โœ… Regular expressions for pattern matching
  • โœ… Statistical analysis and calculation
  • โœ… Data visualization
  • โœ… Code documentation
  • โœ… Problem decomposition and modular design

๐ŸŽฏ Project Context

Course: Python Programming Fundamentals
Institution: University of Hull, MSc Artificial Intelligence & Data Science
Trimester: 1 (First Semester)
Constraint: Build using Python standard libraries only (no pandas/seaborn for core processing)

This constraint was intentional - it forced us to understand data manipulation fundamentals before using higher-level abstractions.

๐Ÿ“ง Contact

Joy Ofuje Balogun

๐Ÿ“„ License

This project is open source and available under the MIT License.


This project represents my journey from zero Python experience to building complete data processing pipelines. It's a testament to what's possible when you commit to learning fundamentals properly.

From complete beginner to data processing pipeline builder in one trimester! ๐Ÿš€

About

My first Python project: Data Validation and Processing system built from scratch using standard libraries I MSc Artificial Intelligence and Data Science @ University of Hull

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published