A Python script that automatically cleans and standardizes messy CSV data, handling common data quality issues like duplicates, inconsistent formatting, and missing values.
- Removes duplicate rows - Identifies and eliminates exact duplicates
- Standardizes text formatting - Converts column values to appropriate casing
- Standardizes date formats - Handles multiple date formats (YYYY-MM-DD, MM/DD/YYYY, YYYY/MM/DD) and converts all to YYYY-MM-DD
- Handles missing values - Intelligently fills missing data with appropriate defaults
- Removes currency symbols - Strips $ and € from price fields and converts to numeric
- Generates summary report - Provides detailed statistics about the cleaning process
- Flexible input options - Accepts command-line arguments or interactive prompts
Requires Python 3.7+ and pandas:
pip install pandas --break-system-packagespython data_cleaner.py input_file.csv output_file.csvExample:
python data_cleaner.py messy_sales_data.csv cleaned_sales_data.csvpython data_cleaner.pyThe script will prompt you for the input filename.
- Customer names:
john smith→John Smith - Products:
LAPTOP→Laptop - Emails:
USER@EMAIL.COM→user@email.com - Status:
completed→Completed
01/16/2024→2024-01-162024/01/18→2024-01-18- Handles multiple formats automatically
- Missing products →
Unknown Product - Missing prices →
0.00 - Missing emails →
no-email@provided.com - Missing customer names →
Unknown Customer - Missing status →
Pending Review - Missing quantities →
1 - Rows with missing Order IDs are removed (invalid data)
- Removes currency symbols:
$25.50→25.50 - Converts to numeric format
- Handles European formats with
€
Before (messy_sales_data.csv):
Order_ID,Customer_Name,Product,Quantity,Price,Order_Date,Status,Email
1001,john smith,Laptop,1,899.99,2024-01-15,Completed,john@email.com
1002,SARAH JONES,Mouse,2,25.50,01/16/2024,completed,SARAH@EMAIL.COM
1007,David Lee,,1,149.99,2024-01-21,Completed,david@email.comAfter (cleaned_sales_data.csv):
Order_ID,Customer_Name,Product,Quantity,Price,Order_Date,Status,Email
1001,John Smith,Laptop,1,899.99,2024-01-15,Completed,john@email.com
1002,Sarah Jones,Mouse,2,25.5,2024-01-16,Completed,sarah@email.com
1007,David Lee,Unknown Product,1,149.99,2024-01-21,Completed,david@email.comThe script generates:
- Cleaned CSV file - Your data with all issues resolved
- Console summary - Statistics about what was cleaned:
- Number of rows processed
- Duplicates removed
- Missing values handled
- Sample of cleaned data
Built with:
- Python 3.x
- pandas for data manipulation
- datetime for date parsing
- Built-in error handling for robustness
Key capabilities:
- Handles multiple date format variations
- Graceful error handling for malformed data
- Preserves data types (integers, floats, strings)
- Zero missing values in output
- Fast processing (handles thousands of rows efficiently)
Perfect for:
- Sales data cleanup
- E-commerce order processing
- Customer database standardization
- Data migration preparation
- Report generation preprocessing
- Any CSV data that needs consistent formatting
MIT License - Free to use, modify, and distribute.
Built by Morvyr W. as part of a Python automation portfolio.
Connect: X: @MorvyrWinds GitHub: Morvyr
Portfolio project demonstrating Python data processing, pandas manipulation, and automated data quality improvement.