A comprehensive end-to-end data analysis project that demonstrates the complete data pipeline: from database creation and data cleaning to machine learning and visualization. This project analyzes customer transaction data using MySQL, Python, and Power BI.
- Create and populate a MySQL database with transaction data
- Clean and preprocess messy data using Python and Pandas
- Perform exploratory data analysis
- Build predictive models with scikit-learn
- Visualize insights using Power BI dashboards
- Duplicate records
- Special characters in names (?, ., /, _, 11)
- Inconsistent capitalization
- Missing values in critical fields
- Typos (e.g., "expence" instead of "expense")
Classification Model
- Algorithm: Random Forest Classifier
- Target Variable: Category (bonus, rent, utilities)
- Features: Amount, transaction_type, currency
- Train/Test Split: 80/20
- Total Amount: €5,020
- Year: 2024
- Standard Deviation: 397
- Total Customers: 9
Amount by Quarter - Bar chart showing transaction trends
Category Distribution - Pie chart breakdown
Customer Details - Interactive table with filters
Transaction Type Filter - Income vs Expense analysis
Bonuses account for nearly half of all transactions (44.58%)
Utilities represent a third of transactions (33.33%)
Rent payments are the smallest category (22.08%)
Q1 and Q2 show highest transaction volumes
Q3 has significantly lower activity
Q4 shows moderate recovery
EUR is the dominant currency (7 out of 9 customers)
USD is used by 2 customers
MySQL 8.0 - Relational database management
Python 3.11
pandas - Data manipulation and cleaning
numpy - Numerical computations
scikit-learn - Machine learning models
Power BI Desktop - Interactive dashboards