🌍 Lire en Français | 📓 View the notebook: Open Notebook
This project focuses on analyzing a retail store sales dataset using Python. The dataset comes from Kaggle and is intentionally "dirty", containing missing and inconsistent values to simulate real-world data challenges.
The project covers the full data analysis workflow:
- Data loading
- Data exploration
- Data cleaning
- Exploratory Data Analysis (EDA)
- Insights generation
The dataset represents transactional sales data from a retail store, including:
- Transaction ID
- Customer ID
- Category
- Item
- Price per Unit
- Quantity
- Total Spent
- Payment Method
- Location (Online / In-store)
It contains:
- 8 product categories
- 25 items per category
- Multiple customers and transactions
- Missing and inconsistent values
The main goals of this project are:
- Clean a messy real-world dataset
- Understand customer purchasing behavior
- Analyze sales performance across categories
- Identify trends and patterns in transactions
- Practice end-to-end data analysis in Python
Several data quality issues were addressed:
- Handling missing values in key columns (Price, Quantity, Total Spent)
- Ensuring consistency between related variables
- Removing or handling duplicates
- Validating calculated fields (e.g., Total Spent = Price × Quantity)
The analysis includes:
- Dataset structure exploration (shape, types, unique values)
- Customer and product analysis
- Category-level performance
- Payment method distribution
- Online vs in-store behavior
- Python
- Pandas (data manipulation)
- Matplotlib & Seaborn (visualization)
- Google Colab
- Identification of the number of unique customers and products
- Clear distinction between online and in-store transactions
- Detection of missing values patterns in key financial variables
- Understanding of category distribution and customer behavior
👤 Author: Robin Rubangura