This project is an intelligent e-commerce analytics system that uses machine learning to perform customer segmentation, churn prediction, and Customer Lifetime Value (CLV) prediction. The system automatically detects data changes and retrains models when necessary.
- Data Ingestion: Automatic detection of data file changes with hash-based verification
- Data Processing: Cleaning and feature engineering (RFM metrics, return rates, etc.)
- Customer Segmentation: K-Means clustering for customer grouping
- Churn Prediction: Classification model with GridSearch optimization
- CLV Prediction: Regression model for lifetime value estimation
- REST API: FastAPI-based endpoints for predictions and data retrieval
- Database Integration: SQLite for data storage and retrieval
The system follows a modular pipeline architecture:
- Data Ingestion (
data_ingestion.py): Loads and validates CSV data - Data Processing (
data_processor.py): Cleans data and creates features - Model Training (
model_trainer.py): Trains ML models and saves artifacts - API Service (
api.py): Provides REST endpoints for predictions - Main Orchestrator (
main.py): Coordinates the entire pipeline
- Python 3.8 or higher
- pip package manager
-
Clone or download the project:
git clone <repository-url> cd OnlineRetailMachineLearningProject
-
Create a virtual environment:
python -m venv .venv
-
Activate the virtual environment:
- Windows:
.venv\Scripts\activate - Linux/Mac:
source .venv/bin/activate
- Windows:
-
Install dependencies:
pip install -r requirements.txt
-
Place your data file:
- Copy your
online_retail_II.csvfile todata/raw/directory - The system supports both 'Invoice' and 'InvoiceNo' column names
- Copy your
Run the main script to process data and train models:
python main.pyThe system will:
- Check for data changes
- Process and clean data if needed
- Train models (segmentation, churn, CLV)
- Save models to
artifacts/directory - Display a status report
After training, start the FastAPI server:
uvicorn src.api:app --reloadThe API will be available at http://localhost:8000
Visit http://localhost:8000/docs for interactive API documentation.
GET /: Welcome messageGET /customer/{customer_id}: Get predictions for a specific customerPOST /predict/live: Make live predictions with custom data
Get customer predictions:
curl http://localhost:8000/customer/12345Live prediction:
curl -X POST "http://localhost:8000/predict/live" \
-H "Content-Type: application/json" \
-d '{
"recency": 30,
"frequency": 5,
"monetary": 1500.0,
"avg_basket": 300.0,
"return_rate": 0.1
}'The system expects CSV data with the following columns:
- Customer ID (numeric)
- Invoice/InvoiceNo (transaction identifier)
- Quantity (numeric)
- Price/UnitPrice (numeric)
- InvoiceDate (datetime)
- Algorithm: K-Means Clustering
- Features: Recency, Frequency, Monetary, AvgBasketSize
- Number of segments: 4 (configurable)
- Algorithm: Random Forest Classifier
- Target: Customers with Recency > 90 days
- Features: Frequency, Monetary, AvgBasketSize, return_rate
- Optimization: GridSearchCV for hyperparameter tuning
- Algorithm: Random Forest Regressor
- Target: Total Monetary value
- Features: Recency, Frequency, AvgBasketSize, return_rate
OnlineRetailMachineLearningProject/
├── main.py # Main orchestrator
├── requirements.txt # Python dependencies
├── .gitignore # Git ignore rules
├── src/
│ ├── __init__.py
│ ├── config.py # Configuration settings
│ ├── data_ingestion.py # Data loading and validation
│ ├── data_processor.py # Data cleaning and feature engineering
│ ├── model_trainer.py # ML model training
│ └── api.py # FastAPI endpoints
├── data/
│ ├── raw/ # Raw data files
│ └── data_state.json # Data change tracking
├── db/ # SQLite database files
├── artifacts/ # Trained model files
└── notebooks/ # Jupyter notebooks (for analysis)
Modify src/config.py to customize:
- Database paths
- Model artifacts location
- Table names
- File encodings
The system provides detailed console output for debugging. Check the terminal output for error messages and processing status.