Skip to content

This project builds an end-to-end data analytics and machine learning pipeline using raw transaction data from an e-commerce company.

Notifications You must be signed in to change notification settings

Egehan134/OnlineRetailMachineLearningProject

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

E-Commerce Analytics System

This project is an intelligent e-commerce analytics system that uses machine learning to perform customer segmentation, churn prediction, and Customer Lifetime Value (CLV) prediction. The system automatically detects data changes and retrains models when necessary.

Features

  • Data Ingestion: Automatic detection of data file changes with hash-based verification
  • Data Processing: Cleaning and feature engineering (RFM metrics, return rates, etc.)
  • Customer Segmentation: K-Means clustering for customer grouping
  • Churn Prediction: Classification model with GridSearch optimization
  • CLV Prediction: Regression model for lifetime value estimation
  • REST API: FastAPI-based endpoints for predictions and data retrieval
  • Database Integration: SQLite for data storage and retrieval

Architecture

The system follows a modular pipeline architecture:

  1. Data Ingestion (data_ingestion.py): Loads and validates CSV data
  2. Data Processing (data_processor.py): Cleans data and creates features
  3. Model Training (model_trainer.py): Trains ML models and saves artifacts
  4. API Service (api.py): Provides REST endpoints for predictions
  5. Main Orchestrator (main.py): Coordinates the entire pipeline

Installation

Prerequisites

  • Python 3.8 or higher
  • pip package manager

Setup Steps

  1. Clone or download the project:

    git clone <repository-url>
    cd OnlineRetailMachineLearningProject
  2. Create a virtual environment:

    python -m venv .venv
  3. Activate the virtual environment:

    • Windows: .venv\Scripts\activate
    • Linux/Mac: source .venv/bin/activate
  4. Install dependencies:

    pip install -r requirements.txt
  5. Place your data file:

    • Copy your online_retail_II.csv file to data/raw/ directory
    • The system supports both 'Invoice' and 'InvoiceNo' column names

Usage

Training Models

Run the main script to process data and train models:

python main.py

The system will:

  • Check for data changes
  • Process and clean data if needed
  • Train models (segmentation, churn, CLV)
  • Save models to artifacts/ directory
  • Display a status report

Starting the API Server

After training, start the FastAPI server:

uvicorn src.api:app --reload

The API will be available at http://localhost:8000

API Documentation

Visit http://localhost:8000/docs for interactive API documentation.

Endpoints

  • GET /: Welcome message
  • GET /customer/{customer_id}: Get predictions for a specific customer
  • POST /predict/live: Make live predictions with custom data

Example API Usage

Get customer predictions:

curl http://localhost:8000/customer/12345

Live prediction:

curl -X POST "http://localhost:8000/predict/live" \
     -H "Content-Type: application/json" \
     -d '{
       "recency": 30,
       "frequency": 5,
       "monetary": 1500.0,
       "avg_basket": 300.0,
       "return_rate": 0.1
     }'

Data Format

The system expects CSV data with the following columns:

  • Customer ID (numeric)
  • Invoice/InvoiceNo (transaction identifier)
  • Quantity (numeric)
  • Price/UnitPrice (numeric)
  • InvoiceDate (datetime)

Model Details

Customer Segmentation

  • Algorithm: K-Means Clustering
  • Features: Recency, Frequency, Monetary, AvgBasketSize
  • Number of segments: 4 (configurable)

Churn Prediction

  • Algorithm: Random Forest Classifier
  • Target: Customers with Recency > 90 days
  • Features: Frequency, Monetary, AvgBasketSize, return_rate
  • Optimization: GridSearchCV for hyperparameter tuning

CLV Prediction

  • Algorithm: Random Forest Regressor
  • Target: Total Monetary value
  • Features: Recency, Frequency, AvgBasketSize, return_rate

Project Structure

OnlineRetailMachineLearningProject/
├── main.py                 # Main orchestrator
├── requirements.txt        # Python dependencies
├── .gitignore             # Git ignore rules
├── src/
│   ├── __init__.py
│   ├── config.py          # Configuration settings
│   ├── data_ingestion.py  # Data loading and validation
│   ├── data_processor.py  # Data cleaning and feature engineering
│   ├── model_trainer.py   # ML model training
│   └── api.py             # FastAPI endpoints
├── data/
│   ├── raw/               # Raw data files
│   └── data_state.json    # Data change tracking
├── db/                    # SQLite database files
├── artifacts/             # Trained model files
└── notebooks/             # Jupyter notebooks (for analysis)

Configuration

Modify src/config.py to customize:

  • Database paths
  • Model artifacts location
  • Table names
  • File encodings

Logs

The system provides detailed console output for debugging. Check the terminal output for error messages and processing status.

About

This project builds an end-to-end data analytics and machine learning pipeline using raw transaction data from an e-commerce company.

Resources

Stars

Watchers

Forks

Languages