## Real-World Insights: Optimizing Logistics and Supply Chain Data for SwiftChain Analytics

![SwiftChain](https://drive.google.com/uc?export=view&id=1Yx6xN9JUJu5mqbU6T8HZ81sfnVjYyO_0)

SwiftChain Analytics is a pioneering logistics analytics multinational founded in 2010. With its headquarters in Chicago, SwiftChain Analytics specializes in using data-driven insights to optimize supply chain operations for businesses globally. The organization combines advanced machine learning techniques with domain expertise to deliver innovative solutions for challenges such as delivery delays, inventory management, and transportation optimization.

SwiftChain Analytics’ mission is to empower businesses with actionable intelligence, enabling them to reduce costs, improve customer satisfaction, and achieve operational excellence. By collaborating with leading e-commerce companies, manufacturers, and logistics providers, SwiftChain Analytics has established itself as a trusted partner in the logistics industry.

In 2024, SwiftChain Analytics embarked on a project to develop a predictive model for delivery delay classification using historical logistics data. Insights derived from this project can drive decisions to enhance delivery systems and improve customer experience.

## Where you come in

Having proven your expertise as a data scientist with an excellent track record a Data Scientist this company in line with its 2024 project has contracted you to solve one of the most common yet challenging problems in supply chain management: **predicting delivery delays.** Using real-world logistics data, you will analyze delivery patterns and develop a predictive model to classify and anticipate delays, enabling proactive solutions to enhance operational efficiency.


## Objectives

- Data Understanding and Exploration: Analyze the dataset to identify trends and outliers in logistics data.
- Data Preprocessing: Clean and prepare data, addressing issues such as missing values and inconsistent formats.
- Feature Engineering: Create and select relevant features to improve model performance.
- Model Development: Build, evaluate, and fine-tune classification models to predict delivery delays.
- Insights and Recommendations: Provide actionable recommendations for reducing delivery delays.

## Data

**About the Dataset:**

This dataset simulates real-world logistics data, containing 41 variables across multiple categories, including:

- Customer demographics
- Order details
- Shipping modes
- Product information
- Delivery outcomes (target label)

**Data Dictionary**

| Column Name                 | Description                                                                 |
|-----------------------------|-----------------------------------------------------------------------------|
| `payment_type`              | Payment method used for the order (e.g., DEBIT, TRANSFER).                 |
| `profit_per_order`          | Profit generated per order.                                                |
| `sales_per_customer`        | Total sales made by a customer.                                            |
| `category_id`               | Unique identifier for the product category.                                |
| `category_name`             | Name of the product category (e.g., Water Sports, Cleats).                |
| `customer_city`             | City of the customer.                                                     |
| `customer_country`          | Country of the customer.                                                  |
| `customer_id`               | Unique identifier for the customer.                                       |
| `customer_segment`          | Segment classification of the customer (e.g., Consumer, Corporate).       |
| `customer_state`            | State of the customer.                                                    |
| `customer_zipcode`          | Zipcode of the customer.                                                  |
| `department_id`             | Unique identifier for the department.                                     |
| `department_name`           | Name of the department.                                                   |
| `latitude`                  | Latitude coordinates of the customer location.                            |
| `longitude`                 | Longitude coordinates of the customer location.                           |
| `market`                    | Market classification for the order.                                      |
| `order_city`                | City where the order was placed.                                          |
| `order_country`             | Country where the order was placed.                                       |
| `order_customer_id`         | Identifier linking the order to a customer.                               |
| `order_date`                | Date when the order was placed.                                           |
| `order_id`                  | Unique identifier for the order.                                          |
| `order_item_cardprod_id`    | Identifier linking the item to a product.                                 |
| `order_item_discount`       | Discount applied to the order item.                                       |
| `order_item_discount_rate`  | Discount rate applied to the order item.                                  |
| `order_item_id`             | Unique identifier for the order item.                                     |
| `order_item_product_price`  | Price of the product in the order item.                                   |
| `order_item_profit_ratio`   | Profit ratio of the order item.                                           |
| `order_item_quantity`       | Quantity of the product in the order item.                                |
| `sales`                     | Total sales value of the order.                                           |
| `order_item_total_amount`   | Total amount for the order item.                                          |
| `order_profit_per_order`    | Profit for the entire order.                                              |
| `order_region`              | Region where the order was placed.                                        |
| `order_state`               | State where the order was placed.                                         |
| `order_status`              | Current status of the order (e.g., COMPLETE, PENDING).                   |
| `product_card_id`           | Identifier for the product card.                                          |
| `product_category_id`       | Identifier for the product category.                                      |
| `product_name`              | Name of the product.                                                     |
| `product_price`             | Price of the product.                                                    |
| `shipping_date`             | Date when the order was shipped.                                          |
| `shipping_mode`             | Shipping mode used (e.g., Standard Class, Second Class).                 |
| `label`                     | Delivery outcome label: `-1` for late, `0` for on-time, `1` for early.   |


**You will find the dataset in the project folder named: "[SwiftChain](https://drive.google.com/drive/folders/1RUTpI2Gw6uOu2mfgHE0K4nn5gl6dLUuX?usp=drive_link)." The folder is further made up of the logistics.csv and feature_description.csv datasets for this problem. Use them accordingly.**

**Problem Statement:**

Develop a machine learning model to predict the delivery delay status (label column), where:

- `-1 indicates a late delivery`
- `0 indicates an on-time delivery`
- `1 indicates early delivery`

**Evaluation Metric:**

Accuracy and F1 Score will be used to evaluate the performance of the delay prediction model.

## Tasks

**Phase 1: Understanding the Problem**

- Write a summary of the potential business impact of delivery delays.
- Examine and describe key variables in the dataset.

**Phase 2: Exploratory Data Analysis (EDA)**

- Load the dataset and inspect its structure.
- Generate descriptive statistics for numerical and categorical variables.
- Visualize key relationships (e.g., shipping mode vs. delivery status).
- Identify missing values and propose handling strategies.

**Phase 3: Data Preprocessing**

- Encode categorical variables such as customer_segment and shipping_mode.
- Handle missing values and outliers in critical columns.
- Standardize or normalize numerical features, including profit_per_order and sales_per_customer.

**Phase 4: Feature Engineering**

- Create new features such as shipping_duration (difference between order and shipping dates).
- Engineer features from product_category_id and customer_country.
- Perform feature selection to identify the most predictive variables.

**Phase 5: Model Development**

- Split the dataset into training and testing sets.
- Train and evaluate multiple classification algorithms (e.g., Logistic Regression, Random Forest, XGBoost).
- Use hyperparameter tuning to optimize the best-performing model.

**Phase 6: Insights and Recommendations**

- Analyze feature importance and describe factors contributing to delivery delays.
- Write a summary report highlighting actionable recommendations to reduce delays.

**Deliverables**

- Exploratory Data Analysis (EDA) notebook with visualizations and data cleaning steps. (3 weeks)
- An organized Jupyter Notebook detailing necessary project phases (2 weeks)
- A final report summarizing findings, model performance, and recommendations. (2 week)

**Timeline - 7 weeks**