# 📦 E-Commerce Shipping Data

*Product Shipment Delivered on Time ?*

## Dataset Overview

This dataset is provided by an *international e-commerce company selling electronic products*, aiming to uncover insights from its customer database.

### Dataset Description

- **Source**: [Kaggle](https://www.kaggle.com/datasets/prachi13/customer-analytics)
- **Purpose**: To analyze factors affecting timely product delivery and improve service.

### Key Variables

- **ID**: Unique customer identifier.
- **Warehouse Block**: Section of the warehouse (A, B, C, D, E) where the product is stored.
- **Mode of Shipment**: Shipping method (e.g., Ship, Flight, Road).
- **Customer Care Calls**: The number of calls made by customers to inquire about the status of their shipments.
- **Customer Rating**: Rating from 1 (worst) to 5 (best).
- **Cost of the Product**: Price in US dollars.
- **Prior Purchases**: Number of prior purchases by the customer.
- **Product Importance**: Categorized as low, medium, or high.
- **Gender**: Customer's gender (Male/Female).
- **Discount Offered**: Applied discount.
- **Weight in Grams**: Product weight.
- **Reached on Time**: Target variable; 1 = not on time, 0 = on time.

### Project Goals

In this notebook, I will first focus on **exploratory data analysis (EDA)** to gain a comprehensive understanding of the dataset and the relationships between different variables. This step is crucial for preparing the dataset for the second part, which will be dedicated to the **development of machine learning models**.

The aim is to **predict if an item will arrive on time** based on the outlined characteristics. By identifying trends through visualizations and statistical analysis, we can help the company enhance operations and serve customers better.

## Getting Started

In [1]:
# Import necessary libraries for EDA
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('ggplot')
%matplotlib inline

In [2]:
# Load the data
data = pd.read_csv('/Users/arnaudrivat/Documents/Porfolio Arnaud Rivat/Projet n°1/Train.csv')
print("The first 5 rows of the dataframe") 
data.head()

The first 5 rows of the dataframe


Unnamed: 0,ID,Warehouse_block,Mode_of_Shipment,Customer_care_calls,Customer_rating,Cost_of_the_Product,Prior_purchases,Product_importance,Gender,Discount_offered,Weight_in_gms,Reached.on.Time_Y.N
0,1,D,Flight,4,2,177,3,low,F,44,1233,1
1,2,F,Flight,4,5,216,2,low,M,59,3088,1
2,3,A,Flight,2,2,183,4,low,M,48,3374,1
3,4,B,Flight,3,3,176,4,medium,M,10,1177,1
4,5,C,Flight,2,2,184,3,medium,F,46,2484,1


## Data Wrangling

In [3]:
# Dropping the 'ID' column as it is not relevant for the analysis or modeling
data.drop('ID', axis=1, inplace=True)

# Rename the column 'Reached.on.Time_Y.N' to 'Reached_on_Time'
data.rename(columns={'Reached.on.Time_Y.N' : 'Reached_on_Time'} , inplace= True)

data

Unnamed: 0,Warehouse_block,Mode_of_Shipment,Customer_care_calls,Customer_rating,Cost_of_the_Product,Prior_purchases,Product_importance,Gender,Discount_offered,Weight_in_gms,Reached_on_Time
0,D,Flight,4,2,177,3,low,F,44,1233,1
1,F,Flight,4,5,216,2,low,M,59,3088,1
2,A,Flight,2,2,183,4,low,M,48,3374,1
3,B,Flight,3,3,176,4,medium,M,10,1177,1
4,C,Flight,2,2,184,3,medium,F,46,2484,1
...,...,...,...,...,...,...,...,...,...,...,...
10994,A,Ship,4,1,252,5,medium,F,1,1538,1
10995,B,Ship,4,1,232,5,medium,F,6,1247,0
10996,C,Ship,5,4,242,5,low,F,4,1155,0
10997,F,Ship,5,2,223,6,medium,M,2,1210,0


- The dataset we are working on contains 10,999 observations and 11 variables.

In [4]:
# Data type of each column and Missing values
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10999 entries, 0 to 10998
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Warehouse_block      10999 non-null  object
 1   Mode_of_Shipment     10999 non-null  object
 2   Customer_care_calls  10999 non-null  int64 
 3   Customer_rating      10999 non-null  int64 
 4   Cost_of_the_Product  10999 non-null  int64 
 5   Prior_purchases      10999 non-null  int64 
 6   Product_importance   10999 non-null  object
 7   Gender               10999 non-null  object
 8   Discount_offered     10999 non-null  int64 
 9   Weight_in_gms        10999 non-null  int64 
 10  Reached_on_Time      10999 non-null  int64 
dtypes: int64(7), object(4)
memory usage: 945.4+ KB


- All columns contain 10999 non-null values, which means **there are no missing data entries**.
- **All columns have appropriate data types** for further exploratory data analysis.

In [5]:
# Count the number of numerical and categorical variables
num = data.select_dtypes(include='number').shape[1] 
cat = data.select_dtypes(include='object').shape[1]  

print(f'Number of categorical variables: {cat}')
print(f'Number of numerical variables: {num}')

Number of categorical variables: 4
Number of numerical variables: 7


- Categorical variables will be transformed later when I develop machine learning models.

In [6]:
# Check for duplicate rows
data.duplicated().sum()

np.int64(0)

- Each row is unique ; **there are no duplicates.**