# HW7
### Business Context
You are working as a data analyst for an e-commerce logistics company that aims to improve its delivery performance and customer satisfaction.  
Recently, management noticed an increasing number of **late shipments**, which negatively impacts repeat purchases and customer loyalty.  

To address this, the analytics team collected operational data for several hundred shipments, including details such as:
- Shipping mode (e.g., road, flight, ship)
- Warehouse block
- Product cost and discount
- Customer service calls
- Package weight and prior purchase history

The goal is to use **predictive analytics** to understand what factors contribute to on-time delivery and to **build a simple, interpretable classification model** using **Naive Bayes**.



### Business Objective
Develop a model that predicts whether a product shipment will arrive **on time** or **delayed**, based on its shipping and customer features.  
The model will support:
- Early identification of risky shipments  
- Operational optimization (e.g., route selection, discount policies)  
- Customer experience improvement



### Data Dictionary

| Column | Description |
|---------|-------------|
| `ID` | Unique shipment identifier |
| `Warehouse_block` | Distribution center label (A–F) |
| `Mode_of_Shipment` | Mode of transportation (Ship, Flight, Road) |
| `Customer_care_calls` | Number of customer care calls during transit |
| `Customer_rating` | Rating given by the customer (1–5) |
| `Cost_of_the_Product` | Cost of the product in local currency |
| `Prior_purchases` | Number of previous purchases by the same customer |
| `Product_importance` | Importance level of the product (low, medium, high) |
| `Gender` | Customer gender |
| `Discount_offered` | Percentage discount applied to the product |
| `Weight_in_gms` | Product weight in grams |
| `Reached.on.Time_Y.N` | **Target variable** : 1 = On time, 0 = Delayed |

### Load and Preview the Data

In [11]:
import pandas as pd

df = pd.read_csv("shipping.csv")

print("Shape:", df.shape)
df.head()

Shape: (10999, 12)


Unnamed: 0,ID,Warehouse_block,Mode_of_Shipment,Customer_care_calls,Customer_rating,Cost_of_the_Product,Prior_purchases,Product_importance,Gender,Discount_offered,Weight_in_gms,Reached.on.Time_Y.N
0,1,D,Flight,4,2,177,3,low,F,44,1233,1
1,2,F,Flight,4,5,216,2,low,M,59,3088,1
2,3,A,Flight,2,2,183,4,low,M,48,3374,1
3,4,B,Flight,3,3,176,4,medium,M,10,1177,1
4,5,C,Flight,2,2,184,3,medium,F,46,2484,1


In [12]:
df.info()
df.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10999 entries, 0 to 10998
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   ID                   10999 non-null  int64 
 1   Warehouse_block      10999 non-null  object
 2   Mode_of_Shipment     10999 non-null  object
 3   Customer_care_calls  10999 non-null  int64 
 4   Customer_rating      10999 non-null  int64 
 5   Cost_of_the_Product  10999 non-null  int64 
 6   Prior_purchases      10999 non-null  int64 
 7   Product_importance   10999 non-null  object
 8   Gender               10999 non-null  object
 9   Discount_offered     10999 non-null  int64 
 10  Weight_in_gms        10999 non-null  int64 
 11  Reached.on.Time_Y.N  10999 non-null  int64 
dtypes: int64(8), object(4)
memory usage: 1.0+ MB


ID                     0
Warehouse_block        0
Mode_of_Shipment       0
Customer_care_calls    0
Customer_rating        0
Cost_of_the_Product    0
Prior_purchases        0
Product_importance     0
Gender                 0
Discount_offered       0
Weight_in_gms          0
Reached.on.Time_Y.N    0
dtype: int64

### Q1 Some EDA
1. Give descriptive statistics/five number summary for all numerical independent variables.
2. Use `groupby()` to compute the **average on-time delivery rate** for each category in: `Warehouse_block`, `Mode_of_Shipment` and `Product_importance`. This helps identify which operational or logistical segments perform better or worse. Answer 
- Which `Warehouse_block` has the *lowest* on-time rate?  
- Which `Mode_of_Shipment` performs *best* (highest on-time rate)?  
- Are *high-importance* products delivered faster or slower on average?  
- Based on these patterns, suggest one operational reason (e.g., congestion, route choice, product handling priority).

In [13]:
import pandas as pd

#1
num_cols = [
    "Customer_care_calls",
    "Customer_rating",
    "Cost_of_the_Product",
    "Prior_purchases",
    "Discount_offered",
    "Weight_in_gms"
]

summary = df[num_cols].describe().T[
    ["min", "25%", "50%", "75%", "max"]
]

print(summary)


#2
print("\n On-time Rate by Warehouse_block")
wb_rate = df.groupby("Warehouse_block")["Reached.on.Time_Y.N"].mean()
print(wb_rate)

print("\n On-time Rate by Mode_of_Shipment")
mos_rate = df.groupby("Mode_of_Shipment")["Reached.on.Time_Y.N"].mean()
print(mos_rate)

print("\n On-time Rate by Product_importance")
pi_rate = df.groupby("Product_importance")["Reached.on.Time_Y.N"].mean()
print(pi_rate)

                        min     25%     50%     75%     max
Customer_care_calls     2.0     3.0     4.0     5.0     7.0
Customer_rating         1.0     2.0     3.0     4.0     5.0
Cost_of_the_Product    96.0   169.0   214.0   251.0   310.0
Prior_purchases         2.0     3.0     3.0     4.0    10.0
Discount_offered        1.0     4.0     7.0    10.0    65.0
Weight_in_gms        1001.0  1839.5  4149.0  5050.0  7846.0

 On-time Rate by Warehouse_block
Warehouse_block
A    0.586470
B    0.602291
C    0.596836
D    0.597601
F    0.598472
Name: Reached.on.Time_Y.N, dtype: float64

 On-time Rate by Mode_of_Shipment
Mode_of_Shipment
Flight    0.601576
Road      0.588068
Ship      0.597561
Name: Reached.on.Time_Y.N, dtype: float64

 On-time Rate by Product_importance
Product_importance
high      0.649789
low       0.592788
medium    0.590450
Name: Reached.on.Time_Y.N, dtype: float64


Warehouse A has the lowest on-time rate. Shipment of flight performs best. High importance products delivered faster on average. 

Based on the results, Warehouse A may be slower due to internal congestion, while flight shipments perform best because air transport has shorter transit times and fewer disruptions. High-importance products also arrive more reliably may since they are typically given prioritized handling.

### Q2 Data Preprocessing and Feature Binning
Before training a Naive Bayes classifier, we need to:
1. Prepare the data for modeling by encoding categorical variables.  
2. Bin continuous numeric features into discrete intervals. (Briefly explain the binning method/logic for each varaible. 

In [14]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# 1. Categorical Encoding

cat_cols = ["Warehouse_block", "Mode_of_Shipment", "Product_importance", "Gender"]

categorical_encoder = OneHotEncoder(handle_unknown="ignore")

# 2. Binning Numerical Features

# Binning logic:
df["CareCalls_bin"] = pd.cut(
    df["Customer_care_calls"],
    bins=[1, 3, 5, 10],
    labels=["low", "medium", "high"],
    include_lowest=True
)

df["Rating_bin"] = df["Customer_rating"].astype(str)

df["Cost_bin"] = pd.qcut(
    df["Cost_of_the_Product"],
    q=3,
    labels=["low_cost", "mid_cost", "high_cost"]
)

df["Purch_bin"] = pd.cut(
    df["Prior_purchases"],
    bins=[1, 3, 5, 15],
    labels=["low", "medium", "high"],
    include_lowest=True
)

df["Discount_bin"] = pd.cut(
    df["Discount_offered"],
    bins=[0, 5, 10, 20, 70],
    labels=["very_low", "low", "medium", "high"],
    include_lowest=True
)

df["Weight_bin"] = pd.qcut(
    df["Weight_in_gms"],
    q=3,
    labels=["light", "medium", "heavy"]
)

bin_cols = [
    "CareCalls_bin",
    "Rating_bin",
    "Cost_bin",
    "Purch_bin",
    "Discount_bin",
    "Weight_bin"
]

X = df[cat_cols + bin_cols]
y = df["Reached.on.Time_Y.N"]

df.head()

Unnamed: 0,ID,Warehouse_block,Mode_of_Shipment,Customer_care_calls,Customer_rating,Cost_of_the_Product,Prior_purchases,Product_importance,Gender,Discount_offered,Weight_in_gms,Reached.on.Time_Y.N,CareCalls_bin,Rating_bin,Cost_bin,Purch_bin,Discount_bin,Weight_bin
0,1,D,Flight,4,2,177,3,low,F,44,1233,1,medium,2,low_cost,low,high,light
1,2,F,Flight,4,5,216,2,low,M,59,3088,1,medium,5,mid_cost,low,high,medium
2,3,A,Flight,2,2,183,4,low,M,48,3374,1,low,2,low_cost,medium,high,medium
3,4,B,Flight,3,3,176,4,medium,M,10,1177,1,low,3,low_cost,medium,low,light
4,5,C,Flight,2,2,184,3,medium,F,46,2484,1,low,2,low_cost,low,high,medium


### Q3 Model Training and 5-Fold Cross-Validation
- Combine the preprocessing pipeline (`preprocessor`) with a **Multinomial Naive Bayes** model.  
- Use `Pipeline` from scikit-learn to connect them.  
- Set the smoothing parameter `alpha = 1.0` (Laplace smoothing).
- Use `StratifiedKFold` to ensure balanced class proportions in each fold.  
- Compute accuracy for each fold using `cross_val_score`.  
- Report:
  - Accuracy per fold  
  - Mean accuracy  

In [15]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
import numpy as np

#1. Preprocessing
all_cat_cols = [
    "Warehouse_block",
    "Mode_of_Shipment",
    "Product_importance",
    "Gender",
    "CareCalls_bin",
    "Rating_bin",
    "Cost_bin",
    "Purch_bin",
    "Discount_bin",
    "Weight_bin"
]

preprocessor = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), all_cat_cols)
    ],
    remainder="drop"
)

#2. Pipeline
model = Pipeline([
    ("preprocessor", preprocessor),
    ("clf", MultinomialNB(alpha=1.0))
])

#3. Cross-validation
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_val_score(model, X, y, cv=kf, scoring="accuracy")

print("Accuracy per fold:", np.round(scores, 5))
print("Mean accuracy:", round(scores.mean(), 5))

#q3
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix

y_pred_cv = cross_val_predict(model, X, y, cv=kf)

cm = confusion_matrix(y, y_pred_cv, labels=[0, 1])
print(cm)


Accuracy per fold: [0.66545 0.65045 0.65773 0.64364 0.64347]
Mean accuracy: 0.65215
[[3075 1361]
 [2465 4098]]


Now please answer
- What is the average model accuracy?  
- Does performance vary much across folds (stable or inconsistent)?  
- Which type of error seems more common — predicting on-time when delayed, or vice versa?  

1. The average model accuracy is about 0.65.
2. The fold accuracies range from about 0.643 to 0.665, which is a small spread. This indicates that the model’s performance is fairly stable across the folds.
3. Based on the confusion matrix, false nagetive, predicting on-time when delayed, is more common.

### Q4 Predict Delivery Outcome for Three Customer Profiles

#### Objective
Apply your trained Naive Bayes model to **three realistic customer profiles**, each representing a different business segment.  
You will predict the likelihood of on-time delivery and propose targeted actions to improve service quality.



#### Scenario
Your logistics company serves many types of customers.  
To help management design better policies, you will analyze three typical personas:

1. **Ada – Premium Frequent Buyer**  
   - Long purchase history, usually high-value items  
   - Expects fast, reliable delivery  
   - Typically chooses *Flight* shipping  

2. **Bob – Discount Seeker**  
   - Focused on low price and high discounts  
   - Buys heavier, lower-importance items  
   - Often uses *Road* shipment  

3. **Cindy – Occasional Shopper**  
   - Few past purchases  
   - Mid-price items, moderate importance  
   - Uses *Ship* mode occasionally  


**Tasks：**

You will create sample profiles in specific data frames (open-ended and answer varies) for each and use your model to predict whether their next order will arrive on time.
For the above three typical types of customers, each list the predicted outcome, what factors seem to drive each prediction (key business risk factor), recommended business Incentive and the operational action plan

Unnamed: 0,Predicted_OnTime_Prob,Predicted_Label
Ada,0.638454,1
Bob,0.999058,1
Cindy,0.615144,1
