# Decision Support System in Smart Supply Chains for Predicting Late Deliveries

The implementation of a Python-based Bayesian Network using pgmpy and applied to the “DataCo SMART SUPPLY CHAIN FOR BIG DATA ANALYSIS” dataset. The primary goal of the DSS is to predict orders that would be delivered late. To achieve this, data-derived conditional probability distributions (CPDs) were used rather than assuming or estimating a priori knowledge of the CPDs. The performance of the DSS is compared to a multi-class classification machine learning model. The DSS can generate multiple insights and conduct what-if analyses to support decision-making.

_Keywords:_  Decision Support System (DSS), Bayesian Network, Supply Chain Management (SCM), decision-making


In [2]:
import pandas as pd

from pgmpy.estimators import MaximumLikelihoodEstimator, BayesianEstimator
from pgmpy.estimators import ParameterEstimator
from pgmpy.factors.discrete import TabularCPD, DiscreteFactor
from pgmpy.inference import VariableElimination
from pgmpy.models import BayesianNetwork

from sklearn.metrics import classification_report, confusion_matrix


## 1.1 Data Preparation & Cleaning

### 1.1.1 Load Source Date

In [3]:
df_source = pd.read_csv('data/raw/DataCoSupplyChainDataset.csv', encoding='unicode_escape')
df_source.drop_duplicates(inplace=True)

### 1.1.2 Data Description

In [6]:
df_description = pd.read_csv('data/raw/DescriptionDataCoSupplyChain.csv')
df_description.DESCRIPTION = df_description.DESCRIPTION.str.replace(':', '')

#df_description.style.set_properties(**{'text-align': 'left'})
df_description

Unnamed: 0,FIELDS,DESCRIPTION
0,Type,Type of transaction made
1,Days for shipping (real),Actual shipping days of the purchased product
2,Days for shipment (scheduled),Days of scheduled delivery of the purchased ...
3,Benefit per order,Earnings per order placed
4,Sales per customer,Total sales per customer made per customer
5,Delivery Status,"Delivery status of orders Advance shipping ,..."
6,Late_delivery_risk,Categorical variable that indicates if sendi...
7,Category Id,Product category code
8,Category Name,Description of the product category
9,Customer City,City where the customer made the purchase


## 1.2 Data Visualization & Analysis

In [None]:
# payment type
df_source['Type'].value_counts().plot(kind='bar')

In [None]:
# shipping mode
df_source['Shipping Mode'].value_counts().plot(kind='bar')

In [None]:
# customer segment
df_source['Customer Segment'].value_counts().plot(kind='bar')

In [None]:
# Days for shipment (scheduled)
df_source['Days for shipment (scheduled)'].value_counts().plot(kind='bar')

In [None]:
# Late_delivery_risk
df_source['Late_delivery_risk'].value_counts().plot(kind='bar')

In [None]:
# Customer State
customer_state_vc = df_source['Customer State'].value_counts()

print(f'Number of states: {len(customer_state_vc)}')
print(customer_state_vc)

In [None]:
# market
df_source['Market'].value_counts().plot(kind='bar')

> - A question could be which store is most likely to have a late delivery risk?
> - which store does not deliver to the market

In [None]:
# State to which the store where the purchase is registered belongs
df_source[df_source['Market'] == 'Africa']['Customer State'].value_counts()

In [None]:
# customer city
customer_city_vc = df_source['Customer City'].value_counts()
customer_city_vc

In [None]:
# delivery status
df_source['Delivery Status'].value_counts().plot(kind='bar')

## 1.3 Feature Engineering

Select the columns that will be used as the network nodes and remove duplicate records to have a single record per order.

In [None]:
nodes = [
    'Order Id',
    'Shipping Mode',
    'Customer Segment',
    'Days for shipment (scheduled)',
    'Delivery Status',
    'Customer State',
    'Market',
]

df_data = df_source[nodes] \
    .rename(columns={
        'Type': 'Payment Type',
        'Customer State': 'Store State'}) \
    .drop_duplicates() \
    .reset_index(drop=True)


print(df_data.shape)
with pd.option_context('display.max_columns', None):
    display(df_data.head())

In [None]:
# find columns with missing values
df_data.isnull().sum()

### 1.3.1 Create a Training and Test Set

In [None]:
random_state = 98421

In [None]:
# create the training dataset
df_test = df_data.sample(frac=0.3, random_state=random_state)
df_train = df_data.drop(df_test.index)

# reset the index of both datasets
df_train.reset_index(drop=True, inplace=True)
df_test.reset_index(drop=True, inplace=True)

print(f'Training dataset shape : {df_train.shape}')
print(f'Test dataset shape     : {df_test.shape}')

## 1.4 Model Training

### 1.4.1 Model Definition

Define the model nodes and edges.

In [None]:
# Create the Bayesian network model
model = BayesianNetwork()

# Add the nodes to the model
model.add_node('Shipping Mode')
model.add_node('Customer Segment')
model.add_node('Days for shipment (scheduled)')
model.add_node('Delivery Status')
model.add_node('Store State')
model.add_node('Market')


# Add the edges between the nodes to the model
model.add_edge('Delivery Status', 'Shipping Mode')
model.add_edge('Delivery Status', 'Customer Segment')
model.add_edge('Delivery Status', 'Days for shipment (scheduled)')
model.add_edge('Delivery Status', 'Store State')
model.add_edge('Delivery Status', 'Market')


### 1.4.2 Compute the Probability Distribution Table (PDT)

In [None]:
pe = ParameterEstimator(model, df_train)
pe.state_counts('Shipping Mode')
#pe.state_counts('Delivery Status')

In [None]:
est = BayesianEstimator(model, df_train)

print(est.estimate_cpd('Customer Segment', prior_type='BDeu', equivalent_sample_size=10))

In [None]:
# # Create a MaximumLikelihoodEstimator object
# estimator = MaximumLikelihoodEstimator(model, df_train)

# estimator.get_parameters(weighted=True)

In [None]:
model.fit(
    data=df_train, 
    estimator=BayesianEstimator,
    prior_type='BDeu',
    equivalent_sample_size=10)

In [None]:
# model.fit(
#     data=df_train, 
#     estimator=MaximumLikelihoodEstimator)

In [None]:
# Check if the model is valid
model.check_model()

In [None]:
model_cpds = model.get_cpds()
model_cpds

In [None]:
for cpd in model_cpds:
    print(f'--- {cpd.variable} ---')
    print(cpd.values, end='\n\n')

## 1.5 Model Details

In [None]:
model.active_trail_nodes('Delivery Status')

In [None]:
model.get_independencies()

## 1.6 Model Queries

In [None]:
infer = VariableElimination(model)

In [None]:
def get_ratios(probabilities: DiscreteFactor, col_name: str='Ratio') -> pd.DataFrame:
    """
    Show probabilities for a given variable.

    Parameters
    ----------
    probabilities : DiscreteFactor
        Probabilities for the variable.

    Returns
    -------
    pd.DataFrame
        Probabilities for the variable.
    """
    # Get the probabilities for each value
    value_probabilities = probabilities.values
    variable_name = list(probabilities.state_names.keys())[0]
    state_names = probabilities.state_names[variable_name]

    # create a dataframe with the probabilities
    return pd.DataFrame(
        data=value_probabilities,
        index=state_names,
        columns=[col_name]) \
        .sort_values(by=col_name, ascending=False)

### 1.6.1 What is the probability of a late delivery?

In [None]:
late_delivery = infer.query(
    variables=['Delivery Status'],
    joint=False)

print(late_delivery['Delivery Status'])

### 1.6.2 Which states handle the most late deliveries?

We can see that this is close the the same ratio as the total number of orders per state, given that we know that 54.7% of the orders are late.

In [None]:
# Query the probability of a late delivery for each state
state_late_delivery = infer.query(
    variables=['Store State'],
    evidence={'Delivery Status': 'Late delivery'})

get_ratios(state_late_delivery).head(5)

In [None]:
df_train['Store State'].value_counts(normalize=True).head(5)

#### 1.6.2.1 What is the probability of a late delivery per store?

In [None]:
df_store_late_delivery = pd.DataFrame()

for state_name in list(state_late_delivery.state_names.values())[0]:
    # get the delivery status probabilities for the state
    df_state = get_ratios(
        infer.query(
            variables=['Delivery Status'],
            evidence={'Store State': state_name})
    )

    # add the state name to the dataframe
    df_state['Store State'] = state_name

    # append the dataframe to the main dataframe    
    df_store_late_delivery = df_store_late_delivery.append(df_state \
        .reset_index(drop=False) \
        .rename(columns={'index': 'Delivery Status'})
    )

# show the stores with the highest probability of late delivery
df_store_late_delivery \
    .query('`Delivery Status` == "Late delivery"') \
    .drop(columns=['Delivery Status']) \
    .reindex(columns=['Store State', 'Ratio']) \
    .sort_values(by='Ratio', ascending=False) \
    .reset_index(drop=True) \
    .head(5)

> We have discovered that the [state](https://www.scouting.org/resources/los/states/) Delaware (DE) has the highest probability of late delivery as reflected in the data below. This is an example that warrents further investigation.

In [None]:
df_train[df_train['Store State'] == 'DE']

### 1.6.3 Which Market has the most late deliveries?

In [None]:
# Which Market is most likely to have a late delivery?
market_late_delivery = infer.query(
    variables=['Market'],
    evidence={'Delivery Status': 'Late delivery'})

get_ratios(market_late_delivery).head(5)


#### 1.6.3.1 What is the probability of a late delivery in Europe?

In [None]:
get_ratios(
    infer.query(
        variables=['Delivery Status'],
        evidence={'Market': 'Europe'}),
    col_name='Probability')

#### 1.6.3.2 Which shipping method has the most late deliveries in the Pacific Asia Market?

In [None]:
shipping_mode_late_delivery = infer.query(
    variables=['Shipping Mode'],
    evidence={
        'Delivery Status': 'Late delivery',
        'Market': 'Pacific Asia'
    })

get_ratios(shipping_mode_late_delivery).head(5)

#### 1.6.3.3 What is the probability of a late delivery for orders in the Pacific Asia Market sipped using Standard Class?

In [None]:
get_ratios(
    infer.query(
        variables=['Delivery Status'],
        evidence={
            'Shipping Mode': 'Standard Class',
            'Market': 'Pacific Asia'
        }),
    col_name='Probability'
).loc['Late delivery']

### 1.6.4 Which customer segment has the most late deliveries?

In [None]:
get_ratios(
    infer.query(
        variables=['Customer Segment'],
        evidence={
            'Delivery Status': 'Late delivery',
        })
)

In [None]:
df_train['Customer Segment'].value_counts(normalize=True).head(5)

### 1.6.5.1 Which shipping method is most likely to have a late delivery in the Corporate segment?

In [None]:
get_ratios(
    infer.query(
        variables=['Shipping Mode'],
        evidence={
            'Delivery Status': 'Late delivery',
            'Customer Segment': 'Corporate'
        })
)

In [None]:
# customer segment is corporate and delivery status is late delivery
df_train[
    (df_train['Customer Segment'] == 'Corporate') &
    (df_train['Delivery Status'] == 'Late delivery')
]['Shipping Mode'].value_counts(normalize=True).head(5)

## 1.7 Model Evaluation

In [None]:
# remove the label to predict
df_eval = df_test \
    .drop_duplicates() \
    .reset_index(drop=True)

df_eval.head()

In [None]:
# predict the label
df_predict = model.predict(df_eval.drop(columns=['Order Id', 'Delivery Status']))
df_predict.rename(columns={
    'Delivery Status': 'y_pred'}, inplace=True)

# join the prediction back to the evaluation data
df_eval = df_eval.join(df_predict)

df_eval.head()

In [None]:
# show the confusion matrix
confusion_matrix(
    y_true=df_eval['Delivery Status'],
    y_pred=df_eval['y_pred'])

In [None]:
# show the classification report
print(classification_report(
    y_true=df_eval['Delivery Status'],
    y_pred=df_eval['y_pred']))

### 1.7.1 Create a AutoML Model for Comparison

In [None]:
from pycaret.classification import *

In [None]:
df_train

In [None]:
classifier = setup(
    data=df_train.drop(columns=['Order Id']),
    target='Delivery Status',
    train_size=0.7,
    session_id=random_state,
    verbose=False)

In [None]:
# perform a model comparison
models = compare_models(n_select=3)
top_model = models[0]

In [None]:
plot_model(top_model, plot='feature')

In [None]:
# Unseen Data Prediction
df_predicted = predict_model(estimator=top_model, data=df_test)

In [None]:
# show the confusion matrix
confusion_matrix(
    y_true=df_predicted['Delivery Status'],
    y_pred=df_predicted['prediction_label'])

In [None]:
# show the classification report
print(classification_report(
    y_true=df_predicted['Delivery Status'],
    y_pred=df_predicted['prediction_label']))