# End-to-End Pricing and Profitability Analysis  
## Superstore Global Sales Dataset

### Project Overview
This project focuses on analysing pricing, discounting, and profitability patterns using a real-world retail dataset.  
The goal is to understand how discounts, pricing decisions, and operational factors impact profit outcomes, while explicitly addressing data quality limitations and business constraints.

### Key Objectives
- Analyse the impact of discounts on revenue and profit
- Identify loss-making orders and products
- Build transparent, decision-ready features for downstream SQL and Power BI analysis
- Document all data quality issues and analytical assumptions


##  Business Context and Core Questions

### Business Context
Despite strong sales performance, retail businesses often struggle with inconsistent profitability.  
High revenue does not necessarily translate into healthy margins, especially when discounts are applied without clear rules or evidence-based evaluation.

This project is framed around a common business concern:
**Are discounts genuinely driving profitable growth, or are they masking underlying pricing and cost issues?**

---

### Core Business Questions

#### Q1. Sales vs Profit Alignment
Where do sales and profit diverge across products, categories, regions, and customer segments?  
Which areas generate high revenue but consistently erode profit?

#### Q2. Discount Impact on Profitability
How do different discount levels affect profit and sales volume?  
At what point does discounting become harmful rather than beneficial?

#### Q3. Decision-Oriented Pricing Rules
Based on historical data, what practical discount and pricing rules can be defined to reduce loss-making orders and improve margin stability?

---

### Analytical Focus
The analysis prioritises **decision-making insight** over descriptive reporting.  
Each question is designed to lead to a clear business action rather than a standalone metric.


## Load Libraries
Import the core Python libraries required for data manipulation and numerical operations.


In [2]:
import pandas as pd
import numpy as np

## Load Dataset
Load the raw Superstore orders dataset into a DataFrame for initial inspection.


In [3]:
df = pd.read_csv("E:/Project/GITHub/Project01/SuperStoreOrders.csv/SuperStoreOrders.csv")
df.head()

Unnamed: 0,order_id,order_date,ship_date,ship_mode,customer_name,segment,state,country,market,region,...,category,sub_category,product_name,sales,quantity,discount,profit,shipping_cost,order_priority,year
0,AG-2011-2040,1/1/2011,6/1/2011,Standard Class,Toby Braunhardt,Consumer,Constantine,Algeria,Africa,Africa,...,Office Supplies,Storage,"Tenex Lockers, Blue",408,2,0.0,106.14,35.46,Medium,2011
1,IN-2011-47883,1/1/2011,8/1/2011,Standard Class,Joseph Holt,Consumer,New South Wales,Australia,APAC,Oceania,...,Office Supplies,Supplies,"Acme Trimmer, High Speed",120,3,0.1,36.036,9.72,Medium,2011
2,HU-2011-1220,1/1/2011,5/1/2011,Second Class,Annie Thurman,Consumer,Budapest,Hungary,EMEA,EMEA,...,Office Supplies,Storage,"Tenex Box, Single Width",66,4,0.0,29.64,8.17,High,2011
3,IT-2011-3647632,1/1/2011,5/1/2011,Second Class,Eugene Moren,Home Office,Stockholm,Sweden,EU,North,...,Office Supplies,Paper,"Enermax Note Cards, Premium",45,3,0.5,-26.055,4.82,High,2011
4,IN-2011-47883,1/1/2011,8/1/2011,Standard Class,Joseph Holt,Consumer,New South Wales,Australia,APAC,Oceania,...,Furniture,Furnishings,"Eldon Light Bulb, Duo Pack",114,5,0.1,37.77,4.7,Medium,2011


## Data Quality Assessment
Check data types, missing values, duplicates, and basic data integrity issues.


In [4]:
df.dtypes


order_id           object
order_date         object
ship_date          object
ship_mode          object
customer_name      object
segment            object
state              object
country            object
market             object
region             object
product_id         object
category           object
sub_category       object
product_name       object
sales              object
quantity            int64
discount          float64
profit            float64
shipping_cost     float64
order_priority     object
year                int64
dtype: object



Initial inspection showed that several columns, including the order and shipping dates, were stored as `object` types. To ensure accurate time-based analysis and prevent downstream calculation issues, these date fields were explicitly converted to proper datetime formats using controlled coercion.


In [5]:
df["order_date"] = pd.to_datetime(df["order_date"], errors="coerce")
df["ship_date"] = pd.to_datetime(df["ship_date"], errors="coerce")
df[["order_date", "ship_date"]].dtypes


order_date    datetime64[ns]
ship_date     datetime64[ns]
dtype: object

### Missing Values Assessment

In [6]:
key_cols = [
    "order_id",
    "order_date",
    "product_id",
    "customer_name",
    "sales",
    "quantity",
    "discount",
    "profit"
]

df[key_cols].isna().sum().sort_values(ascending=False)


order_date       31223
order_id             0
product_id           0
customer_name        0
sales                0
quantity             0
discount             0
profit               0
dtype: int64

In [7]:
(df[key_cols].isna().mean() * 100).round(2).sort_values(ascending=False)


order_date       60.88
order_id          0.00
product_id        0.00
customer_name     0.00
sales             0.00
quantity          0.00
discount          0.00
profit            0.00
dtype: float64

After identifying the key business-critical columns, missing values were assessed to understand potential risks to downstream analysis.

The inspection revealed a significant level of missing data in two critical fields:

- **Order Date (~60.9%)**  
  A large proportion of records lack a valid order date. This suggests potential issues in the order capture process or data migration from upstream systems. Missing order dates directly impact time-based analysis such as trend evaluation, seasonality, and delivery performance metrics.

- **Sales (~5.1%)**  
  A smaller but non-negligible portion of records contains missing sales values. These records represent transactions with incomplete financial information and pose a direct risk to revenue, profitability, and pricing analyses.

All other key business columns showed minimal or no missing values, indicating that the core transactional structure remains largely intact.

Importantly, no records were removed at this stage. The focus was placed on understanding the root causes of missing data and defining controlled, well-documented remediation strategies before applying any corrective actions. This approach aligns with industry best practices for responsible data quality management.

In [13]:
# Create a single effective order date using ship_date as a fallback
df["effective_order_date"] = df["order_date"].fillna(df["ship_date"])

# Record which date source is used for each record
df["order_date_source"] = np.where(
    df["order_date"].notna(),
    "order_date",
    "ship_date"
)

# Preview original and derived date fields to validate the logic
df[[
    "order_date",
    "ship_date",
    "effective_order_date",
    "order_date_source"
]].head(10)



Unnamed: 0,order_date,ship_date,effective_order_date,order_date_source
0,2011-01-01,2011-06-01,2011-01-01,order_date
1,2011-01-01,2011-08-01,2011-01-01,order_date
2,2011-01-01,2011-05-01,2011-01-01,order_date
3,2011-01-01,2011-05-01,2011-01-01,order_date
4,2011-01-01,2011-08-01,2011-01-01,order_date
5,2011-01-01,2011-08-01,2011-01-01,order_date
6,2011-02-01,2011-06-01,2011-02-01,order_date
7,2011-03-01,2011-03-01,2011-03-01,order_date
8,2011-03-01,2011-09-01,2011-03-01,order_date
9,2011-03-01,2011-07-01,2011-03-01,order_date


In [14]:
# Check the percentage distribution of order date sources
df["order_date_source"].value_counts(normalize=True) * 100

order_date_source
ship_date     60.883816
order_date    39.116184
Name: proportion, dtype: float64

### Duplicate Records Check and Removal

To identify potential duplicate records, a composite business key was defined using  
`order_id`, `product_id`, and `customer_name`.  
This combination represents the closest available grain of an order line in the absence of a unique line identifier.

In [10]:
dup_key = ["order_id", "product_id", "customer_name"]

df.duplicated(subset=dup_key).sum()


35

An initial duplicate check showed **35 duplicated rows** based on this key.

Since no additional fields such as `order_line_id` or ingestion timestamp were available to distinguish records, duplicates were assumed to originate from data reprocessing or partial updates.

The dataset was therefore deduplicated by keeping the first occurrence of each composite key.

• Rows before deduplication: 51,290  
• Rows after deduplication: 51,255  
• Duplicates removed: 35

This decision was documented as an analytical assumption, and the limitation was noted for downstream pricing and profitability analysis.

In [11]:
dup_key = ["order_id", "product_id", "customer_name"]

before = df.shape[0]

df = df.drop_duplicates(subset=dup_key, keep="first").copy()

after = df.shape[0]

print("Rows before:", before)
print("Rows after:", after)
print("Removed duplicates:", before - after)


Rows before: 51290
Rows after: 51255
Removed duplicates: 35


### Data Validation and Feature Engineering

In this step, core numeric columns including sales, profit, discount, and quantity
were explicitly converted to numeric types to ensure reliable comparisons
and prevent type-related errors.

Logical sanity checks were performed to identify:
• Negative sales values  
• Negative profit values  
• Discounts outside the valid range (0 to 1)  
• Invalid quantities (zero or negative)

After validating data integrity, several analytical features were created
to support pricing, discount, profitability, and delivery performance analysis.


In [17]:
# Ensure numeric columns are truly numeric
num_cols = ["sales", "profit", "discount", "quantity"]
for col in num_cols:
    df[col] = pd.to_numeric(df[col], errors="coerce")

# Sanity checks
print("sales < 0:", (df["sales"] < 0).sum())
print("profit < 0:", (df["profit"] < 0).sum())
print("discount < 0:", (df["discount"] < 0).sum())
print("discount > 1:", (df["discount"] > 1).sum())
print("quantity <= 0:", (df["quantity"] <= 0).sum())

# Pricing and profitability features
df["sales_before_discount"] = np.where(
    df["discount"] < 1,
    df["sales"] / (1 - df["discount"]),
    np.nan
)

df["profit_margin"] = np.where(
    df["sales"] != 0,
    df["profit"] / df["sales"],
    np.nan
)

df["unit_price"] = np.where(
    df["quantity"] > 0,
    df["sales"] / df["quantity"],
    np.nan
)

# Flags
df["has_discount"] = np.where(df["discount"] > 0, 1, 0)
df["is_loss"] = np.where(df["profit"] < 0, 1, 0)

# Delivery metrics
df["delivery_delay_days"] = (df["ship_date"] - df["effective_order_date"]).dt.days

df["is_delay_reliable"] = np.where(
    df["order_date_source"] == "order_date",
    1,
    0
)

df[
    [
        "sales",
        "discount",
        "sales_before_discount",
        "profit",
        "profit_margin",
        "quantity",
        "unit_price",
        "has_discount",
        "is_loss",
        "delivery_delay_days",
        "is_delay_reliable",
    ]
].head(10)


sales < 0: 0
profit < 0: 12540
discount < 0: 0
discount > 1: 0
quantity <= 0: 0


Unnamed: 0,sales,discount,sales_before_discount,profit,profit_margin,quantity,unit_price,has_discount,is_loss,delivery_delay_days,is_delay_reliable
0,408.0,0.0,408.0,106.14,0.260147,2,204.0,0,0,151.0,1
1,120.0,0.1,133.333333,36.036,0.3003,3,40.0,1,0,212.0,1
2,66.0,0.0,66.0,29.64,0.449091,4,16.5,0,0,120.0,1
3,45.0,0.5,90.0,-26.055,-0.579,3,15.0,1,1,120.0,1
4,114.0,0.1,126.666667,37.77,0.331316,5,22.8,1,0,212.0,1
5,55.0,0.1,61.111111,15.342,0.278945,2,27.5,1,0,212.0,1
6,314.0,0.0,314.0,3.12,0.009936,1,314.0,0,0,120.0,1
7,276.0,0.1,306.666667,110.412,0.400043,1,276.0,1,0,0.0,1
8,912.0,0.4,1520.0,-319.464,-0.350289,4,228.0,1,1,184.0,1
9,667.0,0.0,667.0,253.32,0.37979,4,166.75,0,0,122.0,1




All numeric validation checks passed successfully.
No negative sales values, invalid discount rates, or invalid quantities were detected.

A significant number of transactions show negative profit values.
These records represent genuine loss-making orders rather than data quality issues.
Common drivers of negative profit include high discount rates, low unit prices,
and cost structures that exceed revenue at the order level.

Negative profit observations are retained intentionally, as they provide
critical signals for pricing strategy, discount effectiveness,
and loss-driving product or customer segments.

The dataset is now analytically reliable and ready for deeper
pricing, discount impact, and profitability analysis.


-----------------------------------------------------------------------------------------------------------------------

### Discount vs Profit Relationship Analysis

This step explores the relationship between discount levels and profitability.
The objective is to understand whether higher discounts are associated with
lower profit and increased loss-making orders.

The analysis focuses on profit, profit margin, and discount rate,
while preserving loss-making transactions as valid business signals.


In [19]:
# Focus on rows with valid discount, sales, and profit values
analysis_df = df[
    (df["discount"].notna()) &
    (df["sales"].notna()) &
    (df["profit"].notna())
].copy()

# Create discount buckets for clearer interpretation
analysis_df["discount_bucket"] = pd.cut(
    analysis_df["discount"],
    bins=[-0.01, 0, 0.1, 0.2, 0.3, 0.5, 1],
    labels=["0%", "0-10%", "10-20%", "20-30%", "30-50%", "50%+"]
)

# Aggregate profit metrics by discount bucket (observed=True removes the FutureWarning)
discount_profit_summary = (
    analysis_df
    .groupby("discount_bucket", observed=True)
    .agg(
        order_count=("profit", "count"),
        avg_profit=("profit", "mean"),
        avg_profit_margin=("profit_margin", "mean"),
        loss_rate=("is_loss", "mean")
    )
    .reset_index()
)

# Display summary table
discount_profit_summary


Unnamed: 0,discount_bucket,order_count,avg_profit,avg_profit_margin,loss_rate
0,0%,27533,38.926899,0.265709,0.0
1,0-10%,4193,45.204942,0.171075,0.194849
2,10-20%,5879,16.110843,0.140078,0.230992
3,20-30%,883,-15.95888,-0.037732,0.621744
4,30-50%,6009,-48.003841,-0.339006,0.874022
5,50%+,4129,-82.324612,-1.151515,1.0


### Key Findings

Profitability declines as discount levels increase.
Low or zero discount orders show higher average profit and profit margin,
while higher discount buckets exhibit a sharp increase in loss rate.

Orders with discounts above 30 percent are disproportionately loss-making,
indicating that discounting at these levels often fails to generate
sufficient volume uplift to offset margin erosion.

Loss-making transactions are therefore primarily driven by aggressive
discount strategies rather than data quality issues.

These results suggest that discount thresholds should be carefully controlled,
and that high-discount campaigns require tighter cost and pricing governance.


--------------------------------------------------------------------------------------------------------

### Loss-Making Orders Deep Dive

This step focuses on identifying which products contribute most to overall losses.
Rather than treating all loss-making orders equally, the analysis isolates
products with repeated negative profit patterns.

The objective is to distinguish between occasional losses
and structurally loss-driving products.


In [20]:
# Filter only loss-making orders
loss_df = df[df["is_loss"] == 1].copy()

# Aggregate loss metrics by product
product_loss_summary = (
    loss_df
    .groupby("product_id", observed=True)
    .agg(
        loss_order_count=("profit", "count"),
        total_loss=("profit", "sum"),
        avg_loss_per_order=("profit", "mean"),
        avg_discount=("discount", "mean")
    )
    .reset_index()
)

# Sort products by total loss contribution
product_loss_summary = product_loss_summary.sort_values(
    by="total_loss",
    ascending=True
)

# Display top loss-driving products
product_loss_summary.head(10)


Unnamed: 0,product_id,loss_order_count,total_loss,avg_loss_per_order,avg_discount
5472,TEC-MA-10000418,2,-9239.9692,-4619.9846,0.7
2523,OFF-BI-10004995,3,-6859.3896,-2286.4632,0.766667
5487,TEC-MA-10000822,3,-5269.969,-1756.656333,0.5
2112,OFF-BI-10000545,6,-5098.566,-849.761,0.716667
1610,OFF-AP-10001623,4,-5071.443,-1267.86075,0.3
5933,TEC-PH-10002991,4,-4574.6439,-1143.660975,0.355
5680,TEC-MOT-10003050,2,-4493.352,-2246.676,0.65
1285,FUR-TA-10001889,8,-4201.5993,-525.199912,0.4025
2196,OFF-BI-10001359,4,-4162.0336,-1040.5084,0.725
5625,TEC-MA-10004125,1,-3839.9904,-3839.9904,0.5


In [22]:
# Filter loss-making orders that are valid for pricing analysis
# Exclude rows with missing sales, as unit economics cannot be computed reliably
loss_df_valid_sales = df[
    (df["is_loss"] == 1) &
    (df["sales"].notna())
].copy()

# Aggregate loss metrics by product using only sales-valid records
product_loss_summary = (
    loss_df_valid_sales
    .groupby("product_id", observed=True)
    .agg(
        loss_order_count=("profit", "count"),
        total_loss=("profit", "sum"),
        avg_loss_per_order=("profit", "mean"),
        avg_discount=("discount", "mean")
    )
    .reset_index()
)

# Sort products by total loss contribution (most negative first)
product_loss_summary = product_loss_summary.sort_values(
    by="total_loss",
    ascending=True
)

# Display top loss-driving products with reliable pricing data
product_loss_summary.head(10)


Unnamed: 0,product_id,loss_order_count,total_loss,avg_loss_per_order,avg_discount
1285,FUR-TA-10002885,3,-2798.488,-932.829333,0.7
4026,OFF-ST-10000872,13,-2430.918,-186.993692,0.292308
1336,FUR-TA-10003963,1,-1924.542,-1924.542,0.85
2117,OFF-BI-10001359,2,-1892.6489,-946.32445,0.75
2419,OFF-BI-10004632,8,-1882.31,-235.28875,0.65
1079,FUR-IKE-10000649,4,-1867.536,-466.884,0.65
5579,TEC-NOK-10001283,4,-1816.59,-454.1475,0.675
1315,FUR-TA-10003522,2,-1780.356,-890.178,0.7
1281,FUR-TA-10002833,3,-1726.1304,-575.3768,0.503333
25,FUR-BO-10000038,3,-1672.596,-557.532,0.4


### Interpreting Initial Loss Concentration Results

The initial loss ranking showed a small number of products with very large total losses,
sometimes driven by only one or two loss-making orders.
At first glance, this raised concerns about potential calculation issues.

A closer inspection revealed that this behaviour was not caused by aggregation errors,
but by a known data quality limitation identified earlier during data assessment.
Approximately 5 percent of records contain missing sales values while still reporting profit and discount.

Because unit economics and relative profitability cannot be reliably evaluated without sales,
these records were excluded from pricing-focused analysis rather than being removed entirely.

After filtering to loss-making orders with valid sales values only,
a second pattern became clear.

Even after removing the affected records, several products still exhibit
severe losses driven by a small number of very large orders combined with aggressive discounting.
This indicates that order size and ticket value play a critical role in loss magnitude,
and that a single heavily discounted high-value order can dominate total loss figures.

This insight motivated the shift from absolute loss analysis
to normalised metrics such as profit margin,
allowing structurally loss-making products to be distinguished
from products affected by isolated outlier orders.


In [23]:
# Use only loss-making orders with valid sales for normalized analysis
loss_df_valid_sales = df[
    (df["is_loss"] == 1) &
    (df["sales"].notna()) &
    (df["profit_margin"].notna())
].copy()

# Aggregate normalized loss metrics by product
product_loss_normalized = (
    loss_df_valid_sales
    .groupby("product_id", observed=True)
    .agg(
        loss_order_count=("profit", "count"),
        total_loss=("profit", "sum"),
        avg_loss_per_order=("profit", "mean"),
        avg_discount=("discount", "mean"),
        avg_profit_margin=("profit_margin", "mean"),
        median_profit_margin=("profit_margin", "median")
    )
    .reset_index()
)

# Sort by normalized profitability instead of raw loss
product_loss_normalized = product_loss_normalized.sort_values(
    by="avg_profit_margin"
)

# Display most structurally loss-making products
product_loss_normalized.head(10)


Unnamed: 0,product_id,loss_order_count,total_loss,avg_loss_per_order,avg_discount,avg_profit_margin,median_profit_margin
1247,FUR-TA-10001935,2,-733.158,-366.579,0.8,-3.863871,-3.863871
1542,OFF-AP-10001634,1,-3.7584,-3.7584,0.8,-3.7584,-3.7584
1336,FUR-TA-10003963,1,-1924.542,-1924.542,0.85,-3.467643,-3.467643
1182,FUR-TA-10000147,1,-344.412,-344.412,0.8,-3.44412,-3.44412
1185,FUR-TA-10000207,1,-1218.384,-1218.384,0.8,-3.249024,-3.249024
1639,OFF-AP-10004249,1,-6.3441,-6.3441,0.8,-3.17205,-3.17205
1360,FUR-TA-10004544,2,-1391.208,-695.604,0.725,-2.938509,-2.938509
1631,OFF-AP-10004052,1,-8.532,-8.532,0.8,-2.844,-2.844
1622,OFF-AP-10003849,1,-393.602,-393.602,0.8,-2.752462,-2.752462
1656,OFF-AP-10004868,1,-24.7086,-24.7086,0.8,-2.7454,-2.7454


----

### Pricing Rule Simulation: Discount Cap Scenario

This step simulates a simple pricing control rule to estimate potential loss reduction.
The scenario assumes that discounts above a fixed threshold are capped,
while keeping cost structure unchanged.

The objective is not to predict exact future profit,
but to quantify how much loss could have been avoided
by applying a conservative discount governance rule.


In [24]:
# Define discount cap threshold
DISCOUNT_CAP = 0.20

# Work on a copy to avoid mutating original data
scenario_df = df.copy()

# Use only rows with valid sales_before_discount and profit
scenario_df = scenario_df[
    scenario_df["sales_before_discount"].notna() &
    scenario_df["profit"].notna()
].copy()

# Infer cost from original data (cost = sales - profit)
scenario_df["cost"] = scenario_df["sales"] - scenario_df["profit"]

# Apply discount cap
scenario_df["discount_capped"] = np.where(
    scenario_df["discount"] > DISCOUNT_CAP,
    DISCOUNT_CAP,
    scenario_df["discount"]
)

# Recalculate sales under capped discount
scenario_df["sales_capped"] = (
    scenario_df["sales_before_discount"] * (1 - scenario_df["discount_capped"])
)

# Recalculate profit under capped discount (cost assumed unchanged)
scenario_df["profit_capped"] = scenario_df["sales_capped"] - scenario_df["cost"]

# Compare original vs capped profit
scenario_df["profit_delta"] = scenario_df["profit_capped"] - scenario_df["profit"]

# Aggregate impact summary
pricing_rule_impact = scenario_df.agg(
    original_total_profit=("profit", "sum"),
    capped_total_profit=("profit_capped", "sum"),
    profit_improvement=("profit_delta", "sum"),
    affected_order_count=("profit_delta", lambda x: (x > 0).sum())
)

pricing_rule_impact


Unnamed: 0,profit,profit_capped,profit_delta
original_total_profit,713569.19068,,
capped_total_profit,,1494000.0,
profit_improvement,,,780431.023306
affected_order_count,,,22087.0


In [28]:
# Define discount cap threshold
DISCOUNT_CAP = 0.20

# Work on a copy to avoid mutating original data
scenario_df = df.copy()

# Keep only rows where the scenario is computable
# sales_before_discount is needed to reconstruct sales under different discount rates
scenario_df = scenario_df[
    scenario_df["sales_before_discount"].notna() &
    scenario_df["sales"].notna() &
    scenario_df["profit"].notna() &
    scenario_df["discount"].notna()
].copy()

# Infer cost from original data (cost = sales - profit)
scenario_df["cost"] = scenario_df["sales"] - scenario_df["profit"]

# Apply discount cap
scenario_df["discount_capped"] = np.where(
    scenario_df["discount"] > DISCOUNT_CAP,
    DISCOUNT_CAP,
    scenario_df["discount"]
)

# Recalculate sales under capped discount (assume list price stays the same)
scenario_df["sales_capped"] = (
    scenario_df["sales_before_discount"] * (1 - scenario_df["discount_capped"])
)

# Recalculate profit under capped discount (assume cost unchanged)
scenario_df["profit_capped"] = scenario_df["sales_capped"] - scenario_df["cost"]

# Profit delta (how much profit improves under the rule)
scenario_df["profit_delta"] = scenario_df["profit_capped"] - scenario_df["profit"]

# Clean, metric-first summary (avoid DataFrame.agg NaN layout issue)
pricing_rule_impact = pd.Series({
    "original_total_profit": scenario_df["profit"].sum(),
    "capped_total_profit": scenario_df["profit_capped"].sum(),
    "profit_improvement": scenario_df["profit_delta"].sum(),
    "affected_order_count": (scenario_df["profit_delta"] > 0).sum(),
    "total_orders_in_scope": scenario_df.shape[0],
    "share_orders_affected": (scenario_df["profit_delta"] > 0).mean()
})

pricing_rule_impact


original_total_profit    7.135692e+05
capped_total_profit      1.494000e+06
profit_improvement       7.804310e+05
affected_order_count     2.208700e+04
total_orders_in_scope    4.862600e+04
share_orders_affected    4.542220e-01
dtype: float64

In [29]:

# Convert the summary output into a clean, readable table
impact_df = pricing_rule_impact.to_frame(name="value").reset_index()
impact_df.columns = ["metric", "value"]

# Format numbers: currency-like for large numbers, percent for shares
impact_df["value"] = impact_df.apply(
    lambda r: f"{r['value']:.2%}" if r["metric"] == "share_orders_affected" else f"{r['value']:,.0f}",
    axis=1
)

# Friendly metric names for reporting
metric_labels = {
    "original_total_profit": "Original Total Profit",
    "capped_total_profit": "Total Profit with 20% Discount Cap",
    "profit_improvement": "Estimated Profit Improvement",
    "affected_order_count": "Orders Positively Affected",
    "total_orders_in_scope": "Orders Evaluated",
    "share_orders_affected": "Share of Orders Affected"
}

impact_df["metric"] = impact_df["metric"].map(metric_labels)

impact_df


Unnamed: 0,metric,value
0,Original Total Profit,713569
1,Total Profit with 20% Discount Cap,1494000
2,Estimated Profit Improvement,780431
3,Orders Positively Affected,22087
4,Orders Evaluated,48626
5,Share of Orders Affected,45.42%


### Interpreting the Pricing Rule Simulation Results

This table summarises the impact of a hypothetical pricing rule
that caps discounts at 20 percent for all orders included in the analysis.

The simulation compares the original observed profit
with a counterfactual scenario in which discounts above 20 percent
are reduced to the cap, while keeping sales volume and cost structure unchanged.

**Original Total Profit** represents the total realised profit
from all orders that were eligible for the simulation.

**Total Profit with 20% Discount Cap** shows the recalculated profit
under the capped discount scenario.

**Estimated Profit Improvement** is the difference between the capped
and original profit, indicating how much profit could have been preserved
by applying the discount rule.

**Orders Positively Affected** counts the number of orders
for which profit increased under the capped discount scenario.

**Orders Evaluated** represents the total number of orders
that had sufficient data to be included in the simulation.

**Share of Orders Affected** indicates the proportion of evaluated orders
that would have benefited from the discount cap.

This simulation is not a forecast of future performance.
It is a directional analysis designed to quantify the potential downside risk
of aggressive discounting and to illustrate the value of basic discount governance.
