### Price Optimization Machine Learning Model

##### Objective
We want to leverage a machine learning model to help us set optimal prices, the aim would be to increase revenue and/or margin while keeping in mind market conditions and customer trust.

##### Why Now?
I believe we are now well-positioned to design a price optimization model. We have complete access to all our current and historical Brightpearl data, including the pricing information we need.

##### Expected Benefits
- Revenue Uplift: with optimized pricing we can expect improved revenue performance per product
- Margin Protection: An optimized model could help us avoid underpricing
- Insights: Clear understanding of demand elasticity by product & segment

##### Scope
TBC

##### Data Needed

- Historical prices & sales (SKU × date/time × channel)
- Product costs
- Inventory & stockouts
- Promotions & discounts
- Competitor prices ????? Is this achieveable for us
- External demand drivers (seasonality, events)

##### Resources

- Tools: Data warehouse (Perceptium), Python ML stack, Tableau BI dashboard.

### Breakdown of Data from Tables

##### Historical prices & sales (SKU × date/time × channel) / Product costs
Order Table:
- ord_id - Order ID
- ord_invoicetaxDate - Tax Date
- ord_channelId - Channel ID
- ord_orderTypeCode - Type code (used to filter, example: PC or SC is a refund????? Please confirm)

Orderline Table:
- orl_ord_id - Order ID (Number for overall order)
- orl_id - OrderLine ID (Number for orderline, used to show individual lines inside of an order)
- orl_productSku - product SKU
- orl_productId - Product ID
- orl_nominalCode - For filtering (Not needed as a column)
- orl_itemCostValue - Cost (cost price for single unit of product)
- orl_quantity - Quantity (number of items purchased)
- orl_productPriceValue - Price (price of the product at the time the order is placed)
- DO NOT USE - orl_discountPercentage - discount percent on row (not dependable) 


** 02/09/2025 - To test and train Model we are going to use potentially 5 to 10 top selling products, this will make it easier to source competitor pricing for the time being
Products:

In [13]:
# Imports Libraries - Remove unneeded 

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import warnings
import pyodbc
warnings.filterwarnings('ignore')

from pathlib import Path

from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.seasonal import seasonal_decompose  

from sklearn.model_selection import train_test_split

##### Loading Dataset

In [14]:
#Load datasets - Original Dataset use sql below for full model
#orders = pd.read_csv('Order.csv')

# --- Step 1: Read the credentials from the text file ---
credentials = {}
try:
    with open('credentials.txt', 'r') as file:
        for line in file:
            # Remove leading/trailing whitespace and split the line at the first '='
            key, value = line.strip().split('=', 1)
            credentials[key.strip()] = value.strip()
except FileNotFoundError:
    print("Error: The 'credentials.txt' file was not found.")
    exit() # or handle the error in another way

# Assign credentials to variables
server_name = credentials.get('server')
database_name = credentials.get('database')
username = credentials.get('username')
password = credentials.get('password')
driver = '{ODBC Driver 17 for SQL Server}'

# Check for missing credentials
if not all([server_name, database_name, username, password]):
    raise ValueError("One or more credentials are missing from the file.")

# --- Step 2: Establish the connection ---
try:
    conn_string = (
        f'DRIVER={driver};'
        f'SERVER={server_name};'
        f'DATABASE={database_name};'
        f'UID={username};'
        f'PWD={password};'
    )
    conn = pyodbc.connect(conn_string)
    print("Connection to Azure SQL Database successful!")

except pyodbc.Error as ex:
    print(f"Error connecting to the database: {ex.args[0]}")
    conn = None

# --- Step 3: Fetch merged data and load into a single DataFrame ---

#points to sql file
sql_path = Path("C:/Users/Devin Ferko/Desktop/Codes/Machine Learning Projects/Price Optimization/HistoricalPricesSales.sql")

with open(sql_path, 'r', encoding='utf-8') as file:
    sql_query = file.read()
    print("SQL query loaded from file.")

if conn:
    try:
        # Reads SQL query from file
        merged_query = sql_query
        
        # Load the joined data directly into a single DataFrame
        orders = pd.read_sql(merged_query, conn)
        print(f"Successfully loaded {len(orders)} rows from the merged query.")
        #print("\nMerged DataFrame Head:")
        #print(orders.head())

    except Exception as e:
        print(f"An error occurred while fetching data: {e}")

    finally:
        conn.close()
        print("Database connection closed.")
else:
    print("Cannot proceed with data fetching. Database connection failed.")

Connection to Azure SQL Database successful!
SQL query loaded from file.
Successfully loaded 6359 rows from the merged query.
Database connection closed.


In [15]:
orders = orders.rename(columns={"Product SKU": "SKU"})
#Info
orders.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6359 entries, 0 to 6358
Data columns (total 16 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   Order ID           6359 non-null   int64         
 1   Tax Date           6359 non-null   datetime64[ns]
 2   Net                6359 non-null   float64       
 3   Total              6359 non-null   float64       
 4   Channel Id         6359 non-null   int64         
 5   Type Code          6359 non-null   object        
 6   Orderline ID       6359 non-null   int64         
 7   Product Id         6359 non-null   int64         
 8   SKU                6359 non-null   object        
 9   Product Name       6359 non-null   object        
 10  Quantity           6359 non-null   float64       
 11  Product Value      6359 non-null   float64       
 12  Product Tax Value  6359 non-null   float64       
 13  Price of Product   6359 non-null   float64       
 14  Cost of 

In [16]:
#Null values per columns
orders.isnull().sum()

Order ID             0
Tax Date             0
Net                  0
Total                0
Channel Id           0
Type Code            0
Orderline ID         0
Product Id           0
SKU                  0
Product Name         0
Quantity             0
Product Value        0
Product Tax Value    0
Price of Product     0
Cost of Product      0
Nominal Code         0
dtype: int64

### Promotion Data

We have a couple of CSV's with some promotional data that would be benefical to apply to this notebook

In [17]:
# Summer Sale 2025 - 25th of June to 27th of August
summerSale25 = pd.read_csv("C:/Users/Devin Ferko/Desktop/Codes/Machine Learning Projects/Price Optimization/CSVs/Summer Sale 2025 - Prepped.csv")

#Uncomment the below if needed
#summerSale25.head()

In [18]:
# Spring Sale 2025 - March 5th to April 7th
springSale25 = pd.read_csv("C:/Users/Devin Ferko/Desktop/Codes/Machine Learning Projects/Price Optimization/CSVs/Spring Sale 2025 - prepped.csv")

#Uncomment the below if needed
#springSale25.head()

### Indicate if sale was present and apply discount percentages accordingly

In [19]:
# Add Sale Boolean Column and the discount percentage applied

# --- SUMMER SALE 2025 ---
# Define date windows
sum_start_date = pd.to_datetime('2025-06-25')
sum_end_date = pd.to_datetime('2025-08-27')

#Ensure clean dtypes - types for SKU are string and no trailing space
orders["SKU"] = orders["SKU"].astype(str).str.strip()
summerSale25["SKU"] = summerSale25["SKU"].astype(str).str.strip()

#Add summer sale 2025 column
orders["Summer_Sale"] = (
    orders["Tax Date"].between(sum_start_date, sum_end_date) #True if date falls in between
    & orders["SKU"].isin(summerSale25["SKU"]) #True if SKU matches
).astype("int8") # converts to boolean - t/f or 1/0

#Bring in discount columns from the sale sheet
#Keep only the columns we need from the sale table
discount_cols = ["% off list TW", "% off list DR", "% off list OR"]
orders = orders.merge(
    summerSale25[["SKU"] + discount_cols],
    on="SKU",
    how="left"
)


# Map Channel Id -> the corresponding discount column.
channel_discount_map = {
    "2": "% off list TW",
    "7": "% off list DR",
    "8": "% off list OR"
}

channel_key = orders["Channel Id"].astype(str).str.strip() # Ensure Channel Id's are strings
chosen_col = channel_key.map(channel_discount_map)  # per-row column name to use

# Vectorized pick of the right discount per row
disc_df = orders[discount_cols] #seperates discount values to seperate df
col_indexer = pd.Index(discount_cols).get_indexer(chosen_col) # converts column names in chosen_col into numeric indices so we can index the DataFrame efficiently.
row_indexer = np.arange(len(orders)) #array of row numbers [0, 1, 2, ..., n-1]

result = np.full(len(orders), np.nan, dtype=float) #empty array to hold discount values
in_window = orders["Tax Date"].between(sum_start_date, sum_end_date) #True if in window
valid_choice = col_indexer >= 0 #True if valid discount column exists for channel
mask = in_window & valid_choice #only select discount for orders in window and with valid channel

# Pull the values only where in window and with a valid channel/discount
result[mask] = disc_df.to_numpy()[row_indexer[mask], col_indexer[mask]]
orders["sumsale25_discount_percent"] = result

#If you prefer 0 instead of NaN when not applicable, uncomment:
orders["sumsale25_discount_percent"] = orders["sumsale25_discount_percent"].fillna(0)

#Drops unwanted columns
orders = orders.drop(['% off list TW', '% off list DR', '% off list OR'], axis=1)

# --- SPRING SALE 2025 ---
# Define date windows
spr_start_date = pd.to_datetime('2025-03-05')
spr_end_date = pd.to_datetime('2025-04-07')

#Ensure clean dtypes - types for SKU are string and no trailing space
orders["SKU"] = orders["SKU"].astype(str).str.strip()
springSale25["SKU"] = springSale25["SKU"].astype(str).str.strip()

#Add summer sale 2025 column
orders["Spring_Sale"] = (
    orders["Tax Date"].between(spr_start_date, spr_end_date) #True if date falls in between
    & orders["SKU"].isin(springSale25["SKU"]) #True if SKU matches
).astype("int8") # converts to boolean - t/f or 1/0

#Bring in discount columns from the sale sheet
#Keep only the columns we need from the sale table
discount_cols = ["% off list TW", "% off list DR", "% off list OR"]
orders = orders.merge(
    springSale25[["SKU"] + discount_cols],
    on="SKU",
    how="left"
)


# Map Channel Id -> the corresponding discount column.
channel_discount_map = {
    "2": "% off list TW",
    "7": "% off list DR",
    "8": "% off list OR"
}

channel_key = orders["Channel Id"].astype(str).str.strip() # Ensure Channel Id's are strings
chosen_col = channel_key.map(channel_discount_map)  # per-row column name to use

# Vectorized pick of the right discount per row
disc_df = orders[discount_cols] #seperates discount values to seperate df
col_indexer = pd.Index(discount_cols).get_indexer(chosen_col) # converts column names in chosen_col into numeric indices so we can index the DataFrame efficiently.
row_indexer = np.arange(len(orders)) #array of row numbers [0, 1, 2, ..., n-1]

result = np.full(len(orders), np.nan, dtype=float) #empty array to hold discount values
in_window = orders["Tax Date"].between(spr_start_date, spr_end_date) #True if in window
valid_choice = col_indexer >= 0 #True if valid discount column exists for channel
mask = in_window & valid_choice #only select discount for orders in window and with valid channel

# Pull the values only where in window and with a valid channel/discount
result[mask] = disc_df.to_numpy()[row_indexer[mask], col_indexer[mask]]
orders["sprsale25_discount_percent"] = result

#If you prefer 0 instead of NaN when not applicable, uncomment:
orders["sprsale25_discount_percent"] = orders["sprsale25_discount_percent"].fillna(0)

#Drops unwanted columns
orders = orders.drop(['% off list TW', '% off list DR', '% off list OR'], axis=1)

# --- TAX MONTH AND SEASONS ---
# Extract the month number from Tax Date
orders["TaxMonth"] = orders["Tax Date"].dt.month

# Flag if TaxMonth is in summer (June=6, July=7, August=8 for example)
orders["Winter"] = orders["TaxMonth"].isin([12, 1, 2]).astype(int)
orders["Spring"] = orders["TaxMonth"].isin([3, 4, 5]).astype(int)
orders["Summer"] = orders["TaxMonth"].isin([6, 7, 8]).astype(int)
orders["Fall"] = orders["TaxMonth"].isin([9, 10, 11]).astype(int)


In [20]:
# Uncomment the below if needed
#orders.info() 
#orders.to_csv('out.csv') 

### Product Attributes - Akeneo data

In [21]:
# Reads akeneo product attribute dataset
prdAttr = pd.read_csv("C:/Users/Devin Ferko/Desktop/Codes/Machine Learning Projects/Price Optimization/CSVs/Akeneo Product Attributes - Sheet1.csv")

#Uncomment the below if needed
#prdAttr.head()

In [22]:
# Merge attribute data
orders = orders.merge(prdAttr, on='SKU', how='left')

#Uncomment the below if needed
#orders.info()

In [23]:
# Reads akeneo product attribute dataset
compPrice = pd.read_csv("C:/Users/Devin Ferko/Desktop/Codes/Machine Learning Projects/Price Optimization/CSVs/Competitor Pricing.csv")

#Uncomment the below if needed
#compPrice.head()

In [25]:
# Merge attribute data
orders = orders.merge(compPrice, on='SKU', how='left')

#Uncomment the below if needed
orders.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6359 entries, 0 to 6358
Data columns (total 81 columns):
 #   Column                              Non-Null Count  Dtype         
---  ------                              --------------  -----         
 0   Order ID                            6359 non-null   int64         
 1   Tax Date                            6359 non-null   datetime64[ns]
 2   Net                                 6359 non-null   float64       
 3   Total                               6359 non-null   float64       
 4   Channel Id                          6359 non-null   int64         
 5   Type Code                           6359 non-null   object        
 6   Orderline ID                        6359 non-null   int64         
 7   Product Id                          6359 non-null   int64         
 8   SKU                                 6359 non-null   object        
 9   Product Name_x                      6359 non-null   object        
 10  Quantity                

### Next Steps

This dataset works as a great base but we could benefit from:

- Customer Data