# Smart Substitution Model

**DataBytes - DiscountMate Project - ML Team**

**Author:**
Bailey Mulcahy (224534798), s224534798@deakin.edu.au <br>

---
## Overview

This document outlines a suggested approach for the smart substitution model, including the following sections:

* **Introduction**
* **Loading the Dataset**
* **Preprocessing the Data**
* **Building the Model**
* **Evaluating the Model**
* **Conclusion**

---

## Introduction

The aim of the smart substitution model is to suggest similar items of similar quality and size, that are cheaper than the original. For example, it may suggest a \\$4 carton of milk instead of a \\$5 carton that is the same size.

This document creates the model and generates suggested items to substitute for all original items in the dataset. A new dataset is created, containing the original information, as well as information for the suggested substitute item.

---

## Loading the Dataset

This first section loads the dataset as a pandas DataFrame and prints the first row to inspect.

In [4]:
# Import required libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.neighbors import NearestNeighbors
from scipy.sparse import hstack
import ast
from scipy.sparse import csr_matrix

# Load the dataset as a pandas DataFrame and print the first row
df = pd.read_csv("smart_substitution_dataset.csv")
print(df.head(1))

   product_code                                    name  brand  \
0       8371390  coles hot cross buns traditional fruit  coles   

  brand_confidence brand_tier category subcategory  original_price  \
0      store_brand      store   easter      easter             4.4   

   sale_price  std_item_size std_item_size_unit  item_size  price_per_unit  \
0         3.0            6.0               pack     6.0274            0.73   

  unit_type size_band                                 tags  similarity_score  
0      each     mixed  ['coles', 'cross', 'buns', 'fruit']          0.179609  


---

## Preprocessing the Data

This section involves fixing some issues that may lead to poor predictions, as well as preparing the dataset for TF-IDF. This includes filling missing numeric columns and combining features for similarity.

In [7]:
# Convert tags from string to list
if isinstance(df['tags'].iloc[0], str):             # If 'tags' is a string,
    df['tags'] = df['tags'].apply(ast.literal_eval) # Convert to list

# Combine tags into a single string for TF-IDF (space separated)
df['tags_str'] = df['tags'].apply(lambda x: ' '.join(x))

# Fill missing numeric columns
df['item_size'] = df['item_size'].fillna(1) # Default to 1
df['price_per_unit'] = df['price_per_unit'].fillna(df['sale_price'] / df['item_size']) # Fill missing values with sale price divided by item size

# Combine features for similarity into one string (tags + category + subcategory)
df['combined_features'] = df['tags_str'] + ' ' + df['category'] + ' ' + df['subcategory']

print(f"Data loaded and preprocessed. Total rows: {len(df)}")

Data loaded and preprocessed. Total rows: 24897


---

## Building the Model

This section builds a nearest neighbors model and prints an update on the item generating process after every 1000 rows.

**How does this model work?**

* The model begins by converting text (the combined string of tags + category + subcategory) into a TF-IDF matrix, where each product is represented as a numerical vector of word/term weights. stop_words='english' ensures common words like "the", "and", or "of" are ignored.
* Numeric features (price_per_unit and item_size) are scaled using StandardScaler to ensure they are comparable with other features, and categorical features (size_band, category, and subcategory) are one-hot encoded.
* These text, numeric, and categorical vectors are combined into a hybrid feature matrix for each product, allowing the model to capture both semantic and structured feature similarity.
* A NearestNeighbors model is created using cosine distance, which measures similarity between the hybrid vectors. n_neighbors=6 retrieves the 5 nearest neighbors plus the item itself.
* The dataset is prepared to store suggested items by creating new columns for the recommendations.
* The model loops through every item to generate substitutes. For each item, it finds the 6 nearest neighbors in the hybrid feature space and converts cosine distance to similarity (closer to 1 = more similar).
* Cost-focused filtering is applied: Only neighbors that are cheaper than the current item are considered.
* Among the cheaper candidates, the most similar one is selected. If no cheaper neighbor exists, no recommendation is made for that item.
* Savings are calculated as the difference between the original item's sale price and the suggested substitute’s sale price, and similarity is recorded.
* Finally, the information of the selected substitute is added to the dataset, and a new CSV file is created with all recommendations.

In [10]:
# -----------------------------
# Build NearestNeighbors Model
# -----------------------------

# Convert text (combined feature string) to TF-IDF vectors
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(df['combined_features'])

# Scale numeric features
numeric_features = df[['price_per_unit', 'item_size']].fillna(0)
scaler = StandardScaler()
numeric_scaled = scaler.fit_transform(numeric_features)
numeric_sparse = csr_matrix(numeric_scaled)

# Encode categorical features
categorical_features = df[['size_band', 'category', 'subcategory']].fillna("unknown")
encoder = OneHotEncoder(handle_unknown='ignore')
categorical_encoded = encoder.fit_transform(categorical_features)

# Combine TF-IDF, numeric features, and categorical features into a hybrid matrix
hybrid_matrix = hstack([tfidf_matrix, numeric_sparse, categorical_encoded])

# Fit NearestNeighbors model (cosine distance)
nn_model = NearestNeighbors(n_neighbors=6, metric='cosine', n_jobs=-1)  # Retrieves 5 similar products + itself
nn_model.fit(hybrid_matrix) # Train NearestNeighbors model on the TF-IDF vectors

print("Hybrid NearestNeighbors model built (TF-IDF + numeric + categorical).")

# --------------------------
# Generate Substitute Items
# --------------------------

# List of all original columns to copy for suggested items
original_cols = [
    'product_code', 'name', 'brand', 'brand_confidence', 'brand_tier',
    'category', 'subcategory', 'original_price', 'sale_price',
    'std_item_size', 'std_item_size_unit', 'item_size', 'price_per_unit',
    'unit_type', 'size_band', 'tags', 'similarity_score'
]

# Prepare new columns with 'suggested_' prefix
for col in original_cols:
    df[f'suggested_{col}'] = None

# Explicitly create suggested_savings and suggested_similarity
df['suggested_savings'] = None
df['suggested_similarity'] = None

# Loop through each item
total_rows = len(df)
for idx in range(total_rows):
    
    # Find nearest neighbors
    distances, indices = nn_model.kneighbors(hybrid_matrix[idx:idx+1]) # Finds 6 nearest neighbors for current item
    similarities = 1 - distances[0] # Convert cosine distance to similarity (closer to 1 = more similar)
    candidate_idxs = indices[0][1:] # Skip self (drop the first neighbor)
    similarities_candidates = similarities[1:] # Skip self for similarities

    # Select the most similar candidate
    if len(candidate_idxs) > 0:
        # Identify all cheaper candidates
        cheaper_mask = df.iloc[candidate_idxs]['sale_price'].values < df.iloc[idx]['sale_price']
        
        if cheaper_mask.any():
            # Select the most similar among cheaper candidates
            cheaper_candidates = candidate_idxs[cheaper_mask]
            cheaper_sims = similarities_candidates[cheaper_mask]
            best_pos = cheaper_sims.argmax()
            best_idx = cheaper_candidates[best_pos]
            best_similarity = cheaper_sims[best_pos]

            # Save all suggested item details
            for col in original_cols:
                df.at[idx, f'suggested_{col}'] = df.iloc[best_idx][col]
            
            # Calculate savings
            savings = df.iloc[idx]['sale_price'] - df.iloc[best_idx]['sale_price']
            
            # Save extra info
            df.at[idx, 'suggested_savings'] = round(savings, 2)
            df.at[idx, 'suggested_similarity'] = round(best_similarity, 2)

    # Progress update every 1000 rows
    if (idx + 1) % 1000 == 0 or idx == total_rows - 1:
        print(f"Processed {idx + 1}/{total_rows} rows.")

# Save results to CSV
df.to_csv('smart_substitution_with_recommendations.csv', index=False)
print('File saved.')

Hybrid NearestNeighbors model built (TF-IDF + numeric + categorical).
Processed 1000/24897 rows.
Processed 2000/24897 rows.
Processed 3000/24897 rows.
Processed 4000/24897 rows.
Processed 5000/24897 rows.
Processed 6000/24897 rows.
Processed 7000/24897 rows.
Processed 8000/24897 rows.
Processed 9000/24897 rows.
Processed 10000/24897 rows.
Processed 11000/24897 rows.
Processed 12000/24897 rows.
Processed 13000/24897 rows.
Processed 14000/24897 rows.
Processed 15000/24897 rows.
Processed 16000/24897 rows.
Processed 17000/24897 rows.
Processed 18000/24897 rows.
Processed 19000/24897 rows.
Processed 20000/24897 rows.
Processed 21000/24897 rows.
Processed 22000/24897 rows.
Processed 23000/24897 rows.
Processed 24000/24897 rows.
Processed 24897/24897 rows.
File saved.


---

## Evaluating the Model

There are a few key evaluation metrics that can be used to give an overview of how the model is performing. What we want to see is firstly whether the model is saving money by suggesting cheaper alternatives to items, and additionally how similar the item suggestions actually are. It is crucial that these metrics are used in combination with each other, as they do not give a strong picture of model performance when used alone. For example, if we were to use just 'money saved' as a metric, a model that simply suggests the cheapest item available would beat all other models - this is not what we want. We want a model that suggests cheaper alternatives that *are very similar* to the original item. Therefore, if we use other metrics like swap acceptance rate (the percentage of suggested items that meet specified conditions), we can get a good overall view of both the money saved, and the similarity of the items to the original ones.

**Number of substitutions suggested:**

In [73]:
# Load the CSV with recommendations
df_results = pd.read_csv("smart_substitution_with_recommendations.csv")

# Count null and non-null values
missing_count = df_results['suggested_product_code'].isnull().sum()
non_missing_count = df_results['suggested_product_code'].notnull().sum()

# Calculate percentage of items with a suggestion
percent_with_suggestion = (non_missing_count / len(df_results)) * 100

# Print results
print(f"Items with no suggestion: {missing_count}")
print(f"Items with a suggestion: {non_missing_count}")
print(f"Percentage of items with a suggestion: {percent_with_suggestion:.2f}%")

Items with no suggestion: 6623
Items with a suggestion: 18274
Percentage of items with a suggestion: 73.40%


As we can see above, there were 6623 original items that did not receive a substitution suggestion. This means that out of the top 5 most similar items to the original item, none of them were cheaper. The logic of the model is dependent on the assumption that it should not be recommending any items that are not cheaper than the original, which goes with the 'smart substitution' idea. However, this could be changed if necessary, and it could be set so that every single item receives a recommendation - the issue is, these recommendations wouldn't always be cheaper. Overall, 73% of items received a suggested substitute at a cheaper price.

**Average savings per item:**

In [76]:
# Ensure suggested_savings is numeric
df_results['suggested_savings'] = pd.to_numeric(df_results['suggested_savings'], errors='coerce')

# Calculate average and total savings
avg_savings = df_results['suggested_savings'].dropna().mean()
avg_savings_including_empty = df_results['suggested_savings'].fillna(0).mean()
total_savings = df_results['suggested_savings'].dropna().sum()

print(f"Average savings per item (with substitute): ${avg_savings:.2f}")
print(f"Average savings per item (overall): ${avg_savings_including_empty:.2f}")
print(f"Total savings across all items: ${total_savings:.2f}")

Average savings per item (with substitute): $4.07
Average savings per item (overall): $2.99
Total savings across all items: $74399.30


As we can see above, the average savings per item where a substitute was found is \\$4.07. We can also check the average savings across all items, including the ones with no suggestion being set to \\$0 savings, and this gives an average of \\$2.99. Overall, the suggested substitutions save a combined \\$74399.30 across all items.

**Swap acceptance rate:**

Another key metric we can check is what percentage of the suggested items meet our criteria for a successful swap. These criteria can be adjusted based on the nature of the items we went the model to suggest - in this evaluation it checks that there is a suggested item, that the subcategory is the same, that the size band is the same, that it has the same unit type, and that its similarity is above 0.5.

In [79]:
# Define swap acceptance rules
# Valid if:
# 1. There is a suggested item
# 2. Same subcategory
# 3. Same size_band
# 4. Same unit_type
# 5. Similarity above threshold (0.5)
valid_swaps_mask = (
    df_results['suggested_product_code'].notna() &
    (df_results['subcategory'] == df_results['suggested_subcategory']) &
    (df_results['size_band'] == df_results['suggested_size_band']) &
    (df_results['unit_type'] == df_results['suggested_unit_type']) &
    (df_results['suggested_similarity'] >= 0.5)
)

accepted_swaps = valid_swaps_mask.sum()
total_swaps_suggested = df_results['suggested_product_code'].notna().sum()
total_items = len(df_results)

# Swap acceptance rate out of items that had suggestions
acceptance_rate_suggested = accepted_swaps / total_swaps_suggested if total_swaps_suggested > 0 else 0

# Swap acceptance rate out of all items
acceptance_rate_all = accepted_swaps / total_items if total_items > 0 else 0

print(f"Accepted swaps: {accepted_swaps}")
print(f"Total swaps suggested: {total_swaps_suggested}")
print(f"Swap Acceptance Rate (out of suggested items): {acceptance_rate_suggested:.2%}")
print(f"Swap Acceptance Rate (out of all items): {acceptance_rate_all:.2%}")

Accepted swaps: 16610
Total swaps suggested: 18274
Swap Acceptance Rate (out of suggested items): 90.89%
Swap Acceptance Rate (out of all items): 66.71%


As we can see above, the model has performed really well. Out of the suggested items, 90% of them have been accepted based on our specified conditions. If we want to measure this out of the total number of items, not just those with suggestions, the model achieved a moderate 66%.

## Conclusion

Overall the model performed very well. With a swap acceptance rate of 90%, 9/10 of the suggested items met the criteria outlining an effective swap, and this was for the 73% of items that did receive a suggested substitution. Additionally, the swap acceptance rate including items that did not have any item suggested, was still 66%. Out of all the items suggested, the model saved $74399, with an average of \\$4.07 per suggested item. This is a significant amount of savings for shoppers, and the swap acceptance rate highlights that these items are similar too, not just cheaper.