The Biggest weakness in the competitive analysis was (the 2.6% comparability) and then you solve it with a more sophisticated data science technique. We will solve this issue

The Goal: Move the exact name matching to cononical product matching. We want to group "Coca-Cola 500ml", "Coke Classic 500ml Bottle", and "Coca-Cola Original Taste 0.5L" into a single canonical product: "coca-cola-500ml". This will massively expand our comparable product universe.

## e5-large transformer to Create Canonical Products

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from thefuzz import process, fuzz
import sys
import os

sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))
from src.data_processing import normalise_product_name

sns.set_theme(style="whitegrid")
plt.rcParams["figure.figsize"] = (12, 6)

In [2]:
INTERIM_PATH = "C:/Project/UK store analysis/data/01_interim/cleaned_supermarket_data.parquet"
df_comp = pd.read_parquet(INTERIM_PATH)

print("Cleaned data loaded successfully")
df_comp.head(3)

Cleaned data loaded successfully


Unnamed: 0,supermarket,prices,prices_unit,unit,names,date,category,own_brand
0,Aldi,3.09,0.14,unit,Mamia Ultra-fit Peppa Pig Nappy Pants 22 Pack/...,2024-04-13,baby_products,False
1,Aldi,3.09,0.17,unit,Mamia Ultra-fit Peppa Pig Nappy Pants 18 Pack/...,2024-04-13,baby_products,False
2,Aldi,3.59,0.09,unit,Mamia Ultra-fit Nappy Pants 40 Pack/Size 4,2024-04-13,baby_products,False


In [3]:
df_comp['normalized_name'] = df_comp['names'].apply(normalise_product_name)

# Check a few results
df_comp[['names', 'normalized_name']].head(10)


Unnamed: 0,names,normalized_name
0,Mamia Ultra-fit Peppa Pig Nappy Pants 22 Pack/...,mamia ultrafit peppa pig nappy pants size 6
1,Mamia Ultra-fit Peppa Pig Nappy Pants 18 Pack/...,mamia ultrafit peppa pig nappy pants size 7
2,Mamia Ultra-fit Nappy Pants 40 Pack/Size 4,mamia ultrafit nappy pants size 4
3,Mamia Boy's Night Pants 15 Pack,mamia boys night pants
4,Mamia Girl's Night Pants 15 Pack,mamia girls night pants
5,Mamia Newborn Nappies 24 Pack/Size 1 2-5kg/4-1...,mamia newborn nappies size 1 2411lbs
6,Mamia Nappies Ultra Dry Air System 22 Pack/Siz...,mamia nappies ultra dry air system size 7 xxl
7,Mamia Newborn Nappies 56 Pack/Size 3 Midi 4-9k...,mamia newborn nappies size 3 midi 4920lbs
8,Mamia Newborn Nappies 60 Pack/Size 2 Mini 3-6k...,mamia newborn nappies size 2 mini 3613lbs
9,Mamia Ultra-fit Nappy Pants 28 Pack/Size 7,mamia ultrafit nappy pants size 7


In [4]:
# Extract unique normalised names
unique_names = df_comp["normalized_name"].dropna().unique()

print(f"Unique product names: {len(unique_names)}")

Unique product names: 116229


In [7]:
import torch
from sentence_transformers import SentenceTransformer

device = "cuda" if torch.cuda.is_available() else "cpu"
model = SentenceTransformer("intfloat/e5-large", device=device)
# Format inputs with "query: " prefix (as per e5 training protocol)
formatted_names = ["query: " + name for name in unique_names]

print(device)
print(model)

cuda
SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)


In [8]:
# Compute embedding in batches
embeddings = model.encode(
    formatted_names,
    batch_size=128,
    show_progress_bar=True,
    convert_to_numpy=True
)

Batches:   0%|          | 0/909 [00:00<?, ?it/s]

In [9]:
import faiss

# Normalise for cosine similarity
faiss.normalize_L2(embeddings)

dimension = embeddings.shape[1]
index = faiss.IndexFlatIP(dimension)
index.add(embeddings)

k = 5
distances, indices = index.search(embeddings, k)

In [10]:
# Canonical mapping
similarity_threshold = 0.85
mapping = {}

for i, name in enumerate(unique_names):
    similar_indices = [idx for idx, dist in zip(indices[i], distances[i]) if dist >= similarity_threshold]
    for idx in similar_indices:
        neighbour_name = unique_names[idx]
        canonical = min(name, neighbour_name)
        mapping[neighbour_name] = canonical

# Apply the mapping
df_comp["canonical_name"] = df_comp["normalized_name"].map(mapping)
df_comp["canonical_name"] = df_comp["canonical_name"].fillna(df_comp["normalized_name"])

In [11]:
OUTPUT_PATH = "C:/Project/UK store analysis/data/02_processed/canonical_products_e5.parquet"
df_comp.to_parquet(OUTPUT_PATH)

print("✅ Canonical product mapping complete and saved.")


✅ Canonical product mapping complete and saved.


In [12]:
df_comp.head(10)

Unnamed: 0,supermarket,prices,prices_unit,unit,names,date,category,own_brand,normalized_name,canonical_name
0,Aldi,3.09,0.14,unit,Mamia Ultra-fit Peppa Pig Nappy Pants 22 Pack/...,2024-04-13,baby_products,False,mamia ultrafit peppa pig nappy pants size 6,mamia ultrafit peppa pig nappy pants size 6
1,Aldi,3.09,0.17,unit,Mamia Ultra-fit Peppa Pig Nappy Pants 18 Pack/...,2024-04-13,baby_products,False,mamia ultrafit peppa pig nappy pants size 7,mamia ultrafit nappy pants size 7
2,Aldi,3.59,0.09,unit,Mamia Ultra-fit Nappy Pants 40 Pack/Size 4,2024-04-13,baby_products,False,mamia ultrafit nappy pants size 4,mamia ultrafit nappy pants size 4
3,Aldi,4.79,0.32,unit,Mamia Boy's Night Pants 15 Pack,2024-04-13,baby_products,False,mamia boys night pants,mamia boys night pants
4,Aldi,4.79,0.32,unit,Mamia Girl's Night Pants 15 Pack,2024-04-13,baby_products,False,mamia girls night pants,mamia bed time bath
5,Aldi,0.85,0.04,unit,Mamia Newborn Nappies 24 Pack/Size 1 2-5kg/4-1...,2024-04-13,baby_products,False,mamia newborn nappies size 1 2411lbs,mamia eco nappies size 2 mini 3613lbs
6,Aldi,2.79,0.13,unit,Mamia Nappies Ultra Dry Air System 22 Pack/Siz...,2024-04-13,baby_products,False,mamia nappies ultra dry air system size 7 xxl,mamia nappies ultra dry air system size 7 xxl
7,Aldi,2.89,0.05,unit,Mamia Newborn Nappies 56 Pack/Size 3 Midi 4-9k...,2024-04-13,baby_products,False,mamia newborn nappies size 3 midi 4920lbs,mamia eco nappies size 2 mini 3613lbs
8,Aldi,2.25,0.04,unit,Mamia Newborn Nappies 60 Pack/Size 2 Mini 3-6k...,2024-04-13,baby_products,False,mamia newborn nappies size 2 mini 3613lbs,mamia newborn nappies size 2 mini 3613lbs
9,Aldi,3.59,0.13,unit,Mamia Ultra-fit Nappy Pants 28 Pack/Size 7,2024-04-13,baby_products,False,mamia ultrafit nappy pants size 7,mamia ultrafit nappy pants size 7


## The Payoff: Basket Level Analysis

In [14]:
sns.set_theme(style="whitegrid")
PROCESSED_DATA_PATH = "C:/Project/UK store analysis/data/02_processed/canonical_products_e5.parquet"
df_canonical = pd.read_parquet(PROCESSED_DATA_PATH)

print("Canonical product data loaded successfully.")

Canonical product data loaded successfully.


In [15]:
# Create the competitive Pivol Table
latest_date = df_canonical["date"].max()
df_latest = df_canonical[df_canonical["date"] == latest_date].copy()

# Drop duplicates in case we have multiple products on same data for a retailer
df_latest = df_latest.drop_duplicates(subset=["canonical_name", "supermarket"])

# Pivot on the new canonical_name column
price_pivot_canonical = df_latest.pivot_table(
    index = "canonical_name",
    columns="supermarket",
    values="prices"
)

print(f"Created a competitive pivot table with {price_pivot_canonical.shape[0]} unique canonical products")
price_pivot_canonical.head()

Created a competitive pivot table with 67341 unique canonical products


supermarket,ASDA,Aldi,Morrisons,Sains,Tesco
canonical_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0 fat greek style,,,1.45,,
0 fat greek style yogurt,0.85,,1.1,,1.95
0 fat greek style yogurt 4x100g,1.15,,,,1.15
0 fat greek style yogurt strawberry,,,1.45,,
0 fat natural yogurt,1.1,,1.09,,1.7


In [19]:
# Your product index
product_index = price_pivot_canonical.index.tolist()

def find_products_by_keywords_filtered(keywords, product_index, max_items=15):
    matches = []
    for product in product_index:
        # Skip if product name starts with a digit (likely pack size)
        if product[0].isdigit():
            continue
        name = product.lower()
        if any(keyword.lower() in name for keyword in keywords):
            matches.append(product)
        if len(matches) >= max_items:
            break
    return matches

# Broader keywords for each basket
essentials_keywords = [
    "milk", "bread", "egg", "cheese", "potato", "butter", "bean", "banana",
    "yogurt", "tomato", "onion", "apple"
]

big_brand_keywords = [
    "coca-cola", "heinz", "kelloggs", "walkers", "nescafe", "hovis"
]

healthy_choice_keywords = [
    "almond", "avocado", "pasta", "spinach", "chicken", "yogurt", "rice", "broccoli", "carrot"
]

# Find filtered products
essentials = find_products_by_keywords_filtered(essentials_keywords, product_index, max_items=15)
big_brand = find_products_by_keywords_filtered(big_brand_keywords, product_index, max_items=15)
healthy_choice = find_products_by_keywords_filtered(healthy_choice_keywords, product_index, max_items=15)

print("Essentials basket:\n", essentials)
print("\nBig Brand Shop basket:\n", big_brand)
print("\nHealthy Choice basket:\n", healthy_choice)

# Build baskets dict
baskets = {
    "The Essentials": essentials,
    "The Big Brand Shop": big_brand,
    "The Healthy Choice": healthy_choice,
}


Essentials basket:
 ['abracadebora buttermilk pancakes', 'acti leaf almond unsweetened uht milk', 'actimel 0 fat original yogurt drinks', 'actimel blueberry cultured yogurt drink 8x100g', 'actimel blueberry yogurt drinks', 'actimel coconut yogurt drinks', 'actimel coconut yogurt drinks 8x100g', 'actimel dairy free almond mango yogurt drink alternative 6x100g', 'actimel kids strawberry banana yoghurt drink', 'actimel kids strawberry banana yoghurt drink 4x100g', 'actimel kids strawberry banana yoghurt drink multipack 4x100g', 'actimel multifruit cultured yogurt drink 12x100g', 'actimel raspberry 0 added sugar fat free yogurt drink 8x100g', 'actimel strawberry 0 added sugar fat free yogurt drink 12x100g', 'actimel strawberry blueberry cultured yogurt drink 12x100g']

Big Brand Shop basket:
 ['costa nescafe dolce gusto c', 'costa nescafe dolce gusto compa', 'costa nescafe dolce gusto compatible one pod latte', 'heinz 30 less fat salad cream', 'heinz 4 mths original farleys rusks', 'heinz 

In [20]:
baskets

{'The Essentials': ['abracadebora buttermilk pancakes',
  'acti leaf almond unsweetened uht milk',
  'actimel 0 fat original yogurt drinks',
  'actimel blueberry cultured yogurt drink 8x100g',
  'actimel blueberry yogurt drinks',
  'actimel coconut yogurt drinks',
  'actimel coconut yogurt drinks 8x100g',
  'actimel dairy free almond mango yogurt drink alternative 6x100g',
  'actimel kids strawberry banana yoghurt drink',
  'actimel kids strawberry banana yoghurt drink 4x100g',
  'actimel kids strawberry banana yoghurt drink multipack 4x100g',
  'actimel multifruit cultured yogurt drink 12x100g',
  'actimel raspberry 0 added sugar fat free yogurt drink 8x100g',
  'actimel strawberry 0 added sugar fat free yogurt drink 12x100g',
  'actimel strawberry blueberry cultured yogurt drink 12x100g'],
 'The Big Brand Shop': ['costa nescafe dolce gusto c',
  'costa nescafe dolce gusto compa',
  'costa nescafe dolce gusto compatible one pod latte',
  'heinz 30 less fat salad cream',
  'heinz 4 mth

In [21]:
category_counts = df_canonical['category'].value_counts()
print(category_counts)

# Then select top categories as baskets


category
food_cupboard      2112803
health_products    1608050
fresh_food         1339306
home               1337377
drinks             1086120
household           544690
free-from           348226
frozen              332651
pets                296586
baby_products       290635
bakery              232798
Name: count, dtype: int64


In [24]:
# Your product index (all canonical product names)
product_index = set(price_pivot_canonical.index.tolist())  # use set for faster lookup

# Function to filter product list by keywords and limit number of items
def find_products_by_keywords_filtered(keywords, product_list, max_items=15):
    matches = []
    for product in product_list:
        if product[0].isdigit():  # skip pack sizes
            continue
        name = product.lower()
        if any(keyword.lower() in name for keyword in keywords):
            matches.append(product)
        if len(matches) >= max_items:
            break
    return matches

# --- Keyword-based baskets (your existing baskets) ---

essentials_keywords = [
    "milk", "bread", "egg", "cheese", "potato", "butter", "bean", "banana",
    "yogurt", "tomato", "onion", "apple"
]

big_brand_keywords = [
    "coca-cola", "heinz", "kelloggs", "walkers", "nescafe", "hovis"
]

healthy_choice_keywords = [
    "almond", "avocado", "pasta", "spinach", "chicken", "yogurt", "rice", "broccoli", "carrot"
]

essentials = find_products_by_keywords_filtered(essentials_keywords, product_index, max_items=15)
big_brand = find_products_by_keywords_filtered(big_brand_keywords, product_index, max_items=15)
healthy_choice = find_products_by_keywords_filtered(healthy_choice_keywords, product_index, max_items=15)

# --- Category-based baskets ---

selected_categories = [
    "food_cupboard",
    "health_products",
    "fresh_food",
    "drinks",
    "household",
    "free-from",
    "frozen",
    "pets",
    "baby_products",
    "bakery"
]

baskets_from_categories = {}
max_products_per_category = 20

for cat in selected_categories:
    # Filter df_canonical to get products in category
    products_in_cat = df_canonical[df_canonical['category'] == cat]['canonical_name'].unique()
    # Keep only those products present in product_index (to ensure consistency)
    filtered_products = [p for p in products_in_cat if p in product_index and not p[0].isdigit()]
    # Limit the number of products per category
    baskets_from_categories[cat] = filtered_products[:max_products_per_category]

# --- Combine all baskets into one dictionary ---

baskets = {
    "The Essentials": essentials,
    "The Big Brand Shop": big_brand,
    "The Healthy Choice": healthy_choice,
}

# Add category-based baskets with nicer names (optional)
category_basket_names = {
    "food_cupboard": "Food Cupboard",
    "health_products": "Health Products",
    "fresh_food": "Fresh Food",
    "drinks": "Drinks",
    "household": "Household",
    "free-from": "Free From",
    "frozen": "Frozen Foods",
    "pets": "Pets",
    "baby_products": "Baby Products",
    "bakery": "Bakery",
}

for cat, products in baskets_from_categories.items():
    baskets[category_basket_names.get(cat, cat)] = products

# --- Print basket summaries ---

for basket_name, products in baskets.items():
    print(f"Basket: {basket_name} - {len(products)} products")
    print(products)
    print()


Basket: The Essentials - 15 products
['the best all butter chocola', 'inspired cuisine pulled beef bean chilli', 'galbani maxi italian mozzarella cheese', 'sainsburys light soft cheese', 'bernard matthews ham cheese turkey escalopes', 'emporium british spicy cheddar cheese slices with chill', 'cut green beans', 'sainsburys lemon cheesecake summer edition', 'cadbury dark milk giant buttons chocolate bag', 'cow gate 4 growing up milk powder 2 years', 'heinz oat banana multigrain 7 months', 'napolina double concentrate tomato puree', 'mcvities blissfuls belgian milk chocolate caramel biscuits', 'jacobs biscuits for cheese', 'tropical sun black beans in salted water']

Basket: The Big Brand Shop - 15 products
['heinz 4 mths original farleys rusks', 'heinz oat banana multigrain 7 months', 'heinz cream of tomato cup packet soup x4', 'kelloggs rice krispies cereal milk bars 6x20g', 'hovis seeded', 'walkers baked cheese onion multipack crisps 6x22g', 'heinz no added sugar cream of tomato soup'

In [27]:
baskets

{'The Essentials': ['the best all butter chocola',
  'inspired cuisine pulled beef bean chilli',
  'galbani maxi italian mozzarella cheese',
  'sainsburys light soft cheese',
  'bernard matthews ham cheese turkey escalopes',
  'emporium british spicy cheddar cheese slices with chill',
  'cut green beans',
  'sainsburys lemon cheesecake summer edition',
  'cadbury dark milk giant buttons chocolate bag',
  'cow gate 4 growing up milk powder 2 years',
  'heinz oat banana multigrain 7 months',
  'napolina double concentrate tomato puree',
  'mcvities blissfuls belgian milk chocolate caramel biscuits',
  'jacobs biscuits for cheese',
  'tropical sun black beans in salted water'],
 'The Big Brand Shop': ['heinz 4 mths original farleys rusks',
  'heinz oat banana multigrain 7 months',
  'heinz cream of tomato cup packet soup x4',
  'kelloggs rice krispies cereal milk bars 6x20g',
  'hovis seeded',
  'walkers baked cheese onion multipack crisps 6x22g',
  'heinz no added sugar cream of tomato s