---

# **Moore MKW - Anomaly Detection in Product Prices**
### **A pilot project using crawled Keessmit product data**

### Moni IT & Consultancy

### Gert Wijnalda, M.M.J.D.T. Voeten

---

---

# **1.0 - Project Goal**

---





- To detect anomalies in product prices.
- Find out whether automatisation of anomaly detection in addition to randomized checks could lead to improved confidence in audits while reducing auditor workloads.

---

# **2.0 - Dataset Description**

---

<style>
  table {
    margin: auto;
    width: 100%;
  }
  td {
    text-align: justify;
    padding: 8px; 
  }
  th
  {
    text-align: center;
  }
</style>

---

<table> <tr> <th> Feature </th> <th> Description </th> </tr> <tr> <td> productID </td> <td> Unique identifier assigned to each product. </td> </tr> <tr> <td> productName </td> <td> Full descriptive name of the product, often including specifications or value (e.g., "Shadowline Parasol beschermhoes Small"). </td> </tr> <tr> <td> productPrice </td> <td> Listed retail price of the product, expressed in local currency. </td> </tr> <tr> <td> productVAT </td> <td> Value Added Tax (VAT) applicable to the product, if available. </td> </tr> <tr> <td> productDiscount </td> <td> Discount offered on the product, typically expressed as a percentage reduction (e.g., "-15%"). </td> </tr> <tr> <td> productColor </td> <td> Color or combination of colors of the product (e.g., "Grijs - Antraciet"). </td> </tr> <tr> <td> productCategory </td> <td> Product category or grouping, indicating the type of item (e.g., "Parasols", "Tuinstoelen"). </td> </tr> <tr> <td> productBrand </td> <td> Brand or manufacturer associated with the product. </td> </tr> <tr> <td> productURL </td> <td> Website link to the product’s detail page. </td> </tr> <tr> <td> currentDate </td> <td> Timestamp indicating when the product information was collected or recorded. </td> </tr> </table><br>

---

# **3.0 - Methodology**

---

TODO

---

# **4.0 - Imports & Preliminaries**

---

In [None]:
#%pip install -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


In [105]:
# General
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import re

# Visualisations
import matplotlib.pyplot as pltå
import seaborn as sns
import plotly.express as px
import matplotlib.ticker as mtick

# Machine Learning
from sklearn.model_selection import train_test_split, cross_val_score, KFold, cross_val_predict
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import LabelEncoder, StandardScaler
from scipy.stats import ttest_rel, wilcoxon, pearsonr
from sklearn.ensemble import IsolationForest
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import CountVectorizer
from scipy.sparse import hstack

# Warnings
import warnings
warnings.filterwarnings('ignore')

In [106]:
magic_number = sum([ord(x) for x in "Make Moni Flow"])

SEED = magic_number
TEST_SIZE = 0.2
RANDOM_STATE = magic_number
print(magic_number)

1257


In [107]:
raw_data = "./src/Keessmit_raw.xlsx"
data = pd.read_excel(raw_data)
data.head()

Unnamed: 0,productID,productName,productPrice,productVAT,productDiscount,productColor,productCategory,productBrand,productURL,currentDate,url
0,1000.0,"Cadeaubon t.w.v. €20,- Kees Smit Tuinmeubelen",,,,Groen,\n Tuinaccessoires ...,Kees Smit Collectie,,2025-08-12T09:38:53.862Z,https://www.keessmit.nl/cadeaubon-t-w-v-20-kee...
1,100365.0,Hartman Da Vinci standenstoel,"295,-",,(-15%),Grijs - Antraciet,\n Tuinstoelen,Hartman tuinmeubelen,,2025-08-12T09:38:53.982Z,https://www.keessmit.nl/hartman-da-vinci-stand...
2,10060.0,Shadowline Parasol beschermhoes Small,29.95,,(-19%),Grijs - Antraciet,\n Parasols,Shadowline,,2025-08-12T09:38:54.099Z,https://www.keessmit.nl/shadowline-parasol-bes...
3,10061.0,Shadowline Parasol beschermhoes Medium,33.95,,(-14%),Grijs - Antraciet,\n Parasols,Shadowline,,2025-08-12T09:38:54.214Z,https://www.keessmit.nl/shadowline-parasol-bes...
4,10062.0,Shadowline Parasol beschermhoes Large,37.95,,(-10%),Grijs - Antraciet,\n Parasols,Shadowline,,2025-08-12T09:38:54.308Z,https://www.keessmit.nl/shadowline-parasol-bes...


---

# **5.0 - Dataset Cleaning**

---

In [108]:
data.describe()

Unnamed: 0,productID,productVAT,productURL
count,4519.0,0.0,0.0
mean,121903.757911,,
std,8118.953139,,
min,1000.0,,
25%,118988.5,,
50%,124318.0,,
75%,126379.0,,
max,128388.0,,


In [109]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4520 entries, 0 to 4519
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   productID        4519 non-null   float64
 1   productName      4520 non-null   object 
 2   productPrice     4135 non-null   object 
 3   productVAT       0 non-null      float64
 4   productDiscount  3873 non-null   object 
 5   productColor     2800 non-null   object 
 6   productCategory  4519 non-null   object 
 7   productBrand     4510 non-null   object 
 8   productURL       0 non-null      float64
 9   currentDate      4520 non-null   object 
 10  url              4520 non-null   object 
dtypes: float64(3), object(8)
memory usage: 388.6+ KB


In [110]:
data = data.drop(columns=["productVAT", "productURL", "currentDate"])
data.head()

Unnamed: 0,productID,productName,productPrice,productDiscount,productColor,productCategory,productBrand,url
0,1000.0,"Cadeaubon t.w.v. €20,- Kees Smit Tuinmeubelen",,,Groen,\n Tuinaccessoires ...,Kees Smit Collectie,https://www.keessmit.nl/cadeaubon-t-w-v-20-kee...
1,100365.0,Hartman Da Vinci standenstoel,"295,-",(-15%),Grijs - Antraciet,\n Tuinstoelen,Hartman tuinmeubelen,https://www.keessmit.nl/hartman-da-vinci-stand...
2,10060.0,Shadowline Parasol beschermhoes Small,29.95,(-19%),Grijs - Antraciet,\n Parasols,Shadowline,https://www.keessmit.nl/shadowline-parasol-bes...
3,10061.0,Shadowline Parasol beschermhoes Medium,33.95,(-14%),Grijs - Antraciet,\n Parasols,Shadowline,https://www.keessmit.nl/shadowline-parasol-bes...
4,10062.0,Shadowline Parasol beschermhoes Large,37.95,(-10%),Grijs - Antraciet,\n Parasols,Shadowline,https://www.keessmit.nl/shadowline-parasol-bes...


In [111]:
def clean_text(text):
    if pd.isna(text): return ""
    text = text.lower()
    text = re.sub(r'[^\w\s]', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

data['productName_clean'] = data['productName'].apply(clean_text)
data['productName_clean'] = data['productName_clean'].str.replace(r'\b\w\b', '', regex=True)
data['productName_clean'] = data['productName_clean'].str.replace(r'\s+', ' ', regex=True).str.strip()
data['productName_clean'] = data['productName_clean'].astype(str)



In [None]:
def clean_price(price):
    if pd.isna(price): return None
    price_str = str(price).strip()
    price_str = re.sub(r"[,-]+$", "", price_str)
    if re.match(r"^\d{1,3}(\.\d{3})*,\d{1,2}$", price_str):
        price_str = price_str.replace(".", "").replace(",", ".")
    elif "," in price_str and "." not in price_str:
        price_str = price_str.replace(",", ".")
    elif re.match(r"^\d{1,3}(\.\d{3})*$", price_str):
        price_str = price_str.replace(".", "")
    try: return round(float(price_str), 2)
    except ValueError: return None

data['productPrice'] = data['productPrice'].apply(clean_price).fillna(0)


In [113]:
data['productDiscount'] = data['productDiscount'].str.replace(r'[^\d.]', '', regex=True)
data['productDiscount'] = data['productDiscount'].replace('', '0').astype(float) / 100
data['productDiscount'] = data['productDiscount'].fillna(0)
data['productDiscount'].head(20)

0     0.00
1     0.15
2     0.19
3     0.14
4     0.10
5     0.10
6     0.15
7     0.25
8     0.21
9     0.17
10    0.13
11    0.17
12    0.18
13    0.00
14    0.00
15    0.00
16    0.48
17    0.00
18    0.00
19    0.00
Name: productDiscount, dtype: float64

In [114]:
data['productColor'] = data['productColor'].str.strip()                  # remove leading/trailing whitespace
data['productColor'] = data['productColor'].replace(r'\s+', ' ', regex=True)  # replace multiple spaces/newlines with single space
print(data['productColor'].unique())
print(len(data['productColor'].unique()))

['Groen' 'Grijs - Antraciet' 'Zwart' 'Taupe - Naturel - Bruin' nan
 'Old Teak Greywash' 'Wit - Ecru' 'Grijs - Antraciet, Zwart' 'Zwart, Rood'
 'Grijs - Antraciet, Wit - Ecru' 'Groen, Taupe - Naturel - Bruin'
 'Zwart, Natural Teak' 'Grijs - Antraciet, Natural Teak'
 'Wit - Ecru, Natural Teak' 'Rood' 'Blauw, Grijs - Antraciet' 'Geel'
 'Roze' 'Grijs - Antraciet, Old Teak Greywash'
 'Wit - Ecru, Old Teak Greywash' 'Natural Teak' 'Groen, Natural Teak'
 'Taupe - Naturel - Bruin, Old Teak Greywash' 'Zwart, Old Teak Greywash'
 'Taupe - Naturel - Bruin, Zwart'
 'Grijs - Antraciet, Taupe - Naturel - Bruin, Zwart' 'Blauw'
 'Taupe - Naturel - Bruin, Zwart, Natural Teak' 'Groen, Old Teak Greywash'
 'Oranje, Rood' 'Grijs - Antraciet, Groen, Natural Teak'
 'Grijs - Antraciet, Wit - Ecru, Natural Teak' 'Blauw, Groen'
 'Taupe - Naturel - Bruin, Wit - Ecru, Natural Teak'
 'Grijs - Antraciet, Roze' 'Blauw, Taupe - Naturel - Bruin, Wit - Ecru'
 'Roze, Rood' 'Blauw, Zwart' 'Wit - Ecru, Zwart'
 'Blauw, Old 

In [115]:
data['productCategory'] = data['productCategory'].str.replace(r'\s+', ' ', regex=True).str.strip()
data['productCategory'].head(20)

0     Tuinaccessoires
1         Tuinstoelen
2            Parasols
3            Parasols
4            Parasols
5            Parasols
6     Tuinaccessoires
7     Tuinaccessoires
8           Ligbedden
9            Parasols
10           Parasols
11           Parasols
12           Parasols
13           Parasols
14        Tuinkussens
15        Tuinkussens
16        Tuinkussens
17        Tuinkussens
18        Tuinkussens
19        Tuinkussens
Name: productCategory, dtype: object

In [116]:
data['productBrand'] = data['productBrand'].str.lower().str.strip().str.replace("&amp;", "&")
data['productBrand'].head(20)

0      kees smit collectie
1     hartman tuinmeubelen
2               shadowline
3               shadowline
4               shadowline
5               shadowline
6     hartman tuinmeubelen
7     hartman tuinmeubelen
8     hartman tuinmeubelen
9      kees smit collectie
10     kees smit collectie
11     kees smit collectie
12     kees smit collectie
13                   glatz
14                 madison
15                 madison
16                 madison
17                 madison
18                 madison
19                 madison
Name: productBrand, dtype: object

In [117]:
data.head()

Unnamed: 0,productID,productName,productPrice,productDiscount,productColor,productCategory,productBrand,url,productName_clean
0,1000.0,"Cadeaubon t.w.v. €20,- Kees Smit Tuinmeubelen",0.0,0.0,Groen,Tuinaccessoires,kees smit collectie,https://www.keessmit.nl/cadeaubon-t-w-v-20-kee...,cadeaubon 20 kees smit tuinmeubelen
1,100365.0,Hartman Da Vinci standenstoel,295.0,0.15,Grijs - Antraciet,Tuinstoelen,hartman tuinmeubelen,https://www.keessmit.nl/hartman-da-vinci-stand...,hartman da vinci standenstoel
2,10060.0,Shadowline Parasol beschermhoes Small,29.95,0.19,Grijs - Antraciet,Parasols,shadowline,https://www.keessmit.nl/shadowline-parasol-bes...,shadowline parasol beschermhoes small
3,10061.0,Shadowline Parasol beschermhoes Medium,33.95,0.14,Grijs - Antraciet,Parasols,shadowline,https://www.keessmit.nl/shadowline-parasol-bes...,shadowline parasol beschermhoes medium
4,10062.0,Shadowline Parasol beschermhoes Large,37.95,0.1,Grijs - Antraciet,Parasols,shadowline,https://www.keessmit.nl/shadowline-parasol-bes...,shadowline parasol beschermhoes large


---

# **6.0 - Feature Engineering**

---

## Product Name Comparisons (TF-IDF + Clustering)

In [118]:
vectorizer = TfidfVectorizer()
vectorizer.fit(data['productName_clean'])
X_name = vectorizer.transform(data['productName_clean'])

In [119]:
k_name = 100  #TODO: think of smarter way to select clusters given complexity. This is just the approx different cats.
kmeans_name = KMeans(n_clusters=k_name, random_state=RANDOM_STATE)
data['name_cluster'] = kmeans_name.fit_predict(X_name)

In [120]:
data.head(100)

Unnamed: 0,productID,productName,productPrice,productDiscount,productColor,productCategory,productBrand,url,productName_clean,name_cluster
0,1000.0,"Cadeaubon t.w.v. €20,- Kees Smit Tuinmeubelen",0.00,0.00,Groen,Tuinaccessoires,kees smit collectie,https://www.keessmit.nl/cadeaubon-t-w-v-20-kee...,cadeaubon 20 kees smit tuinmeubelen,70
1,100365.0,Hartman Da Vinci standenstoel,295.00,0.15,Grijs - Antraciet,Tuinstoelen,hartman tuinmeubelen,https://www.keessmit.nl/hartman-da-vinci-stand...,hartman da vinci standenstoel,9
2,10060.0,Shadowline Parasol beschermhoes Small,29.95,0.19,Grijs - Antraciet,Parasols,shadowline,https://www.keessmit.nl/shadowline-parasol-bes...,shadowline parasol beschermhoes small,9
3,10061.0,Shadowline Parasol beschermhoes Medium,33.95,0.14,Grijs - Antraciet,Parasols,shadowline,https://www.keessmit.nl/shadowline-parasol-bes...,shadowline parasol beschermhoes medium,9
4,10062.0,Shadowline Parasol beschermhoes Large,37.95,0.10,Grijs - Antraciet,Parasols,shadowline,https://www.keessmit.nl/shadowline-parasol-bes...,shadowline parasol beschermhoes large,9
...,...,...,...,...,...,...,...,...,...,...
95,105737.0,ROUGH-S Adirondack schommelbank 110cm,669.95,0.20,Old Teak Greywash,Tuinbanken,rough furniture,https://www.keessmit.nl/rough-s-adirondack-sch...,rough adirondack schommelbank 110cm,9
96,105742.0,ROUGH-O tuinbank 150cm,799.95,0.20,Old Teak Greywash,Tuinbanken,rough furniture,https://www.keessmit.nl/rough-o-tuinbank-150cm...,rough tuinbank 150cm,24
97,105902.0,ROUGH-X 180cm barset 7-delig,0.00,0.00,,Tuinsets,rough furniture,https://www.keessmit.nl/rough-x-180cm-barset-7...,rough 180cm barset delig,60
98,105911.0,ROUGH-S/ROUGH-X 180cm dining tuinset 5-delig s...,1369.00,0.26,,Tuinsets,rough furniture,https://www.keessmit.nl/rough-s-rough-x-180cm-...,rough rough 180cm dining tuinset delig stapelbaar,78


In [121]:
cluster_products = data[data['name_cluster'] == 3]
cluster_products['productName_clean']

21        madison tuinkussen kuip 48x46cm zit
22      madison kuip tuinkussens hoog 96x45cm
391     madison kuip tuinkussens hoog 96x45cm
392     madison kuip tuinkussens hoog 96x45cm
393     madison kuip tuinkussens hoog 96x45cm
394     madison kuip tuinkussens hoog 96x45cm
395     madison kuip tuinkussens hoog 96x45cm
416      madison tuinkussens kuip 48x46cm zit
417       madison tuinkussen kuip 48x46cm zit
3843      madison tuinkussen kuip 48x46cm zit
3844      madison tuinkussen kuip 48x46cm zit
3845      madison tuinkussen kuip 48x46cm zit
3846      madison tuinkussen kuip 48x46cm zit
3847      madison tuinkussen kuip 48x46cm zit
Name: productName_clean, dtype: object

In [122]:
cluster_counts = data['name_cluster'].value_counts().sort_index()
cluster_table = cluster_counts.reset_index()
cluster_table.columns = ['cluster_id', 'num_products']
print(cluster_table)


    cluster_id  num_products
0            0            74
1            1            34
2            2            18
3            3            14
4            4            25
..         ...           ...
95          95            54
96          96            39
97          97            15
98          98            23
99          99            38

[100 rows x 2 columns]


## Product Name Comparisons (Word Count + Clustering)

In [123]:
count_vectorizer = CountVectorizer()
X_count = count_vectorizer.fit_transform(data['productName_clean'])

In [124]:

k_count = 100  
kmeans_count = KMeans(n_clusters=k_count, random_state=RANDOM_STATE)
data['name_cluster_count'] = kmeans_count.fit_predict(X_count)

In [125]:
cluster_counts = data['name_cluster_count'].value_counts().sort_index()
cluster_table = cluster_counts.reset_index()
cluster_table.columns = ['cluster_id', 'num_products']
print(cluster_table)


    cluster_id  num_products
0            0           120
1            1            57
2            2           120
3            3            75
4            4            41
..         ...           ...
95          95            16
96          96            25
97          97            10
98          98            13
99          99            15

[100 rows x 2 columns]


### Product Name Comparisons (TF-IDF + Word Count + Clustering)

In [126]:
count_vectorizer = CountVectorizer()
X_count2 = count_vectorizer.fit_transform(data['productName_clean'])
vectorizer = TfidfVectorizer()
vectorizer.fit(data['productName_clean'])
X_name2 = vectorizer.transform(data['productName_clean'])
X_combined = hstack([X_name2, X_count2])


In [127]:
k_combined = 100
kmeans_combined = KMeans(n_clusters=k_combined, random_state=RANDOM_STATE)
data['name_cluster_combined'] = kmeans_combined.fit_predict(X_combined)

In [128]:
cluster_counts = data['name_cluster_combined'].value_counts().sort_index()
cluster_table = cluster_counts.reset_index()
cluster_table.columns = ['cluster_id', 'num_products']
print(cluster_table)


    cluster_id  num_products
0            0            97
1            1           105
2            2            93
3            3            33
4            4            96
..         ...           ...
95          95            10
96          96            21
97          97            26
98          98            30
99          99            28

[100 rows x 2 columns]


In [129]:
cluster_products = data[data['name_cluster_combined'] == 95]
cluster_products['productName_clean']

1074    il tempo di sotto lounge voetenbank 66x66cm
1075    il tempo del capo lounge voetenbank 66x66cm
2292         il tempo cira lounge tussenmodule 70cm
2293         il tempo cira lounge tussenmodule 70cm
2294           il tempo cira lounge hoekmodule 90cm
2295           il tempo cira lounge hoekmodule 90cm
2296        il tempo cira lounge voetenbank 90x90cm
2297        il tempo cira lounge voetenbank 90x90cm
2298               il tempo giulia lounge tuinstoel
3106    il tempo giulia lounge tuintafel 100cm 40cm
Name: productName_clean, dtype: object

In [130]:
data.head(20)

Unnamed: 0,productID,productName,productPrice,productDiscount,productColor,productCategory,productBrand,url,productName_clean,name_cluster,name_cluster_count,name_cluster_combined
0,1000.0,"Cadeaubon t.w.v. €20,- Kees Smit Tuinmeubelen",0.0,0.0,Groen,Tuinaccessoires,kees smit collectie,https://www.keessmit.nl/cadeaubon-t-w-v-20-kee...,cadeaubon 20 kees smit tuinmeubelen,70,51,82
1,100365.0,Hartman Da Vinci standenstoel,295.0,0.15,Grijs - Antraciet,Tuinstoelen,hartman tuinmeubelen,https://www.keessmit.nl/hartman-da-vinci-stand...,hartman da vinci standenstoel,9,13,23
2,10060.0,Shadowline Parasol beschermhoes Small,29.95,0.19,Grijs - Antraciet,Parasols,shadowline,https://www.keessmit.nl/shadowline-parasol-bes...,shadowline parasol beschermhoes small,9,6,6
3,10061.0,Shadowline Parasol beschermhoes Medium,33.95,0.14,Grijs - Antraciet,Parasols,shadowline,https://www.keessmit.nl/shadowline-parasol-bes...,shadowline parasol beschermhoes medium,9,6,6
4,10062.0,Shadowline Parasol beschermhoes Large,37.95,0.1,Grijs - Antraciet,Parasols,shadowline,https://www.keessmit.nl/shadowline-parasol-bes...,shadowline parasol beschermhoes large,9,6,6
5,10063.0,Shadowline Zweefparasol beschermhoes Boog,44.95,0.1,Grijs - Antraciet,Parasols,shadowline,https://www.keessmit.nl/shadowline-zweefparaso...,shadowline zweefparasol beschermhoes boog,9,18,26
6,100819.0,Hartman Da Vinci voetenbank 2 standen,295.0,0.15,Grijs - Antraciet,Tuinaccessoires,hartman tuinmeubelen,https://www.keessmit.nl/hartman-da-vinci-voete...,hartman da vinci voetenbank standen,53,13,23
7,100822.0,Hartman Garden voetenbank,149.95,0.25,Grijs - Antraciet,Tuinaccessoires,hartman tuinmeubelen,https://www.keessmit.nl/hartman-garden-voetenb...,hartman garden voetenbank,53,13,23
8,100825.0,Hartman Da Vinci ligbed met wiel,54.0,0.21,Grijs - Antraciet,Ligbedden,hartman tuinmeubelen,https://www.keessmit.nl/hartman-da-vinci-ligbe...,hartman da vinci ligbed met wiel,39,67,23
9,100848.0,Parasolvoet 25kg beton,49.0,0.17,Grijs - Antraciet,Parasols,kees smit collectie,https://www.keessmit.nl/parasolvoet-25kg-beton...,parasolvoet 25kg beton,9,13,23


## Clustering for Anomaly Detection

In [131]:
features = [
    'productPrice',
    'productDiscount',
    'name_cluster',
    'name_cluster_count',
    'name_cluster_combined'
]

X = data[features]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


In [132]:
k_final = 100
kmeans_final = KMeans(n_clusters=k_final, random_state=RANDOM_STATE)
data['final_cluster'] = kmeans_final.fit_predict(X_scaled)


In [133]:

distances = kmeans_final.transform(X_scaled)
data['anomaly_score'] = np.min(distances, axis=1)


In [134]:
threshold = data['anomaly_score'].mean() + 3 * data['anomaly_score'].std()
data['is_anomaly'] = data['anomaly_score'] > threshold


In [135]:
data.head(100)

Unnamed: 0,productID,productName,productPrice,productDiscount,productColor,productCategory,productBrand,url,productName_clean,name_cluster,name_cluster_count,name_cluster_combined,final_cluster,anomaly_score,is_anomaly
0,1000.0,"Cadeaubon t.w.v. €20,- Kees Smit Tuinmeubelen",0.00,0.00,Groen,Tuinaccessoires,kees smit collectie,https://www.keessmit.nl/cadeaubon-t-w-v-20-kee...,cadeaubon 20 kees smit tuinmeubelen,70,51,82,47,0.261398,False
1,100365.0,Hartman Da Vinci standenstoel,295.00,0.15,Grijs - Antraciet,Tuinstoelen,hartman tuinmeubelen,https://www.keessmit.nl/hartman-da-vinci-stand...,hartman da vinci standenstoel,9,13,23,41,0.577274,False
2,10060.0,Shadowline Parasol beschermhoes Small,29.95,0.19,Grijs - Antraciet,Parasols,shadowline,https://www.keessmit.nl/shadowline-parasol-bes...,shadowline parasol beschermhoes small,9,6,6,0,0.834130,False
3,10061.0,Shadowline Parasol beschermhoes Medium,33.95,0.14,Grijs - Antraciet,Parasols,shadowline,https://www.keessmit.nl/shadowline-parasol-bes...,shadowline parasol beschermhoes medium,9,6,6,0,1.084326,False
4,10062.0,Shadowline Parasol beschermhoes Large,37.95,0.10,Grijs - Antraciet,Parasols,shadowline,https://www.keessmit.nl/shadowline-parasol-bes...,shadowline parasol beschermhoes large,9,6,6,25,0.916995,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,105737.0,ROUGH-S Adirondack schommelbank 110cm,669.95,0.20,Old Teak Greywash,Tuinbanken,rough furniture,https://www.keessmit.nl/rough-s-adirondack-sch...,rough adirondack schommelbank 110cm,9,55,41,82,0.614271,False
96,105742.0,ROUGH-O tuinbank 150cm,799.95,0.20,Old Teak Greywash,Tuinbanken,rough furniture,https://www.keessmit.nl/rough-o-tuinbank-150cm...,rough tuinbank 150cm,24,55,41,82,0.714381,False
97,105902.0,ROUGH-X 180cm barset 7-delig,0.00,0.00,,Tuinsets,rough furniture,https://www.keessmit.nl/rough-x-180cm-barset-7...,rough 180cm barset delig,60,68,67,39,0.624515,False
98,105911.0,ROUGH-S/ROUGH-X 180cm dining tuinset 5-delig s...,1369.00,0.26,,Tuinsets,rough furniture,https://www.keessmit.nl/rough-s-rough-x-180cm-...,rough rough 180cm dining tuinset delig stapelbaar,78,59,32,54,0.873935,False


In [136]:
anomalies = data[data['is_anomaly']]
print(len(anomalies))

34


In [137]:
output_path = "./output/anomalies.xlsx"

with pd.ExcelWriter(output_path, engine="openpyxl") as writer:
    anomalies.to_excel(writer, sheet_name="Anomalies", index=False)