# 1. Anomaly Detection with Isolation Forest

In [1]:
import pandas as pd

In [3]:
FINAL_DATA_PATH = "C:/Project/UK store analysis/data/02_processed/feature_engineered_data.parquet"
df = pd.read_parquet(FINAL_DATA_PATH)
df.head(3)

Unnamed: 0,supermarket,prices,prices_unit,unit,names,date,category,own_brand,normalized_name,canonical_name,...,market_std_price,market_min_price,market_max_price,price_vs_market_avg,price_rank,is_cheapest,day_of_week,day_of_month,week_of_year,month
0,ASDA,1.29,3.9,l,,2024-01-11,food_cupboard,False,,,...,,1.29,1.29,0.0,1.0,1,3,11,2,1
1,ASDA,,,,,2024-01-12,fresh_food,False,,,...,,,,,,0,4,12,2,1
2,ASDA,,,,,2024-01-13,fresh_food,False,,,...,,,,,,0,5,13,2,1


In [None]:
from sklearn.ensemble import IsolationForest

# Use a subset of features relevant for detecting unusual pricing
anomaly_features = [
    'prices',
    'price_diff_1d',
    'price_rol_std_7d',
    'price_vs_market_avg'
]

# Drop NaNs from the selected features
df_anomaly = df[anomaly_features].dropna()

iso_forest = IsolationForest(n_estimators=100, contamination=0.01, random_state=123)
iso_forest.fit(df_anomaly)

# Predict anomalies on the cleaned subset
df_anomaly['anomaly_score'] = iso_forest.predict(df_anomaly)
df['anomaly'] = iso_forest.predict(df[anomaly_features].fillna(0))

anomalies = df[df['anomaly'] == -1]
print(f"Detected {len(anomalies)} potential anomalies.")
print("\nExamples of detected price anomalies:")
print(anomalies[['supermarket', 'canonical_name', 'date', 'prices', 'price_lag_1d', 'price_diff_1d']].head(10))


Detected 95245 potential anomalies.

Examples of detected price anomalies:
      supermarket  canonical_name       date  prices  price_lag_1d  \
26913        ASDA  3 tier steamer 2024-01-09    23.0          12.0   
26914        ASDA  3 tier steamer 2024-01-10    12.0          23.0   
26916        ASDA  3 tier steamer 2024-01-11    12.0          23.0   
26918        ASDA  3 tier steamer 2024-01-12    12.0          23.0   
26920        ASDA  3 tier steamer 2024-01-13     8.0          23.0   
26921        ASDA  3 tier steamer 2024-01-13    23.0           8.0   
26922        ASDA  3 tier steamer 2024-01-14     8.0          23.0   
26923        ASDA  3 tier steamer 2024-01-14    23.0           8.0   
26924        ASDA  3 tier steamer 2024-01-15     8.0          23.0   
26925        ASDA  3 tier steamer 2024-01-15    23.0           8.0   

       price_diff_1d  
26913           11.0  
26914          -11.0  
26916          -11.0  
26918          -11.0  
26920          -15.0  
26921           

Detected Anomalies: We have identified ~95,000 potential anomalies, which is roughly 1% of the dataset. 

Interpreting the Example: "3 tier steamer" at ASDA
On 2024-01-09, the price is £23.00, but the previous day's price (price_lag_1d) was £12.00. This is a huge jump (price_diff_1d = £11.00). The model flags this as an anomaly.

On 2024-01-10, the price drops back to £12.00. This is another massive change (price_diff_1d = -£11.00). The model flags this too.
We see this pattern repeating: the price seems to be oscillating daily between two distinct price points (£23.00, £12.00, and later £8.00).

Meaning of pattern:

Promotional Pricing: This could be a "deal of the day" type of promotion that is being turned on and off.

A/B Price Testing: The retailer might be actively testing price sensitivity by showing different prices to different user groups or at different times.

Data Scraping Error: It's possible that the scraper is hitting two different versions of the product page that have different prices, leading to this daily flip-flop.