# FIT5196 Assessment 2


Table of Contents

1. Data Cleansing

    1.1 Import Data

    1.2 Dirty Data

      - 1.2.1 Dirty Data EDA

        - 1.2.1.1 Individual Column EDA

        - 1.2.1.2 Cross-Column EDA

        - 1.2.1.3 Dirty Data Fix
    
    1.3 Outlier Data
    
    1.4 Missing Data
    



## Task 1. Data Cleansing

### 1.1 Import Data

In [201]:
from google.colab import drive

drive.mount('/content/drive')
base = "/content/drive/MyDrive/FIT5196Assignment2/" # for colab

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [202]:
# Begin here if running locally

# base = ""

In [203]:
# --- Library, data and helper functions preparation ---
# --- Import libraries ---
import os
import pandas as pd
import numpy as np
import ast
from math import radians, cos, sin, asin, sqrt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import r2_score
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk
nltk.download('vader_lexicon')

# --- Load data ---
# Warehouse data
warehouse_data = pd.read_csv(base + 'warehouses.csv')

# Dirty data
dirty_data = pd.read_csv(base + 'Group_035_dirty_data.csv')

# Missing data
missing_data = pd.read_csv(base + 'Group_035_missing_data.csv')

# Outlier data
outlier_data = pd.read_csv(base + 'Group_035_outlier_data.csv')

# --- Prepare Sentiment analyzer ---
sia = SentimentIntensityAnalyzer()

# --- Prepare Haversine helper ---
def haversine_dist(lat1, lon1, lat2, lon2):
    """
    Calculate the Haversine distance between two points on the Earth's surface.
    """
    R = 6378  # Earth radius in KM
    lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
    c = 2 * np.arcsin(np.sqrt(a))
    return R * c

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


### 1.2 Dirty Data

#### 1.2.1 Dirty Data EDA

In [204]:
# Check all the data type
print(dirty_data.dtypes)

order_id                          object
customer_id                       object
date                              object
nearest_warehouse                 object
shopping_cart                     object
order_price                        int64
delivery_charges                 float64
customer_lat                     float64
customer_long                    float64
coupon_discount                    int64
order_total                      float64
season                            object
is_expedited_delivery               bool
distance_to_nearest_warehouse    float64
latest_customer_review            object
is_happy_customer                   bool
dtype: object


**Data Type Inspection - Interpretation:**

The data type inspection confirms that each column in the dataset has a generally appropriate data type. Identifiers such as order_id and customer_id are stored as object, indicating alphanumeric strings. Numerical attributes such as orer_price, deivery_charges, order_total, and the geographic coordinates includig customer_lat, customer_long, distance_to_nearest_warehouse, are correctly stored as either integer or float. Boolean attributes such as is_expedited_delivery and is_happy_customer are properly represented as bool. However, some columsn such as date and shopping_cart are stored as object, which suggests further conversion to datetime type and parsing required respectively.


**Justification:**

Verifying data types is crucial first step in EDA because incorrect data types may lead to misinterpretation or computational errors during validation and cleaning.

In [205]:
# Summary of dirty data
dirty_data.describe(include = "all")

Unnamed: 0,order_id,customer_id,date,nearest_warehouse,shopping_cart,order_price,delivery_charges,customer_lat,customer_long,coupon_discount,order_total,season,is_expedited_delivery,distance_to_nearest_warehouse,latest_customer_review,is_happy_customer
count,500,500,500,500,500,500.0,500.0,500.0,500.0,500.0,500.0,500,500,500.0,499,500
unique,500,493,304,6,460,,,,,,,8,2,,499,2
top,ORD089659,ID2621587173,2019-05-25,Thompson,"[('Lucent 330S', 1), ('iAssist Line', 1)]",,,,,,,Summer,True,,great value hard to beat for the price.,True
freq,1,2,6,201,4,,,,,,,128,258,,1,372
mean,,,,,,13864.58,78.03112,-27.942558,135.095643,10.74,12598.74034,,,1.06467,,
std,,,,,,7781.162606,14.400599,41.353147,41.351948,8.449567,6961.477582,,,0.492255,,
min,,,,,,1375.0,46.4,-37.831769,-37.827219,0.0,1450.55,,,0.0319,,
25%,,,,,,7900.0,66.695,-37.81862,144.94883,5.0,7405.7,,,0.72735,,
50%,,,,,,12260.0,76.82,-37.812261,144.962138,10.0,11111.125,,,1.02575,,
75%,,,,,,19260.0,85.8025,-37.805632,144.978927,15.0,16934.4225,,,1.3592,,


**Descriptive Summary - Interpretation:**

The descriptive summary indicates that the dataset contains 500 records, with all columns having complete counts except for latest_customer_review with only 499 records. The identifiers order_id and customer_id show 500 and 493 unique values respectively, suggesting that some customers placed multiple orders. The dataset spans 304 unique dates, covering several seasons, with Summer accummulating to 128 orders, the highest order count out of all seasons. There are a total of 8 variation in season and 6 variation in nearest_warehouse, suggesting a potential error in both columns. For numerical columns, order_price has a mean of approximately 13,864 AUD and ranges from 1,375 to 39,330 AUDm indicating a wide variety of order quantity and item price. Delivery charges average around 78 AUD, consistent with varying distances and expedited options. The value range of customer_lat and customer_long shows an expected spread around Melbourne's coordinates, but the similarity in minimum (\~ -37.8) and maximum value (~145.0) of both attributes suggests that there are potential swapped coordinates. The order_total logically tracks below order_price due to discounts and delivery charges. Boolean columns such as is_expedited_delivery and is_happy_customer are roughly balanced, with 258 expedited deliveries and 372 happy customers.


**Justification:**

Using describe() provides a simple yet efficient high-level assessment of both categorical and numerical attributes. This summary helps verify completeness, detect outliers, and identify naming inconsistencies or duplication. It also provides a clear understanding of the dataset's structure, supporting targeted anomaly detection in later steps, and provides evidence of data integrity across multiple attribute types.

#### 1.2.1.1 Individual Column EDA

In [206]:
# date
dirty_data["date"] = pd.to_datetime(dirty_data["date"], errors="coerce")

# Invalid date
print(f"Invalid date: {dirty_data['date'].isna().sum()}")
invalid_date = dirty_data[dirty_data["date"].isna()]
print(invalid_date[["order_id", "customer_id", "date", "season"]].to_string(index=False))

# Date distribution
valid_dates = dirty_data["date"].dropna()
print(f"Min date: {valid_dates.min()}")
print(f"Max date: {valid_dates.max()}")
print(f"Number of valid dates: {valid_dates.count()}")
print(f"Number of unique dates: {valid_dates.nunique()}")

Invalid date: 27
 order_id  customer_id date season
ORD164387 ID0289597227  NaT Summer
ORD066446 ID0145235264  NaT Winter
ORD312565 ID0638050574  NaT Summer
ORD181051 ID0709970691  NaT Summer
ORD046408 ID5402876538  NaT Spring
ORD219265 ID0055722470  NaT Winter
ORD006455 ID0634777174  NaT Spring
ORD084861 ID3094966833  NaT Winter
ORD438655 ID2705184152  NaT Summer
ORD234563 ID2383211199  NaT Spring
ORD199817 ID0650275823  NaT Summer
ORD113549 ID0634780047  NaT Autumn
ORD489756 ID0846548135  NaT Spring
ORD273300 ID0052599838  NaT Spring
ORD194653 ID0575428932  NaT Summer
ORD480775 ID4655129040  NaT Spring
ORD160619 ID0746917821  NaT Winter
ORD402436 ID2621587173  NaT Autumn
ORD311888 ID1463620717  NaT Autumn
ORD491911 ID2237521759  NaT Winter
ORD265708 ID1889073821  NaT Summer
ORD060082 ID0576834725  NaT Summer
ORD469475 ID0767665017  NaT Summer
ORD461231 ID0126934555  NaT Winter
ORD047863 ID0638044384  NaT Autumn
ORD499923 ID6197211200  NaT Spring
ORD036565 ID0616939377  NaT Spring
Min

**Date Formatting - Interpretation:**

The date column was converted from object to datetime format using pd.to_datetime(), where errors are coerced by automatically assigning NaT to invalid or unrecognisable entries. The result shows 27 missing date values, but these rows have valid season values, suggesting that the order period can be inferred later through seasonal mapping. Among valid entries, the date range spans from 2019-01-01 to 2019-12-30, covering an entire calendar year. There are 473 valid dates with 278 unique dates, indicating multiple orders occurred on the same dates.

**Justification:**

Validating and converting date formats ensures temporal consistency across the dataset. The use of pd.to_datetime(..., errors="coerce") standardises all date entries while safely identifying anomalies. Detecting 27 invalid dates highlights missing temporal information that must be recovered using related attributes such as season attribute. The date range confirms that the dataset represents a continuous operational year, which supports seasonal trend analysis. To confirm that there's absolutely no more than one error per row requires cross-column EDA with season.

In [207]:
# nearest_warehouse
# Invalid nearest_warehouse names
invalid_warehouse_name = dirty_data[dirty_data["nearest_warehouse"].isna()]
print("Invalid season found:", len(invalid_warehouse_name))

# Unique values
print(dirty_data["nearest_warehouse"].unique())

# Invalid nearest_warehouse value
invalid_warehouses = dirty_data[~dirty_data["nearest_warehouse"].isin(warehouse_data["names"])]
print("Number of invalid nearest warehouse names:", len(invalid_warehouses))
print(dirty_data["nearest_warehouse"].value_counts(dropna=False))
print(invalid_warehouses[["order_id", "nearest_warehouse"]].to_string(index=False))

Invalid season found: 0
['Thompson' 'Nickolson' 'Bakers' 'thompson' 'nickolson' 'bakers']
Number of invalid nearest warehouse names: 20
nearest_warehouse
Thompson     201
Nickolson    180
Bakers        99
thompson       9
nickolson      7
bakers         4
Name: count, dtype: int64
 order_id nearest_warehouse
ORD256861          thompson
ORD014442          thompson
ORD052629         nickolson
ORD166717          thompson
ORD461915         nickolson
ORD461426            bakers
ORD015774         nickolson
ORD169718         nickolson
ORD120949            bakers
ORD138742            bakers
ORD025929            bakers
ORD393258          thompson
ORD201338         nickolson
ORD224300         nickolson
ORD016571         nickolson
ORD471229          thompson
ORD164906          thompson
ORD118183          thompson
ORD300512          thompson
ORD099672          thompson


**Nearest_warehouse column - Interpretation:**

The nearest_warehouse attribute shows a total of six distinct values, but it is confirmed that there are only three valid warehouses (Thompson, Nickolson, Bakers) from the given warehouses.csv dataset. This reveals that there are 20 records with inconsistent naming. The frequency confirms that Thompson is the most common warehouse, followed by Nickolson and Bakers.

**Justification:**

Warehouse names are categorical attributes that should match a predefined list from the warehouse.csv reference file. Ensuring exact spelling and consistent capitalisation is necessary because these names are used to compute the distance_to_nearest_warehouse and determine the logistics accuracy. To confirm that there's absolutely no more than one error per row requires cross-column EDA with distance_to_nearest_warehouse, customer_lat, and customer_long.


In [208]:
# shopping_cart (item ordered)
# Extract the necessary columns
shopping_cart_check = dirty_data[["order_id", "shopping_cart", "order_total"]].copy()

# Parse the shopping_cart value to get the item ordered
shopping_cart_check["shopping_cart_parsed"] = shopping_cart_check["shopping_cart"].apply(ast.literal_eval)
branded_items = [item for cart in shopping_cart_check["shopping_cart_parsed"] for (item, qty) in cart]

# Invalid value
invalid_cart = shopping_cart_check[shopping_cart_check["shopping_cart"].isna() | shopping_cart_check["shopping_cart"].eq("[]")]
print("Number of invalid values:", len(invalid_cart))
invalid_items = [item for item in branded_items if pd.isna(item) or item is None]
print("Number of invalid items:", len(invalid_items))
invalid_duplicates = shopping_cart_check[shopping_cart_check["shopping_cart_parsed"].apply(lambda cart: len({item for item, qty in cart}) != len(cart))]
print("Number of duplicated items:", len(invalid_duplicates))

# Count and value of unique branded items
unique_branded_items = pd.Series(branded_items).unique()
print(f"Number of unique branded items:", len(unique_branded_items))
print(f"Unique branded items:", unique_branded_items)

# Frequency distribution
print(pd.Series(branded_items).value_counts(dropna=False))

Number of invalid values: 0
Number of invalid items: 0
Number of duplicated items: 0
Number of unique branded items: 10
Unique branded items: ['iAssist Line' 'Alcon 10' 'Universe Note' 'pearTV' 'Toshika 750'
 'Candle Inferno' 'Thunder line' 'iStream' 'Olivia x460' 'Lucent 330S']
iAssist Line      169
Toshika 750       163
Lucent 330S       154
Alcon 10          153
Thunder line      151
pearTV            147
Candle Inferno    147
Olivia x460       146
iStream           141
Universe Note     134
Name: count, dtype: int64


**Shopping_cart column - Interpretation:**  
The shopping_cart attribute was parsed from its string representation into Python list format using ast.literal_eval(), allowing each tuple to be separated into item names and quantities. The result shows that there are no invalid or missing cart entries and a total of 10 unique branded items were identified including iAssist Line, Alcon 10, Universe Note, pearTV, Toshika 750, Candle Inferno, Thunder line, iStream, Olivia x460, and Lucent 330S. This matches the specified criteria for the number of unique items in shopping_cart column.

**Justification:**

Parsing and validating the shopping_cart attribute is essential as it directly influences the accuracy of order_price and order_total. The absence of invalid items with consistent naming and correct number of branded items confirms that the shopping_cart entries adhere to business rule. To confirm that there's absolutely no error in this column requires cross-column EDA with order_price, delivery_charges, coupon_discount, and order_total.

In [209]:
# Longitude: customer_lat, customer_long
invalid_coords = dirty_data[dirty_data["customer_lat"].isna() |
                            dirty_data["customer_long"].isna() |
                            ~dirty_data["customer_lat"].between(-90, 90) |
                            ~dirty_data["customer_long"].between(-180, 180)]
print("Invalid coordinates found:", len(invalid_coords))
print(invalid_coords[["order_id", "customer_id", "customer_lat", "customer_long"]].to_string(index=False))

Invalid coordinates found: 27
 order_id  customer_id  customer_lat  customer_long
ORD091929 ID0145235237    144.959364     -37.815878
ORD299508 ID0260907252    145.009445     -37.823816
ORD074143 ID6167247310    144.960234     -37.819701
ORD392203 ID0588197234    144.973944     -37.812101
ORD208957 ID6245731092    144.977354     -37.810785
ORD090831 ID1519470918    144.993262     -37.797624
ORD062280 ID6167489480    144.961790     -37.816990
ORD155978 ID0452381032    144.949411     -37.824991
ORD493957 ID0361227457    144.976899     -37.801182
ORD083244 ID0577458190    144.977813     -37.818479
ORD373348 ID1888340704    144.985178     -37.793879
ORD055195 ID2399230968    144.983469     -37.806415
ORD125480 ID2190483590    144.961303     -37.810368
ORD285476 ID2141904233    144.935968     -37.802954
ORD349254 ID0387153047    144.947587     -37.805420
ORD048052 ID0207093528    145.004517     -37.801171
ORD479919 ID1449297346    144.977507     -37.815613
ORD326763 ID0581709069    144.9282

**Geographical coordinates columns - Interpretation:**

A total of 27 invalid geographical coordinates were identified in the customer_lat and customer_long columns, where the latitude values exceed the valid range of -90 to 90, and longitude values exceed -180 to 180. The result indicates that the coordinates are swapped where the latitude column contains values around 144 and the longitude column contains values around -37.  

**Justification:**

Validating geographical coordinates is crucial to ensure spatial integrity. Customer locations directly determine the nearest_warehouse and distance_to_nearest_warehouse using Haversine formula. hence, the swapped coordinates would lead to computational errors.


In [210]:
# distance_to_nearest_warehouse
invalid_distance = dirty_data[dirty_data["distance_to_nearest_warehouse"].isna() |
                              dirty_data["distance_to_nearest_warehouse"] <= 0]
print("Invalid distance found:", len(invalid_distance))

Invalid distance found: 0


**Distance_to_nearest_warehouse column - Interpretation:**

The validation check for distance_to_nearest_warehouse column found no invalid or missing distance values. All records contain positive numerical values, confirming that each order has an assigned and non-zero delivery distance.

**Justification:**

Validating distance_to_nearest_warehouse ensures spatial and logistical consistency. Negative or zero distances would indicate errors in warehouse assignment or coordinate mismatches, while missing values may disrupt calculations. To confirm that there's absolutely no error in this column requires cross-column EDA with nearest_warehouse, customer_lat, and customer_long.

In [211]:
# order_total
invalid_total = dirty_data[dirty_data["order_total"].isna() |
                           dirty_data["order_total"] <= 0 |
                           (dirty_data["order_total"].apply(lambda x: len(str(x).split('.')[-1]) > 2 if '.' in str(x) else False))]
print("Invalid order total found:", len(invalid_total))

Invalid order total found: 0


**Order_total column - Interpretation:**

No invalid or missing values were detected in the order_total column, where all entries are positive and non-zero numerical values with at most two decimal places, indicating that order_total is in valid monetary format.

**Justification:**

The validation of order_total is essential because it represents the final transaction amount after applying discounts and delivery charges, where all totals should be positive and in correct format to ensure no computational errors later on. To confirm there's absolutely no errors in this column requires cross-column EDA with shopping_cart, order_price, delivery_charges, and coupon_discount.


In [212]:
# season
# Invalid season names
invalid_season = dirty_data[dirty_data["season"].isna()]
print("Invalid season found:", len(invalid_season))

# Unique values
print(dirty_data["season"].unique())

# Invalid letter case
invalid_case_season = dirty_data[~dirty_data["season"].str.match(r"^[A-Z][a-z]+$", na=False)]
print("Number of seasons not capitalised correctly:", len(invalid_case_season))
print(dirty_data["season"].value_counts(dropna=False))
print(invalid_case_season[["order_id", "season"]].to_string(index=False))

Invalid season found: 0
['Winter' 'Summer' 'spring' 'Autumn' 'Spring' 'autumn' 'winter' 'summer']
Number of seasons not capitalised correctly: 21
season
Summer    128
Winter    124
Spring    120
Autumn    107
spring      9
summer      6
autumn      3
winter      3
Name: count, dtype: int64
 order_id season
ORD209240 spring
ORD277275 spring
ORD123382 autumn
ORD209230 winter
ORD132264 spring
ORD192930 spring
ORD172628 spring
ORD103002 summer
ORD344249 summer
ORD222038 summer
ORD026633 summer
ORD455600 summer
ORD468030 spring
ORD478782 winter
ORD345384 summer
ORD493849 spring
ORD492808 winter
ORD385926 spring
ORD462038 autumn
ORD434426 spring
ORD387883 autumn


**Season column - Interpretation:**

The records in the season column confirmed with no null entries. However, inspection of unique values shows both correctly and incorrectly capitalised season names, such as Winter vs winter. The result shows a total of 21 values were identified with incorrect letter casing.

**Justification:**

Consistent categorical formatting is critical for accurate grouping, filtering, and model training. Inconsistent letter casing can cause duplicate categories to be treated as separate values, leading to incorrect seasonal aggregations. By using str.match() with capitalised regex pattern, all lowercase entries can be standardised to title case. To confirm that there's absolutely no more than one error in this column requires cross-column EDA with date.

In [213]:
# is_expedited_delivery
invalid_expedited = dirty_data[dirty_data["is_expedited_delivery"].isna()]
print("Invalid expedited delivery found:", len(invalid_expedited))
print(dirty_data["is_expedited_delivery"].value_counts(dropna=False))

Invalid expedited delivery found: 0
is_expedited_delivery
True     258
False    242
Name: count, dtype: int64


**Is_expedited_delivery column - Interpretation:**

All records in is_expedited_delivery column contain only Boolean values, with no missing entries detected. The frequency distribution shows a near-balanced split with 258 expedited delivery and 242 standard deliveries.

**Justification:**

Validating is_expedited_delivery column is critical for verifying delivery charge calculations since expedited orders should incur higher costs according to seasonal pricing models. Confirming that all entries are valid Boolean values ensures the integrity of this categorical variable and supports accurate modelling. To confirm there are absolutely no errors in this column requires training of a linear regression model with a good R2 score (over 0.97) to validate the column.

In [214]:
# is_happy_customer
invalid_happy = dirty_data[dirty_data["is_happy_customer"].isna()]
print("Invalid positive customer response found:", len(invalid_happy))
print(dirty_data["is_happy_customer"].value_counts(dropna=False))

Invalid positive customer response found: 0
is_happy_customer
True     372
False    128
Name: count, dtype: int64


**Is_happy_customer column - Interpretation:**

All records in is_happy_customer column contain only Boolean values, with no missing entries detected. The frequency distribution shows 372 happy customers and 128 unhappy customers, indicating that the majority of customers reported positive experiences with their orders.

**Justification:**

Validating is_happy_customer column is critical for linear regression modelling that predicts delivery charges, where customer satisfaction can influence business rules. Ensuring all entries are valid Boolean values prevents logical errors during sentiment-based analysis.

#### 1.2.1.2 Cross-Column EDA

In [215]:
# Check date and season pairing
# Extract the necessary columns and exclude NA values
date_season = dirty_data[dirty_data["date"].notna()][["order_id", "date", "season"]].copy()

# Standardise season name
date_season["season_clean"] = date_season["season"].str.strip().str.lower()

# Get the months from date column
date_season["month"] = date_season["date"].dt.month

# Mapping of months to seasons
season_months = {"summer": [12, 1, 2],
                 "autumn": [3, 4, 5],
                 "winter": [6, 7, 8],
                 "spring": [9, 10, 11]}

# Valid season check
def valid_season (row):
  season = row["season_clean"]
  month = row["month"]
  if season in season_months:
    return month in season_months[season]
  else:
    return False

# Apply valid season check to the months and season
date_season["season_match"] = date_season.apply(valid_season, axis=1)

# Invalid date-season pairs
invalid_date_season = date_season[~date_season["season_match"]]
invalid_case = invalid_date_season[~invalid_date_season["season"].str.match(r"^[A-Z][a-z]+$", na=False)]

print("Number of mismatched date-season pairs:", len(invalid_date_season))
print("Number of seasons not capitalised correctly in date-season pairs:", len(invalid_case))
print(invalid_date_season[["order_id", "date", "month", "season"]].to_string(index=False))

Number of mismatched date-season pairs: 21
Number of seasons not capitalised correctly in date-season pairs: 15
 order_id       date  month season
ORD209240 2019-02-07      2 spring
ORD419503 2019-09-21      9 Autumn
ORD277275 2019-12-26     12 spring
ORD123382 2019-01-06      1 autumn
ORD209230 2019-09-03      9 winter
ORD192930 2019-08-30      8 spring
ORD172628 2019-03-30      3 spring
ORD103002 2019-10-17     10 summer
ORD040501 2019-10-17     10 Summer
ORD344249 2019-10-23     10 summer
ORD222038 2019-11-04     11 summer
ORD377228 2019-09-24      9 Autumn
ORD468030 2019-05-06      5 spring
ORD363647 2019-12-01     12 Winter
ORD345384 2019-11-25     11 summer
ORD127439 2019-07-25      7 Spring
ORD493849 2019-04-24      4 spring
ORD495324 2019-07-10      7 Summer
ORD462038 2019-07-10      7 autumn
ORD434426 2019-06-14      6 spring
ORD387883 2019-10-20     10 autumn


**Date-season pairing - Interpretation:**

The validation of date-season cross-column EDA revealed 21 mismatched pairs where the recorded season does not align with the actual month derived from the date column. Additionally, there are 15 records out of the 21 mismatched pairs were not properly capitalised with title case.

**Justification:**

The date and season are paired up together to validate the types of error occurred in each row. The method used was to detect the mismatched date-season pairing through mapping each Australian season to its corresponding months (Summer: December - February, Autumn: March - May, Winter: June - August, Spring: September - November). By comparing the month extracted from each order's date with its recorded season, mismatched pairs were detected. By using str.match() with capitalised regex pattern, improper capitalisation of the season names were identified. This is to validate that the improper capitalisation of the season names may not be the actual error in these rows, rather it is actually mismatched date-season pairs.
  

In [216]:
# Check if distance_to_nearest_warehouse is correct using Haversine distance
# Extract the necessary columns
warehouse_dist = dirty_data[["order_id", "nearest_warehouse", "distance_to_nearest_warehouse", "customer_lat", "customer_long"]].copy()

# Standardise warehouses name
warehouse_dist["nearest_warehouse_clean"] = warehouse_dist["nearest_warehouse"].str.strip().str.title()

# Swap the invalid customer_lat and customer_long
invalid_coords_tmp = warehouse_dist["customer_lat"].between(-180, 180) & warehouse_dist["customer_long"].between(-90, 90)
warehouse_dist.loc[invalid_coords_tmp, ["customer_lat", "customer_long"]] = warehouse_dist.loc[invalid_coords_tmp, ["customer_long", "customer_lat"]].values

# Check warehouse_data data types
print(warehouse_data.dtypes)

# Merge both datasets
merge_warehouse_dist = warehouse_dist.merge(warehouse_data,
                                            left_on="nearest_warehouse_clean",
                                            right_on="names",
                                            how="left")

# Apply Haversine distance function to the coordinates
merge_warehouse_dist["haversine_dist"] = merge_warehouse_dist.apply(lambda row:
                                                                    haversine_dist(row["customer_lat"],
                                                                             row["customer_long"],
                                                                             row["lat"],
                                                                             row["lon"]),
                                                                    axis=1)

# Compare the distances
merge_warehouse_dist["dist_diff"] = (merge_warehouse_dist["haversine_dist"] - merge_warehouse_dist["distance_to_nearest_warehouse"]).abs()

# Invalid distance_to_nearest_warehouse
invalid_distance = merge_warehouse_dist[merge_warehouse_dist["dist_diff"] > 0.1]
print("Number of mismatched distances:", len(invalid_distance))
print(invalid_distance[["order_id", "nearest_warehouse_clean", "customer_lat", "customer_long", "lat", "lon", "distance_to_nearest_warehouse", "haversine_dist", "dist_diff"]].to_string(index=False))

names     object
lat      float64
lon      float64
dtype: object
Number of mismatched distances: 43
 order_id nearest_warehouse_clean  customer_lat  customer_long        lat        lon  distance_to_nearest_warehouse  haversine_dist  dist_diff
ORD418280               Nickolson    -37.822991     144.976257 -37.818595 144.969551                         1.9037        0.766301   1.137399
ORD237879               Nickolson    -37.800113     144.935310 -37.818595 144.969551                         1.7391        3.647097   1.907997
ORD020055                  Bakers    -37.817347     145.008734 -37.809996 144.995232                         0.6764        1.442062   0.765662
ORD483341                Thompson    -37.820155     144.952118 -37.812673 144.947069                         0.7995        0.943856   0.144356
ORD106152               Nickolson    -37.820522     144.982869 -37.818595 144.969551                         0.7663        1.190577   0.424277
ORD052629               Nickolson    -37.8

**Distance_to_nearest_warehouse Validation - Interpretation:**

The comparison between the recorded distance_to_nearest_warehouse and the recalculated Haversine distance identified 43 mismatched records where the absolute difference exceeded 0.1 km. These discrepancies indicate that the stored distances were either incorrectly computed or linked to the wrong warehouse coordinates. The mismatches are distributed across all three warehouses (Nickolson, Thompson, and Bakers), suggesting that the issue is not isolated to a single location but likely results from rounding errors, coordinate swaps, or incorrect warehouse assignment in the raw data.

**Justification:**

To verify distance accuracy, the Haversine formula was applied to recompute the true distance between each customer's coordinates and their assigned warehouse's coordinates. This method detects logical inconsistencies by comparing the physically calculated distance to the recorded value. Rows with deviations greater than 0.1 km were flagged as invalid since such variance exceeds expected rounding or measurement noise in real-world data.

In [217]:
# Check if nearest_warehouse is correct using Haversine distance
# Calculate the distance to all three warehouses
for _, row in warehouse_data.iterrows():
  wh = row["names"]
  lat = row["lat"]
  lon = row["lon"]

  merge_warehouse_dist[f"haversine_dist_{wh}"] = merge_warehouse_dist.apply(lambda row:
                                                                    haversine_dist(row["customer_lat"],
                                                                             row["customer_long"],
                                                                             lat,
                                                                             lon),
                                                                  axis=1)

# Get the nearest warehouse name
distance_cols = [f"haversine_dist_{wh}" for wh in warehouse_data["names"]]
merge_warehouse_dist[distance_cols] = merge_warehouse_dist[distance_cols].apply(lambda x: np.around(x, 4))
merge_warehouse_dist["actual_nearest_distance"] = merge_warehouse_dist[distance_cols].min(axis=1)
merge_warehouse_dist["actual_nearest_warehouse"] = (merge_warehouse_dist[distance_cols].idxmin(axis=1).str.replace("haversine_dist_", ""))

# Check if the nearest_warehouse is correct
merge_warehouse_dist["is_correct"] = (merge_warehouse_dist["nearest_warehouse_clean"] == merge_warehouse_dist["actual_nearest_warehouse"])

# Invalid nearest_warehouse
invalid_nearest_warehouse = merge_warehouse_dist[~merge_warehouse_dist["is_correct"]]
print("Number of invalid nearest warehouse:", len(invalid_nearest_warehouse))
print(invalid_nearest_warehouse[["order_id", "nearest_warehouse_clean", "distance_to_nearest_warehouse", "actual_nearest_warehouse", "actual_nearest_distance", "haversine_dist_Thompson", "haversine_dist_Nickolson", "haversine_dist_Bakers"]].to_string(index=False))

# Check if any rows for distance_to_nearest_warehouse is exactly the same as actual_nearest_distance
distance_check = invalid_nearest_warehouse["distance_to_nearest_warehouse"] == invalid_nearest_warehouse["actual_nearest_distance"]
print("Number of incorrect nearest warehouse distance:", len(distance_check[~distance_check]))

Number of invalid nearest warehouse: 20
 order_id nearest_warehouse_clean  distance_to_nearest_warehouse actual_nearest_warehouse  actual_nearest_distance  haversine_dist_Thompson  haversine_dist_Nickolson  haversine_dist_Bakers
ORD237879               Nickolson                         1.7391                 Thompson                   1.7391                   1.7391                    3.6471                 5.3839
ORD052629               Nickolson                         1.3297                   Bakers                   1.3297                   4.6427                    2.5795                 1.3297
ORD461915               Nickolson                         0.2395                 Thompson                   0.2395                   0.2395                    1.8613                 4.0736
ORD069936                Thompson                         1.3319                Nickolson                   1.3319                   1.7608                    1.3319                 2.6129
ORD461426      

**Nearest_warehouse Validation - Interpretation:**

By recalculating the Haversine distance from each customer's coordinates to all three warehouses, 20 orders were identified with incorrect nearest_warehouse assignments. The recorded warehouse name does not match the actual geographically nearest warehouse, even though the distance_to_nearest_warehouse value itself was correct. The consistency in distance_to_nearest_warehouse values suggests that the error lies in the categorical warehouse name rather than in the numeric distance calculation.

**Justification:**

This ensures geospatial consistency between the warehouse name and the computed minimum distance. The method calculates the distance from each customer to every warehouse using the Haversine formula and determines the true nearest location by identifying the minimum value across all warehouses. Rows where the stored nearest_warehouse does not match this computed result are flagged as incorrect. The comparison between distance_to_nearest_warehouse and actual_nearest_distance confirms that it is a categorical mislabelling rather than a distance computation error.

In [218]:
# Check if is_happy_customer is correct using SentimentIntensityAnalyzer
# Extract the necessary columns and replace missing values with empty string
is_happy_check = dirty_data[["order_id", "customer_id", "is_happy_customer", "latest_customer_review"]].copy()
is_happy_check["latest_customer_review"] = is_happy_check["latest_customer_review"].fillna("").astype(str)

# Apply sentiment analysis to each row of latest_review_customer and return the compound_score but None if it is an empty string
is_happy_check["compound_score"] = is_happy_check["latest_customer_review"].apply(lambda row: sia.polarity_scores(row)["compound"]
                                                                                  if row.strip() != ""
                                                                                  else None)

# Create a column to store True when compound_score is more than 0.05 or is an empty string
is_happy_check["happy_prediction"] = is_happy_check.apply(lambda row: True
                                                          if row["latest_customer_review"].strip() == ""
                                                          else row["compound_score"] >= 0.05,
                                                          axis=1)

# Check if is_happy_customer aligns with latest_review_customer
is_happy_check["is_correct"] = is_happy_check["happy_prediction"] == is_happy_check["is_happy_customer"]

# Invalid is_happy_customer
invalid_sentiment = is_happy_check[~is_happy_check["is_correct"]]
print("Number of invalid sentiment labels:", len(invalid_sentiment))
print(invalid_sentiment[["order_id", "latest_customer_review", "is_happy_customer", "compound_score", "happy_prediction", "is_correct"]].to_string(index=False))

Number of invalid sentiment labels: 27
 order_id                                                                                                                                                                                                                                                      latest_customer_review  is_happy_customer  compound_score  happy_prediction  is_correct
ORD216249                                                                                                          battery runs low fast the phone works fine, although the battery runs out fast and you have to charge it a lot .i like the phone but wish the battery didn't suck.              False          0.8052              True       False
ORD412492                                                                                                                                                                                                                                      nice i don't see any problems with i

**Is_happy_customer Validation - Interpretation:**

The sentiment analysis comparison revealed 27 mismatched rows where the recorded is_happy_customer label does not align with the predicted sentiment derived from the latest_customer_review. The compound sentiment scores, generated by the SentimentIntensityAnalyzer, ranging from 0.2 to 0.9 for positive reviews and below -0.4 for negative ones. These inconsistencies suggest errors during manual or automated label assignment in the raw dataset.

**Justification:**

To verify the correctness of the is_happy_customer field, each customer review was analysed using the VADER SentimentIntensityAnalyzer, which assigns a compound_score representing the polarity of the review text. A score of ≥ 0.05 indicates positive sentiment (True), while lower scores indicate negative sentiment (False). Empty reviews were assumed positive, following the dataset’s business rule that customers without complaints are considered satisfied. Detection of these mismatches are crucial for modelling that relate customer happiness to delivery performance and product quality.


#### 1.2.1.3 Dirty Data Fix

In [219]:
# Create a duplicate of dirty_data to clean
clean_data = dirty_data.copy()

# Create fix log
fix_log = pd.DataFrame(columns=["order_id", "error_type"])

**Backup - Justification:**

Storing the raw data preserves a reference point for all subsequent data-cleaning operations. Maintaining an unaltered copy ensures that any modifications can be traced back to the original records, supporting transparency and reproducibility. A fix log is created to systematically record each detected anomaly and the corresponding correction applied. Every fixed rows is will be logged into fix log, which tracks the order_id and corresponding error_type. The fixed rows listed in fix log will be excluded from further processing to maintain the single-fault assumption.

In [220]:
# Fix season and casing based on date
# Extract the unfixed data
unfixed_data = clean_data[~clean_data["order_id"].isin(fix_log["order_id"])]

# Parse date to datetime data type and extract month
unfixed_data["date"] = pd.to_datetime(unfixed_data["date"], errors="coerce")
unfixed_data["month"]= unfixed_data["date"].dt.month

# Mapping for month to season
season_map = {12: "Summer", 1: "Summer", 2: "Summer",
              3: "Autumn", 4: "Autumn", 5: "Autumn",
              6: "Winter", 7: "Winter", 8: "Winter",
              9: "Spring", 10: "Spring", 11: "Spring"}

unfixed_data["season_clean"] = unfixed_data["month"].map(season_map)

unfixed_data["season_clean"] = np.where(
    unfixed_data["month"].isna(),
    unfixed_data["season"],
    unfixed_data["season_clean"]
)

# Mismatched season
season_mismatch = (unfixed_data["month"].notna() &
                   (unfixed_data["season"] != unfixed_data["season_clean"]))
print("Number of mismatched seasons:", season_mismatch.sum())

# Replace the invalid rows with the correct values
clean_data.loc[season_mismatch, "season"] = unfixed_data.loc[season_mismatch, "season_clean"]

# Log the order_id for fixed rows
fixed_ids = unfixed_data.loc[season_mismatch, "order_id"]
fix_log = pd.concat([fix_log,
                     pd.DataFrame({"order_id": fixed_ids,
                                   "error_type": "Incorrect season"})],
                    ignore_index=True)

print(f"Fixed {len(fixed_ids)} rows.")
print(clean_data.loc[unfixed_data.loc[season_mismatch].index, ["order_id", "date", "season"]].to_string(index=False))

Number of mismatched seasons: 27
Fixed 27 rows.
 order_id       date season
ORD209240 2019-02-07 Summer
ORD419503 2019-09-21 Spring
ORD277275 2019-12-26 Summer
ORD123382 2019-01-06 Summer
ORD209230 2019-09-03 Spring
ORD132264 2019-10-14 Spring
ORD192930 2019-08-30 Winter
ORD172628 2019-03-30 Autumn
ORD103002 2019-10-17 Spring
ORD040501 2019-10-17 Spring
ORD344249 2019-10-23 Spring
ORD222038 2019-11-04 Spring
ORD026633 2019-01-16 Summer
ORD377228 2019-09-24 Spring
ORD455600 2019-02-18 Summer
ORD468030 2019-05-06 Autumn
ORD478782 2019-07-26 Winter
ORD363647 2019-12-01 Summer
ORD345384 2019-11-25 Spring
ORD127439 2019-07-25 Winter
ORD493849 2019-04-24 Autumn
ORD492808 2019-08-07 Winter
ORD495324 2019-07-10 Winter
ORD385926 2019-09-10 Spring
ORD462038 2019-07-10 Winter
ORD434426 2019-06-14 Winter
ORD387883 2019-10-20 Spring


**Season fix - Justification:**

The correct season is derived directly from the date column using the Australian calendar. Dates were converted to the datetime format, and months were mapped to their respective seasons. A total of 21 rows with mismatched date-season pairs were corrected using this mapping. After ensuring date-season consistency, the 6 rows with improper text casing is standardised to proper capitalisation. A total of 27 rows were fixed and logged as "Incorrect season" in the fix_log.

In [221]:
# Fix date
# Extract the unfixed data
unfixed_data = clean_data[~clean_data["order_id"].isin(fix_log["order_id"])].copy()

# Mapping for season to date
date_map = {
    "Summer": "2019-01-15",
    "Autumn": "2019-04-15",
    "Winter": "2019-07-15",
    "Spring": "2019-10-15"
}

# Missing date value
missing_date = unfixed_data["date"].isna()
print("Number of missing dates:", missing_date.sum())

# Extract the missing date index
date_index = unfixed_data.index[missing_date]

# Map the missing date to date_map
correct_date = pd.to_datetime(unfixed_data.loc[date_index, "season"].map(date_map), errors="coerce")

# Replace the invalid rows with the correct values
clean_data.loc[date_index, "date"] = correct_date

# Log the order_id for fixed rows
fixed_ids = unfixed_data.loc[date_index, "order_id"]
fix_log = pd.concat([fix_log,
                     pd.DataFrame({"order_id": fixed_ids,
                                   "error_type": "Missing date"})],
                    ignore_index=True)

print(f"Fixed {len(fixed_ids)} rows.")
print(clean_data.loc[date_index, ["order_id", "date", "season"]].to_string(index=False))

Number of missing dates: 27
Fixed 27 rows.
 order_id       date season
ORD164387 2019-01-15 Summer
ORD066446 2019-07-15 Winter
ORD312565 2019-01-15 Summer
ORD181051 2019-01-15 Summer
ORD046408 2019-10-15 Spring
ORD219265 2019-07-15 Winter
ORD006455 2019-10-15 Spring
ORD084861 2019-07-15 Winter
ORD438655 2019-01-15 Summer
ORD234563 2019-10-15 Spring
ORD199817 2019-01-15 Summer
ORD113549 2019-04-15 Autumn
ORD489756 2019-10-15 Spring
ORD273300 2019-10-15 Spring
ORD194653 2019-01-15 Summer
ORD480775 2019-10-15 Spring
ORD160619 2019-07-15 Winter
ORD402436 2019-04-15 Autumn
ORD311888 2019-04-15 Autumn
ORD491911 2019-07-15 Winter
ORD265708 2019-01-15 Summer
ORD060082 2019-01-15 Summer
ORD469475 2019-01-15 Summer
ORD461231 2019-07-15 Winter
ORD047863 2019-04-15 Autumn
ORD499923 2019-10-15 Spring
ORD036565 2019-10-15 Spring


**Date fix - Justification:**

The date column was fixed by filling missing values based on each record's season. Since the dataset represents 2019 transactions, a mapping (Summer → 2019-01-15, Autumn → 2019-04-15, Winter → 2019-07-15, Spring → 2019-10-15) was applied to assign a representative mid-season date for each missing entry. This approach ensures temporal consistency and preserves the logical relationship between date and season when the original date was unavailable. A total of 27 rows were fixed and logged as "Missing date" in the fix_log.

In [222]:
# Fix customer_lat and customer_long
# Extract the unfixed data
unfixed_data = clean_data[~clean_data["order_id"].isin(fix_log["order_id"])].copy()

# Swapped coordinates
swapped_coords = (unfixed_data["customer_lat"].between(-180, 180) &
                  unfixed_data["customer_long"].between(-90, 90))
print("Number of swapped coordinates:", swapped_coords.sum())

swapped_coords_index = unfixed_data.index[swapped_coords]

# Replace the invalid rows with the correct values
clean_data.loc[swapped_coords_index, ["customer_lat", "customer_long"]] = (clean_data.loc[swapped_coords_index, ["customer_long", "customer_lat"]].to_numpy())

# Log the order_id for fixed rows
fixed_ids = clean_data.loc[swapped_coords_index, "order_id"].tolist()
fix_log = pd.concat([fix_log,
                     pd.DataFrame({"order_id": fixed_ids,
                                   "error_type": "Swapped coordinates"})],
                    ignore_index=True)

print(f"Fixed {len(fixed_ids)} rows.")
print(clean_data.loc[swapped_coords_index, ["order_id", "customer_lat", "customer_long"]].to_string(index=False))

Number of swapped coordinates: 27
Fixed 27 rows.
 order_id  customer_lat  customer_long
ORD091929    -37.815878     144.959364
ORD299508    -37.823816     145.009445
ORD074143    -37.819701     144.960234
ORD392203    -37.812101     144.973944
ORD208957    -37.810785     144.977354
ORD090831    -37.797624     144.993262
ORD062280    -37.816990     144.961790
ORD155978    -37.824991     144.949411
ORD493957    -37.801182     144.976899
ORD083244    -37.818479     144.977813
ORD373348    -37.793879     144.985178
ORD055195    -37.806415     144.983469
ORD125480    -37.810368     144.961303
ORD285476    -37.802954     144.935968
ORD349254    -37.805420     144.947587
ORD048052    -37.801171     145.004517
ORD479919    -37.815613     144.977507
ORD326763    -37.805420     144.928230
ORD442562    -37.820067     144.968618
ORD063341    -37.827219     144.988143
ORD012510    -37.799791     144.954950
ORD070213    -37.801354     144.946328
ORD481618    -37.816652     144.988204
ORD202182    -3

**Geographical Coordinates Fix - Justification:**

The customer_lat and customer_long columns were corrected by identifying rows where latitude and longitude values were likely swapped. Normally, latitude values should range between -90 and 90, while longitude values fall between -180 and 180. Rows where latitude was within the longitude range (-180 to 180) and longitude within the latitude range (-90 to 90) were flagged as swapped. For these records, the coordinates were exchanged to their correct positions, ensuring geographical validity and consistency with expected coordinate ranges. A total of 27 rows were fixed and logged as “Swapped coordinates” in the fix_log.

In [223]:
# Fix nearest_warehouse
# 1. Actual nearest warehouse
# Extract the unfixed data
unfixed_data = clean_data[~clean_data["order_id"].isin(fix_log["order_id"])].copy()
unfixed_index = unfixed_data.index

# Extract the original nearest_warehouse and the actual nearest_warehouse in stripped string format
ori_wh = merge_warehouse_dist.loc[unfixed_index, "nearest_warehouse_clean"].astype(str).str.strip()
actual_wh = merge_warehouse_dist.loc[unfixed_index, "actual_nearest_warehouse"].astype(str).str.strip()

incorrect_wh = (ori_wh != actual_wh)

# Extract rows with incorrect nearest_warehouse but correct distance_to_nearest_warehouse
correct_dist = ((merge_warehouse_dist.loc[unfixed_index, "distance_to_nearest_warehouse"] -
                 merge_warehouse_dist.loc[unfixed_index, "actual_nearest_distance"]).abs() <= 0)

# Invalid nearest_warehouse
wh_fix = incorrect_wh & correct_dist
incorrect_wh_index = merge_warehouse_dist.loc[unfixed_index].index[wh_fix]
print("Number of incorrect nearest warehouse name:", len(incorrect_wh_index))

# Replace the invalid rows with the correct values
clean_data.loc[incorrect_wh_index, "nearest_warehouse"] = (merge_warehouse_dist.loc[incorrect_wh_index, "actual_nearest_warehouse"].values)

# Log the order_id for fixed rows
fixed_ids = clean_data.loc[incorrect_wh_index, "order_id"].values
fix_log = pd.concat([fix_log,
                     pd.DataFrame({"order_id": fixed_ids,
                                   "error_type": "Incorrect nearest_warehouse"})],
                    ignore_index=True)

print(f"Fixed {len(fixed_ids)} rows.")
print(clean_data.loc[incorrect_wh_index, ["order_id", "nearest_warehouse"]].to_string(index=False))

Number of incorrect nearest warehouse name: 20
Fixed 20 rows.
 order_id nearest_warehouse
ORD237879          Thompson
ORD052629            Bakers
ORD461915          Thompson
ORD069936         Nickolson
ORD461426          Thompson
ORD015774          Thompson
ORD120949          Thompson
ORD216010            Bakers
ORD138742          Thompson
ORD334316         Nickolson
ORD081635         Nickolson
ORD025929          Thompson
ORD201338          Thompson
ORD436885            Bakers
ORD224300          Thompson
ORD471229         Nickolson
ORD164906         Nickolson
ORD118183            Bakers
ORD300512            Bakers
ORD426908          Thompson


**Nearest_warehouse Fix - Justification:**

The nearest_warehouse column was corrected by verifying each customer's recorded warehouse against the actual nearest warehouse computed using the Haversine distance. Rows where the nearest_warehouse name differed from the calculated nearest one were identified as incorrect, but only those with a correct distance_to_nearest_warehouse value were updated to prevent cascading errors. The warehouse names were then replaced with the accurate ones derived from distance calculations. A total of 20 rows were corrected and logged under “Incorrect nearest_warehouse”, ensuring each order now correctly reflects its true nearest warehouse location.

In [224]:
# Fix distance_to_nearest_warehouse
# Haversine distance
# Extract the unfixed data
unfixed_data = clean_data[~clean_data["order_id"].isin(fix_log["order_id"])].copy()
unfixed_index = unfixed_data.index

# Extract the original nearest_warehouse and the actual nearest_warehouse in stripped string format
ori_wh = merge_warehouse_dist.loc[unfixed_index, "nearest_warehouse_clean"].astype(str).str.strip()
actual_wh = merge_warehouse_dist.loc[unfixed_index, "actual_nearest_warehouse"].astype(str).str.strip()

correct_wh = (ori_wh == actual_wh)

# Extract the actual nearest_warehouse distance in unfixed data only
actual_nearest_distance = np.around(merge_warehouse_dist.loc[unfixed_index, "actual_nearest_distance"].values, 4)

# Extract the original distance_to_nearest_warehouse
ori_distance = merge_warehouse_dist.loc[unfixed_index, "distance_to_nearest_warehouse"].values

# Calculate the distance difference
mismatch_dist = np.abs(ori_distance - actual_nearest_distance) > 0.1

# Invalid distance_to_nearest_warehouse
dist_fix = correct_wh & mismatch_dist
incorrect_dist_index = merge_warehouse_dist.loc[unfixed_index].index[dist_fix]
print("Number of incorrect distance to nearest warehouse:", len(incorrect_dist_index))

# Replace the invalid rows with the correct values
clean_data.loc[incorrect_dist_index, "distance_to_nearest_warehouse"] = (merge_warehouse_dist.loc[incorrect_dist_index, "actual_nearest_distance"].values)

# Log the order_id for fixed rows
fixed_ids = clean_data.loc[incorrect_dist_index, "order_id"].values
fix_log = pd.concat([fix_log,
                     pd.DataFrame({"order_id": fixed_ids,
                                   "error_type": "Incorrect distance_to_nearest_warehouse"})],
                    ignore_index=True)

print(f"Fixed {len(fixed_ids)} rows.")
print(clean_data.loc[incorrect_dist_index, ["order_id", "distance_to_nearest_warehouse"]].to_string(index=False))

Number of incorrect distance to nearest warehouse: 23
Fixed 23 rows.
 order_id  distance_to_nearest_warehouse
ORD418280                         0.7663
ORD020055                         1.4421
ORD483341                         0.9439
ORD106152                         1.1906
ORD177750                         1.8275
ORD452867                         0.8989
ORD393540                         1.6016
ORD135624                         1.7146
ORD471922                         1.8650
ORD385052                         0.8499
ORD307211                         1.1409
ORD227802                         0.8424
ORD041333                         0.6523
ORD378423                         0.9532
ORD321986                         1.1383
ORD205213                         0.3711
ORD440634                         0.7465
ORD416698                         2.7735
ORD049973                         0.8999
ORD005273                         0.8712
ORD048269                         0.9092
ORD175303                    

**Distance_to_nearest_warehouse Fix - Justification:**

The distance_to_nearest_warehouse column was fixed by recalculating the true Haversine distance between each customer's coordinates and their nearest warehouse. For each record where the warehouse name was already verified as correct, the stored distance was compared with the computed actual distance. Rows showing a deviation greater than 0.1 km were flagged as inaccurate and updated with the precise recalculated values. A total of 23 rows were corrected and recorded in the fix_log under “Incorrect distance_to_nearest_warehouse.”


In [225]:
# Fix nearest_warehouse
# 2. Naming inconsistency
# Extract the unfixed data
unfixed_data = clean_data[~clean_data["order_id"].isin(fix_log["order_id"])].copy()

# Extract the warehouse name from warehouse.csv
valid_names = warehouse_data["names"].str.strip().str.title()

# Invalid nearest_warehouse
invalid_names = ~unfixed_data["nearest_warehouse"].astype(str).str.strip().isin(valid_names)
invalid_name_index = unfixed_data.index[invalid_names]
print("Number of invalid warehouse names:", len(invalid_name_index))

# Replace the invalid rows with correct values
clean_data.loc[invalid_name_index, "nearest_warehouse"] = (clean_data.loc[invalid_name_index, "nearest_warehouse"].astype(str).str.strip().str.title())

# Log the order_id for fixed rows
fixed_ids = clean_data.loc[invalid_name_index, "order_id"].values
fix_log = pd.concat([fix_log,
                     pd.DataFrame({"order_id": fixed_ids,
                                   "error_type": "Inconsistent nearest_warehouse naming"})],
                    ignore_index=True)

print(f"Fixed {len(fixed_ids)} rows.")
print(clean_data.loc[invalid_name_index, ["order_id", "nearest_warehouse"]].to_string(index=False))

Number of invalid warehouse names: 7
Fixed 7 rows.
 order_id nearest_warehouse
ORD256861          Thompson
ORD014442          Thompson
ORD166717          Thompson
ORD169718         Nickolson
ORD393258          Thompson
ORD016571         Nickolson
ORD099672          Thompson


**Warehouse Name Fix - Justification:**

All warehouse names were compared against the valid list extracted from warehouse.csv, converting them to a consistent title-cased format using .str.title() and stripping extra spaces. This correction ensures uniform naming conventions across all records. A total of 7 were corrected and logged as “Inconsistent nearest_warehouse naming” in the fix_log.


In [226]:
# Individual item price
# Extract the fixed data
fixed_data = clean_data[clean_data["order_id"].isin(set(fix_log["order_id"]))].copy()

# Map the order id from fixed data to the parsed shopping_cart value
cart_map = shopping_cart_check.set_index("order_id")["shopping_cart_parsed"]
fixed_data["cart_parsed"] = fixed_data["order_id"].map(cart_map)

# Extract all the item and the item names in shopping_cart as a list
all_items = [item for cart in fixed_data["cart_parsed"] for (item, qty) in cart]
item_names = pd.Series(all_items).value_counts().index.tolist()

# Lookup for item name with column index
index = {n: i for i, n in enumerate(item_names)}
K = len(item_names)

# Build numerical matrices by looping through each order
rows, targets = [], []
for _, r in fixed_data.iterrows():
    v = np.zeros(K)
    for (name, qty) in r["cart_parsed"]:
        if name in index:
            v[index[name]] += float(qty)
            rows.append(v)
            targets.append(float(r["order_price"]))

A = np.vstack(rows) # Item quantities per order
b = np.array(targets, dtype=float) # order_price

# Calculate the individual item prices with linalg
unit_prices, *_ = np.linalg.lstsq(A, b, rcond=None)
price_map = pd.Series(unit_prices, index=item_names).round(2)

print(price_map.sort_index())

Alcon 10          8950.0
Candle Inferno     430.0
Lucent 330S       1230.0
Olivia x460       1225.0
Thunder line      2180.0
Toshika 750       4320.0
Universe Note     3450.0
iAssist Line      2225.0
iStream            150.0
pearTV            6310.0
dtype: float64


In [227]:
# Check shopping_cart, order_price, order_total is correct
# Extract the unfixed data
unfixed_data = clean_data[~clean_data["order_id"].isin(set(fix_log["order_id"]))].copy()
unfixed_data["cart_parsed"] = unfixed_data["order_id"].map(shopping_cart_check.set_index("order_id")["shopping_cart_parsed"])

# Calculate expected order_price with individual item price
def order_price_check(cart):
  '''
  Calculates the order_price based on individual item price
  '''
  return round(sum(price_map[item]*float(qty) for item, qty in cart), 2)

# Compute expected values based on individual item price
unfixed_data["expected_order_price"] = unfixed_data["cart_parsed"].apply(order_price_check)
unfixed_data["expected_order_total"] = (unfixed_data["expected_order_price"] * (1 - unfixed_data["coupon_discount"]/100) + unfixed_data["delivery_charges"]).round(2)

# Compute the difference for order_price and order_total
unfixed_data["price_diff"] = (unfixed_data["expected_order_price"] - unfixed_data["order_price"]).abs().round(2)
unfixed_data["total_diff"] = (unfixed_data["expected_order_total"] - unfixed_data["order_total"]).abs().round(2)

# Compare the expected order_price and order_total with the record and check if it matches
unfixed_data["price_match"] = unfixed_data["price_diff"] <= 0
unfixed_data["total_match"] = unfixed_data["total_diff"] <= 0

# Mismatched rows
mismatch_order = unfixed_data[(~unfixed_data["price_match"]) | (~unfixed_data["total_match"])].copy()

# Output
output = ["order_id", "shopping_cart", "order_price",
        "expected_order_price", "price_diff", "price_match",
        "order_total", "expected_order_total", "total_diff", "total_match"]

print(f"Rows mismatch order:", len(mismatch_order))
print(mismatch_order[output].to_string(index=False))

Rows mismatch order: 81
 order_id                                                                        shopping_cart  order_price  expected_order_price  price_diff  price_match  order_total  expected_order_total  total_diff  total_match
ORD012542                                            [('Thunder line', 1), ('Toshika 750', 2)]         6740               10820.0      4080.0        False      9809.79               9809.79        0.00         True
ORD136268                                               [('Alcon 10', 2), ('iAssist Line', 2)]         2600               22350.0     19750.0        False     21331.40              21331.40        0.00         True
ORD396615                              [('Olivia x460', 1), ('Lucent 330S', 1), ('pearTV', 2)]        14010               15075.0      1065.0        False     15151.80              15151.80        0.00         True
ORD038061                               [('Candle Inferno', 2), ('pearTV', 1), ('iStream', 1)]         8920         

**Shopping_cart, Order_price, Order_total Check - Justification:**

Using the fixed unit prices in price_map, the code recalculates the expected pre-discount total (expected_order_price) and applies the business rule order_total = order_price * (1 - coupon_discount/100) + delivery_charges to obtain the expected final total. Any differences indicate data anomalies such as incorrect item prices, misapplied discounts, or wrongly added delivery charges.

In [228]:
# Classify the mismatched rows into shopping_cart error, order_price error, order_total error
# List out the item names
item_names = list(price_map.index)

# Function to check if row will be correct by swapping item
def swap_item(row):
  '''
  Check if order_price and order_total can be correct by swapping between each item in the cart
  '''
  cart = row["cart_parsed"]

  price_diff = float(row["order_price"]  - row["expected_order_price"])
  total_diff = float(row["order_total"]  - row["expected_order_total"])
  discount  = 1 - float(row["coupon_discount"])/100

  for old_name, qty in cart:
    old_p = float(price_map[old_name])
    for new_name in item_names:
      if new_name == old_name:
        continue
      new_p = float(price_map[new_name])
      delta_price = (new_p - old_p) * float(qty)
      if (round(abs(delta_price - price_diff), 2)<= 0 and
          round(abs(delta_price * discount - total_diff), 2) <= 0):
        return True, {"incorrect item": old_name, "correct item": new_name, "qty": float(qty)}
  return False, {}

# Classify the mismatched rows to their error type
# If the item can be swapped for the row to be correct -> shopping_cart_error
# If the order_price calculated with individual item price matches order_total -> order_price_error
# If the order_price calculated with individual item price matches the original order_price, but order_total is incorrect -> order_total_error
error_cat = []
error_details = []

for _, r in mismatch_order.iterrows():
    # shopping_cart error
    can_swap, swap_info = swap_item(r)
    if can_swap:
        error_cat.append("shopping_cart_error")
        error_details.append(swap_info)
        continue

    # order_price error
    incorrect_price = abs(float(r["expected_order_price"] - r["order_price"])) > 0
    correct_total = abs(float(r["expected_order_total"] - r["order_total"])) <= 0
    if incorrect_price and correct_total:
        error_cat.append("order_price_error")
        error_details.append({})
        continue

    # order_total error
    correct_price  = not incorrect_price
    incorrect_total = not correct_total
    if correct_price and incorrect_total:
        error_cat.append("order_total_error")
        error_details.append({})
        continue

    # unknown error
    error_cat.append("unknown_error")
    error_details.append({})

# Create new columns to store error_type and error_details
mismatch_order["error_cat"] = error_cat
mismatch_order["error_details"]  = error_details

# Summary for error type
error_summary = (mismatch_order["error_cat"]
                 .value_counts()
                 .rename_axis("error category")
                 .reset_index(name="rows"))

# Print out the order_id and error_details for shopping_cart error
print(error_summary.to_string(index=False))
print("Detected shopping_cart error:")
print(mismatch_order.loc[mismatch_order["error_cat"]=="shopping_cart_error", ["order_id","error_details"]].to_string(index=False))

     error category  rows
  order_price_error    27
shopping_cart_error    27
  order_total_error    27
Detected shopping_cart error:
 order_id                                                                     error_details
ORD038061   {'incorrect item': 'Candle Inferno', 'correct item': 'Lucent 330S', 'qty': 2.0}
ORD157781       {'incorrect item': 'iStream', 'correct item': 'Candle Inferno', 'qty': 2.0}
ORD135183     {'incorrect item': 'Lucent 330S', 'correct item': 'Thunder line', 'qty': 1.0}
ORD108147        {'incorrect item': 'pearTV', 'correct item': 'Candle Inferno', 'qty': 2.0}
ORD292260    {'incorrect item': 'Lucent 330S', 'correct item': 'Universe Note', 'qty': 2.0}
ORD286146   {'incorrect item': 'Toshika 750', 'correct item': 'Candle Inferno', 'qty': 1.0}
ORD258960         {'incorrect item': 'Olivia x460', 'correct item': 'Alcon 10', 'qty': 2.0}
ORD489080        {'incorrect item': 'iAssist Line', 'correct item': 'Alcon 10', 'qty': 1.0}
ORD063030  {'incorrect item': 'Thunder

**Shopping_cart, Order_price, Order_total Classification- Justification:**

First, shopping_cart error is tested by checking whether swapping any one item in the parsed cart while keeping quantity fixed makes both order_price and order_total match after applying the coupon. A successful one-item swap implies the cart contains a wrong item. If no swap explains the mismatch, it checks arithmetic consistency where it checks if the recomputed expected_order_price (from unit prices * quantities) disagrees with the recorded order_price but the final order_total still matches, the row is labeled order_price_error. If order_price matches but order_total does not, it is labeled order_total_error. Anything else is unknown_error for manual review.

In [229]:
# Fix shopping_cart error
# Extract the unfixed data
unfixed_data = clean_data[~clean_data["order_id"].isin(fix_log["order_id"])].copy()
unfixed_index = unfixed_data.index

# Extract the rows with shopping_cart_error
cart_error = (mismatch_order["error_cat"] == "shopping_cart_error")
cart_error_rows = mismatch_order.loc[cart_error, ["order_id", "error_details"]].copy()

# Mapping of the order_id with invalid shopping_cart value to its error_details
cart_error_swap = dict(zip(cart_error_rows["order_id"], cart_error_rows["error_details"]))

# Function to swap item name
def swap_shopping_cart(cart_list, old_name, new_name, qty):
  new_list = []
  swapped = False
  for name, q in cart_list:
    if (not swapped) and (name == old_name) and (float(q) == float(qty)):
      new_list.append((new_name, q))
      swapped = True
    else:
      new_list.append((name, q))
  if not swapped:
    for i, (n, q) in enumerate(new_list):
      if n == old_name:
        new_list[i] = (new_name , q)
        swapped = True
        break
  return new_list, swapped

# Invalid shopping_cart (item name)
cart_fix = clean_data["order_id"].isin(cart_error_swap.keys())
cart_index = clean_data.index[cart_fix & clean_data.index.isin(unfixed_index)]

# Swap the invalid item name to the correct item name in shopping_cart
for i in cart_index:
  old_id = clean_data.at[i, "order_id"]
  info = cart_error_swap[old_id]
  old_name = info["incorrect item"]
  new_name = info["correct item"]
  qty = info["qty"]

  raw = clean_data.at[i, "shopping_cart"]

  cart_list = list(ast.literal_eval(raw))

  new_cart, did_swap = swap_shopping_cart(cart_list, old_name, new_name, qty)
  if did_swap:
    clean_data.at[i, "shopping_cart"] = repr(new_cart)

# Log the order_id for fixed rows
fixed_ids = clean_data.loc[cart_index, "order_id"].values
fix_log = pd.concat([fix_log,
                     pd.DataFrame({"order_id": fixed_ids,
                                   "error_type": "Incorrect shopping_cart"})],
                    ignore_index=True)

print(f"Fixed {len(fixed_ids)} rows.")
print(clean_data.loc[cart_index, ["order_id", "shopping_cart"]].to_string(index=False))

Fixed 27 rows.
 order_id                                                                       shopping_cart
ORD038061                                 [('Lucent 330S', 2), ('pearTV', 1), ('iStream', 1)]
ORD157781                                         [('Candle Inferno', 2), ('Olivia x460', 2)]
ORD135183           [('pearTV', 2), ('Olivia x460', 2), ('Alcon 10', 1), ('Thunder line', 1)]
ORD108147                                       [('Universe Note', 2), ('Candle Inferno', 2)]
ORD292260         [('Universe Note', 2), ('Toshika 750', 2), ('iStream', 1), ('Alcon 10', 2)]
ORD286146                                       [('Candle Inferno', 1), ('Universe Note', 2)]
ORD258960                              [('Alcon 10', 2), ('pearTV', 1), ('Universe Note', 2)]
ORD489080                             [('Alcon 10', 1), ('Candle Inferno', 2), ('pearTV', 1)]
ORD063030                                            [('Candle Inferno', 2), ('Alcon 10', 2)]
ORD035025                           [('Thunde

**Shopping_cart Fix - Justification:**

A mapping order_id → error_details from the classifier is built and for each affected order, we parse the stored cart, perform a single in-place name swap, and write the corrected cart back. Each successful fix is appended to fix_log as "Incorrect shopping_cart".

In [230]:
# Fix order_price
# Extract the unfixed data
unfixed_data = clean_data[~clean_data["order_id"].isin(fix_log["order_id"])].copy()

# Extract the rows with order_price_error
price_error = (mismatch_order["error_cat"] == "order_price_error")
price_error_rows = set(mismatch_order.loc[price_error, "order_id"])

# Extract the correct order_price
price_error_unfixed = mismatch_order.loc[mismatch_order["order_id"].isin(price_error_rows), ["order_id", "expected_order_price"]]

# Mapping of the order_id with the correct order_price value
expected_price_map = dict(zip(price_error_unfixed["order_id"], price_error_unfixed["expected_order_price"]))

# Replace the invalid rows with the correct values
price_fix = clean_data["order_id"].isin(expected_price_map.keys())
price_index = clean_data.index[price_fix]
clean_data.loc[price_index, "order_price"] = (clean_data.loc[price_index, "order_id"].map(expected_price_map))

# Log the order_id for fixed rows
fixed_ids = clean_data.loc[price_index, "order_id"].values
fix_log = pd.concat([fix_log,
                     pd.DataFrame({"order_id": fixed_ids,
                                   "error_type": "Incorrect order_price"})],
                    ignore_index=True)

print(f"Fixed {len(fixed_ids)} rows.")
print(clean_data.loc[price_index, ["order_id", "order_price"]].to_string(index=False))

Fixed 27 rows.
 order_id  order_price
ORD012542        10820
ORD136268        22350
ORD396615        15075
ORD276861        22380
ORD097415        22010
ORD282385        20045
ORD368942         7130
ORD181929        21520
ORD377946        10210
ORD431838        11700
ORD189464        10315
ORD221277        19275
ORD294135         9500
ORD007117        10165
ORD354953         2330
ORD455667        20470
ORD203938        13870
ORD267973        25710
ORD486573        24225
ORD411297         5550
ORD470970        11400
ORD363756        15685
ORD113541         7970
ORD291184        23655
ORD199032        15970
ORD208094         7810
ORD023631        11990


**Order_price Fix - Justification:**

This fixes rows labeled order_price_error by overwriting the recorded order_price with the recomputed expected_order_price derived from unit prices * quantities. An order_id → expected_order_price map is built from the mismatch table, and update only those rows. Because the error category guarantees the final order_total was already consistent with the correct pre-discount price, we do not touch order_total. Each fix is appended to fix_log as "Incorrect order_price".

In [231]:
# Fix order_total
# Extract the unfixed data
unfixed_data = clean_data[~clean_data["order_id"].isin(fix_log["order_id"])].copy()

# Extract the rows with order_total_error
total_error = (mismatch_order["error_cat"] == "order_total_error")
total_error_rows = set(mismatch_order.loc[total_error, "order_id"])

# Extract the correct order_total
total_error_unfixed = mismatch_order.loc[mismatch_order["order_id"].isin(total_error_rows), ["order_id", "expected_order_total"]]

# Mapping of the order_id with the correct order_total value
expected_total_map = dict(zip(total_error_unfixed["order_id"], total_error_unfixed["expected_order_total"]))

# Replace the invalid rows with the correct values
total_fix = clean_data["order_id"].isin(expected_total_map.keys())
total_index = clean_data.index[total_fix]
clean_data.loc[total_index, "order_total"] = (clean_data.loc[total_index, "order_id"].map(expected_total_map))

# Log the order_id for fixed rows
fixed_ids = clean_data.loc[total_index, "order_id"].values
fix_log = pd.concat([fix_log,
                     pd.DataFrame({"order_id": fixed_ids,
                                   "error_type": "Incorrect order_total"})],
                    ignore_index=True)

print(f"Fixed {len(fixed_ids)} rows.")
print(clean_data.loc[total_index, ["order_id", "order_total"]].to_string(index=False))

Fixed 27 rows.
 order_id  order_total
ORD159650      3028.94
ORD105180     11957.17
ORD390068     11024.52
ORD098805     13718.93
ORD371831      7534.92
ORD115991      9139.41
ORD460642      3745.31
ORD293767     11196.19
ORD330007     22399.00
ORD319958     22753.58
ORD069151     24451.51
ORD185952      6195.12
ORD052541     23890.43
ORD372162     16792.73
ORD200848     20273.34
ORD405673      8228.93
ORD455855     31816.16
ORD022783     13428.47
ORD153936     12820.31
ORD421787     23579.36
ORD034671      5499.14
ORD421351      2243.80
ORD187672     15613.34
ORD153201     14470.76
ORD496180     14523.18
ORD230545     12202.81
ORD089659      2051.85


**Order_total Fix - Justification:**

This step corrects order_total_error by recalculating the final total using the business rule (apply coupon to the pre-discount order_price, then add delivery_charges) and overwriting the recorded order_total with the computed value. All corrected order_ids are appended to fix_log as "Incorrect order_total".

In [232]:
# Fix is_happy_customer
# Extract the unfixed data
unfixed_data = clean_data[~clean_data["order_id"].isin(fix_log["order_id"])].copy()
unfixed_index = unfixed_data.index

# Invalid is_happy_customer
incorrect_happy_rows = is_happy_check.loc[~is_happy_check["is_correct"], ["order_id", "happy_prediction"]]
incorrect_happy_id = incorrect_happy_rows["order_id"]
happy_index = clean_data.index[clean_data["order_id"].isin(incorrect_happy_id) & clean_data.index.isin(unfixed_index)]

# Replace the invalid rows with the correct values
happy_target = clean_data.loc[happy_index, "order_id"]
happy_fix = (is_happy_check.set_index("order_id").reindex(happy_target)["happy_prediction"].values)
clean_data.loc[happy_index, "is_happy_customer"] = happy_fix

# Log the order_id for fixed rows
fixed_ids = clean_data.loc[happy_index, "order_id"].values
fix_log = pd.concat([fix_log,
                     pd.DataFrame({"order_id": fixed_ids,
                                   "error_type": "Incorrect is_happy_customer"})],
                    ignore_index=True)

print(f"Fixed {len(fixed_ids)} rows.")
print(clean_data.loc[happy_index, ["order_id", "is_happy_customer"]].to_string(index=False))

Fixed 27 rows.
 order_id  is_happy_customer
ORD216249               True
ORD412492               True
ORD457652               True
ORD066764               True
ORD352239               True
ORD268941               True
ORD478343               True
ORD064373               True
ORD370767               True
ORD252102              False
ORD330702               True
ORD408565               True
ORD480194               True
ORD494528               True
ORD083198              False
ORD241933              False
ORD246197               True
ORD208028               True
ORD251878               True
ORD405488               True
ORD115461               True
ORD363854               True
ORD435481               True
ORD256544               True
ORD319183               True
ORD102139               True
ORD366292              False


**Is_happy_customer Fix - Justification:**

The model's happy_prediction is pulled from the incorrect rows, then overwrites the label, and log each fix in fix_log as "Incorrect is_happy_customer".

In [233]:
# Check if is_expedited_delivery is correct with linear model
# Extract the unfixed data
unfixed_data = clean_data[~clean_data["order_id"].isin(fix_log["order_id"])].copy()

# Extract cleaned data as train_data
train_data = clean_data[clean_data["order_id"].isin(fix_log["order_id"])].copy()

features = ["distance_to_nearest_warehouse", "is_expedited_delivery", "is_happy_customer"]
target = "delivery_charges"

# Dictionary of models and r2 score by season
models = {}
r2_scores = {}

# Train model based on season
for season in train_data["season"].unique():
  season_data = train_data[train_data["season"] == season].copy()

  x = season_data[features]
  y = season_data[target]

  model = LinearRegression()
  model.fit(x,y)

  r2 = model.score(x, y)

  models[season] = model
  r2_scores[season] = r2

  print(f"R2 score for {season} model trained = {r2:.5f}")

# Function to apply linear model for prediction
def predict_delivery(row):
    model = models[row["season"]]
    x_row = pd.DataFrame(
        [[row["distance_to_nearest_warehouse"],
          int(row["is_expedited_delivery"]),
          int(row["is_happy_customer"])]],
        columns = features)
    return model.predict(x_row)[0]

# Predict is_expedited_delivery with the linear model
unfixed_data["predicted_delivery_charges"] = unfixed_data.apply(predict_delivery, axis=1)
unfixed_data["delivery_error"] = unfixed_data["delivery_charges"] - unfixed_data["predicted_delivery_charges"]

# Calculate the tolerance for delivery_error using Median Absolute Deviation (MAD)
tolerance_dict = {}

for season, model in models.items():
    season_data = train_data[train_data["season"] == season]
    preds = model.predict(season_data[features])
    residuals = season_data[target] - preds
    resid_mad = np.median(np.abs(residuals - np.median(residuals)))
    tolerance = 3 * 1.4826 * resid_mad
    tolerance_dict[season] = tolerance
    print(f"Tolerance for {season} model: ±{tolerance:.3f}")

# Check for invalid is_expedited_delivery based on the calculated tolerance
def check_expedited(row):
    tol = tolerance_dict[row["season"]]
    error = row["delivery_error"]

    if error > tol:
        return True
    elif error < -tol:
        return False
    else:
        return row["is_expedited_delivery"]

unfixed_data["correct_expedited"] = unfixed_data.apply(check_expedited, axis=1)

# Invalid is_expedited_delivery
invalid_expedited = unfixed_data[unfixed_data["correct_expedited"] != unfixed_data["is_expedited_delivery"]]
print("Number of invalid is_delivery_expedited:", len(invalid_expedited))
print(invalid_expedited[["order_id", "season", "delivery_error", "is_expedited_delivery", "correct_expedited"]].to_string(index=False))

R2 score for Winter model trained = 0.98377
R2 score for Summer model trained = 0.99552
R2 score for Autumn model trained = 0.98718
R2 score for Spring model trained = 0.99378
Tolerance for Winter model: ±3.143
Tolerance for Summer model: ±2.575
Tolerance for Autumn model: ±2.593
Tolerance for Spring model: ±3.536
Number of invalid is_delivery_expedited: 54
 order_id season  delivery_error  is_expedited_delivery  correct_expedited
ORD316891 Winter       13.560236                  False               True
ORD092985 Summer       21.000244                  False               True
ORD337660 Autumn      -14.224409                   True              False
ORD398696 Summer       21.087023                  False               True
ORD232419 Summer       19.701900                  False               True
ORD013805 Summer       20.113138                  False               True
ORD333779 Autumn      -14.797708                   True              False
ORD403640 Spring      -26.648642        

**Is_expedited_delivery Check - Justification:**

To validate is_expedited_delivery, linear regression models by season is trained fixed data to predict delivery_charges from distance_to_nearest_warehouse, is_expedited_delivery, and is_happy_customer. For each unfixed row, the residual is computed and a robust MAD-based tolerance is used to flag inconsistencies where a large positive residual suggests the charge is too high, a large negative residual suggests it's too low, otherwise the original flag is retained.

In [234]:
# Fix is_expedited_delivery
# Extract the unfixed data
unfixed_data = clean_data[~clean_data["order_id"].isin(fix_log["order_id"])].copy()
unfixed_index = unfixed_data.index

# Mapping of the order_id with the correct is_expedited_delivery value

expedited_map = (invalid_expedited
                 .loc[:, ["order_id", "correct_expedited"]]
                 .set_index("order_id")["correct_expedited"])

# Invalid is_expedited_delivery
incorrect_expedited = unfixed_data["order_id"].isin(expedited_map.index)
expedited_index = unfixed_data.index[incorrect_expedited]

# Replace the invalid rows with the correct values
clean_data.loc[expedited_index, "is_expedited_delivery"] = (clean_data.loc[expedited_index, "order_id"].map(expedited_map).astype(bool).values)

# Log the order_id for fixed_rows
fixed_ids = clean_data.loc[expedited_index, "order_id"].values
fix_log = pd.concat([fix_log,
                     pd.DataFrame({"order_id": fixed_ids,
                                   "error_type": "Incorrect is_expedited_delivery"})],
                    ignore_index=True)

print(f"Fixed {len(fixed_ids)} rows.")
print(clean_data.loc[expedited_index, ["order_id", "is_expedited_delivery"]].to_string(index=False))

Fixed 54 rows.
 order_id  is_expedited_delivery
ORD316891                   True
ORD092985                   True
ORD337660                  False
ORD398696                   True
ORD232419                   True
ORD013805                   True
ORD333779                  False
ORD403640                  False
ORD423616                  False
ORD205988                   True
ORD068755                   True
ORD041144                  False
ORD161558                   True
ORD331515                   True
ORD482432                  False
ORD227689                  False
ORD387260                   True
ORD474989                   True
ORD303124                  False
ORD176250                   True
ORD377178                   True
ORD400642                  False
ORD231763                   True
ORD175829                   True
ORD066841                  False
ORD397774                  False
ORD172403                   True
ORD451073                   True
ORD297470                  F

**Is_expedited_delivery Fix - Justification:**

This step corrects is_expedited_delivery for rows the residual-model flagged as inconsistent. Each affected order_id is mapped to its correct_expedited value, overwrite the flag, and append the fix to fix_log as "Incorrect is_expedited_delivery".

In [235]:
# Summary of fixed rows
print("Summary of dirty_data.csv:")
print("- Number of fixed rows:", len(fix_log))

duplicate_order_id = fix_log["order_id"].duplicated().any()
print("- Only one error per row fixed:", ~duplicate_order_id)

print("- Types of error in dirty_data.csv:")
error_count = fix_log["error_type"].value_counts()
for error_type, count in error_count.items():
  print(f"  * {error_type}: {count}")

Summary of dirty_data.csv:
- Number of fixed rows: 293
- Only one error per row fixed: True
- Types of error in dirty_data.csv:
  * Incorrect is_expedited_delivery: 54
  * Missing date: 27
  * Incorrect season: 27
  * Swapped coordinates: 27
  * Incorrect shopping_cart: 27
  * Incorrect order_price: 27
  * Incorrect order_total: 27
  * Incorrect is_happy_customer: 27
  * Incorrect distance_to_nearest_warehouse: 23
  * Incorrect nearest_warehouse: 20
  * Inconsistent nearest_warehouse naming: 7


In [236]:
# Export to csv
file_name = "Group035_dirty_data_solution.csv"
output_path = os.path.join(base, file_name)
clean_data.to_csv(output_path, index=False, na_rep="NaN")

#### 1.3 Outlier Data

In [237]:
# Detect and remove outlier data by fitting linear model and computing robust z-score (Median + MAD) from residuals
def remove_delivery_charge_outliers(df):
    """
    Detect and remove delivery charge outliers using a linear model with
    robust z-score residual analysis. Returns (filtered_df, outliers_df).
    """
    df = df.copy()  # work on a copy to avoid side effects

    # --- Step 1: Prepare features ---
    X = df[['distance_to_nearest_warehouse', 'is_expedited_delivery', 'is_happy_customer', 'season']]
    y = df['delivery_charges']

    # Convert boolean to integer
    X['is_expedited_delivery'] = X['is_expedited_delivery'].astype(int)
    X['is_happy_customer'] = X['is_happy_customer'].astype(int)

    # One-hot encode season (drop first to avoid multicollinearity)
    encoder = OneHotEncoder(drop='first', sparse_output=False)
    season_encoded = encoder.fit_transform(X[['season']])
    season_encoded_df = pd.DataFrame(season_encoded, columns=encoder.get_feature_names_out(['season']))

    # Combine base features and season dummies
    X_base = X.drop(columns=['season']).reset_index(drop=True)
    X_encoded = pd.concat([X_base, season_encoded_df], axis=1)

    # --- Step 2: Add interaction terms between season and numeric predictors ---
    for season_col in season_encoded_df.columns:
        for feature in ['distance_to_nearest_warehouse', 'is_expedited_delivery', 'is_happy_customer']:
            interaction_name = f"{feature}_x_{season_col}"
            X_encoded[interaction_name] = X_base[feature] * season_encoded_df[season_col]

    # --- Step 3: Fit linear model ---
    model = LinearRegression()
    model.fit(X_encoded, y)

    # --- Step 4: Compute predictions and residuals (temporary only) ---
    predicted = model.predict(X_encoded)
    residuals = df['delivery_charges'] - predicted

    # --- Step 5: Robust Z-score (Median + MAD) method ---
    median_resid = np.median(residuals)
    mad_resid = np.median(np.abs(residuals - median_resid))
    robust_z = 0.6745 * (residuals - median_resid) / mad_resid

    outlier_mask = np.abs(robust_z) > 3.5
    outliers_df = df[outlier_mask].copy()
    filtered_df = df[~outlier_mask].reset_index(drop=True)

    print(f"Removed {outliers_df.shape[0]} outliers (robust-based), remaining: {filtered_df.shape[0]}")

    return filtered_df, outliers_df

filtered_data, outliers = remove_delivery_charge_outliers(outlier_data)

print("\nDetected Outliers:")
print(outliers[['order_id', 'delivery_charges']].head())

# Export to csv
file_name = "Group035_outlier_data_solution.csv"
output_path = os.path.join(base, file_name)
filtered_data.to_csv(output_path, index=False, na_rep="NaN")

Removed 31 outliers (robust-based), remaining: 469

Detected Outliers:
      order_id  delivery_charges
17   ORD089831           144.450
28   ORD322216            71.475
93   ORD276933            39.570
98   ORD142753            39.010
104  ORD317663            73.815


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['is_expedited_delivery'] = X['is_expedited_delivery'].astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['is_happy_customer'] = X['is_happy_customer'].astype(int)


**Outlier data - Methodology**  
 To detect and remove outlier data, we create a linear model and fit it according to the predictors described. We then compute the residuals of expected delivery charge vs actual delivery charge, and computed the robust z-score of median + median absolute deviation to identify which delivery charges were far unlike the predicted delivery charges. We then remove these outliers and display them for identification.

#### 1.4 Missing Data

In [238]:
# Show how much missing data there is
print(missing_data.isna().sum())

order_id                          0
customer_id                       0
date                              0
nearest_warehouse                55
shopping_cart                     0
order_price                      15
delivery_charges                 40
customer_lat                      0
customer_long                     0
coupon_discount                   0
order_total                      15
season                            0
is_expedited_delivery             0
distance_to_nearest_warehouse    31
latest_customer_review            0
is_happy_customer                40
dtype: int64


In [239]:
# Impute missing data by logical rule-based imputation
# --- Imputation logic ---
# We determine the values to impute from the columns from which they are dependent on.
# This is imputation by rule calculation, and is definitively correct assuming the relationships and the other columns are correct.
def impute_row(row, warehouses, price_map):
    # Parse shopping_cart into list of tuples locally (in-place, no extra column)
    try:
        parsed_cart = ast.literal_eval(row['shopping_cart']) if pd.notna(row['shopping_cart']) else []
    except Exception:
        parsed_cart = []

    # --- Impute nearest_warehouse from nearest warehouse and customer lat long ---
    if pd.isna(row['nearest_warehouse']) and pd.notna(row['customer_lat']) and pd.notna(row['customer_long']):
        distances = warehouses.apply(
            lambda wh: haversine_dist(row['customer_lat'], row['customer_long'], wh['lat'], wh['lon']), axis=1
        )
        nearest_idx = distances.idxmin()
        row['nearest_warehouse'] = warehouses.loc[nearest_idx, 'names']
        row['distance_to_nearest_warehouse'] = distances.min()

    # --- Impute order_price order total, delivery charge and coupon discount, or shopping cart if needed ---
    if pd.isna(row['order_price']):
        if pd.notna(row['order_total']) and pd.notna(row['delivery_charges']) and pd.notna(row['coupon_discount']):
            denom = (100 - row['coupon_discount']) / 100
            if denom != 0:
                row['order_price'] = (row['order_total'] - row['delivery_charges']) / denom
        elif pd.isna(row['order_total']) and parsed_cart:  # fallback to catalog
            items, qtys = zip(*parsed_cart)
            prices = price_map.reindex(items).fillna(0).values
            row['order_price'] = np.dot(prices, qtys)

    # --- Impute delivery_charges from order total, order price and coupon discount ---
    if pd.isna(row['delivery_charges']):
        if pd.notna(row['order_total']) and pd.notna(row['order_price']) and pd.notna(row['coupon_discount']):
            denom = (100 - row['coupon_discount']) / 100
            row['delivery_charges'] = row['order_total'] - row['order_price'] * denom

    # --- Impute order_total ---
    if pd.isna(row['order_total']):
        if pd.notna(row['order_price']) and pd.notna(row['coupon_discount']) and pd.notna(row['delivery_charges']):
            denom = (100 - row['coupon_discount']) / 100
            row['order_total'] = row['order_price'] * denom + row['delivery_charges']

    # --- Impute distance_to_nearest_warehouse from customer lat long---
    if pd.isna(row['distance_to_nearest_warehouse']):
        if pd.notna(row['nearest_warehouse']) and pd.notna(row['customer_lat']) and pd.notna(row['customer_long']):
            wh = warehouses.loc[warehouses['names'] == row['nearest_warehouse']].iloc[0]
            row['distance_to_nearest_warehouse'] = haversine_dist(row['customer_lat'], row['customer_long'], wh['lat'], wh['lon'])

    # --- Impute is_happy_customer from sentiment---
    if pd.isna(row['is_happy_customer']) and pd.notna(row['latest_customer_review']):
        sentiment = sia.polarity_scores(str(row['latest_customer_review']))
        row['is_happy_customer'] = sentiment['compound'] >= 0.05

    return row

# Apply
missing_data_imputed = missing_data.apply(lambda r: impute_row(r, warehouse_data, price_map), axis=1)

# Export to csv
file_name = "Group035_missing_data_solution.csv"
output_path = os.path.join(base, file_name)
missing_data_imputed.to_csv(output_path, index=False, na_rep="NaN")

In [240]:
# Verify that imputation caught all missing data
if sum(missing_data_imputed.isna().sum()) == 0:
    print("No missing data remaining\n")
else:
    print(missing_data_imputed.isna().sum())
    rows_with_nulls = missing_data_imputed[missing_data_imputed.isnull().any(axis=1)]
    print(rows_with_nulls)

No missing data remaining



**Missing Data Imputation - Methodology**  
To impute the missing data, we notice that few columns are actually missing, and these columns can be directly derived from other columns which are said to be error-free. As such, we can directly impute the missing data to be derived from the other columns. For example, the order total can be imputed as the order price after applying coupon discounts and adding delivery charges.

Given the assumption that the other columns are error free, this is the most direct, logical and correct way to impute missing data as it is deterministic.