## **Notebook Content**

> Project: Customer Prioritization Under Constraints

File: 01_data_cleaning.ipynb <br>
Author: Bryan Melvida

Purpose:
- Ingest raw transactional data
- Assess data quality, consistency, and anomalies
- Apply targeted data corrections prior to analysis

Input: [`Online Retail.xlsx`](../data/raw/Online%20Retail.xlsx) <br>
Related Documentation: [`customer_raw_dataset.md`](../docs/raw/customer_raw_dictionary.md)

Output: <br>
Related Documentation: [`preprocessing_log.md`](../docs/preprocessed/preprocessing_log.md)

---

In [None]:
import warnings
warnings.filterwarnings("ignore", category= FutureWarning)
from pathlib import Path

import sys
sys.path.append('../')
import src.assessment_views as av

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

**Load Dataset**

In [None]:
# Prevent reading xlsx if parquet file format exists
PARQUET_PATH = Path("../data/raw/customer_raw.parquet")
XLSX_PATH = Path("../data/raw/Online Retail.xlsx")

if PARQUET_PATH.exists() and PARQUET_PATH.stat().st_size > 0:
    df = pd.read_parquet(PARQUET_PATH)
    loaded_from = "parquet"
else:
    df = pd.read_excel(XLSX_PATH)
    loaded_from = "xlsx"

print(f"Dataset loaded from: {loaded_from}")

**Export to Parquet**

In [None]:
if loaded_from == "xlsx":
    df["InvoiceNo"] = df["InvoiceNo"].astype(str)
    df["StockCode"] = df["StockCode"].astype(str)
    df["Description"] = df["Description"].astype(str)

    df.to_parquet(PARQUET_PATH, engine="pyarrow")

df = pd.read_parquet('../data/raw/customer_raw.parquet', engine= 'pyarrow')

---

## **Data Assessment**
Evaluate overall data readiness to surface structural issues and risks before analysis proceeds

**Data Profiling & Structure Audit**

In [111]:
# Convenience dataset summary for visual inspection
av.df_summary(df)

Total Rows: 541,909
Total Columns: 8
Total Null Values: 135,080

                  data type  # unique  # non-null  # null     % null
InvoiceNo            object     25900      541909       0   0.000000
StockCode            object      4070      541909       0   0.000000
Description          object      4224      541909       0   0.000000
Quantity              int64       722      541909       0   0.000000
InvoiceDate  datetime64[ns]     23260      541909       0   0.000000
UnitPrice           float64      1630      541909       0   0.000000
CustomerID          float64      4372      406829  135080  24.926694
Country              object        38      541909       0   0.000000


**Continuous Variable Validation**

In [110]:
df.describe(include= ['number']).T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Quantity,541909.0,9.55225,218.081158,-80995.0,1.0,3.0,10.0,80995.0
UnitPrice,541909.0,4.611114,96.759853,-11062.06,1.25,2.08,4.13,38970.0
CustomerID,406829.0,15287.69057,1713.600303,12346.0,13953.0,15152.0,16791.0,18287.0


**Duplicate Check**

In [109]:
duplicates = df.loc[df.duplicated(keep=False), :]

print(f'Total Duplicated Records: {len(duplicates):,}')
print(f'Unique InvoiceNo Duplicated Records: {len(set(duplicates["InvoiceNo"])):,}')
print(f'Unique CustomerID Duplicate Records: {len(set(duplicates["CustomerID"])):,}')

Total Duplicated Records: 10,147
Unique InvoiceNo Duplicated Records: 1,933
Unique CustomerID Duplicate Records: 1,045


**Cancelled Invoices Check**

Reference: <br>
`InvoiceNo` data dictionary states that codes starting with letter "C" indicate cancelled transactions.

In [108]:
cancelled_invoices = df.loc[df['InvoiceNo'].str.startswith('C'), 'InvoiceNo']

n_unique_cancelled_invoice = cancelled_invoices.nunique()
n_unique_invoice = df['InvoiceNo'].nunique()
cancelled_invoice_pct = (n_unique_cancelled_invoice/ n_unique_invoice) * 100

print(f'Cancelled Invoice Count: {n_unique_cancelled_invoice:,}',
      f'out of {n_unique_invoice:,}',
      f'or {(cancelled_invoice_pct):.2f}%')


Cancelled Invoice Count: 3,836 out of 25,900 or 14.81%


**Field Relationship Validation**

Reference:<br>
The data dictionary states that `StockCode` and `Description` are expected to have a one-to-one mapping relationship.

In [191]:
unique_stockcode_description = df.groupby('StockCode')['Description'].nunique()
incosistent_stockcode = unique_stockcode_description[unique_stockcode_description > 1].index

print(f'Total Inconsistent StockCodes: {len(incosistent_stockcode)}')
print(f'Mapped `Descriptions` from inconsistent StockCode: {incosistent_stockcode[0]}')

set(df.loc[df['StockCode'] == incosistent_stockcode[0], 'Description'])

Total Inconsistent StockCodes: 1324
Mapped `Descriptions` from inconsistent StockCode: 10002


{'INFLATABLE POLITICAL GLOBE ', 'nan'}

Assess literal *"nan"* values presence across columns

In [180]:
nan_str_dict = {}

for column in df.columns:
    nan_str_dict[column] = len(df[column][df[column] == 'nan'])

nan_str_dict

{'InvoiceNo': 0,
 'StockCode': 0,
 'Description': 1454,
 'Quantity': 0,
 'InvoiceDate': 0,
 'UnitPrice': 0,
 'CustomerID': 0,
 'Country': 0}

Assess whether observed `StockCode` inconsistencies are associated with literal *"nan"* values

In [197]:
clean_unique_stockcode_description = df.loc[df['Description'] != 'nan'].groupby('StockCode')['Description'].nunique()
clean_inconsistent_stockcode = clean_unique_stockcode_description[clean_unique_stockcode_description > 1].index

print(f'Mapped `Description` from inconsistent StockCode: {clean_inconsistent_stockcode[0]}')

set(df.loc[(df['StockCode'] == clean_inconsistent_stockcode[0]) & (df['Description'] != 'nan'), 'Description'])

Mapped `Description` from inconsistent StockCode: 10080


{'GROOVY CACTUS INFLATABLE', 'check'}

---

## **Assessment Findings**
Consolidated summary of data quality issues identified during the assessment

<u>Missing Values</u>
- `CustomerID` contains 135,080 null values, representing 24.9% of all records.

<u>Extreme Value Ranges</u>
- `Quantity` and `UnitPrice` exhibit extreme positive and negative values that fall outside expected operational ranges.

<u>Duplicate Records</u>
- 10,147 duplicate rows identified.
    - Associated with 1,933 unique `InvoiceNo`
    - Involving 1,045 unique `CustomerID`

<u>Cancelled Transactions</u>
- 3,836 cancelled invoices, accounting for 14.81% of total invoices.

<u>Field Relationship Inconsistencies</u>
- `StockCode` does not consistently map to a single `Description`.
- Literal string value `"nan"` is present in the Description field.

---

## **Data Correction**

---

## **End of Notebook**
- Data has been cleaned, validated, and is ready for feature engineering.
- Key data issues identified and resolutions are documented in: [`preprocessing_log.md`](../docs/preprocessed/preprocessing_log.md)
- Ready for downstream notebook: [`02_feature_engineering.ipynb`](../notebooks/)