## **Notebook Content**

> Project: Customer Prioritization Under Constraints

File: 01_data_cleaning.ipynb </br>
Author: Bryan Melvida

Purpose:
- Ingest raw transactional data
- Assess data quality, consistency, and anomalies
- Apply targeted data corrections prior to analysis

Input: [`Online Retail.xlsx`](../data/raw/Online%20Retail.xlsx) </br>
Related Documentation: [`dataset.md`](../docs/dataset.md), [`preprocessing_log.md`](../docs/preprocessing_log.md)

Output:

---

In [1]:
import warnings
warnings.filterwarnings("ignore", category= FutureWarning)
from pathlib import Path

import sys
sys.path.append('../')
import src.assessment_views as av

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

**Load Dataset**

In [2]:
# Prevent reading xlsx if parquet file format exists
PARQUET_PATH = Path("../data/preprocessed/customer_data.parquet")
XLSX_PATH = Path("../data/raw/Online Retail.xlsx")

if PARQUET_PATH.exists() and PARQUET_PATH.stat().st_size > 0:
    df = pd.read_parquet(PARQUET_PATH)
    loaded_from = "parquet"
else:
    df = pd.read_excel(XLSX_PATH)
    loaded_from = "xlsx"

print(f"Dataset loaded from: {loaded_from}")

Dataset loaded from: parquet


**Export to Parquet**

In [3]:
if loaded_from == "xlsx":
    df["InvoiceNo"] = df["InvoiceNo"].astype(str)
    df["StockCode"] = df["StockCode"].astype(str)
    df["Description"] = df["Description"].astype(str)

    df.to_parquet(PARQUET_PATH, engine="pyarrow")

df = pd.read_parquet('../data/preprocessed/customer_data.parquet', engine= 'pyarrow')

## **Data Assessment**
Evaluate overall data readiness to surface structural issues and risks before analysis proceeds

**Data Profiling & Structure Audit**

In [4]:
# Convenience dataset summary for visual inspection
av.df_summary(df)

Total Rows: 541,909
Total Columns: 8
Total Null Values: 135,080

                  data type  # unique  # non-null  # null     % null
InvoiceNo            object     25900      541909       0   0.000000
StockCode            object      4070      541909       0   0.000000
Description          object      4224      541909       0   0.000000
Quantity              int64       722      541909       0   0.000000
InvoiceDate  datetime64[ns]     23260      541909       0   0.000000
UnitPrice           float64      1630      541909       0   0.000000
CustomerID          float64      4372      406829  135080  24.926694
Country              object        38      541909       0   0.000000


**Duplicate Checks**

In [45]:
duplicates = df.loc[df.duplicated(keep=False), :]

print(f'Total Duplicated Records: {len(duplicates):,}')
print(f'Unique InvoiceNo Duplicated Records: {len(set(duplicates["InvoiceNo"])):,}')
print(f'Unique CustomerID Duplicate Records: {len(set(duplicates["CustomerID"])):,}')

Total Duplicated Records: 10,147
Unique InvoiceNo Duplicated Records: 1,933
Unique CustomerID Duplicate Records: 1,045


## **Data Correction**

## **End of Notebook**
- Data has been cleaned, validated, and is ready for feature engineering.
- Key data issues identified and resolutions are documented in: `preprocessing_log.md`
- Ready for downstream notebook: `02_feature_engineering.ipynb`