# Phase 3: Data Cleaning and Production Readiness

## 1. Business Problem Statement
Before building predictive models or dashboards, the business must ensure that its
data is accurate, consistent, and reliable. Poor data quality can lead to incorrect
forecasts, flawed customer insights, and misguided executive decisions.

The goal of this phase is to prepare production-ready data that can be safely used
for modeling and decision-making.

## 2. Why This Matters to the Business
Data quality issues directly translate into financial risk. Inaccurate dates, duplicate
orders, or inconsistent customer identifiers can distort forecasts, customer segmentation,
and profit optimization, leading to costly business decisions.

In [12]:
import sys
import pandas as pd
from pathlib import Path

PROJECT_ROOT = Path().resolve().parents[0]
sys.path.append(str(PROJECT_ROOT))
# Ensure processed data directory exists
processed_dir = Path("data/processed")
processed_dir.mkdir(parents=True, exist_ok=True)


In [13]:
from src.data_loader import load_raw_data

df = load_raw_data()
df.head()


⚠️ UTF-8 failed. Data loaded using LATIN-1 encoding


Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,...,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
0,1,CA-2016-152156,11-08-2016,11-11-2016,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,2,0.0,41.9136
1,2,CA-2016-152156,11-08-2016,11-11-2016,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94,3,0.0,219.582
2,3,CA-2016-138688,06-12-2016,6/16/2016,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,...,90036,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62,2,0.0,6.8714
3,4,US-2015-108966,10-11-2015,10/18/2015,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775,5,0.45,-383.031
4,5,US-2015-108966,10-11-2015,10/18/2015,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.368,2,0.2,2.5164


## 3. Data Quality Assessment

In [14]:
df.isnull().sum()

Row ID           0
Order ID         0
Order Date       0
Ship Date        0
Ship Mode        0
Customer ID      0
Customer Name    0
Segment          0
Country          0
City             0
State            0
Postal Code      0
Region           0
Product ID       0
Category         0
Sub-Category     0
Product Name     0
Sales            0
Quantity         0
Discount         0
Profit           0
dtype: int64

In [15]:
df.duplicated().sum()

np.int64(0)

In [16]:
df['Order Date'] = pd.to_datetime(
    df['Order Date'],
    format='mixed',
    dayfirst=False
)

df['Ship Date'] = pd.to_datetime(
    df['Ship Date'],
    format='mixed',
    dayfirst=False
)

In [17]:
(df['Ship Date'] < df['Order Date']).sum()

np.int64(0)

In [18]:
(df['Sales'] <= 0).sum(), (df['Profit'] < 0).sum()

(np.int64(0), np.int64(1871))

Negative profit values are retained as they represent legitimate business losses.

- STANDARDIZE CATEGORICAL DATA

In [19]:
df['Category'] = df['Category'].str.strip().str.title()
df['Region'] = df['Region'].str.strip().str.title()

In [20]:
from src.preprocessing import clean_data

clean_df = clean_data(df)

In [21]:
clean_df.to_csv("../data/processed/cleaned_data.csv", index=False)