# Data Quality & Validation Report

## Purpose
This notebook validates the reliability of the sales dataset
after ingestion and type enforcement.

It provides transparent checks that demonstrate:
- Data completeness
- Business rule consistency
- Revenue integrity
- Temporal coverage

This report is intended for stakeholders, reviewers, and clients.


## 1. Load Raw Data

The raw dataset is loaded without transformation.
All validation compares raw values to derived or expected behavior.


In [27]:
import pandas as pd
import numpy as np

pd.set_option("display.max_columns", None)

df = pd.read_csv("../data/raw/sales_data.csv")


## 2. Dataset Overview

We first validate basic structural properties:
- Row count
- Column count
- Memory footprint


In [28]:
df.shape


(185950, 13)

In [29]:
df.columns

Index(['Unnamed: 0', 'Order ID', 'Product Category', 'Product',
       'Quantity Ordered', 'Price Each', 'Order Date', 'Purchase Address',
       'Month', 'Sales', 'City', 'Hour', 'Time of Day'],
      dtype='object')

In [30]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 185950 entries, 0 to 185949
Data columns (total 13 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Unnamed: 0        185950 non-null  int64  
 1   Order ID          185950 non-null  int64  
 2   Product Category  185950 non-null  object 
 3   Product           185950 non-null  object 
 4   Quantity Ordered  185950 non-null  int64  
 5   Price Each        185950 non-null  float64
 6   Order Date        185950 non-null  object 
 7   Purchase Address  185950 non-null  object 
 8   Month             185950 non-null  int64  
 9   Sales             185950 non-null  float64
 10  City              185950 non-null  object 
 11  Hour              185950 non-null  int64  
 12  Time of Day       185950 non-null  object 
dtypes: float64(2), int64(5), object(6)
memory usage: 18.4+ MB


In [31]:

df.describe(include="all")


Unnamed: 0.1,Unnamed: 0,Order ID,Product Category,Product,Quantity Ordered,Price Each,Order Date,Purchase Address,Month,Sales,City,Hour,Time of Day
count,185950.0,185950.0,185950,185950,185950.0,185950.0,185950,185950,185950.0,185950.0,185950,185950.0,185950
unique,,,8,19,,,142395,140787,,,9,,4
top,,,Audio Devices,USB-C Charging Cable,,,15-12-2019 20:16,"193 Forest St, San Francisco, CA 94016",,,San Francisco,,Afternoon
freq,,,47756,21903,,,8,9,,,44732,,67158
mean,8340.388475,230417.569379,,,1.124383,184.399735,,,7.05914,185.490917,,14.413305,
std,5450.554093,51512.73711,,,0.442793,332.73133,,,3.502996,332.919771,,5.423416,
min,0.0,141234.0,,,1.0,2.99,,,1.0,2.99,,0.0,
25%,3894.0,185831.25,,,1.0,11.95,,,4.0,11.95,,11.0,
50%,7786.0,230367.5,,,1.0,14.95,,,7.0,14.95,,15.0,
75%,11872.0,275035.75,,,1.0,150.0,,,10.0,150.0,,19.0,


## 3. Missing Values Check

Missing data can break aggregations and BI measures.
We verify completeness across all columns.


In [32]:
missing_summary = df.isna().sum().to_frame("missing_count")
missing_summary["missing_pct"] = missing_summary["missing_count"] / len(df)

missing_summary


Unnamed: 0,missing_count,missing_pct
Unnamed: 0,0,0.0
Order ID,0,0.0
Product Category,0,0.0
Product,0,0.0
Quantity Ordered,0,0.0
Price Each,0,0.0
Order Date,0,0.0
Purchase Address,0,0.0
Month,0,0.0
Sales,0,0.0


## 4. Business Rule Validation

We validate core transactional logic:
- Quantity must be positive
- Price must be positive
- Sales must equal Quantity Ã— Price


In [33]:
df["calculated_sales"] = df["Quantity Ordered"] * df["Price Each"]

df["valid_quantity"] = df["Quantity Ordered"] > 0
df["valid_price"] = df["Price Each"] > 0
df["valid_sales"] = np.isclose(
    df["Sales"],
    df["calculated_sales"],
    atol=0.01
)


### Validation Results (% of records passing)


In [34]:
validation_results = pd.DataFrame({
    "quantity_valid_pct": [df["valid_quantity"].mean()],
    "price_valid_pct": [df["valid_price"].mean()],
    "sales_match_pct": [df["valid_sales"].mean()]
})

validation_results


Unnamed: 0,quantity_valid_pct,price_valid_pct,sales_match_pct
0,1.0,1.0,1.0


## 5. Order ID Duplication Analysis

Order IDs are expected to repeat because
one order can contain multiple products.

We confirm this assumption explicitly.


In [35]:
order_counts = df["Order ID"].value_counts()

(order_counts > 1).mean()


np.float64(0.039991705756093184)

## 6. Revenue Reconciliation

We validate total revenue consistency between:
- Reported sales
- Calculated sales


In [36]:
reported_revenue = df["Sales"].sum()
calculated_revenue = df["calculated_sales"].sum()

pd.DataFrame({
    "reported_revenue": [reported_revenue],
    "calculated_revenue": [calculated_revenue],
    "difference": [reported_revenue - calculated_revenue]
})


Unnamed: 0,reported_revenue,calculated_revenue,difference
0,34492035.97,34492035.97,0.0


## 7. Temporal Coverage

We verify that the dataset spans expected months and hours.
This prevents incomplete or truncated reporting.



In [37]:
df["Order Date"] = pd.to_datetime(df["Order Date"], errors="coerce")

df["Order Date"].min(), df["Order Date"].max()


  df["Order Date"] = pd.to_datetime(df["Order Date"], errors="coerce")


(Timestamp('2019-01-01 03:07:00'), Timestamp('2020-01-01 05:13:00'))

In [38]:
df["Month"].value_counts().sort_index()


Month
1      9709
2     11975
3     15153
4     18279
5     16566
6     13554
7     14293
8     11961
9     11621
10    20282
11    17573
12    24984
Name: count, dtype: int64

In [39]:
df["Hour"].value_counts().sort_index()


Hour
0      3910
1      2350
2      1243
3       831
4       854
5      1321
6      2482
7      4011
8      6256
9      8748
10    10944
11    12411
12    12587
13    12129
14    10984
15    10175
16    10384
17    10899
18    12280
19    12905
20    12228
21    10921
22     8822
23     6275
Name: count, dtype: int64

## 8. Categorical Coverage

We ensure product categories and cities are populated
and reasonably distributed.


In [40]:
df["Product Category"].value_counts(normalize=True)


Product Category
Audio Devices             0.256822
Charging Cables           0.234262
Batterie                  0.221662
Monitors                  0.129169
Phones and Accessories    0.077612
Laptops and Computers     0.047604
Entertainment Devices     0.025813
Home Appliances           0.007056
Name: proportion, dtype: float64

In [41]:
df["City"].nunique()


9

## 9. Final Data Quality Assessment

### Summary
- Dataset is structurally complete
- Business rules largely satisfied
- Revenue calculations reconcile
- Time and categorical coverage are valid

### Conclusion
The dataset is suitable for downstream normalization,
SQL modeling, and Power BI analytics.
