## Phase 1 – Final Summary

- Completed profiling of all raw datasets.
- Identified table grain, keys, and relationships.
- Detected data quality issues such as:
  - Date columns stored as strings
  - Potential duplicate records
  - Referential integrity risks
- These findings will guide data cleaning and ETL design in Phase 2.


### Duplicate & Integrity Check Summary

- sales.csv:
  - Duplicate rows observed and will be removed during data cleaning.
  - Foreign key checks required for product_id and store_id.

- Dimension tables:
  - Minimal or no duplicates observed.
  - Referential integrity appears mostly valid.


In [8]:
# Check product_id integrity in sales
invalid_products = sales[~sales["product_id"].isin(products["Product_ID"])]

# Check store_id integrity in sales
invalid_stores = sales[~sales["store_id"].isin(stores["Store_ID"])]

print("Invalid product references in sales:", invalid_products.shape[0])
print("Invalid store references in sales:", invalid_stores.shape[0])


Invalid product references in sales: 0
Invalid store references in sales: 0


In [7]:
for name, df in datasets.items():
    print(f"\n{name.upper()} - DUPLICATE ROWS")
    print(df.duplicated().sum())



SALES - DUPLICATE ROWS
0

PRODUCTS - DUPLICATE ROWS
0

CATEGORIES - DUPLICATE ROWS
0

STORES - DUPLICATE ROWS
0

WARRANTY - DUPLICATE ROWS
0


## Data Type & Missing Value Summary

- sales.csv:
  - sale_date is stored as object and must be converted to date.
  - No missing values observed.

- products.csv:
  - Launch_Date needs date conversion.
  - Price has no missing values.

- category.csv:
  - No missing values or data type issues.

- stores.csv:
  - Data types look correct.
  - No missing values observed.

- warranty.csv:
  - claim_date stored as object.
  - repair_status needs standardization.


In [6]:
for name, df in datasets.items():
    print(f"\n{name.upper()} - MISSING VALUES")
    print(df.isnull().sum())



SALES - MISSING VALUES
sale_id       0
sale_date     0
store_id      0
product_id    0
quantity      0
dtype: int64

PRODUCTS - MISSING VALUES
Product_ID      0
Product_Name    0
Category_ID     0
Launch_Date     0
Price           0
dtype: int64

CATEGORIES - MISSING VALUES
category_id      0
category_name    0
dtype: int64

STORES - MISSING VALUES
Store_ID      0
Store_Name    0
City          0
Country       0
dtype: int64

WARRANTY - MISSING VALUES
claim_id         0
claim_date       0
sale_id          0
repair_status    0
dtype: int64


In [5]:
for name, df in datasets.items():
    print(f"\n{name.upper()} - DATA TYPES")
    print(df.dtypes)



SALES - DATA TYPES
sale_id       object
sale_date     object
store_id      object
product_id    object
quantity       int64
dtype: object

PRODUCTS - DATA TYPES
Product_ID      object
Product_Name    object
Category_ID     object
Launch_Date     object
Price            int64
dtype: object

CATEGORIES - DATA TYPES
category_id      object
category_name    object
dtype: object

STORES - DATA TYPES
Store_ID      object
Store_Name    object
City          object
Country       object
dtype: object

WARRANTY - DATA TYPES
claim_id         object
claim_date       object
sale_id          object
repair_status    object
dtype: object


### warranty.csv – Observations

- Contains warranty claim information linked to sales transactions.
- Grain: 1 row = 1 warranty claim.
- Primary Key: claim_id
- Foreign Key:
  - sale_id → sales.sale_id
- Data Quality Observations:
  - claim_date should be converted to date format.
  - repair_status values should be standardized.
  - Some sales may have multiple warranty claims.


### stores.csv – Observations

- Contains store-level information including location details.
- Grain: 1 row = 1 store.
- Primary Key: Store_ID
- Foreign Keys: None
- Data Quality Observations:
  - Store and location details appear consistent.
  - Can be used for regional and country-level analysis.


### category.csv – Observations

- Contains product category lookup information.
- Grain: 1 row = 1 product category.
- Primary Key: category_id
- Foreign Keys: None
- Data Quality Observations:
  - Data appears clean with no missing values.
  - Used mainly for grouping and reporting purposes.


## products.csv – Observations

- Contains master data for products sold by the company.
- Grain: 1 row = 1 unique product.
- Primary Key: Product_ID
- Foreign Key:
  - Category_ID → category.category_id
- Data Quality Observations:
  - Launch_Date should be converted to proper date format.
  - Price column should be validated for zero or negative values.


## sales.csv – Observations

- Represents individual sales transactions for products sold in stores.
- Grain: 1 row = 1 product sold in a single sale.
- Primary Key: sale_id
- Foreign Keys:
  - store_id → stores.Store_ID
  - product_id → products.Product_ID
- Data Quality Observations:
  - sale_date is stored as a string and needs conversion to date format.
  - quantity values need validation for negative or unusually high numbers.
  - Large volume of data (1M+ rows), so performance should be considered during analysis.


In [4]:
for name, df in datasets.items():
    print(f"\n{name.upper()}")
    print("Rows:", df.shape[0])
    print("Columns:", df.shape[1])
    print("Column Names:", df.columns.tolist())



SALES
Rows: 1040200
Columns: 5
Column Names: ['sale_id', 'sale_date', 'store_id', 'product_id', 'quantity']

PRODUCTS
Rows: 89
Columns: 5
Column Names: ['Product_ID', 'Product_Name', 'Category_ID', 'Launch_Date', 'Price']

CATEGORIES
Rows: 10
Columns: 2
Column Names: ['category_id', 'category_name']

STORES
Rows: 75
Columns: 4
Column Names: ['Store_ID', 'Store_Name', 'City', 'Country']

WARRANTY
Rows: 30000
Columns: 4
Column Names: ['claim_id', 'claim_date', 'sale_id', 'repair_status']


In [3]:
sales = pd.read_csv("../data/raw/sales.csv")
products = pd.read_csv("../data/raw/products.csv")
categories = pd.read_csv("../data/raw/category.csv")
stores = pd.read_csv("../data/raw/stores.csv")
warranty = pd.read_csv("../data/raw/warranty.csv")

datasets = {
    "sales": sales,
    "products": products,
    "categories": categories,
    "stores": stores,
    "warranty": warranty
}


In [2]:
import pandas as pd
import numpy as np

# Phase 1: Data Understanding & Profiling

- Objective:Understand the structure, relationships, and data quality of raw retail datasets before designing ETL and analytics.

