In [1]:
import pandas as pd
import numpy as np

## Load data

In [2]:
df_raw = pd.read_csv("sales_raw.csv")

## Dataset Overview

This section provides a high-level understanding of the raw sales dataset,
including its size, structure, and column composition.


In [None]:
df_raw.shape

In [None]:
df_raw.info()

## 2.2 Column Name Inspection

Inspecting column names to identify:
- Inconsistent casing
- Special characters
- Non-standard naming conventions


In [None]:
df_raw.columns

## Null Value Analysis

This step identifies missing values across columns to understand
data completeness and potential cleaning requirements.


In [12]:
null_summary = df_raw.isna().sum().to_frame("null_count")
null_summary["null_pct"] = round(null_summary["null_count"] / len(df_raw) * 100, 2)
null_summary.sort_values("null_pct", ascending=False)

Unnamed: 0,null_count,null_pct
Subcategory,11710,49.75
order_source,6743,28.64
Gender,5933,25.2
payment mode,4748,20.17
ORDER ID,0,0.0
Order Date,0,0.0
Customer_id,0,0.0
STATE,0,0.0
Region,0,0.0
category,0,0.0


## Duplicate Order ID Analysis

Orders may appear multiple times due to split shipments or data issues.
This section evaluates duplicate order_id occurrences.


In [13]:
df_raw["ORDER ID"].duplicated().sum()

np.int64(1540)

In [None]:
duplicate_orders = df_raw[df_raw["ORDER ID"].duplicated(keep=False)]
duplicate_orders.sort_values("ORDER ID")

## Numerical Distribution Analysis

Analyzing numerical columns to detect:
- Outliers
- Skewness
- Unrealistic values


In [None]:
df_raw[["Quantity", "unit price", "discount%", "FINAL AMOUNT"]].describe()

## Categorical Value Consistency Check

This section inspects unique values in categorical columns to identify:
- Inconsistent casing (e.g., DELHI vs delhi)
- Abbreviations vs full forms (e.g., TN vs Tamil Nadu)
- Duplicate semantic values
- Missing values (NaN)

These inconsistencies will be standardized during the data cleaning phase.


In [None]:
df_raw["Gender"].unique()

In [None]:
df_raw["STATE"].unique()

In [None]:
df_raw["Region"].unique()

In [None]:
df_raw["Subcategory"].unique()

In [None]:
df_raw["payment mode"].unique()

## Key Data Quality Issues Identified

Based on the profiling analysis, the following issues were identified:

- Inconsistent column naming conventions
- Missing values in gender, payment_mode, and order_source
- Duplicate order_id entries
- Categorical value inconsistencies (state abbreviations, casing)
- Presence of extreme discount values

These issues will be addressed in the data cleaning phase.

