In [None]:
import pandas as pd 
import numpy as np
pd.set_option('display.max_columns',None)
pd.set_option('display.float_format','{:.2f}'.format)

#### Data Cleaning & Feature Engineering

This notebook prepares retail transactional data for business analysis by performing
data quality checks, cleaning, and feature engineering.


In [None]:
df = pd.read_csv(r"C:\Users\shubu\Documents\Business-Performance-Analytics\Data\superstore_raw.csv")
df.head()

In [None]:
df.shape # rows and columns

In [None]:
df.info()

In [None]:
df.head(10)

In [None]:
df.tail(5)

In [None]:
df.columns = (
    df.columns
      .str.strip()
      .str.lower()
      .str.replace(' ', '_')
      .str.replace('-', '_')
      .str.replace('.','_', regex = False)
)

df.columns


In [None]:
df.duplicated().sum()

In [None]:
df.isnull().sum().sort_values(ascending = False)

In [None]:
df.dtypes

#### Data Quality Observations

- Column names have been standardized to snake_case for consistency and SQL compatibility
- No missing values were observed across any columns
- No fully duplicated rows were detected in the dataset
- Date-related columns (order_date, ship_date) are currently in string format and require conversion to datetime
- Numerical columns such as sales, profit, and discount appear correctly populated but require data type validation

In [None]:
df.columns

In [None]:
# Date conversion
df['order_date'] =pd.to_datetime(df['order_date'], errors ='coerce')
df['ship_date'] =pd.to_datetime(df['ship_date'], errors ='coerce')


In [None]:
# Verify
df[['order_date','ship_date']].dtypes

In [None]:
# Numeric columns validation
numeric_columns =['sales','profit','discount','quantity','shipping_cost']
df[numeric_columns].dtypes

In [None]:
df[numeric_columns].describe()

#### Data Type Validation

- Column names standardized using underscores for SQL and BI compatibility
- order_date and ship_date converted to datetime format successfully
- Sales, profit, discount, quantity, and shipping_cost validated as numeric
- No critical data conversion issues observed


In [None]:
# Revenue & Profitability Metrics
df['revenue'] = df['sales']
df['profit_margin'] = np.where(df['sales'] ==0,
                               0,df['profit'] / df['sales'])


In [None]:
# Discount amount
df['discount_amount'] = df['sales'] * df['discount']

In [None]:
# Time based features
df['order_year'] = df['order_date'].dt.year
df['order_month'] = df['order_date'].dt.month
df['order_month_name'] = df['order_date'].dt.month_name()

In [None]:
 # Business Flags
df['loss_making_flag'] = np.where(df['profit']<0,1,0)
df['high_discount_flag'] = np.where(df['discount']>0.25,1,0)
 

In [None]:
df[['revenue','profit','profit_margin','discount_amount','loss_making_flag','high_discount_flag']].head()

In [None]:
df[['loss_making_flag','high_discount_flag']].value_counts()

#### Feature Engineering

The following business metrics and flags were created to support profitability and performance analysis:

- Revenue derived from sales values
- Profit margin calculated as profit divided by sales
- Discount amount calculated to quantify revenue impact
- Time-based features (year, month) extracted from order_date
- Loss-making flag to identify unprofitable transactions
- High discount flag to identify aggressive discounting
