# Data  Cleaning Notebook
__This notebook handles all preprocessing steps required to prepare our churn dataset for analysis and modeling.__

**Objectives:**
- Handle data bugs and ambiguity
- Clean and standardize categorical values
- Handle missing or inconsistent data
- Rename poorly labeled columns for clarity
- Detect and fix duplicates or outliers
- Prepare data for encoding and scaling

In [105]:
import pandas as pd
import numpy as np
import pickle

__creating a copy of the original dataset__

In [106]:
churn_df = pd.read_excel("../data/raw/E Commerce Dataset.xlsx", sheet_name="E Comm")
churn_df_copy = churn_df.copy()

In [107]:
churn_df_copy.head()

Unnamed: 0,CustomerID,Churn,Tenure,PreferredLoginDevice,CityTier,WarehouseToHome,PreferredPaymentMode,Gender,HourSpendOnApp,NumberOfDeviceRegistered,PreferedOrderCat,SatisfactionScore,MaritalStatus,NumberOfAddress,Complain,OrderAmountHikeFromlastYear,CouponUsed,OrderCount,DaySinceLastOrder,CashbackAmount
0,50001,1,4.0,Mobile Phone,3,6.0,Debit Card,Female,3.0,3,Laptop & Accessory,2,Single,9,1,11.0,1.0,1.0,5.0,159.93
1,50002,1,,Phone,1,8.0,UPI,Male,3.0,4,Mobile,3,Single,7,1,15.0,0.0,1.0,0.0,120.9
2,50003,1,,Phone,1,30.0,Debit Card,Male,2.0,4,Mobile,3,Single,6,1,14.0,0.0,1.0,3.0,120.28
3,50004,1,0.0,Phone,3,15.0,Debit Card,Male,2.0,4,Laptop & Accessory,5,Single,8,0,23.0,0.0,1.0,3.0,134.07
4,50005,1,0.0,Phone,1,12.0,CC,Male,,3,Mobile,5,Single,3,0,11.0,1.0,1.0,3.0,129.6


In [108]:
churn_df_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5630 entries, 0 to 5629
Data columns (total 20 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   CustomerID                   5630 non-null   int64  
 1   Churn                        5630 non-null   int64  
 2   Tenure                       5366 non-null   float64
 3   PreferredLoginDevice         5630 non-null   object 
 4   CityTier                     5630 non-null   int64  
 5   WarehouseToHome              5379 non-null   float64
 6   PreferredPaymentMode         5630 non-null   object 
 7   Gender                       5630 non-null   object 
 8   HourSpendOnApp               5375 non-null   float64
 9   NumberOfDeviceRegistered     5630 non-null   int64  
 10  PreferedOrderCat             5630 non-null   object 
 11  SatisfactionScore            5630 non-null   int64  
 12  MaritalStatus                5630 non-null   object 
 13  NumberOfAddress   

### Renaming Columns
We want to rename a column while still keepimg the same naming convention

In [109]:
churn_df_copy.rename(columns={
    "PreferedOrderCat" : "PreferredOrderCat"
}, inplace=True)
# churn_df_copy.head()

<hr/>

### Standardizing Value Naming in Categorical Columns

__importing variables from our exploratory_analysis notebook__

In [110]:

with open("eda_variables.pkl", "rb") as var:
    variables = pickle.load(var)

cat_cols = variables["cat_cols"]
num_cols = variables["num_cols"]

print(f"Categorical columns : {cat_cols}")
print(f"Numerical columns : {num_cols}")

Categorical columns : ['PreferredLoginDevice', 'PreferredPaymentMode', 'Gender', 'PreferedOrderCat', 'MaritalStatus']
Numerical columns : ['Tenure', 'CityTier', 'WarehouseToHome', 'HourSpendOnApp', 'NumberOfDeviceRegistered', 'SatisfactionScore', 'NumberOfAddress', 'Complain', 'OrderAmountHikeFromlastYear', 'CouponUsed', 'OrderCount', 'DaySinceLastOrder', 'CashbackAmount']


We can rename column categories as one category to have a standard naming type. Cases like these may require extra upstream clearification

In [115]:
# Preferred Login Device
churn_df_copy = churn_df_copy.replace({
    'PreferredLoginDevice': {'Mobile Phone': 'Phone'}
})

# Preffered Payment Mode
churn_df_copy = churn_df_copy.replace({
    "PreferredPaymentMode" : {
        "CC" : "Credit Card",
        "COD" : "Cash on Delivery",
        "UPI" : "Unified Payments Interface"
        }
})

# Preferred Order Category
churn_df_copy = churn_df_copy.replace({
    'PreferredOrderCat': {'Mobile': 'Mobile Phone'}
})




<hr/>