# E-commerce Customer Churn Automation: Data Cleaning
**Proyek:** Prediksi Churn Pelanggan E-commerce  
**Tujuan:** Melakukan audit data, menangani missing values, memperbaiki konsistensi, dan menyiapkan data untuk tahap EDA.  
**Output:** `data_churn_cleaned.csv` disimpan di folder `data/processed/`.

In [7]:
import pandas as pd
import numpy as np
import os

# Konfigurasi agar semua kolom terlihat saat di-display
pd.set_option('display.max_columns', None)

In [8]:
# Path relatif sesuai struktur folder VS Code Anda
input_path = '../data/raw/data_ecommerce_customer_churn.csv'

if os.path.exists(input_path):
    # Menggunakan delimiter ';' sesuai dataset Anda
    df = pd.read_csv(input_path, sep=';')
    print("‚úÖ Dataset Berhasil Dimuat")
    print(f"Jumlah Baris: {df.shape[0]} | Jumlah Kolom: {df.shape[1]}")
    display(df.head())
else:
    print("‚ùå Error: File tidak ditemukan. Pastikan file ada di folder data/raw/")

‚úÖ Dataset Berhasil Dimuat
Jumlah Baris: 3941 | Jumlah Kolom: 11


Unnamed: 0,Tenure,WarehouseToHome,NumberOfDeviceRegistered,PreferedOrderCat,SatisfactionScore,MaritalStatus,NumberOfAddress,Complain,DaySinceLastOrder,CashbackAmount,Churn
0,15.00,29.00.00,4,Laptop & Accessory,3,Single,2,0,7.0,143.32.00,0
1,07.00,25.00.00,4,Mobile,1,Married,2,0,7.0,129.29.00,0
2,27.00.00,13.00,3,Laptop & Accessory,1,Married,5,0,7.0,168.54.00,0
3,20.00,25.00.00,4,Fashion,3,Divorced,7,0,,230.27.00,0
4,30.00.00,15.00,4,Others,4,Single,8,0,8.0,322.17.00,0


In [9]:
print("--- 1. Cek Tipe Data ---")
print(df.info())

print("\n--- 2. Cek Missing Values (Persentase) ---")
missing_perc = df.isnull().mean() * 100
print(missing_perc[missing_perc > 0])

print("\n--- 3. Statistik Deskriptif ---")
display(df.describe())

--- 1. Cek Tipe Data ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3941 entries, 0 to 3940
Data columns (total 11 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Tenure                    3747 non-null   object
 1   WarehouseToHome           3772 non-null   object
 2   NumberOfDeviceRegistered  3941 non-null   int64 
 3   PreferedOrderCat          3941 non-null   object
 4   SatisfactionScore         3941 non-null   int64 
 5   MaritalStatus             3941 non-null   object
 6   NumberOfAddress           3941 non-null   int64 
 7   Complain                  3941 non-null   int64 
 8   DaySinceLastOrder         3728 non-null   object
 9   CashbackAmount            3941 non-null   object
 10  Churn                     3941 non-null   int64 
dtypes: int64(5), object(6)
memory usage: 338.8+ KB
None

--- 2. Cek Missing Values (Persentase) ---
Tenure               4.922608
WarehouseToHome      4.288252
DaySi

Unnamed: 0,NumberOfDeviceRegistered,SatisfactionScore,NumberOfAddress,Complain,Churn
count,3941.0,3941.0,3941.0,3941.0,3941.0
mean,3.679269,3.088302,4.237757,0.282416,0.171023
std,1.013938,1.381832,2.626699,0.450232,0.376576
min,1.0,1.0,1.0,0.0,0.0
25%,3.0,2.0,2.0,0.0,0.0
50%,4.0,3.0,3.0,0.0,0.0
75%,4.0,4.0,6.0,1.0,0.0
max,6.0,5.0,22.0,1.0,1.0


In [10]:
# Daftar kolom yang sering memiliki data kosong pada dataset ini
cols_to_fix = ['Tenure', 'WarehouseToHome', 'DaySinceLastOrder']

print("--- Proses Imputasi ---")
for col in cols_to_fix:
    if col in df.columns:
        # Step 1: Paksa kolom menjadi numerik (mencegah TypeError)
        df[col] = pd.to_numeric(df[col], errors='coerce')
        
        # Step 2: Hitung median
        median_val = df[col].median()
        
        # Step 3: Isi nilai kosong
        df[col] = df[col].fillna(median_val)
        print(f"‚úÖ Kolom '{col}' berhasil diisi dengan median: {median_val}")

# Cek kembali apakah masih ada missing values
print(f"\nSisa missing values: {df.isnull().sum().sum()}")

--- Proses Imputasi ---
‚úÖ Kolom 'Tenure' berhasil diisi dengan median: 8.0
‚úÖ Kolom 'WarehouseToHome' berhasil diisi dengan median: 11.0
‚úÖ Kolom 'DaySinceLastOrder' berhasil diisi dengan median: 3.0

Sisa missing values: 0


In [11]:
# 1. Menghapus spasi yang tidak terlihat (Leading/Trailing Spaces)
cat_cols = df.select_dtypes(include=['object']).columns
for col in cat_cols:
    df[col] = df[col].str.strip()

# 2. Standarisasi kategori (Contoh: Menyatukan Phone dan Mobile Phone)
if 'PreferedOrderCat' in df.columns:
    df['PreferedOrderCat'] = df['PreferedOrderCat'].replace('Mobile Phone', 'Phone')
    print("‚úÖ Kategori 'PreferedOrderCat' distandarisasi.")

# 3. Menangani Data Duplikat
duplicates = df.duplicated().sum()
if duplicates > 0:
    df = df.drop_duplicates()
    print(f"‚úÖ Berhasil menghapus {duplicates} baris duplikat.")
else:
    print("‚úÖ Tidak ditemukan data duplikat.")

‚úÖ Kategori 'PreferedOrderCat' distandarisasi.
‚úÖ Berhasil menghapus 675 baris duplikat.


In [12]:
# Membuat folder processed jika belum ada
output_folder = '../data/processed'
os.makedirs(output_folder, exist_ok=True)

output_file = os.path.join(output_folder, 'data_churn_cleaned.csv')

# Simpan ke CSV
df.to_csv(output_file, index=False, sep=';')

print(f"üöÄ DATA CLEANING SELESAI!")
print(f"File disimpan di: {output_file}")
print(f"Final Shape: {df.shape}")

üöÄ DATA CLEANING SELESAI!
File disimpan di: ../data/processed\data_churn_cleaned.csv
Final Shape: (3266, 11)
