# **1. Perkenalan Dataset**

Dataset yang digunakan dalam eksperimen ini adalah **Telco Customer Churn** dari Kaggle.

**Deskripsi Dataset:**
- **Sumber:** Kaggle - WA_Fn-UseC_-Telco-Customer-Churn.csv
- **Problem Type:** Binary Classification
- **Target Variable:** Churn (Yes/No)
- **Jumlah Fitur:** 21 kolom (1 customerID, 19 features, 1 target)
- **Deskripsi:** Dataset ini berisi informasi pelanggan perusahaan telekomunikasi, termasuk informasi demografis, layanan yang digunakan, dan status churn.

**Tujuan:**
Memprediksi apakah pelanggan akan churn (berhenti berlangganan) atau tidak berdasarkan karakteristik pelanggan dan layanan yang digunakan.

# **2. Import Library**

Import semua library yang diperlukan untuk analisis data dan preprocessing.

In [None]:
# Data manipulation
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing
from sklearn.preprocessing import LabelEncoder, StandardScaler

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# **3. Memuat Dataset**

Load dataset CSV dan tampilkan informasi dasar tentang data.

In [None]:
# Load dataset
df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')

# Display basic information
print("Dataset Shape:", df.shape)
print("\n" + "="*50)
print("Sample Data (5 baris pertama):")
print("="*50)
df.head()

In [None]:
# Display column names and data types
print("Informasi Kolom:")
print("="*50)
df.info()

# **4. Exploratory Data Analysis (EDA)**

Melakukan analisis eksplorasi untuk memahami karakteristik dataset.

## 4.1 Statistik Deskriptif

In [None]:
# Descriptive statistics for numerical features
print("Statistik Deskriptif - Fitur Numerik:")
print("="*50)
df.describe()

In [None]:
# Descriptive statistics for categorical features
print("Statistik Deskriptif - Fitur Kategorikal:")
print("="*50)
df.describe(include='object')

## 4.2 Identifikasi Missing Values

In [None]:
# Check for missing values
print("Missing Values:")
print("="*50)
missing_values = df.isnull().sum()
missing_percentage = (missing_values / len(df)) * 100
missing_df = pd.DataFrame({
    'Missing Count': missing_values,
    'Percentage': missing_percentage
})
print(missing_df[missing_df['Missing Count'] > 0])

if missing_df['Missing Count'].sum() == 0:
    print("\nTidak ada missing values yang terdeteksi oleh isnull()")
    print("Akan dilakukan pengecekan lebih lanjut untuk nilai kosong atau spasi")

In [None]:
# Check for empty strings or whitespace in object columns
print("\nPengecekan Empty String / Whitespace:")
print("="*50)
for col in df.select_dtypes(include='object').columns:
    empty_count = df[col].str.strip().eq('').sum()
    if empty_count > 0:
        print(f"{col}: {empty_count} empty values")

## 4.3 Distribusi Target Variable (Churn)

In [None]:
# Target distribution
print("Distribusi Target Variable (Churn):")
print("="*50)
churn_counts = df['Churn'].value_counts()
churn_percentage = df['Churn'].value_counts(normalize=True) * 100

print(f"No:  {churn_counts['No']} ({churn_percentage['No']:.2f}%)")
print(f"Yes: {churn_counts['Yes']} ({churn_percentage['Yes']:.2f}%)")

# Visualization
plt.figure(figsize=(10, 5))

plt.subplot(1, 2, 1)
churn_counts.plot(kind='bar', color=['#2ecc71', '#e74c3c'])
plt.title('Distribusi Churn (Count)')
plt.xlabel('Churn')
plt.ylabel('Jumlah')
plt.xticks(rotation=0)

plt.subplot(1, 2, 2)
plt.pie(churn_counts, labels=churn_counts.index, autopct='%1.1f%%', 
        colors=['#2ecc71', '#e74c3c'], startangle=90)
plt.title('Distribusi Churn (Percentage)')

plt.tight_layout()
plt.show()

## 4.4 Analisis Fitur Numerik

In [None]:
# Identify numerical features
numerical_features = ['tenure', 'MonthlyCharges', 'TotalCharges']

print("Fitur Numerik:")
print(numerical_features)

In [None]:
# Distribution of numerical features
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for idx, col in enumerate(numerical_features):
    axes[idx].hist(df[col].dropna(), bins=30, edgecolor='black', alpha=0.7)
    axes[idx].set_title(f'Distribusi {col}')
    axes[idx].set_xlabel(col)
    axes[idx].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

In [None]:
# Correlation between numerical features
plt.figure(figsize=(8, 6))

# Convert TotalCharges to numeric first for correlation
df_corr = df.copy()
df_corr['TotalCharges'] = pd.to_numeric(df_corr['TotalCharges'], errors='coerce')

corr_matrix = df_corr[numerical_features].corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0, 
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Matrix - Fitur Numerik')
plt.tight_layout()
plt.show()

print("\nCorrelation Matrix:")
print(corr_matrix)

## 4.5 Analisis Fitur Kategorikal

In [None]:
# Identify categorical features (exclude customerID and target)
categorical_features = df.select_dtypes(include='object').columns.tolist()
categorical_features.remove('customerID')
categorical_features.remove('Churn')

print(f"Jumlah Fitur Kategorikal: {len(categorical_features)}")
print("\nDaftar Fitur Kategorikal:")
for i, col in enumerate(categorical_features, 1):
    print(f"{i}. {col}")

In [None]:
# Distribution of key categorical features
key_categorical = ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 
                   'Contract', 'InternetService', 'PaymentMethod']

fig, axes = plt.subplots(3, 3, figsize=(15, 12))
axes = axes.ravel()

for idx, col in enumerate(key_categorical):
    if idx < len(axes):
        df[col].value_counts().plot(kind='bar', ax=axes[idx], color='steelblue')
        axes[idx].set_title(f'Distribusi {col}')
        axes[idx].set_xlabel(col)
        axes[idx].set_ylabel('Count')
        axes[idx].tick_params(axis='x', rotation=45)

# Hide unused subplots
for idx in range(len(key_categorical), len(axes)):
    axes[idx].axis('off')

plt.tight_layout()
plt.show()

In [None]:
# Churn rate by categorical features
print("Churn Rate berdasarkan Fitur Kategorikal:")
print("="*50)

for col in key_categorical:
    churn_rate = df.groupby(col)['Churn'].apply(lambda x: (x == 'Yes').sum() / len(x) * 100)
    print(f"\n{col}:")
    print(churn_rate.round(2))

# **5. Data Preprocessing**

Melakukan preprocessing data untuk mempersiapkan data sebelum modeling.

## 5.1 Konversi TotalCharges ke Numerik

In [None]:
# Check TotalCharges data type
print("Tipe data TotalCharges sebelum konversi:")
print(df['TotalCharges'].dtype)
print("\nSample values:")
print(df['TotalCharges'].head(10))

In [None]:
# Convert TotalCharges to numeric
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

print("Tipe data TotalCharges setelah konversi:")
print(df['TotalCharges'].dtype)

# Check for NaN values after conversion
nan_count = df['TotalCharges'].isnull().sum()
print(f"\nJumlah NaN setelah konversi: {nan_count}")

if nan_count > 0:
    print("\nBaris dengan TotalCharges NaN:")
    print(df[df['TotalCharges'].isnull()][['customerID', 'tenure', 'MonthlyCharges', 'TotalCharges']].head())

## 5.2 Handle Missing Values

In [None]:
# Handle missing values in TotalCharges
# Strategy: Fill with median for rows where tenure is very low

print("Handling missing values...")
print("="*50)

# Check if there are missing values
if df['TotalCharges'].isnull().sum() > 0:
    # Fill missing TotalCharges with median
    median_total_charges = df['TotalCharges'].median()
    df['TotalCharges'].fillna(median_total_charges, inplace=True)
    print(f"Missing values di TotalCharges diisi dengan median: {median_total_charges:.2f}")
else:
    print("Tidak ada missing values yang perlu ditangani")

# Verify no missing values remain
print("\nVerifikasi Missing Values:")
print(df.isnull().sum().sum())
print("Total missing values: 0" if df.isnull().sum().sum() == 0 else f"Total missing values: {df.isnull().sum().sum()}")

## 5.3 Drop CustomerID

In [None]:
# Drop customerID as it's not useful for modeling
print("Dropping customerID column...")
df_processed = df.drop('customerID', axis=1)
print(f"Shape setelah drop customerID: {df_processed.shape}")

## 5.4 Encoding Fitur Kategorikal

In [None]:
# Create a copy for encoding
df_encoded = df_processed.copy()

# Identify categorical columns (exclude target)
categorical_cols = df_encoded.select_dtypes(include='object').columns.tolist()
categorical_cols.remove('Churn')  # Remove target from encoding list

print(f"Jumlah kolom kategorikal yang akan diencode: {len(categorical_cols)}")
print("\nKolom kategorikal:")
for col in categorical_cols:
    print(f"- {col}: {df_encoded[col].nunique()} unique values")

In [None]:
# Apply Label Encoding to all categorical features
label_encoders = {}

for col in categorical_cols:
    le = LabelEncoder()
    df_encoded[col] = le.fit_transform(df_encoded[col])
    label_encoders[col] = le
    
print("Label Encoding selesai!")
print("\nSample encoded data:")
print(df_encoded.head())

In [None]:
# Encode target variable (Churn)
print("Encoding target variable (Churn)...")
le_target = LabelEncoder()
df_encoded['Churn'] = le_target.fit_transform(df_encoded['Churn'])

print("Mapping:")
for i, label in enumerate(le_target.classes_):
    print(f"  {label} -> {i}")

print("\nDistribusi target setelah encoding:")
print(df_encoded['Churn'].value_counts().sort_index())

## 5.5 Scaling Fitur Numerik

In [None]:
# Prepare data for scaling
df_scaled = df_encoded.copy()

# Define numerical features to scale
numerical_features_to_scale = ['tenure', 'MonthlyCharges', 'TotalCharges']

print("Fitur numerik yang akan discaling:")
print(numerical_features_to_scale)
print("\nStatistik sebelum scaling:")
print(df_scaled[numerical_features_to_scale].describe())

In [None]:
# Apply StandardScaler
scaler = StandardScaler()
df_scaled[numerical_features_to_scale] = scaler.fit_transform(df_scaled[numerical_features_to_scale])

print("Scaling selesai!")
print("\nStatistik setelah scaling:")
print(df_scaled[numerical_features_to_scale].describe())

## 5.6 Pemisahan Fitur dan Target

In [None]:
# Separate features and target
X = df_scaled.drop('Churn', axis=1)
y = df_scaled['Churn']

print("Pemisahan fitur dan target selesai!")
print("="*50)
print(f"Shape X (features): {X.shape}")
print(f"Shape y (target): {y.shape}")
print(f"\nJumlah fitur: {X.shape[1]}")
print(f"Jumlah sampel: {X.shape[0]}")

In [None]:
# Verify all data is numeric
print("\nVerifikasi tipe data:")
print("="*50)
print("Features (X):")
print(X.dtypes.value_counts())
print("\nTarget (y):")
print(y.dtype)

In [None]:
# Final check - no missing values
print("\nVerifikasi final - Missing values:")
print("="*50)
print(f"Missing values di X: {X.isnull().sum().sum()}")
print(f"Missing values di y: {y.isnull().sum()}")

if X.isnull().sum().sum() == 0 and y.isnull().sum() == 0:
    print("\nData siap untuk modeling!")

## 5.7 Simpan Data Hasil Preprocessing

In [None]:
# Save preprocessed data for modeling
df_final = pd.concat([X, y], axis=1)
df_final.to_csv('preprocessed_data.csv', index=False)

print("Data hasil preprocessing berhasil disimpan!")
print(f"File: preprocessed_data.csv")
print(f"Shape: {df_final.shape}")
print("\nSample data:")
print(df_final.head())

# **6. Summary**

## Dataset Summary:
- Dataset awal: 7043 baris, 21 kolom
- Target: Churn (binary classification)
- Fitur numerik: tenure, MonthlyCharges, TotalCharges
- Fitur kategorikal: 16 kolom

## Preprocessing Steps:
1. Konversi TotalCharges dari object ke numeric
2. Handle missing values (mengisi dengan median)
3. Drop customerID (tidak relevan untuk modeling)
4. Encoding fitur kategorikal menggunakan Label Encoding
5. Encoding target variable (No=0, Yes=1)
6. Scaling fitur numerik menggunakan StandardScaler
7. Pemisahan fitur (X) dan target (y)

## Output:
- Data final: semua numerik, tidak ada missing values
- File tersimpan: preprocessed_data.csv
- Data siap untuk tahap modeling dengan MLflow