# Customer Churn EDA
**Author**: Muhammad Ali Syed  
**Date**: 02/08/2025
**Purpose**: Initial data exploration for churn prediction model

## Business Context
Marketing needs to identify customers likely to churn within 30 days for retention campaigns.


In [23]:
# Industry practice: Consistent imports at the top
import sys
sys.path.append('..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Our project modules
from src.config import RAW_DATA_PATH, RANDOM_SEED

# Configuration
pd.set_option('display.max_columns', None)
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

# Set random seed for reproducibility
np.random.seed(RANDOM_SEED)


## 1. Data Loading and Initial Inspection

In [24]:
# Load data with error handling
data_path = RAW_DATA_PATH / "WA_Fn-UseC_-Telco-Customer-Churn.csv"

try:
    df = pd.read_csv(data_path)
    print(f"✓ Data loaded successfully: {df.shape[0]:,} rows, {df.shape[1]} columns")
except FileNotFoundError:
    print("✗ Data file not found. Please run: python src/data_downloader.py")
    raise


✓ Data loaded successfully: 7,043 rows, 21 columns


## Data Overview

In [25]:
print("=== Data Overview ===")
print(f"Shape: {df.shape}")
print(f"\nColumn Types:\n{df.dtypes.value_counts()}")
print(f"\nTarget:\n{df['Churn'].value_counts()}")
print(f"\nTarget Distribution:\n{df['Churn'].value_counts(normalize=True)}")

=== Data Overview ===
Shape: (7043, 21)

Column Types:
object     18
int64       2
float64     1
Name: count, dtype: int64

Target:
Churn
No     5174
Yes    1869
Name: count, dtype: int64

Target Distribution:
Churn
No     0.73463
Yes    0.26537
Name: proportion, dtype: float64


## Missing Values Analysis

In [26]:
missing = df.isnull().sum()
if missing.sum() > 0:
    print("\nColumns with missing values: ")
    print(missing[missing > 0])
print("\nNo missing values")


No missing values


## Business-Relevant Questions

In [27]:
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


What's the average tenure of churned vs retained customers?

In [28]:
churned_tenure = df[df['Churn'] == 'Yes']['tenure'].mean()
not_churned_tenure = df[df['Churn'] == 'No']['tenure'].mean()

print(f"\nMean tenure of churned customers: {churned_tenure:.2f}")
print(f"\nMean tenure of customers that didnt churn: {not_churned_tenure:.2f}")


Mean tenure of churned customers: 17.98

Mean tenure of customers that didnt churn: 37.57


Is there a relationship between monthly charges and churn?

In [29]:
churned_monthly_charge = df[df['Churn'] == 'Yes']['MonthlyCharges'].mean()
not_churned_monthly_charge = df[df['Churn'] == 'No']['MonthlyCharges'].mean()

print(f"\nMean Monthly charge for churned customers: {churned_monthly_charge:.2f}")
print(f"\nMean Monthly charge for not churned customers: {not_churned_monthly_charge:.2f}")


Mean Monthly charge for churned customers: 74.44

Mean Monthly charge for not churned customers: 61.27


Which services are churned customers most likely to have?

In [34]:
services_cols = ['PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']
churned_df = df[df['Churn'] == 'Yes']

for cols in services_cols:
    print(f"\n===== {cols} =====")
    print(churned_df[cols].value_counts(normalize=True))


===== PhoneService =====
PhoneService
Yes    0.909042
No     0.090958
Name: proportion, dtype: float64

===== MultipleLines =====
MultipleLines
Yes                 0.454789
No                  0.454254
No phone service    0.090958
Name: proportion, dtype: float64

===== InternetService =====
InternetService
Fiber optic    0.693954
DSL            0.245586
No             0.060460
Name: proportion, dtype: float64

===== OnlineSecurity =====
OnlineSecurity
No                     0.781701
Yes                    0.157838
No internet service    0.060460
Name: proportion, dtype: float64

===== OnlineBackup =====
OnlineBackup
No                     0.659711
Yes                    0.279829
No internet service    0.060460
Name: proportion, dtype: float64

===== DeviceProtection =====
DeviceProtection
No                     0.64794
Yes                    0.29160
No internet service    0.06046
Name: proportion, dtype: float64

===== TechSupport =====
TechSupport
No                     0.773676
Yes