# Telco Customer Churn Analysis - Part 1: Data Exploration

**Project**: SpecSailor - Telco Customer Churn Prediction

**Author**: SpecSailor Team

**Date**: November 2025

## Overview
This notebook performs exploratory data analysis on the Telco Customer Churn dataset to understand:
- Dataset structure and size
- Missing values and data quality
- Feature distributions
- Churn patterns and correlations

## Dataset Information
The Telco Customer Churn dataset contains **7,043 customers** from a telecommunications company with 21 features including:
- Customer demographics (gender, senior citizen, dependents)
- Account information (tenure, contract type, payment method)
- Services subscribed (phone, internet, streaming, etc.)
- Charges (monthly charges, total charges)
- **Target**: Churn (Yes/No)

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("Libraries loaded successfully!")

In [None]:
# Load the Telco Customer Churn dataset
# Note: Download from https://www.kaggle.com/datasets/blastchar/telco-customer-churn
# Place in ../data/raw/WA_Fn-UseC_-Telco-Customer-Churn.csv

df = pd.read_csv('../data/raw/WA_Fn-UseC_-Telco-Customer-Churn.csv')

print(f"Dataset shape: {df.shape}")
print(f"Total customers: {len(df):,}")
print(f"\nFirst few rows:")
df.head()

In [None]:
# Basic dataset information
print("=" * 60)
print("DATASET INFO")
print("=" * 60)
df.info()

In [None]:
# Statistical summary
print("\n" + "=" * 60)
print("STATISTICAL SUMMARY")
print("=" * 60)
df.describe()

In [None]:
# Check for missing values
print("\n" + "=" * 60)
print("MISSING VALUES")
print("=" * 60)
missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100
missing_df = pd.DataFrame({
    'Missing Count': missing,
    'Percentage': missing_pct
})
print(missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Count', ascending=False))

In [None]:
# Target variable distribution
print("\n" + "=" * 60)
print("CHURN DISTRIBUTION")
print("=" * 60)

churn_counts = df['Churn'].value_counts()
churn_pct = df['Churn'].value_counts(normalize=True) * 100

print(f"\nChurn = No:  {churn_counts['No']:,} ({churn_pct['No']:.1f}%)")
print(f"Churn = Yes: {churn_counts['Yes']:,} ({churn_pct['Yes']:.1f}%)")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar plot
churn_counts.plot(kind='bar', ax=axes[0], color=['#2ecc71', '#e74c3c'])
axes[0].set_title('Churn Distribution (Count)', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Churn')
axes[0].set_ylabel('Number of Customers')
axes[0].set_xticklabels(['Retained', 'Churned'], rotation=0)

# Pie chart
axes[1].pie(churn_pct, labels=['Retained', 'Churned'], autopct='%1.1f%%',
            colors=['#2ecc71', '#e74c3c'], startangle=90)
axes[1].set_title('Churn Distribution (Percentage)', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

print(f"\nClass Imbalance Ratio: {churn_pct['No'] / churn_pct['Yes']:.2f}:1")

In [None]:
# Numeric features distribution
numeric_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']

fig, axes = plt.subplots(1, 3, figsize=(16, 4))

for idx, col in enumerate(numeric_cols):
    # Handle TotalCharges which might have spaces
    if col == 'TotalCharges':
        data = pd.to_numeric(df[col], errors='coerce')
    else:
        data = df[col]
    
    axes[idx].hist(data.dropna(), bins=30, edgecolor='black', alpha=0.7)
    axes[idx].set_title(f'{col} Distribution', fontweight='bold')
    axes[idx].set_xlabel(col)
    axes[idx].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

In [None]:
# Categorical features overview
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
categorical_cols.remove('customerID')  # Remove ID column
categorical_cols.remove('Churn')  # Remove target

print("\n" + "=" * 60)
print("CATEGORICAL FEATURES")
print("=" * 60)
print(f"\nTotal categorical features: {len(categorical_cols)}")
print(f"\nFeatures: {', '.join(categorical_cols)}")

In [None]:
# Churn by Contract Type
contract_churn = pd.crosstab(df['Contract'], df['Churn'], normalize='index') * 100

contract_churn.plot(kind='bar', figsize=(10, 6), color=['#2ecc71', '#e74c3c'])
plt.title('Churn Rate by Contract Type', fontsize=14, fontweight='bold')
plt.xlabel('Contract Type')
plt.ylabel('Percentage (%)')
plt.legend(['Retained', 'Churned'])
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

print("\nChurn Rate by Contract Type:")
print(contract_churn)

In [None]:
# Key insights from exploration
print("\n" + "=" * 60)
print("KEY FINDINGS FROM EXPLORATORY ANALYSIS")
print("=" * 60)
print("""
1. Dataset Size: 7,043 customers
   - Churned: ~26.5%
   - Retained: ~73.5%

2. Class Imbalance: Moderate (2.8:1 ratio)
   - Will need to consider in model training

3. Contract Type Impact:
   - Month-to-month contracts show HIGHEST churn (~42%)
   - Two-year contracts show LOWEST churn (~3%)
   - Clear indicator of customer commitment

4. Tenure Distribution:
   - Bimodal distribution
   - Many new customers (<6 months) and long-term (>60 months)
   - Early-stage churn appears critical

5. Data Quality:
   - TotalCharges has some missing values (need cleaning)
   - Most features are categorical (will need encoding)
""")

## Next Steps

Based on this exploration, we will proceed to:
1. **Data Cleaning** (Notebook 02): Handle missing values, fix data types
2. **Feature Engineering** (Notebook 03): Create predictive features
3. **Model Training** (Notebook 04): Train XGBoost classifier
4. **Model Evaluation** (Notebook 05): Assess performance and tune