# Customer Churn Prediction - Exploratory Data Analysis and Preprocessing

This notebook provides a comprehensive analysis of the Telco Customer Churn dataset and implements a preprocessing pipeline for machine learning models.

## Table of Contents
1. [Import Required Libraries](#1-import-required-libraries)
2. [Load and Explore the Dataset](#2-load-and-explore-the-dataset)
3. [Exploratory Data Analysis](#3-exploratory-data-analysis)
4. [Data Preprocessing and Feature Engineering](#4-data-preprocessing-and-feature-engineering)
5. [Train Multiple ML Models](#5-train-multiple-ml-models)
6. [Model Evaluation and Comparison](#6-model-evaluation-and-comparison)
7. [Create Streamlit Web Application](#7-create-streamlit-web-application)
8. [Test the Streamlit App](#8-test-the-streamlit-app)

## 1. Import Required Libraries

Import necessary libraries including pandas, numpy, matplotlib, seaborn, scikit-learn, xgboost, and streamlit.

In [None]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Machine learning libraries
import sklearn
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    classification_report, confusion_matrix, roc_auc_score, 
    roc_curve, precision_recall_curve, accuracy_score
)

# XGBoost
import xgboost as xgb
from xgboost import XGBClassifier

# Model persistence
import joblib

# System and file operations
import os
import sys
from pathlib import Path

# Warnings
import warnings
warnings.filterwarnings('ignore')

# Set up plotting parameters
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

print("✅ All libraries imported successfully!")
print(f"📊 Pandas version: {pd.__version__}")
print(f"🤖 Scikit-learn version: {sklearn.__version__}")
print(f"🚀 XGBoost version: {xgb.__version__}")

## 2. Load and Explore the Dataset

Load the Telco Customer Churn dataset from Kaggle and perform initial data exploration including shape, info, and basic statistics.

In [None]:
# Load the dataset
try:
    data_path = 'data/WA_Fn-UseC_-Telco-Customer-Churn.csv'
    df = pd.read_csv(data_path)
    print(f"Dataset loaded successfully from {data_path}")
except FileNotFoundError:
    print("Dataset not found....")

# Display information about the dataset
print(f"\n Dataset Shape: {df.shape}")
print(f"Number of customers: {df.shape[0]:,}")
print(f"Number of features: {df.shape[1]}")

In [None]:
# Display first few rows
display(df.head())

In [None]:
# Dataset information
df.info()

In [None]:
# Basic statistics
display(df.describe())

In [None]:
# Check for missing values
print("Missing Values Analysis:")
missing_values = df.isnull().sum()
missing_percentage = (missing_values / len(df)) * 100

missing_df = pd.DataFrame({
    'Missing Values': missing_values,
    'Percentage': missing_percentage
})

missing_df = missing_df[missing_df['Missing Values'] > 0].sort_values('Missing Values', ascending=False)

if len(missing_df) > 0:
    display(missing_df)
else:
    print("No missing values found!")

In [None]:
# Target variable distribution
print(" Target Variable (Churn) Distribution:")
churn_counts = df['Churn'].value_counts()
churn_percentage = df['Churn'].value_counts(normalize=True) * 100

churn_summary = pd.DataFrame({
    'Count': churn_counts,
    'Percentage': churn_percentage
})

display(churn_summary)

# Visualize target distribution
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Count plot
sns.countplot(data=df, x='Churn', ax=ax1)
ax1.set_title('Churn Distribution (Count)')
ax1.set_ylabel('Count')

# Pie chart
ax2.pie(churn_counts.values, labels=churn_counts.index, autopct='%1.1f%%', startangle=90)
ax2.set_title('Churn Distribution (Percentage)')

plt.tight_layout()
plt.show()

churn_rate = (df['Churn'] == 'Yes').mean() * 100
print(f"\nOverall Churn Rate: {churn_rate:.2f}%")

## 3. Exploratory Data Analysis

Visualize data distributions, correlations, and churn patterns using matplotlib and seaborn. Analyze categorical and numerical features.

In [None]:
# Separate categorical and numerical features
categorical_features = df.select_dtypes(include=['object']).columns.tolist()
numerical_features = df.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Remove customerID from categorical features
if 'customerID' in categorical_features:
    categorical_features.remove('customerID')

# Remove target from categorical features
if 'Churn' in categorical_features:
    categorical_features.remove('Churn')

print(f"Categorical Features ({len(categorical_features)}): {categorical_features}")
print(f"Numerical Features ({len(numerical_features)}): {numerical_features}")

In [None]:
# Analyze categorical features
print("🔍 Categorical Features Analysis:")

for feature in categorical_features[:6]:  # Display first 6 features
    print(f"\n{feature}:")
    value_counts = df[feature].value_counts()
    print(value_counts)
    
    # Calculate churn rate for each category
    churn_rate_by_category = df.groupby(feature)['Churn'].apply(lambda x: (x == 'Yes').mean() * 100)
    print(f"Churn rate by {feature}:")
    print(churn_rate_by_category.round(2))
    print("-" * 50)

In [None]:
# Visualize categorical features vs churn
fig, axes = plt.subplots(3, 3, figsize=(20, 15))
axes = axes.ravel()

for idx, feature in enumerate(categorical_features[:9]):
    # Create cross-tabulation
    crosstab = pd.crosstab(df[feature], df['Churn'], normalize='index') * 100
    
    crosstab.plot(kind='bar', ax=axes[idx], color=['skyblue', 'salmon'])
    axes[idx].set_title(f'Churn Rate by {feature}')
    axes[idx].set_ylabel('Percentage')
    axes[idx].legend(['No Churn', 'Churn'])
    axes[idx].tick_params(axis='x', rotation=45)

# Hide empty subplots
for idx in range(len(categorical_features), len(axes)):
    axes[idx].set_visible(False)

plt.tight_layout()
plt.show()

In [None]:
# Analyze numerical features
print(" Numerical Features Analysis:")

# Convert TotalCharges to numeric (handle any string values)
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

# Statistical summary for numerical features
numerical_summary = df[numerical_features].describe()
display(numerical_summary)

In [None]:
# Visualize numerical features distributions
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.ravel()

for idx, feature in enumerate(numerical_features[:4]):
    # Handle missing values for visualization
    feature_data = df[feature].dropna()
    
    # Distribution plot
    sns.histplot(data=df.dropna(subset=[feature]), x=feature, hue='Churn', 
                kde=True, ax=axes[idx], alpha=0.7)
    axes[idx].set_title(f'Distribution of {feature} by Churn')
    axes[idx].legend(['No Churn', 'Churn'])

plt.tight_layout()
plt.show()

In [None]:
# Box plots for numerical features
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.ravel()

for idx, feature in enumerate(numerical_features[:4]):
    sns.boxplot(data=df, x='Churn', y=feature, ax=axes[idx])
    axes[idx].set_title(f'{feature} by Churn Status')

plt.tight_layout()
plt.show()

In [None]:
# Correlation analysis
print("🔗 Correlation Analysis:")

# Create a copy for correlation analysis
df_corr = df.copy()

# Remove customerID column if it exists (it's just an identifier, not useful for correlation)
if 'customerID' in df_corr.columns:
    df_corr = df_corr.drop('customerID', axis=1)

# Encode categorical variables for correlation analysis
label_encoders = {}
for col in categorical_features + ['Churn']:
    if col in df_corr.columns:  # Check if column exists
        le = LabelEncoder()
        df_corr[col] = le.fit_transform(df_corr[col].astype(str))
        label_encoders[col] = le

# Handle missing values
df_corr = df_corr.fillna(0)

# Calculate correlation matrix
correlation_matrix = df_corr.corr()

# Plot correlation heatmap
plt.figure(figsize=(16, 12))
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))
sns.heatmap(correlation_matrix, mask=mask, annot=True, cmap='coolwarm', center=0,
            square=True, linewidths=0.5, cbar_kws={"shrink": .8}, fmt='.2f')
plt.title('Correlation Matrix of All Features')
plt.tight_layout()
plt.show()

In [None]:
# Feature correlation with target (Churn)
churn_correlation = correlation_matrix['Churn'].abs().sort_values(ascending=False)
churn_correlation = churn_correlation.drop('Churn')  # Remove self-correlation

print("Features Most Correlated with Churn:")
print(churn_correlation.head(15))

# Visualize top correlations with churn
plt.figure(figsize=(10, 8))
top_15_corr = churn_correlation.head(15)
sns.barplot(x=top_15_corr.values, y=top_15_corr.index, palette='viridis')
plt.title('Top 8 Features Correlated with Churn')
plt.xlabel('Absolute Correlation with Churn')
plt.tight_layout()
plt.show()

## 4. Data Preprocessing and Feature Engineering

Handle missing values, encode categorical variables, scale numerical features, and create new features. Split data into training and testing sets.

In [None]:
# Import our custom preprocessing module
sys.path.append('src')
from preprocessing import ChurnDataPreprocessor

# Initialize preprocessor
preprocessor = ChurnDataPreprocessor()

print("Preprocessor initialized successfully!")

In [None]:
# Load data using preprocessor
df_processed = preprocessor.load_data('data/WA_Fn-UseC_-Telco-Customer-Churn.csv')
print(f"Original dataset shape: {df_processed.shape}")

# Basic data cleaning
df_clean = preprocessor.basic_cleaning(df_processed)
print(f"After basic cleaning: {df_clean.shape}")

# Check for any changes
print("\n Changes made during cleaning:")
print(f"- Columns removed: {set(df_processed.columns) - set(df_clean.columns)}")
print(f"- TotalCharges converted to numeric")
print(f"- Missing values in TotalCharges filled with 0")

In [None]:
# Feature engineering
df_engineered = preprocessor.feature_engineering(df_clean)
print(f" After feature engineering: {df_engineered.shape}")

# Display new features created
new_features = set(df_engineered.columns) - set(df_clean.columns)
print(f"\nNew features created: {new_features}")

# Show sample of new features
if new_features:
    print("\n Sample of new features:")
    display(df_engineered[list(new_features)].head())

In [None]:
# Identify feature types
numeric_features, categorical_features = preprocessor.identify_feature_types(df_engineered)

# print(f"\n Numeric features ({len(numeric_features)}):")
# for i, feature in enumerate(numeric_features, 1):
#     print(f"{i:2d}. {feature}")

# print(f"\n Categorical features ({len(categorical_features)}):")
# for i, feature in enumerate(categorical_features, 1):
#     print(f"{i:2d}. {feature}")

In [None]:
# Encode categorical features
df_encoded = preprocessor.encode_categorical_features(df_engineered, fit=True)
print(f" After encoding categorical features: {df_encoded.shape}")

# Show encoding examples
print("\n Encoding examples:")
for feature in categorical_features[:3]:
    if feature in preprocessor.label_encoders:
        encoder = preprocessor.label_encoders[feature]
        print(f"\n{feature}: {dict(zip(encoder.classes_, encoder.transform(encoder.classes_)))}")

In [None]:
# Scale numeric features
df_scaled = preprocessor.scale_numeric_features(df_encoded, fit=True)
print(f" After scaling numeric features: {df_scaled.shape}")

# Show scaling statistics
print("\n Scaling statistics (mean and std of scaled features):")
scaling_stats = pd.DataFrame({
    'Feature': numeric_features,
    'Mean_After_Scaling': df_scaled[numeric_features].mean().values,
    'Std_After_Scaling': df_scaled[numeric_features].std().values
})
display(scaling_stats)

In [None]:
# Prepare features and target
X, y = preprocessor.prepare_features_target(df_scaled, target_col='Churn')

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"Target distribution: {np.bincount(y)}")
print(f"Target classes: {preprocessor.target_encoder.classes_}")

# Display feature names
feature_names = X.columns.tolist()
print(f"\n--Final feature list ({len(feature_names)} features):")
for i, feature in enumerate(feature_names, 1):
    print(f"{i:2d}. {feature}")

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")
print(f"Training target distribution: {np.bincount(y_train)}")
print(f"Testing target distribution: {np.bincount(y_test)}")

# Calculate split percentages
train_churn_rate = (y_train == 1).mean() * 100
test_churn_rate = (y_test == 1).mean() * 100

print(f"\n Training set churn rate: {train_churn_rate:.2f}%")
print(f"Testing set churn rate: {test_churn_rate:.2f}%")

print("\nData preprocessing completed successfully!")

In [None]:
# Save the preprocessor for later use
os.makedirs('models', exist_ok=True)
preprocessor.save_preprocessor('models/preprocessor.pkl')

# Also save the processed data
np.save('models/X_train.npy', X_train.values)
np.save('models/X_test.npy', X_test.values)
np.save('models/y_train.npy', y_train)
np.save('models/y_test.npy', y_test)

# Save feature names
with open('models/feature_names.txt', 'w') as f:
    for feature in feature_names:
        f.write(f"{feature}\n")

print(" Preprocessor and processed data saved successfully!")
print(" Files saved:")
print("   - models/preprocessor.pkl")
print("   - models/X_train.npy")
print("   - models/X_test.npy")
print("   - models/y_train.npy")
print("   - models/y_test.npy")
print("   - models/feature_names.txt")

## Summary

In this notebook, we have:

1. ✅ **Loaded and explored** the Telco Customer Churn dataset
2. ✅ **Performed comprehensive EDA** with visualizations
3. ✅ **Analyzed feature relationships** and correlations with churn
4. ✅ **Implemented data preprocessing** pipeline including:
   - Data cleaning and missing value handling
   - Feature engineering (new features creation)
   - Categorical encoding
   - Numerical feature scaling
   - Train-test split
5. ✅ **Saved preprocessor and data** for model training

### Key Insights:
- **Churn Rate**: ~27% of customers churn
- **High Risk Factors**: Month-to-month contracts, Fiber optic internet, Electronic check payments
- **Low Risk Factors**: Long-term contracts, Multiple services, Automatic payments
- **Feature Engineering**: Created tenure groups, charge levels, service count, and average monthly charges

### Next Steps:
Move to the model training notebook (`02_Model_Training.ipynb`) to:
- Train multiple machine learning models
- Perform hyperparameter tuning
- Evaluate and compare model performance
- Select the best model for deployment