# Telecom Customer Churn - Data Exploration & Modeling

This notebook contains the complete pipeline for our Churn Prediction System, built directly from our Python codebase. You can run these cells interactively to explore the dataset, visualize customer segments, and understand the model's explanations.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
import shap
import warnings

warnings.filterwarnings('ignore')
plt.style.use('ggplot')

## 1. Data Loading & Inspection

In [None]:
# Load the raw dataset
df = pd.read_csv('../data/raw/customer_churn_dataset.csv')
df.head()

## 2. Customer Segmentation Analysis
We previously ran a KMeans clustering algorithm on customer Tenure and Charges to group them into 3 distinct logical segments.

In [None]:
df_clean = df.copy()
df_clean['TotalCharges'] = pd.to_numeric(df_clean['TotalCharges'], errors='coerce')
df_clean['TotalCharges'].fillna(df_clean['MonthlyCharges'], inplace=True)

# Load our trained clustering models
kmeans_scaler = joblib.load('../models/kmeans_scaler.pkl')
kmeans_model = joblib.load('../models/kmeans_segmentation.pkl')

seg_features = df_clean[['tenure', 'MonthlyCharges', 'TotalCharges']]
scaled_seg = kmeans_scaler.transform(seg_features)
df_clean['Segment'] = kmeans_model.predict(scaled_seg)

sns.scatterplot(x='tenure', y='MonthlyCharges', hue='Segment', data=df_clean, palette='viridis')
plt.title('Customer Segments by Tenure and Charges')
plt.show()

## 3. Model Explainability directly with SHAP
Let's load up our best predictive model (Logistic Regression) and use SHAP to understand which features drive churn.

In [None]:
# Load our classification models
best_model = joblib.load('../models/best_model.pkl')
# Note: We use shap_background for LinearExplainer reference
background_data = joblib.load('../models/shap_background.pkl')
features = joblib.load('../models/model_columns.pkl')

print(f"Loaded Model: {type(best_model).__name__}")

# Set up explainer depending on model type
if type(best_model).__name__ in ['RandomForestClassifier', 'XGBClassifier', 'XGBRegressor']:
    explainer = shap.TreeExplainer(best_model)
else:
    explainer = shap.LinearExplainer(best_model, background_data)

# Calculate SHAP values for a sample
sample_to_explain = background_data[:100] # Explain first 100 
shap_values = explainer(sample_to_explain)

# Plot Global Feature Importance Summary
shap.summary_plot(shap_values, sample_to_explain, feature_names=features)