For this project we have used the classic "Telco Customer Churn" dataset. It's ideal because it contains a mix of demographic, account, and service usage information, along with a clear "Churn" label.

# Project: Customer Churn Prediction Model

Business Goal: To build a machine learning model that accurately predicts which customers are likely to churn, enabling the Customer Success team to take proactive retention measures.

Methodology:
1.  Data Loading & Cleaning: Load the Telco dataset and handle inconsistencies.
2.  Exploratory Data Analysis (EDA): Visualize the differences between churning and non-churning customers.
3.  Feature Engineering & Preprocessing: Convert categorical data to numerical format and prepare it for modeling.
4.  Model Training: Build and train a Random Forest Classifier.
5.  Model Evaluation: Assess performance using a confusion matrix, classification report, and feature importance.
6.  Export for BI: Generate a final predictions file for visualization.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_auc_score
from sklearn.preprocessing import LabelEncoder

# Set plot style
sns.set(style="whitegrid")
pd.set_option('display.max_columns', None)

In [None]:
df = pd.read_csv('telco_churn.csv')

print("Data Head:")
print(df.head())
print("\nData Info:")
df.info()

# --- Data Cleaning ---
# TotalCharges is object type, needs to be numeric. Some values are ' '.
# We'll treat these ' ' as missing values and impute them with the median.
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
median_total_charges = df['TotalCharges'].median()
df['TotalCharges'].fillna(median_total_charges, inplace=True)

# Drop customerID as it's not a predictive feature
df.drop('customerID', axis=1, inplace=True)

# Convert our target variable 'Churn' to binary (0/1)
df['Churn'] = df['Churn'].apply(lambda x: 1 if x == 'Yes' else 0)

print("\nData after cleaning:")
print(df.head())
print(f"\nNumber of missing values in TotalCharges after imputation: {df['TotalCharges'].isnull().sum()}")

EDA: Understanding Churn Drivers
Let's visualize how different features relate to churn.

In [None]:
plt.figure(figsize=(6, 4))
sns.countplot(data=df, x='Churn')
plt.title('Churn Distribution')
plt.show()

print(df['Churn'].value_counts(normalize=True))

# Visualize churn across key categorical features
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
sns.countplot(data=df, x='Contract', hue='Churn', ax=axes[0, 0]).set_title('Churn by Contract Type')
sns.countplot(data=df, x='TechSupport', hue='Churn', ax=axes[0, 1]).set_title('Churn by Tech Support')
sns.countplot(data=df, x='PaymentMethod', hue='Churn', ax=axes[1, 0]).set_title('Churn by Payment Method')
axes[1, 0].tick_params(axis='x', rotation=30)
sns.countplot(data=df, x='InternetService', hue='Churn', ax=axes[1, 1]).set_title('Churn by Internet Service')
plt.tight_layout()
plt.show()

 Data for Modeling
Machine learning models require all input features to be numeric. We will use One-Hot Encoding this.

In [None]:
# Separate features (X) and target (y)
X = df.drop('Churn', axis=1)
y = df['Churn']

# Convert categorical variables into dummy/indicator variables
X_encoded = pd.get_dummies(X, drop_first=True) # drop_first=True to avoid multicollinearity

print("Shape of encoded features:", X_encoded.shape)
print("\nEncoded Features Head:")
print(X_encoded.head())

In [None]:
# Split the data into training and testing sets (80/20 split)
# stratify=y ensures the churn distribution is the same in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X_encoded, y, 
    test_size=0.2, 
    random_state=42, 
    stratify=y
)

print(f"Training set size: {X_train.shape[0]}")
print(f"Testing set size: {X_test.shape[0]}")

In [None]:
# Initialize and train the Random Forest Classifier
# class_weight='balanced' helps the model handle the imbalanced churn data
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced', n_jobs=-1)

print("Training the model...")
rf_classifier.fit(X_train, y_train)
print("Training complete.")

 Evaluating Model Performance
We will use the test set, which the model has never seen before, to get an unbiased assessment of its performance. **Recall** for the 'Churn' class (1) is our key metric.

In [None]:
# Make predictions on the test set
y_pred = rf_classifier.predict(X_test)
y_pred_proba = rf_classifier.predict_proba(X_test)[:, 1] # Probability for the 'Churn' class

# --- Performance Metrics ---
print("--- Model Evaluation ---")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print(f"AUC Score: {roc_auc_score(y_test, y_pred_proba):.2f}\n")

print("Classification Report:")
print(classification_report(y_test, y_pred))

print("Confusion Matrix:")
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Not Churned', 'Churned'], yticklabels=['Not Churned', 'Churned'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

### Identifying Key Churn Drivers
Let's see what features the model found most predictive.

In [None]:
# Get feature importances
importances = rf_classifier.feature_importances_
feature_names = X_encoded.columns
feature_importance_df = pd.DataFrame({'feature': feature_names, 'importance': importances})
feature_importance_df = feature_importance_df.sort_values(by='importance', ascending=False).head(10) # Top 10

# Plot feature importances
plt.figure(figsize=(10, 6))
sns.barplot(x='importance', y='feature', data=feature_importance_df, palette='viridis')
plt.title('Top 10 Churn Driver Features')
plt.show()

# --- Create the Final Output File for the Business Team ---
# Let's predict on the ENTIRE original dataset to get a churn score for everyone
full_predictions_proba = rf_classifier.predict_proba(X_encoded)[:, 1]

# Create a final dataframe with customer info and churn score
final_df = df.copy() # Start with the original (pre-encoded) data for readability
final_df['Churn_Probability'] = full_predictions_proba
final_df['Predicted_Churn'] = rf_classifier.predict(X_encoded)

# Add a risk tier for easy prioritization
def assign_risk_tier(score):
    if score > 0.75:
        return 'High Risk'
    elif score > 0.50:
        return 'Medium Risk'
    else:
        return 'Low Risk'

final_df['Risk_Tier'] = final_df['Churn_Probability'].apply(assign_risk_tier)

# Save to CSV for BI tool import
final_df.to_csv('churn_predictions_with_scores.csv', index=False)

print("\nFinal predictions with scores saved to 'churn_predictions_with_scores.csv'")
print(final_df[['tenure', 'Contract', 'MonthlyCharges', 'Churn_Probability', 'Risk_Tier']].head())