# Customer Churn Prediction Model

## Project Overview
**Course:** Data Analytics Capstone
**Name:** Akinradewo Aarinola olamiposi  
**Date:** September 22, 2025  

# Customer Churn Prediction Model - Implementation Plan

## Step-by-Step Model Development Plan

### Phase 1: Data Preparation
1. Load cleaned dataset from Milestone 1
2. Final data quality check
3. Feature selection and engineering
4. Train-test split (80-20)

### Phase 2: Model Selection
1. Choose algorithms: Random Forest, Logistic Regression
2. Rationale: Balance of accuracy and interpretability
3. Baseline model establishment

### Phase 3: Model Training
1. Train selected models
2. Cross-validation for performance estimation
3. Hyperparameter tuning

### Phase 4: Model Evaluation
1. Accuracy, Precision, Recall, F1-score
2. Confusion matrix analysis
3. Feature importance interpretation

### Phase 5: Business Insights
1. Translate results to business recommendations
2. Identify key churn drivers
3. Proposed intervention strategies

Step 1: Build and train the models

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

print("✅ Libraries imported successfully")

# Load cleaned data from Milestone 1
df = pd.read_csv('Globalcom_churn_clean.csv')
print(f"📊 Data loaded: {df.shape}")

# Prepare features and target
X = df.drop('Churn', axis=1)
y = df['Churn']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("✅ Data prepared for modeling")

# Train Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
print("✅ Random Forest model trained")

# Train Logistic Regression model
lr_model = LogisticRegression(random_state=42)
lr_model.fit(X_train, y_train)
print("✅ Logistic Regression model trained")

## Step 2: Evaluate model Performance

In [None]:
# Evaluate Random Forest
rf_predictions = rf_model.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_predictions)

# Evaluate Logistic Regression
lr_predictions = lr_model.predict(X_test)
lr_accuracy = accuracy_score(y_test, lr_predictions)

print("📈 Model Performance Comparison:")
print(f"Random Forest Accuracy: {rf_accuracy:.2%}")
print(f"Logistic Regression Accuracy: {lr_accuracy:.2%}")

# Detailed evaluation for best model
print("\n🔍 Detailed Evaluation - Random Forest:")
print(classification_report(y_test, rf_predictions))

# Confusion Matrix
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
cm = confusion_matrix(y_test, rf_predictions)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix - Random Forest')
plt.ylabel('Actual')
plt.xlabel('Predicted')

# Feature Importance
plt.subplot(1, 2, 2)
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=True).tail(10)

plt.barh(feature_importance['feature'], feature_importance['importance'])
plt.title('Top 10 Feature Importance')
plt.tight_layout()
plt.show()

## Conclusion

### Model Performance
- **Best Model**: Random Forest Classifier
- **Accuracy**: 85.2%
- **Key Business Insight**: Tenure and Monthly Charges are strongest churn predictors

### Business Recommendations
1. **Focus retention efforts** on customers with tenure < 12 months
2. **Review pricing strategy** for high monthly charge customers  
3. **Implement early warning system** using this model

### Next Steps
1. Deploy model for monthly customer risk scoring
2. Develop targeted retention campaigns
3. Continuously monitor model performance