# Churn Prediction - Hands-On Exercise

**Based on: DataCamp Supervised Learning Chapter 1**

This notebook walks you through applying classification concepts step-by-step.

---

## ðŸŽ¯ Learning Objectives

By completing this notebook, you will:
1. Apply train-test split properly
2. Train multiple classification models
3. Evaluate using multiple metrics
4. Compare model performance
5. Make predictions on new data

---

## Step 1: Import Libraries

**Task**: Import all necessary libraries

In [None]:
# Standard libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn imports
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

# Metrics
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_auc_score, roc_curve
)

# Settings
plt.style.use('seaborn-v0_8-darkgrid')
np.random.seed(42)

print("âœ… Libraries imported successfully!")

## Step 2: Load Data

**Task**: Create a synthetic churn dataset for practice

In [None]:
from sklearn.datasets import make_classification

# Create synthetic churn data
X, y = make_classification(
    n_samples=2000,
    n_features=10,
    n_informative=8,
    n_redundant=2,
    n_classes=2,
    weights=[0.75, 0.25],  # 25% churn rate
    random_state=42
)

# Create DataFrame
feature_names = [
    'account_length', 'international_plan', 'voice_mail_plan',
    'num_voice_messages', 'total_day_minutes', 'total_day_calls',
    'total_eve_minutes', 'total_night_minutes', 'total_intl_calls',
    'customer_service_calls'
]

df = pd.DataFrame(X, columns=feature_names)
df['churn'] = y

print(f"Dataset shape: {df.shape}")
print(f"\nFirst 5 rows:")
df.head()

## Step 3: Exploratory Data Analysis

**Task**: Understand the dataset before modeling

### 3.1 Check class distribution

In [None]:
# YOUR CODE HERE:
# Calculate the percentage of churned vs non-churned customers
churn_counts = df['churn'].value_counts()
churn_pct = df['churn'].value_counts(normalize=True) * 100

print("Churn Distribution:")
print(f"No Churn (0): {churn_counts[0]} ({churn_pct[0]:.1f}%)")
print(f"Churn (1): {churn_counts[1]} ({churn_pct[1]:.1f}%)")

# Visualize
plt.figure(figsize=(8, 5))
churn_counts.plot(kind='bar', color=['green', 'red'])
plt.title('Churn Distribution')
plt.xlabel('Churn (0=No, 1=Yes)')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.show()

# QUESTION: Is this dataset balanced or imbalanced? Why does it matter?

### 3.2 Feature Statistics

In [None]:
# YOUR CODE HERE:
# Display summary statistics for all features
df.describe().round(2)

## Step 4: Prepare Data

**Concept from DataCamp**: Always split data before training!

### 4.1 Separate features and target

In [None]:
# YOUR CODE HERE:
# Separate X (features) and y (target)
X = df.drop('churn', axis=1)
y = df['churn']

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")

### 4.2 Train-Test Split

**Key Parameters**:
- `test_size`: Proportion for test set (typically 0.2 or 0.3)
- `random_state`: For reproducibility
- `stratify`: Maintains class distribution in both sets

In [None]:
# YOUR CODE HERE:
# Split the data with test_size=0.2, random_state=42, stratify=y
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {len(X_train)} samples ({len(X_train)/len(X)*100:.1f}%)")
print(f"Test set: {len(X_test)} samples ({len(X_test)/len(X)*100:.1f}%)")

# Verify stratification worked
print(f"\nTraining set churn rate: {y_train.mean()*100:.1f}%")
print(f"Test set churn rate: {y_test.mean()*100:.1f}%")

# QUESTION: Why is stratification important?

### 4.3 Feature Scaling

**Why scale?** Some algorithms (Logistic Regression, KNN) are sensitive to feature magnitudes.

**Important**: Fit scaler on training data, then transform both train and test!

In [None]:
# YOUR CODE HERE:
# Create StandardScaler and fit_transform training data, transform test data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("âœ… Features scaled")
print(f"\nOriginal feature ranges:")
print(X_train.describe().loc[['min', 'max']].round(2))
print(f"\nScaled feature means (should be ~0):")
print(pd.DataFrame(X_train_scaled, columns=X.columns).mean().round(4))

## Step 5: Train Models

**Task**: Train 4 different classification models

### 5.1 Logistic Regression

In [None]:
# YOUR CODE HERE:
# Train Logistic Regression on SCALED data
lr_model = LogisticRegression(random_state=42, max_iter=1000)
lr_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred_lr = lr_model.predict(X_test_scaled)
y_pred_proba_lr = lr_model.predict_proba(X_test_scaled)[:, 1]

# Calculate accuracy
accuracy_lr = accuracy_score(y_test, y_pred_lr)
print(f"Logistic Regression Accuracy: {accuracy_lr:.3f}")

### 5.2 K-Nearest Neighbors

In [None]:
# YOUR CODE HERE:
# Train KNN with n_neighbors=5 on SCALED data
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train_scaled, y_train)

y_pred_knn = knn_model.predict(X_test_scaled)
y_pred_proba_knn = knn_model.predict_proba(X_test_scaled)[:, 1]

accuracy_knn = accuracy_score(y_test, y_pred_knn)
print(f"KNN Accuracy: {accuracy_knn:.3f}")

### 5.3 Decision Tree

In [None]:
# YOUR CODE HERE:
# Train Decision Tree with max_depth=5 on UNSCALED data (trees don't need scaling)
dt_model = DecisionTreeClassifier(max_depth=5, random_state=42)
dt_model.fit(X_train, y_train)

y_pred_dt = dt_model.predict(X_test)
y_pred_proba_dt = dt_model.predict_proba(X_test)[:, 1]

accuracy_dt = accuracy_score(y_test, y_pred_dt)
print(f"Decision Tree Accuracy: {accuracy_dt:.3f}")

### 5.4 Random Forest

In [None]:
# YOUR CODE HERE:
# Train Random Forest with n_estimators=100, max_depth=5
rf_model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
rf_model.fit(X_train, y_train)

y_pred_rf = rf_model.predict(X_test)
y_pred_proba_rf = rf_model.predict_proba(X_test)[:, 1]

accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"Random Forest Accuracy: {accuracy_rf:.3f}")

## Step 6: Evaluate Models

**Concept**: Accuracy alone is not enough! Use multiple metrics.

### 6.1 Calculate All Metrics

In [None]:
def evaluate_model(y_true, y_pred, y_pred_proba, model_name):
    """Calculate all metrics for a model"""
    return {
        'Model': model_name,
        'Accuracy': accuracy_score(y_true, y_pred),
        'Precision': precision_score(y_true, y_pred),
        'Recall': recall_score(y_true, y_pred),
        'F1-Score': f1_score(y_true, y_pred),
        'ROC-AUC': roc_auc_score(y_true, y_pred_proba)
    }

# Evaluate all models
results = [
    evaluate_model(y_test, y_pred_lr, y_pred_proba_lr, 'Logistic Regression'),
    evaluate_model(y_test, y_pred_knn, y_pred_proba_knn, 'K-Nearest Neighbors'),
    evaluate_model(y_test, y_pred_dt, y_pred_proba_dt, 'Decision Tree'),
    evaluate_model(y_test, y_pred_rf, y_pred_proba_rf, 'Random Forest')
]

# Create comparison DataFrame
results_df = pd.DataFrame(results)
print("\n" + "="*70)
print("MODEL COMPARISON")
print("="*70)
print(results_df.to_string(index=False))

# QUESTIONS:
# 1. Which model has the highest accuracy?
# 2. Which model has the highest recall? (important for catching churners)
# 3. Is there a model with high precision but low recall?

### 6.2 Confusion Matrix for Best Model

In [None]:
# Find best model by ROC-AUC
best_model_name = results_df.loc[results_df['ROC-AUC'].idxmax(), 'Model']
print(f"Best Model: {best_model_name}\n")

# Get predictions for best model (assuming it's Random Forest)
y_pred_best = y_pred_rf  # Change if different model is best

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred_best)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['No Churn', 'Churn'],
            yticklabels=['No Churn', 'Churn'])
plt.title(f'Confusion Matrix - {best_model_name}')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

# Interpretation
tn, fp, fn, tp = cm.ravel()
print("\nConfusion Matrix Breakdown:")
print(f"True Negatives (correctly predicted no churn): {tn}")
print(f"False Positives (incorrectly predicted churn): {fp}")
print(f"False Negatives (missed churners): {fn}")
print(f"True Positives (correctly predicted churn): {tp}")

### 6.3 ROC Curve

In [None]:
# Plot ROC curves for all models
plt.figure(figsize=(10, 8))

models_data = [
    ('Logistic Regression', y_pred_proba_lr),
    ('K-Nearest Neighbors', y_pred_proba_knn),
    ('Decision Tree', y_pred_proba_dt),
    ('Random Forest', y_pred_proba_rf)
]

for name, y_proba in models_data:
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    auc = roc_auc_score(y_test, y_proba)
    plt.plot(fpr, tpr, label=f'{name} (AUC = {auc:.3f})')

plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves - Model Comparison')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

# QUESTION: What does the area under the ROC curve represent?

## Step 7: Make Predictions on New Data

**Task**: Use the best model to predict churn for new customers

In [None]:
# Create sample new customers
new_customers = pd.DataFrame({
    'account_length': [100, 150],
    'international_plan': [1, 0],
    'voice_mail_plan': [0, 1],
    'num_voice_messages': [20, 10],
    'total_day_minutes': [200, 150],
    'total_day_calls': [100, 80],
    'total_eve_minutes': [150, 120],
    'total_night_minutes': [100, 90],
    'total_intl_calls': [5, 2],
    'customer_service_calls': [3, 1]
})

print("New Customers:")
print(new_customers)

# Make predictions (use best model - assuming Random Forest)
predictions = rf_model.predict(new_customers)
probabilities = rf_model.predict_proba(new_customers)

print("\nPredictions:")
for i, (pred, prob) in enumerate(zip(predictions, probabilities)):
    print(f"\nCustomer {i+1}:")
    print(f"  Prediction: {'CHURN' if pred == 1 else 'NO CHURN'}")
    print(f"  Churn Probability: {prob[1]:.1%}")
    print(f"  Confidence: {'High' if max(prob) > 0.8 else 'Medium' if max(prob) > 0.6 else 'Low'}")

## Step 8: Reflection Questions

Answer these to solidify your understanding:

1. **Why did we use train-test split?**
   - Answer: [Write your answer]

2. **Which metric is most important for churn prediction and why?**
   - Answer: [Write your answer]

3. **What does a high false negative rate mean for the business?**
   - Answer: [Write your answer]

4. **When would you choose Logistic Regression over Random Forest?**
   - Answer: [Write your answer]

5. **How can we improve model performance?**
   - Answer: [Write your answer]

---

## ðŸŽ¯ Next Steps

1. âœ… Complete all cells in this notebook
2. âœ… Try changing hyperparameters (max_depth, n_neighbors, etc.)
3. âœ… Experiment with different train-test split ratios
4. âœ… Add cross-validation to the evaluation
5. âœ… Save the best model using pickle
6. âœ… Create a simple Flask API for the model

---

## ðŸ’¡ Key Takeaways

- Always split data before training
- Use multiple metrics, not just accuracy
- Understand the confusion matrix
- Choose metrics based on business context
- Different algorithms have different strengths
- Feature scaling matters for some algorithms

---

**Great job! You've applied all concepts from DataCamp Chapter 1!** ðŸŽ‰