# TeleChurn Predictor: Model Training and Evaluation

This notebook demonstrates the training and evaluation process for telecom customer churn prediction using our custom model modules. We'll train and compare different models with proper handling of class imbalance.

## 1. Setup and Data Loading

In [None]:
# Import standard libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import warnings
import os
import sys

# Configure visualizations
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12
sns.set_palette('viridis')

# Ignore warnings
warnings.filterwarnings('ignore')

# Display all columns
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', '{:.2f}'.format)

In [None]:
# Add scripts directory to path
sys.path.insert(0, '../scripts')  # This ensures scripts directory is first in path

# Import our custom modules
from base_model import BaseModel, ModelFactory
from gradient_boosting import GradientBoostingModel
from neural_network import NeuralNetworkModel
from training_pipeline import ModelTrainer, compare_models
from feature_engineering import FeatureEngineer

In [None]:
# Load the preprocessed data
base_dir = os.path.dirname(os.path.dirname(os.path.abspath("__file__")))
processed_data_dir = os.path.join(base_dir, "data", "processed")
train_file = "preprocessed_cell2celltrain.csv"
holdout_file = "preprocessed_cell2cellholdout.csv"

train_data = pd.read_csv(os.path.join(processed_data_dir, train_file))
holdout_data = pd.read_csv(os.path.join(processed_data_dir, holdout_file))

print(f"Training data shape: {train_data.shape}")
print(f"Holdout data shape: {holdout_data.shape}")

In [None]:
# Display the first few rows of the training data
train_data.head()

## 2. Data Preprocessing and Feature Engineering

In [None]:
# Convert Churn to numeric if it's a string
if train_data['Churn'].dtype == 'object':
    print("Converting Churn from string to numeric...")
    # Map 'Yes'/'No' to 1/0
    train_data['Churn'] = train_data['Churn'].map({'Yes': 1, 'No': 0})
    
if 'Churn' in holdout_data.columns and holdout_data['Churn'].dtype == 'object':
    holdout_data['Churn'] = holdout_data['Churn'].map({'Yes': 1, 'No': 0})

In [None]:
# Initialize the feature engineer
feature_eng = FeatureEngineer(
    remove_correlated=True, 
    correlation_threshold=0.85,
    id_columns=['CustomerID'],  # Explicitly exclude CustomerID from feature engineering
    selection_method='model_based'  # Use model-based feature selection for better results
)

# Apply feature engineering to training data
train_featured = feature_eng.fit_transform(train_data.copy())

# Apply feature engineering to holdout data
holdout_featured = feature_eng.transform(holdout_data.copy())

# Print shape comparison
print(f"Original training data shape: {train_data.shape}")
print(f"Engineered training data shape: {train_featured.shape}")
print(f"\nOriginal holdout data shape: {holdout_data.shape}")
print(f"Engineered holdout data shape: {holdout_featured.shape}")

In [None]:
# Prepare data for modeling
def prepare_data_for_modeling(df):
    df_model = df.copy()
    
    # Check for categorical columns
    categorical_cols = [col for col in df_model.columns 
                       if df_model[col].dtype == 'object' or 
                       df_model[col].dtype.name == 'category']
    
    # Encode categorical columns
    for col in categorical_cols:
        if col != 'Churn':
            le = LabelEncoder()
            df_model[col] = le.fit_transform(df_model[col].astype(str))
    
    # Ensure target is binary numeric
    if 'Churn' in df_model.columns and df_model['Churn'].dtype == 'object':
        df_model['Churn'] = df_model['Churn'].map({'Yes': 1, 'No': 0})
    
    return df_model

# Prepare data
train_featured_model = prepare_data_for_modeling(train_featured)
holdout_featured_model = prepare_data_for_modeling(holdout_featured)

In [None]:
# Split data into features and target
X = train_featured_model.drop('Churn', axis=1) if 'Churn' in train_featured_model.columns else train_featured_model
y = train_featured_model['Churn'] if 'Churn' in train_featured_model.columns else None

# Split into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Prepare holdout set
X_holdout = holdout_featured_model.drop('Churn', axis=1) if 'Churn' in holdout_featured_model.columns else holdout_featured_model
y_holdout = holdout_featured_model['Churn'] if 'Churn' in holdout_featured_model.columns else None

print(f"Training set: {X_train.shape[0]} samples")
print(f"Validation set: {X_val.shape[0]} samples")
print(f"Holdout set: {X_holdout.shape[0]} samples")

In [None]:
# Check class distribution
print("Class distribution in training set:")
print(y_train.value_counts(normalize=True) * 100)

print("\nClass distribution in validation set:")
print(y_val.value_counts(normalize=True) * 100)

if y_holdout is not None:
    print("\nClass distribution in holdout set:")
    print(y_holdout.value_counts(normalize=True) * 100)

## 3. Model Training and Evaluation

We'll train and evaluate multiple models using our custom modules:
1. Gradient Boosting with XGBoost
2. Gradient Boosting with LightGBM
3. Neural Network

Each model will be trained with proper handling of class imbalance.

### 3.1 Gradient Boosting with XGBoost

In [None]:
# Create XGBoost model
xgb_model = GradientBoostingModel(
    model_name="XGBoost_Churn_Predictor",
    implementation="xgboost",
    params={
        'n_estimators': 200,
        'learning_rate': 0.1,
        'max_depth': 5,
        'min_child_weight': 1,
        'gamma': 0.1,
        'subsample': 0.8,
        'colsample_bytree': 0.8,
        'objective': 'binary:logistic',
        'eval_metric': 'auc',
        'scale_pos_weight': 1,  # Will be adjusted automatically for class imbalance
        'random_state': 42
    },
    class_weight="balanced",  # Handle class imbalance
    random_state=42
)

# Create model trainer with SMOTE resampling
xgb_trainer = ModelTrainer(
    model=xgb_model,
    resampling_strategy="smote",  # Use SMOTE to handle class imbalance
    resampling_ratio=0.5,  # Target 1:2 ratio of minority to majority class
    random_state=42,
    model_dir="../models"
)

In [None]:
# Run the training pipeline for XGBoost
xgb_results = xgb_trainer.run_training_pipeline(
    X_train, y_train,
    tune_hyperparameters=False,  # We're using predefined hyperparameters
    tune_threshold=True,  # Find optimal classification threshold
    cross_validate=True,  # Perform cross-validation
    cv=5,
    save_model=True,
    save_history=True,
    plot_cm=True,
    plot_roc=True,
    plot_pr=True,
    plot_prob_dist=True,
    plot_importance=True,
    importance_top_n=20,
    threshold_metric='f1'  # Optimize threshold for F1 score
)

### 3.2 Gradient Boosting with LightGBM

In [None]:
# Create LightGBM model
lgb_model = GradientBoostingModel(
    model_name="LightGBM_Churn_Predictor",
    implementation="lightgbm",
    params={
        'n_estimators': 200,
        'learning_rate': 0.1,
        'max_depth': 5,
        'num_leaves': 31,
        'min_child_samples': 20,
        'subsample': 0.8,
        'colsample_bytree': 0.8,
        'objective': 'binary',
        'metric': 'auc',
        'is_unbalance': True,  # Handle class imbalance
        'random_state': 42
    },
    class_weight="balanced",
    random_state=42
)

# Create model trainer with undersampling
lgb_trainer = ModelTrainer(
    model=lgb_model,
    resampling_strategy="undersample",  # Use undersampling to handle class imbalance
    resampling_ratio=0.5,  # Target 1:2 ratio of minority to majority class
    random_state=42,
    model_dir="../models"
)

In [None]:
# Run the training pipeline for LightGBM
lgb_results = lgb_trainer.run_training_pipeline(
    X_train, y_train,
    tune_hyperparameters=False,
    tune_threshold=True,
    cross_validate=True,
    cv=5,
    save_model=True,
    save_history=True,
    plot_cm=True,
    plot_roc=True,
    plot_pr=True,
    plot_prob_dist=True,
    plot_importance=True,
    importance_top_n=20,
    threshold_metric='f1'
)

### 3.3 Neural Network

In [None]:
# Create Neural Network model
nn_model = NeuralNetworkModel(
    model_name="NeuralNetwork_Churn_Predictor",
    hidden_layers=[128, 64, 32],  # Three hidden layers
    activations="relu",
    dropout_rate=0.3,
    learning_rate=0.001,
    batch_size=64,
    epochs=100,
    early_stopping_patience=10,
    class_weight="balanced",  # Handle class imbalance
    random_state=42
)

# Create model trainer with hybrid resampling
nn_trainer = ModelTrainer(
    model=nn_model,
    resampling_strategy="hybrid",  # Use hybrid approach (undersampling + SMOTE)
    resampling_ratio=0.5,
    random_state=42,
    model_dir="../models"
)

In [None]:
# Run the training pipeline for Neural Network
nn_results = nn_trainer.run_training_pipeline(
    X_train, y_train,
    tune_hyperparameters=False,
    tune_threshold=True,
    cross_validate=True,
    cv=5,
    save_model=True,
    save_history=True,
    plot_cm=True,
    plot_roc=True,
    plot_pr=True,
    plot_prob_dist=True,
    plot_importance=True,  # Will use permutation importance
    importance_top_n=20,
    threshold_metric='f1',
    train_params={
        'validation_split': 0.2,
        'verbose': 1
    }
)

In [None]:
# Plot training history for Neural Network
nn_model.plot_training_history()

## 4. Model Comparison

In [None]:
# Compare all models
trainers = [xgb_trainer, lgb_trainer, nn_trainer]
comparison_df = compare_models(
    trainers=trainers,
    X=X_val,
    y=y_val,
    test_size=0.0,  # Use the entire validation set
    metrics=['accuracy', 'precision', 'recall', 'f1', 'auc'],
    plot=True,
    figsize=(14, 10)
)

In [None]:
# Display comparison results
comparison_df

## 5. Evaluate Best Model on Holdout Set

In [None]:
# Determine the best model based on F1 score
best_model_name = comparison_df.loc['f1'].idxmax()
print(f"Best model based on F1 score: {best_model_name}")

# Get the corresponding trainer
if best_model_name == 'XGBoost_Churn_Predictor':
    best_trainer = xgb_trainer
elif best_model_name == 'LightGBM_Churn_Predictor':
    best_trainer = lgb_trainer
else:
    best_trainer = nn_trainer

In [None]:
# Evaluate the best model on the holdout set
if y_holdout is not None:
    # Get the optimal threshold from validation
    optimal_threshold = best_trainer.training_history.get('optimal_threshold', {}).get('value', 0.5)
    
    print(f"Using optimal threshold: {optimal_threshold:.4f}")
    
    # Evaluate on holdout set
    holdout_metrics = best_trainer.evaluate_model(X_holdout, y_holdout, threshold=optimal_threshold)
    
    # Plot confusion matrix
    best_trainer.plot_confusion_matrix(X_holdout, y_holdout, threshold=optimal_threshold)
    
    # Plot ROC curve
    best_trainer.plot_roc_curve(X_holdout, y_holdout)
    
    # Plot Precision-Recall curve
    best_trainer.plot_precision_recall_curve(X_holdout, y_holdout)
else:
    print("Holdout set does not have target labels for evaluation.")

## 6. Feature Importance Analysis

In [None]:
# Analyze feature importance for the best model
if hasattr(best_trainer.model, 'plot_feature_importance'):
    importance_df = best_trainer.model.plot_feature_importance(top_n=20)
    
    # Display top features
    print("\nTop 20 features by importance:")
    display(importance_df.head(20))

## 7. Hyperparameter Tuning (Optional)

For demonstration, we'll tune hyperparameters for the best model to potentially improve performance further.

In [None]:
# Create a fresh instance of the best model for tuning
if best_model_name == 'XGBoost_Churn_Predictor':
    tuning_model = GradientBoostingModel(
        model_name="XGBoost_Tuned",
        implementation="xgboost",
        class_weight="balanced",
        random_state=42
    )
    
    # Define parameter grid for XGBoost
    param_grid = {
        'n_estimators': [100, 200, 300],
        'max_depth': [3, 5, 7],
        'learning_rate': [0.01, 0.05, 0.1],
        'subsample': [0.7, 0.8, 0.9],
        'colsample_bytree': [0.7, 0.8, 0.9],
        'min_child_weight': [1, 3, 5],
        'gamma': [0, 0.1, 0.2]
    }
    
elif best_model_name == 'LightGBM_Churn_Predictor':
    tuning_model = GradientBoostingModel(
        model_name="LightGBM_Tuned",
        implementation="lightgbm",
        class_weight="balanced",
        random_state=42
    )
    
    # Define parameter grid for LightGBM
    param_grid = {
        'n_estimators': [100, 200, 300],
        'max_depth': [3, 5, 7],
        'learning_rate': [0.01, 0.05, 0.1],
        'num_leaves': [15, 31, 63],
        'subsample': [0.7, 0.8, 0.9],
        'colsample_bytree': [0.7, 0.8, 0.9],
        'min_child_samples': [10, 20, 30]
    }
    
else:  # Neural Network
    print("Hyperparameter tuning for Neural Network is more complex and time-consuming.")
    print("Skipping tuning for this demonstration.")
    tuning_model = None
    param_grid = None

In [None]:
# Tune hyperparameters if we have a model to tune
if tuning_model is not None and param_grid is not None:
    # Build the model
    tuning_model.build()
    
    # Tune hyperparameters
    print("Tuning hyperparameters... (this may take a while)")
    tuning_model.tune_hyperparameters(
        X_train, y_train,
        param_grid=param_grid,
        cv=3,  # Use 3-fold CV for faster tuning
        scoring='roc_auc',
        n_iter=10,  # Try 10 random combinations
        method='random'  # Use random search for faster tuning
    )
    
    # Create trainer for tuned model
    tuned_trainer = ModelTrainer(
        model=tuning_model,
        resampling_strategy=best_trainer.resampling_strategy,
        resampling_ratio=best_trainer.resampling_ratio,
        random_state=42,
        model_dir="../models"
    )
    
    # Evaluate tuned model
    tuned_metrics = tuned_trainer.evaluate_model(X_val, y_val)
    
    # Compare with best model
    best_metrics = best_trainer.evaluate_model(X_val, y_val)
    
    # Create comparison dataframe
    metrics_comparison = pd.DataFrame({
        'Original Model': [best_metrics[m] for m in ['accuracy', 'precision', 'recall', 'f1', 'auc']],
        'Tuned Model': [tuned_metrics[m] for m in ['accuracy', 'precision', 'recall', 'f1', 'auc']]
    }, index=['accuracy', 'precision', 'recall', 'f1', 'auc'])
    
    # Display comparison
    display(metrics_comparison)
    
    # Plot comparison
    metrics_comparison.plot(kind='bar', figsize=(12, 8))
    plt.title('Performance Comparison: Original vs Tuned Model')
    plt.ylabel('Score')
    plt.ylim(0, 1)
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    plt.tight_layout()
    plt.show()

## 8. Conclusions

### Key Findings

1. **Model Performance:**
   - We trained and evaluated three different models: XGBoost, LightGBM, and Neural Network
   - Each model was trained with proper handling of class imbalance using different strategies (SMOTE, undersampling, and hybrid approach)
   - The best performing model was identified based on F1 score, which balances precision and recall

2. **Class Imbalance Handling:**
   - We demonstrated multiple approaches to handle class imbalance:
     - Resampling techniques (SMOTE, undersampling, hybrid)
     - Class weighting in the models
     - Threshold optimization to balance precision and recall

3. **Feature Importance:**
   - We identified the most important features for churn prediction
   - This provides actionable insights for business stakeholders

4. **Hyperparameter Tuning:**
   - We demonstrated how to tune hyperparameters to potentially improve model performance

### Next Steps

1. **Model Deployment:**
   - Deploy the best model in a production environment
   - Implement monitoring to track model performance over time

2. **Feature Engineering Refinement:**
   - Further refine feature engineering based on feature importance analysis
   - Explore additional domain-specific features

3. **Ensemble Methods:**
   - Explore ensemble methods combining multiple models for potentially better performance

4. **Explainability:**
   - Implement SHAP values or other explainability techniques to provide more detailed insights into model predictions

5. **Business Integration:**
   - Develop a system to translate model predictions into actionable business interventions
   - Create dashboards for business users to monitor churn risk and take preventive actions