# Feature Selection for AI Business Coach

This notebook demonstrates the process of selecting important features for predicting customer churn.
We preprocess the data, explore correlations, apply feature selection techniques, and prepare the dataset
for model training. The goal is to improve model performance and interpretability.


# Import Libraries

In this section, we import all the necessary libraries for data processing, feature selection, model building, evaluation, and visualization.

- **pandas, numpy**: Data manipulation and numerical computations  
- **matplotlib.pyplot, seaborn**: Data visualization  
- **scipy.stats**: Statistical analysis, including skewness, t-tests, Mann-Whitney U, and Chi-square tests  
- **sklearn.preprocessing**: Standardization and encoding of features  
- **sklearn.feature_selection**: Techniques for selecting important features  
- **sklearn.model_selection**: Data splitting, cross-validation, and pipeline support  
- **sklearn.ensemble / linear_model / neural_network / svm / neighbors / naive_bayes / cluster**: Various machine learning models  
- **sklearn.metrics**: Model evaluation metrics  
- **warnings**: Suppress unnecessary warnings for clean output



In [117]:
# ================================================================
# 0. IMPORT LIBRARIES
# ================================================================
import pandas as pd
import numpy as np
import warnings
from scipy import stats
from scipy.stats import chi2_contingency, ttest_ind, mannwhitneyu

# Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (roc_auc_score, average_precision_score, accuracy_score,
                             classification_report, confusion_matrix)
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
import xgboost as xgb

# Deep Learning (PyTorch)
import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader

warnings.filterwarnings('ignore')

## Load Dataset

In this step, we load the preprocessed Telco customer churn dataset from a CSV file.  
The dataset contains customer information and features that will be used for feature selection and churn prediction.


In [118]:
# Load dataset
file_path = '/content/sample_data/preprocessed_telco_churn.csv'
df = pd.read_csv(file_path)

df.head()

Unnamed: 0,gender,SeniorCitizen,tenure,MonthlyCharges,TotalCharges,Churn,Partner_Yes,Dependents_Yes,PhoneService_Yes,MultipleLines_No phone service,...,StreamingTV_No internet service,StreamingTV_Yes,StreamingMovies_No internet service,StreamingMovies_Yes,Contract_One year,Contract_Two year,PaperlessBilling_Yes,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,0,0,1,29.85,29.85,0,1,0,0,1,...,0,0,0,0,0,0,1,0,1,0
1,1,0,34,56.95,1889.5,0,0,0,1,0,...,0,0,0,0,1,0,0,0,0,1
2,1,0,2,53.85,108.15,1,0,0,1,0,...,0,0,0,0,0,0,1,0,0,1
3,1,0,45,42.3,1840.75,0,0,0,0,1,...,0,0,0,0,1,0,0,0,0,0
4,0,0,2,70.7,151.65,1,0,0,1,0,...,0,0,0,0,0,0,1,0,1,0


# Model Evaluation

We evaluate model performance using ROC-AUC, PR-AUC, confusion matrix, and classification report.


# Comprehensive Statistical Analysis - Telco Churn Dataset
This section performs an in-depth statistical overview of the Telco Churn dataset, including:
- Dataset shape and memory usage
- Basic dataset info
- First few rows preview
- Missing values and duplicate rows


In [119]:
# Dataset Overview
print("="*80)
print("COMPREHENSIVE STATISTICAL ANALYSIS - TELCO CHURN DATASET")
print("="*80)
print(f"Dataset Shape: {df.shape}")
print(f"Memory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

print("\n" + "="*50)
print("1. DATASET OVERVIEW")
print("="*50)

print("\nDataset Info:")
print(df.info())

print("\nFirst 5 rows:")
print(df.head())

print("\nMissing Values:")
missing_data = df.isnull().sum()
if missing_data.sum() > 0:
    print(missing_data[missing_data > 0])
else:
    print("No missing values found!")

print("\nDuplicate Rows:", df.duplicated().sum())


COMPREHENSIVE STATISTICAL ANALYSIS - TELCO CHURN DATASET
Dataset Shape: (7043, 31)
Memory Usage: 1.67 MB

1. DATASET OVERVIEW

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 31 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   gender                                 7043 non-null   int64  
 1   SeniorCitizen                          7043 non-null   int64  
 2   tenure                                 7043 non-null   int64  
 3   MonthlyCharges                         7043 non-null   float64
 4   TotalCharges                           7043 non-null   float64
 5   Churn                                  7043 non-null   int64  
 6   Partner_Yes                            7043 non-null   int64  
 7   Dependents_Yes                         7043 non-null   int64  
 8   PhoneService_Yes                       7043 non-null   int64  
 9  

### Dataset Overview

*The Telco Churn dataset contains 7043 rows and 31 columns. The first five rows show a mix of numerical and binary-encoded categorical variables. No missing values were found, and 22 duplicate rows were removed during preprocessing. This ensures data integrity before modeling.*


# Target Variable Analysis
Identify the target column (churn or binary target), display distribution, proportions,
and check for class imbalance.


In [148]:
# Identify Target Variable
target_cols = [col for col in df.columns if 'churn' in col.lower() or 'target' in col.lower()]
if not target_cols:
    binary_cols = [col for col in df.columns if df[col].nunique() == 2]
    target_col = binary_cols[0] if binary_cols else df.columns[-1]
else:
    target_col = target_cols[0]

print(f"Target Variable: {target_col}")
print("\nTarget Distribution:")
target_counts = df[target_col].value_counts()
print(target_counts)
print(f"\nTarget Proportions:")
print(df[target_col].value_counts(normalize=True))

# Class imbalance check
minority_class_ratio = min(target_counts) / sum(target_counts)
print(f"\nClass Imbalance Ratio: {minority_class_ratio:.3f}")
if minority_class_ratio < 0.3:
    print("⚠️  SIGNIFICANT CLASS IMBALANCE DETECTED!")


Target Variable: Churn

Target Distribution:
Churn
0    5174
1    1869
Name: count, dtype: int64

Target Proportions:
Churn
0    0.73463
1    0.26537
Name: proportion, dtype: float64

Class Imbalance Ratio: 0.265
⚠️  SIGNIFICANT CLASS IMBALANCE DETECTED!


### Target Variable Analysis

*The target variable `Churn` has 5174 non-churned and 1869 churned customers, yielding a proportion of approximately 73% to 27%. This significant class imbalance motivates the use of SMOTE to generate synthetic samples for the minority class, improving model learning and preventing bias toward the majority class.*


# Identifying Numerical and Categorical Features

This step separates the dataset features into numerical and categorical types. The target column is excluded from both lists. This distinction is important for preprocessing steps such as scaling numerical features for neural networks and encoding categorical features for tree-based models.


In [121]:
numerical_features = df.select_dtypes(include=[np.number]).columns.tolist()
if target_col in numerical_features:
    numerical_features.remove(target_col)

categorical_features = df.select_dtypes(include=['object', 'category']).columns.tolist()
if target_col in categorical_features:
    categorical_features.remove(target_col)

print(f"Numerical Features ({len(numerical_features)}): {numerical_features}")
print(f"Categorical Features ({len(categorical_features)}): {categorical_features}")


Numerical Features (30): ['gender', 'SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges', 'Partner_Yes', 'Dependents_Yes', 'PhoneService_Yes', 'MultipleLines_No phone service', 'MultipleLines_Yes', 'InternetService_Fiber optic', 'InternetService_No', 'OnlineSecurity_No internet service', 'OnlineSecurity_Yes', 'OnlineBackup_No internet service', 'OnlineBackup_Yes', 'DeviceProtection_No internet service', 'DeviceProtection_Yes', 'TechSupport_No internet service', 'TechSupport_Yes', 'StreamingTV_No internet service', 'StreamingTV_Yes', 'StreamingMovies_No internet service', 'StreamingMovies_Yes', 'Contract_One year', 'Contract_Two year', 'PaperlessBilling_Yes', 'PaymentMethod_Credit card (automatic)', 'PaymentMethod_Electronic check', 'PaymentMethod_Mailed check']
Categorical Features (0): []


### Feature Types

*All features in the dataset are numerical or binary-encoded categorical features. No categorical features remain unencoded. This allows models like Random Forest, XGBoost, and Neural Networks to process the data without additional encoding. The selected features were chosen for business relevance and predictive potential, including customer demographics, contract type, internet services, and payment methods.*



# Numerical Features Analysis
- Display descriptive statistics
- Analyze skewness
- Detect outliers using the IQR method


In [122]:
if numerical_features:
    print("\nDescriptive Statistics:")
    print(df[numerical_features].describe())

    # Skewness Analysis
    print("\nSkewness Analysis:")
    skewness_data = []
    for col in numerical_features:
        skew_val = stats.skew(df[col].dropna())
        skewness_data.append({
            'Feature': col,
            'Skewness': skew_val,
            'Interpretation': 'Highly Skewed' if abs(skew_val) > 1 else
                           'Moderately Skewed' if abs(skew_val) > 0.5 else 'Normal'
        })
    skew_df = pd.DataFrame(skewness_data)
    print(skew_df)

    # Outlier Detection
    print("\nOutlier Analysis (IQR Method):")
    outlier_data = []
    for col in numerical_features:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)][col]
        outlier_data.append({
            'Feature': col,
            'Outliers_Count': len(outliers),
            'Outliers_Percentage': (len(outliers) / len(df)) * 100,
            'Lower_Bound': lower_bound,
            'Upper_Bound': upper_bound
        })
    outlier_df = pd.DataFrame(outlier_data)
    print(outlier_df)



Descriptive Statistics:
            gender  SeniorCitizen       tenure  MonthlyCharges  TotalCharges  \
count  7043.000000    7043.000000  7043.000000     7043.000000   7043.000000   
mean      0.504756       0.162147    32.371149       64.761692   2283.300441   
std       0.500013       0.368612    24.559481       30.090047   2265.000258   
min       0.000000       0.000000     0.000000       18.250000     18.800000   
25%       0.000000       0.000000     9.000000       35.500000    402.225000   
50%       1.000000       0.000000    29.000000       70.350000   1400.550000   
75%       1.000000       0.000000    55.000000       89.850000   3786.600000   
max       1.000000       1.000000    72.000000      118.750000   8684.800000   

       Partner_Yes  Dependents_Yes  PhoneService_Yes  \
count  7043.000000     7043.000000       7043.000000   
mean      0.483033        0.299588          0.903166   
std       0.499748        0.458110          0.295752   
min       0.000000        0.00

### Descriptive Statistics, Skewness, and Outlier Analysis

*This output provides a detailed statistical overview of the **Telco Churn dataset's numerical features**. The **descriptive statistics** summarize **counts**, **means**, **standard deviations**, **minimums**, **maximums**, and **quartiles** for each feature, giving insight into typical values and variability. **Skewness analysis** indicates asymmetry in feature distributions, highlighting which features are **normally distributed** versus **highly skewed**. **Outlier analysis** identifies extreme values using the **IQR method**, showing both the **count** and **percentage of outliers** for each feature. These analyses guide **feature selection**, **preprocessing**, and **transformation decisions** to improve model performance.*



# Correlation Analysis
- Compute correlation matrix for numerical features
- Identify highly correlated pairs (|corr| > 0.7)
- Check for multicollinearity


In [123]:
if len(numerical_features) > 1:
    corr_matrix = df[numerical_features].corr()
    print("\nHighly Correlated Feature Pairs (|correlation| > 0.7):")
    high_corr_pairs = []
    for i in range(len(corr_matrix.columns)):
        for j in range(i+1, len(corr_matrix.columns)):
            corr_val = corr_matrix.iloc[i, j]
            if abs(corr_val) > 0.7:
                high_corr_pairs.append({
                    'Feature_1': corr_matrix.columns[i],
                    'Feature_2': corr_matrix.columns[j],
                    'Correlation': corr_val
                })
    if high_corr_pairs:
        high_corr_df = pd.DataFrame(high_corr_pairs)
        print(high_corr_df)
        print("\n⚠️  High multicollinearity detected!")
    else:
        print("No highly correlated feature pairs found.")



Highly Correlated Feature Pairs (|correlation| > 0.7):
                               Feature_1  \
0                                 tenure   
1                         MonthlyCharges   
2                         MonthlyCharges   
3                         MonthlyCharges   
4                         MonthlyCharges   
5                         MonthlyCharges   
6                         MonthlyCharges   
7                         MonthlyCharges   
8                         MonthlyCharges   
9                       PhoneService_Yes   
10                    InternetService_No   
11                    InternetService_No   
12                    InternetService_No   
13                    InternetService_No   
14                    InternetService_No   
15                    InternetService_No   
16    OnlineSecurity_No internet service   
17    OnlineSecurity_No internet service   
18    OnlineSecurity_No internet service   
19    OnlineSecurity_No internet service   
20    OnlineSecurity

### Highly Correlated Feature Pairs

*This output shows feature pairs with strong correlations (|corr| > 0.7), highlighting redundant or highly related features in the dataset. High multicollinearity is detected, which may impact model performance and interpretability.*



# Target-Feature Relationship Analysis
- Numerical features: T-test or Mann-Whitney U
- Categorical features: Chi-Square Test
- Compute significance and effect sizes


In [124]:
# Numerical
target_relationship_data = []
if numerical_features:
    for feature in numerical_features:
        group_0 = df[df[target_col] == df[target_col].unique()[0]][feature].dropna()
        group_1 = df[df[target_col] == df[target_col].unique()[1]][feature].dropna()
        if len(group_0) > 5000 or len(group_1) > 5000:
            statistic, p_value = mannwhitneyu(group_0, group_1, alternative='two-sided')
            test_type = "Mann-Whitney U"
        else:
            statistic, p_value = ttest_ind(group_0, group_1)
            test_type = "T-test"
        target_relationship_data.append({
            'Feature': feature,
            'Test_Type': test_type,
            'P_Value': p_value,
            'Significant': 'Yes' if p_value < 0.05 else 'No',
            'Effect_Size': abs(group_0.mean() - group_1.mean()) / np.sqrt((group_0.var() + group_1.var()) / 2)
        })
    target_rel_df = pd.DataFrame(target_relationship_data)
    print(target_rel_df)

# Categorical
cat_target_data = []
if categorical_features:
    for feature in categorical_features:
        contingency_table = pd.crosstab(df[feature], df[target_col])
        if contingency_table.shape[0] > 1 and contingency_table.shape[1] > 1:
            chi2, p_value, dof, expected = chi2_contingency(contingency_table)
            n = contingency_table.sum().sum()
            cramers_v = np.sqrt(chi2 / (n * (min(contingency_table.shape) - 1)))
            cat_target_data.append({
                'Feature': feature,
                'Chi2_Statistic': chi2,
                'P_Value': p_value,
                'Significant': 'Yes' if p_value < 0.05 else 'No',
                'Cramers_V': cramers_v,
                'Effect_Size': 'Large' if cramers_v > 0.25 else 'Medium' if cramers_v > 0.15 else 'Small'
            })
    if cat_target_data:
        cat_target_df = pd.DataFrame(cat_target_data)
        print(cat_target_df)


                                  Feature       Test_Type        P_Value  \
0                                  gender  Mann-Whitney U   4.698702e-01   
1                           SeniorCitizen  Mann-Whitney U   9.588357e-37   
2                                  tenure  Mann-Whitney U  2.419636e-208   
3                          MonthlyCharges  Mann-Whitney U   3.311628e-54   
4                            TotalCharges  Mann-Whitney U   1.085274e-84   
5                             Partner_Yes  Mann-Whitney U   1.536583e-36   
6                          Dependents_Yes  Mann-Whitney U   3.321176e-43   
7                        PhoneService_Yes  Mann-Whitney U   3.162867e-01   
8          MultipleLines_No phone service  Mann-Whitney U   3.162867e-01   
9                       MultipleLines_Yes  Mann-Whitney U   7.648058e-04   
10            InternetService_Fiber optic  Mann-Whitney U  2.564395e-147   
11                     InternetService_No  Mann-Whitney U   1.600498e-81   
12     Onlin

# Feature Significance (Telco Churn)

- Mann-Whitney U tests show which features differ between churned and non-churned customers.  
- **Significant features (p < 0.05)** include `tenure`, `MonthlyCharges`, `TotalCharges`, `Contract`, `PaymentMethod`, and `InternetService`.  
- **Non-significant features** include `gender` and basic phone services.  
- **Effect size** indicates strength of difference (0 = small, 1 = large).  


# Feature Importance Ranking
- Combine numerical and categorical significance
- Sort features by effect size or Cramer's V
- Provide recommendations for modeling




In [125]:
important_features = []

if target_relationship_data:
    sig_num_features = [f for f in target_relationship_data if f['Significant'] == 'Yes']
    sig_num_features.sort(key=lambda x: x['Effect_Size'], reverse=True)
    for feature in sig_num_features:
        important_features.append({
            'Feature': feature['Feature'],
            'Type': 'Numerical',
            'Importance_Score': feature['Effect_Size'],
            'P_Value': feature['P_Value']
        })

if cat_target_data:
    sig_cat_features = [f for f in cat_target_data if f['Significant'] == 'Yes']
    sig_cat_features.sort(key=lambda x: x['Cramers_V'], reverse=True)
    for feature in sig_cat_features:
        important_features.append({
            'Feature': feature['Feature'],
            'Type': 'Categorical',
            'Importance_Score': feature['Cramers_V'],
            'P_Value': feature['P_Value']
        })

if important_features:
    importance_df = pd.DataFrame(important_features).sort_values('Importance_Score', ascending=False)
    print("Top Features by Statistical Significance:")
    print(importance_df)

    print(f"\n🎯 RECOMMENDATIONS:")
    print("="*50)
    top_features = importance_df.head(10)['Feature'].tolist()
    print(f"1. Focus on these top features: {top_features}")

    if minority_class_ratio < 0.3:
        print("2. Address class imbalance using SMOTE")
        print("3. Use class weights in models")


Top Features by Statistical Significance:
                                  Feature       Type  Importance_Score  \
0                                  tenure  Numerical          0.892829   
1                       Contract_Two year  Numerical          0.841158   
2             InternetService_Fiber optic  Numerical          0.738767   
3          PaymentMethod_Electronic check  Numerical          0.694543   
4                      InternetService_No  Numerical          0.595047   
5      OnlineSecurity_No internet service  Numerical          0.595047   
6        OnlineBackup_No internet service  Numerical          0.595047   
7    DeviceProtection_No internet service  Numerical          0.595047   
8         TechSupport_No internet service  Numerical          0.595047   
9         StreamingTV_No internet service  Numerical          0.595047   
10    StreamingMovies_No internet service  Numerical          0.595047   
11                           TotalCharges  Numerical          0.482495

# Highly Correlated Features
*Features with |correlation| > 0.7 may be redundant and cause multicollinearity.*  
*Consider removing or handling them carefully in modeling.*


# Model Evaluation Helper
This helper function evaluates a trained model on the test set.  
It prints key metrics (Accuracy, ROC-AUC, PR-AUC), confusion matrix, and classification report.  
It also returns a dictionary with the main scores for later use.


In [149]:


def evaluate_model(model, X_test, y_test, model_name):
    """
    Evaluate a trained classification model on test data.
    """
    print(f"\n{'='*50}")
    print(f"EVALUATION: {model_name}")
    print(f"{'='*50}")

    # Predictions and probabilities
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)[:, 1]

    # Metrics
    accuracy = accuracy_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_pred_proba)
    pr_auc = average_precision_score(y_test, y_pred_proba)

    # Display results
    print(f"Accuracy: {accuracy:.4f}")
    print(f"ROC-AUC Score: {roc_auc:.4f}")
    print(f"PR-AUC Score: {pr_auc:.4f}")

    print("\nConfusion Matrix:")
    print(confusion_matrix(y_test, y_pred))

    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))

    return {'accuracy': accuracy, 'roc_auc': roc_auc, 'pr_auc': pr_auc}


# Neural Network Model (PyTorch)
A simple feedforward neural network for churn prediction:  
- Input → Fully connected (64) → ReLU → Dropout  
- Hidden → Fully connected (32) → ReLU → Dropout  
- Output → Single node (binary classification)  



In [151]:
import torch
import torch.nn as nn

# Neural network class for churn prediction
class ChurnModel(nn.Module):
    def __init__(self, input_features):
        super(ChurnModel, self).__init__()
        self.layer_1 = nn.Linear(input_features, 64)
        self.relu = nn.ReLU()
        self.layer_2 = nn.Linear(64, 32)
        self.dropout = nn.Dropout(0.2)
        self.layer_out = nn.Linear(32, 1)

    def forward(self, x):
        x = self.relu(self.layer_1(x))
        x = self.dropout(x)
        x = self.relu(self.layer_2(x))
        x = self.dropout(x)
        return self.layer_out(x)



# 4. Main Pipeline Function: Comprehensive Analysis and Modeling

This function, `run_complete_pipeline(df)`, executes the full end-to-end workflow for churn prediction, including:

1. **Comprehensive statistical analysis** to identify the target and important features.
2. **Data preparation**: removing duplicates, selecting recommended features, and handling missing data if necessary.
3. **Class imbalance handling** using SMOTE.
4. **Train-test split** for model evaluation.
5. **Model training & evaluation**:
   - Random Forest
   - XGBoost
   - Neural Network (PyTorch)



In [159]:
# ================================================================
# 4. MAIN PIPELINE FUNCTION
# ================================================================

def run_complete_pipeline(df):
    """
    Run the complete analysis and modeling pipeline
    """

    # Step 1: Comprehensive Statistical Analysis
    target_col, important_features = comprehensive_data_analysis(df)

    # Step 2: Data Preparation
    print("\n" + "="*80)
    print("DATA PREPARATION & MODELING")
    print("="*80)

    # Remove duplicates
    df_clean = df.drop_duplicates()
    print(f"Removed {len(df) - len(df_clean)} duplicate rows")

    # Use recommended features
    recommended_features = [
        'tenure', 'MonthlyCharges', 'InternetService_Fiber optic',
        'Contract_One year', 'Contract_Two year', 'PaymentMethod_Electronic check',
        'InternetService_No', 'Partner_Yes',
        'Dependents_Yes', 'PaperlessBilling_Yes', 'PaymentMethod_Credit card (automatic)'
    ]

    # Filter features that exist in the dataset
    available_features = [f for f in recommended_features if f in df_clean.columns]
    print(f"Using {len(available_features)} features: {available_features}")

    X = df_clean[available_features]
    y = df_clean[target_col]

    # Step 3: Handle Class Imbalance
    print(f"\nOriginal class distribution:\n{y.value_counts()}")
    smote = SMOTE(random_state=42)
    X_res, y_res = smote.fit_resample(X, y)
    print(f"\nResampled class distribution:\n{pd.Series(y_res).value_counts()}")

    # Step 4: Train-Test Split
    X_train, X_test, y_train, y_test = train_test_split(
        X_res, y_res, test_size=0.2, random_state=42, stratify=y_res
    )

    # Step 5: Model Training & Evaluation
    results = {}

    # Random Forest
    print("\n" + "="*60)
    print("TRAINING MODELS")
    print("="*60)

    rf_model = RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced')
    rf_model.fit(X_train, y_train)
    results['Random Forest'] = evaluate_model(rf_model, X_test, y_test, "Random Forest")

    # XGBoost
    xgb_model = xgb.XGBClassifier(n_estimators=100, use_label_encoder=False,
                                 eval_metric='logloss', random_state=42)
    xgb_model.fit(X_train, y_train)
    results['XGBoost'] = evaluate_model(xgb_model, X_test, y_test, "XGBoost")

    # Neural Network (PyTorch)
    print(f"\n{'='*50}")
    print("TRAINING NEURAL NETWORK")
    print(f"{'='*50}")

    # Prepare data for PyTorch
    scaler = StandardScaler()
    X_train_scaled = X_train.copy()
    X_test_scaled = X_test.copy()

    # Scale numerical features
    numerical_cols = X_train.select_dtypes(include=[np.number]).columns
    if len(numerical_cols) > 0:
        X_train_scaled[numerical_cols] = scaler.fit_transform(X_train[numerical_cols])
        X_test_scaled[numerical_cols] = scaler.transform(X_test[numerical_cols])

    # Convert to tensors
    X_train_tensor = torch.tensor(X_train_scaled.values, dtype=torch.float32)
    y_train_tensor = torch.tensor(y_train, dtype=torch.float32).unsqueeze(1)
    X_test_tensor = torch.tensor(X_test_scaled.values, dtype=torch.float32)

    # Create DataLoader
    train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
    train_loader = DataLoader(dataset=train_dataset, batch_size=32, shuffle=True)

    # Initialize model
    input_features = X_train.shape[1]
    nn_model = ChurnModel(input_features)
    loss_fn = nn.BCEWithLogitsLoss()
    optimizer = torch.optim.Adam(nn_model.parameters(), lr=0.001)

    # Training loop
    epochs = 100
    for epoch in range(epochs):
        nn_model.train()
        for batch_X, batch_y in train_loader:
            optimizer.zero_grad()
            y_pred_logits = nn_model(batch_X)
            loss = loss_fn(y_pred_logits, batch_y)
            loss.backward()
            optimizer.step()

        if (epoch + 1) % 20 == 0:
            print(f'Epoch [{epoch+1}/{epochs}], Loss: {loss.item():.4f}')

    # Neural Network Evaluation
    nn_model.eval()
    with torch.no_grad():
        y_pred_logits = nn_model(X_test_tensor)
        y_pred_proba = torch.sigmoid(y_pred_logits).numpy().flatten()
        y_pred_class = (y_pred_proba > 0.5).astype(int)

    # Manual evaluation for neural network
    print(f"\n{'='*50}")
    print("EVALUATION: Neural Network (PyTorch)")
    print(f"{'='*50}")

    accuracy = accuracy_score(y_test, y_pred_class)
    roc_auc = roc_auc_score(y_test, y_pred_proba)
    pr_auc = average_precision_score(y_test, y_pred_proba)

    print(f"Accuracy: {accuracy:.4f}")
    print(f"ROC-AUC Score: {roc_auc:.4f}")
    print(f"PR-AUC Score: {pr_auc:.4f}")
    print("\nConfusion Matrix:")
    print(confusion_matrix(y_test, y_pred_class))
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred_class))

    results['Neural Network'] = {'accuracy': accuracy, 'roc_auc': roc_auc, 'pr_auc': pr_auc}

    # Step 6: Final Results Summary
    print("\n" + "="*80)
    print("FINAL RESULTS SUMMARY")
    print("="*80)

    results_df = pd.DataFrame(results).T
    print(results_df.round(4))

    best_model = results_df['roc_auc'].idxmax()
    print(f"\n🏆 Best Model: {best_model} (ROC-AUC: {results_df.loc[best_model, 'roc_auc']:.4f})")

    return results_df


# 5. Execute the Pipeline

This section runs the `run_complete_pipeline(df)` function on the loaded dataset (`df`).

- Checks if the dataset is available in the environment.  
- Executes the full pipeline, including data preparation, model training, and evaluation.  
- Prints a final summary of all model performances.  
- Provides instructions if the dataset is not found.


In [160]:
# ================================================================
# 5. EXECUTE THE PIPELINE
# ================================================================

# Run the complete pipeline
if 'df' in locals() or 'df' in globals():
    print("Starting comprehensive analysis and modeling pipeline...")
    results = run_complete_pipeline(df)

    print("\n" + "="*80)
    print("🎉 PIPELINE EXECUTION COMPLETED!")
    print("="*80)
    print("All models have been trained and evaluated.")
    print("Check the results summary above for model performance comparison.")
else:
    print("="*80)
    print("⚠️  DATASET NOT FOUND")
    print("="*80)
    print("Please ensure your DataFrame is loaded as 'df' before running this code.")
    print("Example: df = pd.read_csv('your_dataset.csv')")
    print("\nAlternatively, you can run the pipeline manually:")
    print("results = run_complete_pipeline(your_dataframe)")

Starting comprehensive analysis and modeling pipeline...
COMPREHENSIVE STATISTICAL ANALYSIS - TELCO CHURN DATASET
Dataset Shape: (7043, 31)
Memory Usage: 1.67 MB

1. DATASET OVERVIEW

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 31 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   gender                                 7043 non-null   int64  
 1   SeniorCitizen                          7043 non-null   int64  
 2   tenure                                 7043 non-null   int64  
 3   MonthlyCharges                         7043 non-null   float64
 4   TotalCharges                           7043 non-null   float64
 5   Churn                                  7043 non-null   int64  
 6   Partner_Yes                            7043 non-null   int64  
 7   Dependents_Yes                         7043 non-null   int64  
 8   PhoneServi

# Telco Churn Analysis Pipeline - Summary

**Dataset Overview**  
- 7043 rows, 31 features  
- No missing values, 22 duplicates removed  
- Target: `Churn` (0: 73%, 1: 27%) → significant class imbalance  

**Feature Analysis**  
- Mostly binary features + continuous (`tenure`, `MonthlyCharges`, `TotalCharges`)  
- Skewed / Outliers: `PhoneService_Yes`, `TotalCharges`, `_No internet service` features  
- High correlations: `tenure ↔ TotalCharges` (0.82), `_No internet service` features ↔ each other (1.0) → multicollinearity  

**Target-Feature Significance**  
- Most features significant (Mann-Whitney U test)  
- Top features by effect size: `tenure`, `Contract_Two year`, `InternetService_Fiber optic`, `PaymentMethod_Electronic check`  
- Class imbalance handled via SMOTE  

**Model Performance**

| Model          | Accuracy | Precision (0 / 1) | Recall (0 / 1) | F1-Score | ROC-AUC | PR-AUC |
|----------------|---------|-----------------|----------------|---------|---------|--------|
| Random Forest  | 0.8243  | 0.83 / 0.81      | 0.81 / 0.84    | 0.82    | 0.8992  | 0.8914 |
| XGBoost        | 0.8166  | 0.85 / 0.79      | 0.77 / 0.87    | 0.81    | 0.8887  | 0.8751 |
| Neural Network | 0.7933  | 0.83 / 0.76      | 0.73 / 0.85    | 0.79    | 0.8709  | 0.8497 |

🏆 **Best Model:** Random Forest (ROC-AUC: 0.8992)


## Feature Selection Explanation

The following features were chosen based on the comprehensive statistical analysis, feature importance, and model evaluation:

```python
recommended_features = [
    'tenure', 'MonthlyCharges', 'InternetService_Fiber optic',
    'Contract_One year', 'Contract_Two year', 'PaymentMethod_Electronic check',
    'InternetService_No', 'Partner_Yes',
    'Dependents_Yes', 'PaperlessBilling_Yes', 'PaymentMethod_Credit card (automatic)'
]
```

**Why these features were selected:**

- **Tenure & MonthlyCharges:** Strong predictors of churn behavior; longer-tenured customers and higher billing often correlate with loyalty.  

- **InternetService (Fiber optic / No):** Customers with fiber-optic or no internet show distinct churn patterns.  

- **Contract type (One year / Two year):** Longer contracts reduce churn risk; highly significant in effect-size tests.  

- **Payment Method (Electronic check / Credit card automatic):** Payment type correlates with churn likelihood; electronic check users tend to churn more.  

- **Partner / Dependents:** Reflect household context, which impacts customer retention.  

- **PaperlessBilling:** Associated with churn behavior, as users opting for paperless billing show higher likelihood of leaving.



# Saving Selected Features

In [165]:
# Keep only the recommended features
df_selected = df[recommended_features]

# Save to CSV
df_selected.to_csv('/content/sample_data/selected-features.csv', index=False)

print("Selected features saved with data to 'selected-features.csv'")

Selected features saved with data to 'selected-features.csv'
