# Telecom Churn — End-to-End Analysis and Modeling

**Contents:**

1. EDA & Initial Cleaning
2. Feature Engineering
3. Preprocessing & Pipeline (ColumnTransformer)
4. Train multiple models (LogisticRegression, RandomForest, GradientBoosting, XGBoost)
5. Hyperparameter tuning with GridSearchCV
6. Handling class imbalance (SMOTE or class_weight)
7. Final evaluation and interpretation

**Instructions:** Run the notebook cells sequentially. The dataset `telecom_churn.csv` is expected at `/mnt/data/telecom_churn.csv`.


In [None]:
# Cell: imports and data load
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import ticker
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, classification_report, confusion_matrix, roc_curve, auc
import warnings
warnings.filterwarnings('ignore')

DATA_PATH = '/mnt/data/telecom_churn.csv'
print('Data exists?:', os.path.exists(DATA_PATH))
df = pd.read_csv(DATA_PATH)
print('\\nDataset shape:', df.shape)
print('\\nColumns:\\n', df.columns.tolist())
df.head()

In [None]:
# Cell: Initial cleaning - TotalCharges conversion and missing values overview
print('Original dtype of TotalCharges:', df['TotalCharges'].dtype if 'TotalCharges' in df.columns else 'N/A')

# Convert TotalCharges to numeric, coercing errors to NaN
if 'TotalCharges' in df.columns:
    df['TotalCharges_clean'] = pd.to_numeric(df['TotalCharges'].astype(str).str.strip(), errors='coerce')

# Show rows where conversion failed
failed_conv = df[df['TotalCharges_clean'].isna() & df['TotalCharges'].notna()]
print('Number of rows where TotalCharges could not be converted to numeric (and original not NA):', len(failed_conv))
if len(failed_conv) > 0:
    display(failed_conv.head(10))

# Replace the original TotalCharges with cleaned numeric (and drop old if desired)
if 'TotalCharges' in df.columns:
    df['TotalCharges'] = df['TotalCharges_clean']
    df.drop(columns=['TotalCharges_clean'], inplace=True)

# Missing values overview
missing = df.isna().sum().sort_values(ascending=False)
missing = missing[missing > 0]
print('\\nColumns with missing values and counts:')
print(missing)

# Percentage missing
missing_pct = (df.isna().sum()/len(df)).sort_values(ascending=False)
missing_pct = missing_pct[missing_pct > 0]
print('\\nPercentage missing:')
print((missing_pct*100).round(2))

In [None]:
# Cell: EDA Visualizations
# 1) Distribution of a numerical feature (TotalCharges) for churn vs non-churn
if 'Churn' in df.columns and 'TotalCharges' in df.columns:
    plt.figure(figsize=(8,5))
    churn_yes = df[df['Churn'] == 'Yes']['TotalCharges'].dropna()
    churn_no = df[df['Churn'] == 'No']['TotalCharges'].dropna()
    plt.hist(churn_no, bins=50, alpha=0.6, label='No Churn')
    plt.hist(churn_yes, bins=50, alpha=0.6, label='Churn')
    plt.title('Distribution of TotalCharges: Churn vs No Churn')
    plt.xlabel('TotalCharges')
    plt.ylabel('Count')
    plt.legend()
    plt.tight_layout()
    plt.show()
else:
    print('Churn or TotalCharges column not found for this plot.')

# 2) Relationship between a categorical feature and churn (e.g., PaymentMethod)
cat_col = None
for c in ['PaymentMethod', 'PaymentMethod_Credit card', 'PaymentMethod_Electronic check', 'MultipleLines']:
    if c in df.columns:
        cat_col = c
        break

if cat_col:
    ct = pd.crosstab(df[cat_col].fillna('Missing'), df['Churn'])
    ct_pct = ct.div(ct.sum(axis=1), axis=0)
    print('\\nCounts by {} and Churn:\\n'.format(cat_col))
    display(ct)
    print('\\nPercentage by {} and Churn (row-wise):\\n'.format(cat_col))
    display((ct_pct*100).round(2))
    ct.plot(kind='bar', stacked=False, figsize=(10,5))
    plt.title(f'Counts by {cat_col} and Churn')
    plt.tight_layout()
    plt.show()
else:
    print('No known categorical column (PaymentMethod/MultipleLines) found for this plot.')

# 3) Correlation heatmap for numeric features (simple)
num_df = df.select_dtypes(include=['int64','float64'])
if num_df.shape[1] > 1:
    corr = num_df.corr()
    plt.figure(figsize=(10,8))
    plt.imshow(corr, interpolation='none')
    plt.colorbar()
    plt.xticks(range(len(corr.columns)), corr.columns, rotation=90)
    plt.yticks(range(len(corr.columns)), corr.columns)
    plt.title('Numeric feature correlation matrix (visual)')
    plt.tight_layout()
    plt.show()

## Feature Engineering

Create at least two meaningful features from existing data. Below we implement two candidates and explain reasons:

1. **AvgMonthlyCharge** = TotalCharges / (tenure) when tenure available — gives per-month revenue (helps detect customers who pay a lot early or long-term low spenders).  
2. **MultiService** — binary feature indicating customers who have multiple services (e.g., both PhoneService and MultipleLines or additional service flags). This often correlates with stickiness or complexity of churn.

(If the specific service columns are absent in the dataset, the notebook adapts to available columns.)

In [None]:
# Cell: Feature engineering - create two new features
df_fe = df.copy()
# AvgMonthlyCharge: guard against division by zero or missing Tenure
if 'TotalCharges' in df_fe.columns and 'tenure' in df_fe.columns:
    df_fe['AvgMonthlyCharge'] = df_fe['TotalCharges'] / (df_fe['tenure'].replace(0, np.nan))
    # if tenure is zero or NaN, fill AvgMonthlyCharge with TotalCharges (one-off)
    df_fe['AvgMonthlyCharge'] = df_fe['AvgMonthlyCharge'].fillna(df_fe['TotalCharges'])
else:
    # fallback: if 'MonthlyCharges' exists, use it directly
    if 'MonthlyCharges' in df_fe.columns:
        df_fe['AvgMonthlyCharge'] = df_fe['MonthlyCharges']
    else:
        df_fe['AvgMonthlyCharge'] = np.nan

# MultiService: check some common service columns
service_cols = [c for c in df_fe.columns if c.lower() in ['phoneservice','multiplelines','internetservice','onlinebackup','techsupport','streamingtv','streamingmovies']]
# also include columns that contain 'Service' in name
if not service_cols:
    service_cols = [c for c in df_fe.columns if 'Service' in c or 'service' in c]

if service_cols:
    # Define MultiService as count of 'Yes' across known service indicator columns
    def count_yes(row):
        cnt = 0
        for c in service_cols:
            val = str(row.get(c)).strip().lower()
            if val in ['yes','true','1']:
                cnt += 1
        return cnt
    df_fe['MultiServiceCount'] = df_fe.apply(count_yes, axis=1)
    df_fe['MultiService'] = (df_fe['MultiServiceCount'] > 1).astype(int)
else:
    df_fe['MultiServiceCount'] = 0
    df_fe['MultiService'] = 0

print('Feature engineering done. New columns added:', [c for c in ['AvgMonthlyCharge','MultiServiceCount','MultiService'] if c in df_fe.columns])
df_fe[['AvgMonthlyCharge','MultiServiceCount','MultiService']].head()

## Preprocessing & Pipeline

Rules implemented:
- Numeric features: median imputation + StandardScaler
- Categorical features: most frequent imputation + OneHotEncoder (handle_unknown='ignore')
- Boolean features: impute constant 'Unknown' then treat as categorical

We drop irrelevant columns like customerID if present.

In [None]:
# Cell: Build ColumnTransformer preprocessor and a sample pipeline with LogisticRegression
from sklearn.pipeline import Pipeline

# Prepare feature list: pick columns for modeling using provided top features if available
provided_top = [
    "IsHeavyDataUser",
    "PaymentMethod_Credit card",
    "PaymentMethod_Bank transfer",
    "IsHighValueCustomer",
    "MultipleLines",
    "SeniorCitizen",
    "PaymentMethod_Electronic check",
    "PhoneService",
    "RevenuePerGB",
    "TotalCharges"
]

# Use available columns intersecting with provided_top, else auto-select
features = [c for c in provided_top if c in df_fe.columns]
if not features:
    # use all except target and ID-like columns
    exclude = ['Churn','customerID','customerId','CustomerID']
    features = [c for c in df_fe.columns if c not in exclude]

print('Features used for modeling (sample):', features[:20])

X = df_fe[features].copy()
# Build target y
if 'Churn' in df_fe.columns:
    y = df_fe['Churn'].apply(lambda v: 1 if str(v).strip().lower() in ['yes','true','1'] else 0)
else:
    raise ValueError('Churn column not found in dataset')

# Identify dtypes
numeric_features = X.select_dtypes(include=['int64','float64']).columns.tolist()
categorical_features = X.select_dtypes(include=['object','category']).columns.tolist()
boolean_features = X.select_dtypes(include=['bool']).columns.tolist()

# Treat boolean as categorical by converting to object (after imputation step in pipeline)
print('Numeric features:', numeric_features)
print('Categorical features:', categorical_features)
print('Boolean features:', boolean_features)

# Imputers and transformers
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse=False))
])

# For booleans: impute constant 'Unknown' then onehot encode
boolean_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='Unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse=False))
])

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features),
    ('bool', boolean_transformer, boolean_features)
])

# Build a pipeline with a placeholder classifier (LogisticRegression)
clf_lr = Pipeline(steps=[('preprocessor', preprocessor), ('clf', LogisticRegression(max_iter=1000))])

# Quick train-test split and fit
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print('Training LogisticRegression pipeline...')
clf_lr.fit(X_train, y_train)
print('Done. Sample predictions:')
print(clf_lr.predict(X_test)[:10])

# Evaluate basic metrics
y_pred = clf_lr.predict(X_test)
print('\\nLogistic Regression basic evaluation:')
print('Precision:', precision_score(y_test, y_pred))
print('Recall   :', recall_score(y_test, y_pred))
print('F1       :', f1_score(y_test, y_pred))

## Model Selection: Logistic Regression, Random Forest, Gradient Boosting, XGBoost

Justification:
- Logistic Regression: interpretable baseline.
- Random Forest: robust, handles non-linearities and interactions.
- Gradient Boosting (sklearn): strong predictive performance.
- XGBoost: powerful gradient boosting implementation (if available).

In [None]:
# Cell: Train multiple models and compare metrics
models = {
    'LogisticRegression': LogisticRegression(max_iter=1000),
    'RandomForest': RandomForestClassifier(n_estimators=200, random_state=42),
    'GradientBoosting': GradientBoostingClassifier(n_estimators=200, random_state=42)
}

# Try to add XGBoost if available
try:
    from xgboost import XGBClassifier
    models['XGBoost'] = XGBClassifier(n_estimators=200, use_label_encoder=False, eval_metric='logloss', random_state=42)
except Exception as e:
    print('XGBoost not available or import failed:', e)

results = []
for name, model in models.items():
    pipe = Pipeline([('preprocessor', preprocessor), ('clf', model)])
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)
    res = {
        'model': name,
        'precision': precision_score(y_test, y_pred),
        'recall': recall_score(y_test, y_pred),
        'f1': f1_score(y_test, y_pred),
        'roc_auc': roc_auc_score(y_test, pipe.predict_proba(X_test)[:,1]) if hasattr(pipe, 'predict_proba') else float('nan')
    }
    results.append(res)

results_df = pd.DataFrame(results)
display(results_df)

## Hyperparameter Tuning (GridSearchCV)

We perform GridSearchCV on the entire pipeline (including preprocessor). We'll tune RandomForest hyperparameters as an example. Use `f1_weighted` or `roc_auc` as scoring to handle imbalance.

In [None]:
# Cell: GridSearchCV on pipeline (RandomForest)
from sklearn.model_selection import GridSearchCV

pipe_rf = Pipeline([('preprocessor', preprocessor), ('clf', RandomForestClassifier(random_state=42))])

param_grid = {
    'clf__n_estimators': [100, 200],
    'clf__max_depth': [None, 10, 20],
    'clf__class_weight': [None, 'balanced']
}

grid = GridSearchCV(pipe_rf, param_grid, cv=3, scoring='f1_weighted', n_jobs=-1, verbose=1)
print('Running GridSearchCV (this may take some time)...')
grid.fit(X_train, y_train)
print('Best params:', grid.best_params_)
print('Best score (f1_weighted):', grid.best_score_)

best_rf = grid.best_estimator_

## Handling Class Imbalance

We add a resampling step (SMOTE) inside the pipeline using `imblearn`'s `SMOTE`. If imblearn is not installed, we fallback to class_weight='balanced' in classifiers.

We'll compare performance with and without SMOTE/class_weight.

In [None]:
# Cell: Compare imbalance strategies
use_smote = False
try:
    from imblearn.over_sampling import SMOTE
    from imblearn.pipeline import Pipeline as ImbPipeline
    use_smote = True
    print('imblearn available: SMOTE will be used in resampling pipeline')
except Exception as e:
    print('imblearn not available:', e)

if use_smote:
    # Build imbalanced pipeline with SMOTE
    imb_pipe = ImbPipeline(steps=[('preprocessor', preprocessor), ('smote', SMOTE(random_state=42)), ('clf', RandomForestClassifier(n_estimators=200, random_state=42))])
    imb_pipe.fit(X_train, y_train)
    y_pred_smote = imb_pipe.predict(X_test)
    print('\\nWith SMOTE - RandomForest metrics:')
    print('Precision:', precision_score(y_test, y_pred_smote))
    print('Recall   :', recall_score(y_test, y_pred_smote))
    print('F1       :', f1_score(y_test, y_pred_smote))
else:
    # Fallback: use class_weight='balanced' in classifier
    pipe_cw = Pipeline([('preprocessor', preprocessor), ('clf', RandomForestClassifier(n_estimators=200, class_weight='balanced', random_state=42))])
    pipe_cw.fit(X_train, y_train)
    y_pred_cw = pipe_cw.predict(X_test)
    print('\\nWith class_weight=balanced - RandomForest metrics:')
    print('Precision:', precision_score(y_test, y_pred_cw))
    print('Recall   :', recall_score(y_test, y_pred_cw))
    print('F1       :', f1_score(y_test, y_pred_cw))

# Also print baseline (no imbalance handling) for comparison
pipe_nom = Pipeline([('preprocessor', preprocessor), ('clf', RandomForestClassifier(n_estimators=200, random_state=42))])
pipe_nom.fit(X_train, y_train)
y_pred_nom = pipe_nom.predict(X_test)
print('\\nNo imbalance handling - RandomForest metrics:')
print('Precision:', precision_score(y_test, y_pred_nom))
print('Recall   :', recall_score(y_test, y_pred_nom))
print('F1       :', f1_score(y_test, y_pred_nom))

In [None]:
# Cell: Final evaluation
# Choose final_model in order of preference: best_rf (from grid) -> imb_pipe (SMOTE) -> pipe_cw -> pipe_nom
final_model = None
try:
    final_model = best_rf
    print('Using best_rf from GridSearchCV as final model.')
except Exception:
    if 'use_smote' in globals() and use_smote:
        final_model = imb_pipe
        print('Using SMOTE pipeline as final model.')
    else:
        final_model = pipe_cw
        print('Using class_weight-balanced pipeline as final model.')

# Evaluate
y_pred_final = final_model.predict(X_test)
if hasattr(final_model, 'predict_proba'):
    y_proba = final_model.predict_proba(X_test)[:,1]
else:
    y_proba = None

print('\\nFinal model metrics on test set:')
print('Precision:', precision_score(y_test, y_pred_final))
print('Recall   :', recall_score(y_test, y_pred_final))
print('F1       :', f1_score(y_test, y_pred_final))
if y_proba is not None:
    print('ROC AUC  :', roc_auc_score(y_test, y_proba))

print('\\nClassification Report:\\n')
print(classification_report(y_test, y_pred_final))

cm = confusion_matrix(y_test, y_pred_final)
print('\\nConfusion Matrix:\\n', cm)

# Visualize top 10 feature importances if tree-based
try:
    clf_step = final_model.named_steps['clf'] if isinstance(final_model, Pipeline) else final_model
    importances = None
    if hasattr(clf_step, 'feature_importances_'):
        importances = clf_step.feature_importances_
    if importances is not None:
        # Attempt to reconstruct feature names
        def get_feature_names_from_column_transformer(ct, input_features):
            output_features = []
            for name, trans, cols in ct.transformers:
                if name == 'remainder':
                    continue
                if hasattr(trans, 'named_steps') and 'onehot' in trans.named_steps:
                    ohe = trans.named_steps['onehot']
                    try:
                        cats = ohe.categories_
                        for i, c in enumerate(cols):
                            for cat in cats[i]:
                                output_features.append(f\"{c}__{cat}\")
                    except Exception:
                        output_features.extend(cols)
                else:
                    output_features.extend(cols)
            return output_features

        feature_names = get_feature_names_from_column_transformer(final_model.named_steps['preprocessor'], X.columns)
        feat_imp = pd.Series(importances, index=feature_names).sort_values(ascending=False).head(10)
        print('\\nTop 10 feature importances:')
        display(feat_imp)
        plt.figure(figsize=(8,5))
        plt.barh(feat_imp.index[::-1], feat_imp.values[::-1])
        plt.title('Top 10 feature importances')
        plt.tight_layout()
        plt.show()
except Exception as e:
    print('Could not extract feature importances:', e)

# Business interpretation
print('\\nBusiness interpretation of confusion matrix:')
print('Rows = Actual [0=no churn, 1=churn], Columns = Predicted [0=no churn, 1=churn]')
print('cm =', cm)
print('\\n- False Positive (predict churn when customer stays): costs include unnecessary retention offers or resource allocation.')
print('- False Negative (predict non-churn when customer actually churns): costs include lost revenue and missed retention opportunity.')
print('\\nTypically, False Negatives are more costly for churn tasks because you fail to retain a customer who will leave. Therefore recall for the churn class is often prioritized.')

----

### Notes & Next Steps

- The notebook is ready to run. Depending on package availability (xgboost, imblearn), certain sections may need those packages installed. If a package is missing, you can either install it (`pip install xgboost imbalanced-learn`) or skip the optional parts.
- You can extend the hyperparameter grid, add cross-validation strategies, or calibrate probabilities for better business decisions.

**Generated by ChatGPT — file saved to `/mnt/data/telecom_churn_analysis.ipynb`.**