# Customer Churn Prediction System â€” Final (Clean & Pipeline-ready)
**Purpose:** polished, well-commented notebook that builds a reproducible pipeline (preprocessing + model) and saves artifacts for deployment.

**What changed:**
- Replaced separate scaler + manual encoding with a `ColumnTransformer` (StandardScaler + OneHotEncoder)
- Saved a full `Pipeline` (preprocessing + model) so the Streamlit app can load and predict reliably
- Saved helper artifacts: `feature_cols`, `num_cols`, `cat_cols`, and OHE feature names
- Added clear comments and sections for readability and for recruiters to inspect


In [1]:
# 1) Imports
import warnings
warnings.filterwarnings('ignore')

from pathlib import Path
import joblib
import pandas as pd
import numpy as np

# Modeling & preprocessing
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc, precision_recall_curve, average_precision_score

# For reproducibility
RANDOM_STATE = 42
print('Imports done.')

Imports done.


## 2) Load datasets
We load the two CSVs you used in the original notebook: the 80% train split and 20% test split. Make sure the `data/` folder contains these files.

In [8]:
# Paths - adjust if your repo is structured differently
DATA_DIR = Path('data')
train_path = "C:/Users/abhis/Downloads/churn-bigml-80.csv"
test_path = "C:/Users/abhis/Downloads/churn-bigml-20.csv"

# Read CSVs
df_train = pd.read_csv(train_path)
df_test = pd.read_csv(test_path)

print(f"Train shape: {df_train.shape}") 
print(f"Test shape:  {df_test.shape}")

Train shape: (2666, 20)
Test shape:  (667, 20)


## 3) Quick EDA & target check
Check target distribution and a few sample rows. This helps understand class balance and feature types.

In [9]:
# Target check
print('Target value counts (train):')
print(df_train['Churn'].value_counts(dropna=False))
print('\nSample rows:')
df_train.head()

Target value counts (train):
Churn
False    2278
True      388
Name: count, dtype: int64

Sample rows:


Unnamed: 0,State,Account length,Area code,International plan,Voice mail plan,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls,Churn
0,KS,128,415,No,Yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,No,Yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,No,No,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,Yes,No,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,Yes,No,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


## 4) Preprocessing plan
We will:
1. Drop identifier columns (phone numbers) that do not help prediction.
2. Separate feature columns into numeric and categorical.
3. Build a ColumnTransformer that scales numeric features and one-hot-encodes categorical features.
4. Create a Pipeline that includes preprocessing + RandomForest model.

Saving the full pipeline avoids shape mismatches at prediction time.

In [11]:
# 4.1 Drop identifiers if present
drop_candidates = ['phone', 'Phone', 'PhoneNumber']
for c in drop_candidates:
    if c in df_train.columns:
        df_train = df_train.drop(columns=[c])
        df_test = df_test.drop(columns=[c])
        print(f"Dropped identifier column: {c}")

In [12]:
# 4.2 Feature/target split
TARGET = 'Churn'
feature_cols = [c for c in df_train.columns if c != TARGET]
print('Total features:', len(feature_cols))

# Identify categorical vs numeric using dtype
cat_cols = [c for c in feature_cols if df_train[c].dtype == 'object' or df_train[c].dtype.name == 'category']
num_cols = [c for c in feature_cols if c not in cat_cols]

print('Numerical columns ({}):'.format(len(num_cols)), num_cols)
print('Categorical columns ({}):'.format(len(cat_cols)), cat_cols)

Total features: 19
Numerical columns (16): ['Account length', 'Area code', 'Number vmail messages', 'Total day minutes', 'Total day calls', 'Total day charge', 'Total eve minutes', 'Total eve calls', 'Total eve charge', 'Total night minutes', 'Total night calls', 'Total night charge', 'Total intl minutes', 'Total intl calls', 'Total intl charge', 'Customer service calls']
Categorical columns (3): ['State', 'International plan', 'Voice mail plan']


## 5) Build ColumnTransformer + Pipeline
We use `StandardScaler` for numeric columns and `OneHotEncoder(handle_unknown='ignore')` for categorical columns. The pipeline combines them and ends with a RandomForest classifier.

In [14]:
# 5.1 Define transformers
numeric_transformer = StandardScaler()

# FIXED: use 'sparse_output=False' instead of 'sparse=False'
categorical_transformer = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, num_cols),
        ('cat', categorical_transformer, cat_cols)
    ],
    remainder='drop'
)

# 5.2 Define model and pipeline
rf = RandomForestClassifier(random_state=RANDOM_STATE)

pipeline = Pipeline([
    ('preproc', preprocessor),
    ('model', rf)
])

print("Pipeline created.")


Pipeline created.


## 6) Prepare training & testing matrices
We will fit the pipeline on the train set (80% file) and use the 20% file as a final holdout test set.

In [15]:
# Make sure target is consistent dtype
y_train = df_train[TARGET].astype(str)
X_train = df_train[feature_cols].copy()

y_test = df_test[TARGET].astype(str)
X_test = df_test[feature_cols].copy()

print('Shapes => X_train:', X_train.shape, 'X_test:', X_test.shape)

Shapes => X_train: (2666, 19) X_test: (667, 19)


## 7) Hyperparameter tuning with RandomizedSearchCV
We tune Random Forest parameters using RandomizedSearchCV. Note: the pipeline includes preprocessing so the CV respects train-time preprocessing.

In [16]:
param_dist = {
    'model__n_estimators': [100, 200, 300],
    'model__max_depth': [None, 5, 10],
    'model__min_samples_split': [2, 5]
}

rsearch = RandomizedSearchCV(pipeline, param_distributions=param_dist, n_iter=8, cv=3, scoring='f1', n_jobs=-1, random_state=RANDOM_STATE, verbose=1)
rsearch.fit(X_train, y_train)

print('Best params:', rsearch.best_params_)
best_pipeline = rsearch.best_estimator_

Fitting 3 folds for each of 8 candidates, totalling 24 fits
Best params: {'model__n_estimators': 100, 'model__min_samples_split': 2, 'model__max_depth': None}


## 8) Evaluation on holdout test set
Evaluate final pipeline on the 20% holdout dataset.

In [17]:
# Predict & report
y_pred = best_pipeline.predict(X_test)
# For probability, pipeline.predict_proba returns two columns; positive class probability depends on label ordering
y_proba = best_pipeline.predict_proba(X_test)
# If 'Yes' corresponds to class index 1 (common), take column 1. We'll be cautious.
try:
    classes = best_pipeline.named_steps['model'].classes_
except Exception:
    classes = None

# Print classification report
print('Classification Report:')
print(classification_report(y_test, y_pred))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print('Confusion Matrix:\n', cm)

# Compute ROC/PR using binary encoding (Yes->1, else 0)
y_test_bin = (y_test == 'Yes').astype(int)
if y_proba.shape[1] == 2:
    y_score = y_proba[:, 1]
else:
    # fallback: take first column
    y_score = y_proba[:, 0]

fpr, tpr, _ = roc_curve(y_test_bin, y_score)
roc_auc = auc(fpr, tpr)
print(f'ROC AUC: {roc_auc:.4f}')

avg_prec = average_precision_score(y_test_bin, y_score)
print(f'Average precision (PR AUC): {avg_prec:.4f}')

Classification Report:
              precision    recall  f1-score   support

       False       0.94      1.00      0.97       572
        True       0.98      0.62      0.76        95

    accuracy                           0.94       667
   macro avg       0.96      0.81      0.86       667
weighted avg       0.95      0.94      0.94       667

Confusion Matrix:
 [[571   1]
 [ 36  59]]
ROC AUC: nan
Average precision (PR AUC): 0.0000


## 9) Save pipeline and helper artifacts
Save everything required by the Streamlit app: full pipeline + column metadata.

In [18]:
models_dir = Path('models')
models_dir.mkdir(parents=True, exist_ok=True)

# Save full pipeline (preprocessing + model)
pipeline_path = models_dir / 'churn_pipeline.pkl'
joblib.dump(best_pipeline, pipeline_path)
print('Saved pipeline to', pipeline_path)

# Save metadata and helper artifacts
joblib.dump(feature_cols, models_dir / 'feature_cols.pkl')
joblib.dump(num_cols, models_dir / 'num_cols.pkl')
joblib.dump(cat_cols, models_dir / 'cat_cols.pkl')

# Save the OneHotEncoder feature names (after fitting)
ohe = best_pipeline.named_steps['preproc'].named_transformers_['cat']
ohe_feature_names = list(ohe.get_feature_names_out(cat_cols))
feature_names_after_ohe = list(num_cols) + ohe_feature_names
joblib.dump(feature_names_after_ohe, models_dir / 'feature_names_after_ohe.pkl')

print('Saved feature metadata and encoders. Files in models/:', list(models_dir.iterdir()))

Saved pipeline to models\churn_pipeline.pkl
Saved feature metadata and encoders. Files in models/: [WindowsPath('models/cat_cols.pkl'), WindowsPath('models/churn_pipeline.pkl'), WindowsPath('models/feature_cols.pkl'), WindowsPath('models/feature_names_after_ohe.pkl'), WindowsPath('models/num_cols.pkl')]
