# Customer Churn — Feature Scaling & Pipelines

## Objective
Apply scaling correctly using train-only fitting and pipelines.

This notebook is part of an end-to-end customer churn classification project.
All preprocessing, modeling, and evaluation steps are designed to be:
- Leakage-safe
- Reproducible
- Interview-defensible


## Customer Churn Prediction — Feature Scaling & Model Improvement

In this notebook, we improve baseline churn models by applying feature scaling.
We analyze how scaling affects distance-based and linear models and compare
performance before and after scaling.


### Why Feature Scaling Matters

Many machine learning models are sensitive to feature magnitude.
Features with larger numeric ranges can dominate model behavior,
leading to biased or unstable predictions.


## Importing Libraries

In [2]:
import numpy as np         
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

## Reading data and Separating features X from Target Variables y

In [3]:
df = pd.read_csv("../data/churn_preprocessed.csv")

y=df['Churn']
X=df.drop(columns=['Churn'], errors='ignore')

## Train/Test Splitting

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, shuffle=True)

## Applying Feature Scaling

In [8]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Retrain Logistic Regression (Scaled Data)

In [9]:
logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train_scaled, y_train)
y_pred = logreg.predict(X_test_scaled)

## Evaluate Scaled Logistic Regression

In [14]:
logreg_accuracy = accuracy_score(y_test, y_pred)
logreg_cm = confusion_matrix(y_test, y_pred)
logreg_cr = classification_report(y_test, y_pred)


print(f"Accuracy: {logreg_accuracy}")
print(f"Confusion Matrix:\n{logreg_cm}")
print(f"Classification Report:\n{logreg_cr}")

Accuracy: 0.7842441447835344
Confusion Matrix:
[[915 121]
 [183 190]]
Classification Report:
              precision    recall  f1-score   support

           0       0.83      0.88      0.86      1036
           1       0.61      0.51      0.56       373

    accuracy                           0.78      1409
   macro avg       0.72      0.70      0.71      1409
weighted avg       0.77      0.78      0.78      1409



## Retrain kNN (Scaled Data)

In [12]:
pipeline_steps = [('scaler', StandardScaler()), ('knn', KNeighborsClassifier())]

knn_pipeline = Pipeline(pipeline_steps)
knn_pipeline.fit(X_train, y_train)
knn_pipeline_y_pred = knn_pipeline.predict(X_test)

## Evaluate Scaled kNN

In [13]:
knn_pipeline_accuracy = accuracy_score(y_test, knn_pipeline_y_pred)
knn_pipeline_cm = confusion_matrix(y_test, knn_pipeline_y_pred)
knn_pipeline_cr = classification_report(y_test, knn_pipeline_y_pred)

print(f"Accuracy : {knn_pipeline_accuracy}")
print(f"Confusion Matrix:\n{knn_pipeline_cm}")
print(f"Classification Report:\n{knn_pipeline_cr}")

Accuracy : 0.7338537970191625
Confusion Matrix:
[[1032    4]
 [ 371    2]]
Classification Report:
              precision    recall  f1-score   support

           0       0.74      1.00      0.85      1036
           1       0.33      0.01      0.01       373

    accuracy                           0.73      1409
   macro avg       0.53      0.50      0.43      1409
weighted avg       0.63      0.73      0.63      1409



### Model Performance Comparison

Feature scaling improved model consistency and stability.
kNN showed significant improvement after scaling, while Logistic
Regression exhibited more reliable convergence behavior.


### Business Interpretation

Reducing false negatives is critical in churn prediction, as failing to
identify customers at risk leads to direct revenue loss. Feature scaling
helps models make fairer decisions across all input variables.


### Key Takeaways

- Feature scaling is mandatory for distance-based models.
- Scaling prevents dominant features from biasing predictions.
- Proper ML pipelines reduce data leakage risks.
- Model comparison must be systematic and fair.
