# Task 2: End-to-End ML Pipeline using Scikit-learn

## Problem Statement

Customer churn prediction is critical for telecom companies to retain customers. The objective of this project is to build a reusable and production-ready machine learning pipeline to predict whether a customer will churn.

## Objective

- Build an end-to-end ML pipeline using Scikit-learn

- Apply preprocessing using Pipeline and ColumnTransformer

- Train Logistic Regression and Random Forest models

- Tune hyperparameters using GridSearchCV

- Export the final model using joblib

In [1]:
# Import Libraries

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, classification_report
import joblib

In [2]:
#Load Dataset

df = pd.read_csv("Telco-Customer-Churn.csv")
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'Telco-Customer-Churn.csv'

In [None]:
# Data Cleaning & Basic Preprocessing

# Remove customerID (not useful for prediction)
df.drop("customerID", axis=1, inplace=True)

# Convert TotalCharges to numeric
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")

# Fill missing values
df["TotalCharges"].fillna(df["TotalCharges"].median(), inplace=True)

# Convert target to binary
df["Churn"] = df["Churn"].map({"Yes": 1, "No": 0})

In [None]:
# Feature & Target Split

X = df.drop("Churn", axis=1)
y = df["Churn"]

In [None]:
# Train-Test Split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

In [None]:
# Define Preprocessing Pipeline

# Separate numerical & categorical features:

numeric_features = X.select_dtypes(include=["int64", "float64"]).columns
categorical_features = X.select_dtypes(include=["object"]).columns

# Create transformers

numeric_transformer = Pipeline(steps=[
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)

In [None]:
# Logistic Regression Pipeline

logreg_pipeline = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("classifier", LogisticRegression(max_iter=1000))
])

In [None]:
# Hyperparameter Tuning (GridSearchCV)

param_grid = {
    "classifier__C": [0.01, 0.1, 1, 10],
    "classifier__penalty": ["l2"]
}

grid_logreg = GridSearchCV(
    logreg_pipeline,
    param_grid,
    cv=5,
    scoring="f1",
    n_jobs=-1
)

grid_logreg.fit(X_train, y_train)

In [None]:
# Evaluate Logistic Regression

best_logreg = grid_logreg.best_estimator_

y_pred = best_logreg.predict(X_test)

print("Best Parameters:", grid_logreg.best_params_)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

In [None]:
# Random Forest Pipeline

rf_pipeline = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("classifier", RandomForestClassifier(random_state=42))
])

In [None]:
# Hyperparameter Tuning (Random Forest)

param_grid_rf = {
    "classifier__n_estimators": [100, 200],
    "classifier__max_depth": [None, 10, 20],
    "classifier__min_samples_split": [2, 5]
}

grid_rf = GridSearchCV(
    rf_pipeline,
    param_grid_rf,
    cv=5,
    scoring="f1",
    n_jobs=-1
)

grid_rf.fit(X_train, y_train)

# Evaluate Random Forest

best_rf = grid_rf.best_estimator_

y_pred_rf = best_rf.predict(X_test)

print("Best Parameters:", grid_rf.best_params_)
print("Accuracy:", accuracy_score(y_test, y_pred_rf))
print("F1 Score:", f1_score(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))

In [None]:
# Select Best Model

if f1_score(y_test, y_pred_rf) > f1_score(y_test, y_pred):
    final_model = best_rf
    print("Random Forest selected.")
else:
    final_model = best_logreg
    print("Logistic Regression selected.")

# Export Final Pipeline

joblib.dump(final_model, "churn_pipeline.pkl")
print("Pipeline saved successfully.")

In [None]:
# Load & Use Saved Pipeline

loaded_model = joblib.load("churn_pipeline.pkl")

sample_prediction = loaded_model.predict(X_test.iloc[:5])
print(sample_prediction)

## Final Summary / Insights

Logistic Regression and Random Forest models were trained using a complete Scikit-learn Pipeline. Hyperparameter tuning improved performance using GridSearchCV. The final selected model was exported using joblib, making it reusable and production-ready. The use of Pipeline ensures consistent preprocessing and reduces risk of data leakage.