## Task 5
# End-to-End ML Pipeline with Scikit-learn Pipeline API

# Objective
The objective of this task was to build a reusable and production-ready machine learning pipeline for predicting customer churn. This included automating preprocessing, training multiple models, tuning hyperparameters, and exporting the final optimized pipeline for future use.

# Introduction
The task involved using the Telco Customer Churn dataset, which contains customer demographics, account details, and service usage information, along with the target label indicating whether a customer churned. The goal was to design a complete machine learning pipeline that could handle data preprocessing, model training, evaluation, and deployment in a streamlined and reusable way using Scikit-learn’s Pipeline API.

# Overview
The process began with loading and cleaning the dataset, including converting the TotalCharges column to numeric values, dropping unnecessary fields like customerID, and handling missing values. The features were split into numerical and categorical variables, and separate preprocessing pipelines were applied: numerical features were imputed and scaled, while categorical features were imputed and one-hot encoded. Two classifiers were tested: Logistic Regression as a baseline linear model and Random Forest as a more powerful ensemble model. Each classifier was embedded into a pipeline that included both preprocessing and the model. Hyperparameter tuning was performed using GridSearchCV with cross-validation, optimizing for the F1 score. The best-performing model was then selected, evaluated, and saved using joblib for reusability.

# Evaluation
Logistic Regression achieved a best F1 score of 0.6069 with tuned parameters, while Random Forest achieved a higher F1 score of 0.6204. The Random Forest model also provided an accuracy of 76.69% and ROC AUC of 0.8320. The classification report indicated strong performance for non-churned customers (precision 0.88, recall 0.79, F1 0.83) and reasonable performance for churned customers (precision 0.55, recall 0.72, F1 0.62). Based on these results, the Random Forest model was selected as the best pipeline.

# Summary
This task successfully demonstrated the construction of an end-to-end machine learning pipeline using Scikit-learn. The workflow included preprocessing, training, hyperparameter tuning, and model export, ensuring the pipeline is reusable and production-ready. The Random Forest pipeline achieved the best performance and was saved for future deployment.

In [6]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, f1_score, roc_auc_score
import joblib

# Step 1: Load and preprocess data
def load_and_preprocess_data():
    """
     Loads the telco customer churn dataset and performs initial cleaning
     Converts totalcharges to numeric to handle errors by invalid values
     Removes the customer id column (not useful for prediction).
     Drops rows with missing values.
     Converts churn column from Yes/No to 1/0.
    """
    url = "https://raw.githubusercontent.com/IBM/telco-customer-churn-on-icp4d/master/data/Telco-Customer-Churn.csv"
    df = pd.read_csv(url)

    df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
    df = df.drop('customerID', axis=1)
    df = df.dropna()
    df['Churn'] = df['Churn'].map({'Yes': 1, 'No': 0})

    return df

# Step 2: Create preprocessing pipeline
def create_preprocessing_pipeline(X):
    """
    Creates a preprocessing pipeline for numerical and categorical features
    
    """
    categorical_features = X.select_dtypes(include=['object']).columns.tolist()
    numerical_features = X.select_dtypes(include=[np.number]).columns.tolist()

    # Pipeline for numerical features
    numerical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ])

    # Pipeline for categorical features
    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
    ])

    # Combine both transformers into a single ColumnTransformer
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numerical_transformer, numerical_features),
            ('cat', categorical_transformer, categorical_features)
        ])

    return preprocessor

# Step 3: Main execution
def main():
    # Load and clean data
    df = load_and_preprocess_data()
    X = df.drop('Churn', axis=1)
    y = df['Churn']

    # Split into training and test sets (stratified to balance churn labels)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )

    # Create preprocessing pipeline
    preprocessor = create_preprocessing_pipeline(X)

    # Define models and their hyperparamete by grids for tuning
    models = {
        'logistic_regression': {
            'model': LogisticRegression(random_state=42, class_weight='balanced', max_iter=1000),
            'params': {'classifier__C': [0.1, 1, 10], 'classifier__solver': ['liblinear']}
        },
        'random_forest': {
            'model': RandomForestClassifier(random_state=42, class_weight='balanced'),
            'params': {'classifier__n_estimators': [100, 200], 'classifier__max_depth': [None, 10, 20]}
        }
    }

    best_score = 0
    best_model = None
    best_model_name = ""

    # Train and evaluate each model using gridsearchCV
    for model_name, model_info in models.items():
        print(f"\nTraining {model_name}")

        # Create full pipeline (preprocessing + model)
        pipeline = Pipeline(steps=[
            ('preprocessor', preprocessor),
            ('classifier', model_info['model'])
        ])

        # Perform hyperparameter tuning
        grid_search = GridSearchCV(
            pipeline,
            model_info['params'],
            cv=5,
            scoring='f1',
            n_jobs=-1,
            verbose=0
        )

        # Fit model
        grid_search.fit(X_train, y_train)

        # Evaluate on test set
        y_pred = grid_search.predict(X_test)
        f1 = f1_score(y_test, y_pred)

        print(f"{model_name} - Best F1 Score: {f1:.4f}")
        print(f"Best parameters: {grid_search.best_params_}")

        # Keep track of best performing model
        if f1 > best_score:
            best_score = f1
            best_model = grid_search.best_estimator_
            best_model_name = model_name

    # Final evaluation of best model
    print(f"\nBest model selected: {best_model_name}")
    y_pred = best_model.predict(X_test)
    y_pred_proba = best_model.predict_proba(X_test)[:, 1]

    print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
    print(f"F1 Score: {f1_score(y_test, y_pred):.4f}")
    print(f"ROC AUC: {roc_auc_score(y_test, y_pred_proba):.4f}")
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))

    # Save final pipeline to disk for reuse
    joblib.dump(best_model, 'telco_churn_pipeline.joblib')
    print("\nPipeline saved as telco_churn_pipeline.joblib")

if __name__ == "__main__":
    main()



Training logistic_regression
logistic_regression - Best F1 Score: 0.6069
Best parameters: {'classifier__C': 1, 'classifier__solver': 'liblinear'}

Training random_forest
random_forest - Best F1 Score: 0.6204
Best parameters: {'classifier__max_depth': 10, 'classifier__n_estimators': 100}

Best model selected: random_forest
Accuracy: 0.7669
F1 Score: 0.6204
ROC AUC: 0.8320

Classification Report:
              precision    recall  f1-score   support

           0       0.88      0.79      0.83      1033
           1       0.55      0.72      0.62       374

    accuracy                           0.77      1407
   macro avg       0.72      0.75      0.73      1407
weighted avg       0.79      0.77      0.78      1407


Pipeline saved as telco_churn_pipeline.joblib
