# Objective:
Build a reusable and production-ready machine learning pipeline for predicting customer churn.  

Installing Libraries

In [None]:
!pip install pandas scikit-learn joblib




Loading Dataset and Initial Inspection

In [None]:
import pandas as pd

# Load dataset
df = pd.read_csv("/content/drive/MyDrive/telco.csv")
df.head()


Unnamed: 0,Customer ID,Gender,Age,Under 30,Senior Citizen,Married,Dependents,Number of Dependents,Country,State,...,Total Extra Data Charges,Total Long Distance Charges,Total Revenue,Satisfaction Score,Customer Status,Churn Label,Churn Score,CLTV,Churn Category,Churn Reason
0,8779-QRDMV,Male,78,No,Yes,No,No,0,United States,California,...,20,0.0,59.65,3,Churned,Yes,91,5433,Competitor,Competitor offered more data
1,7495-OOKFY,Female,74,No,Yes,Yes,Yes,1,United States,California,...,0,390.8,1024.1,3,Churned,Yes,69,5302,Competitor,Competitor made better offer
2,1658-BYGOY,Male,71,No,Yes,No,Yes,3,United States,California,...,0,203.94,1910.88,2,Churned,Yes,81,3179,Competitor,Competitor made better offer
3,4598-XLKNJ,Female,78,No,Yes,Yes,Yes,1,United States,California,...,0,494.0,2995.07,2,Churned,Yes,88,5337,Dissatisfaction,Limited range of services
4,4846-WHAFZ,Female,80,No,Yes,Yes,Yes,1,United States,California,...,0,234.21,3102.36,2,Churned,Yes,67,2793,Price,Extra data charges


In [None]:
df.isnull().sum()

Unnamed: 0,0
Customer ID,0
Gender,0
Age,0
Under 30,0
Senior Citizen,0
Married,0
Dependents,0
Number of Dependents,0
Country,0
State,0


Dropping Unnecessary Columns

In [None]:
drop_cols = [
    "CustomerID",
    "Churn Label",
    "Churn Score",
    "Churn Reason",
    "Lat Long",
    "Count",
    "Zip Code"
]

df.drop(columns=drop_cols, inplace=True, errors='ignore')

Encoding Target Variable and Splitting Features/Target

In [None]:
df["Customer Status"] = df["Customer Status"].map({
    "Churned": 1,
    "Stayed": 0,
    "Active": 0
})

y = df["Customer Status"]
X = df.drop(columns=["Customer Status"])


Displaying DataFrame Head After Column Dropping

In [None]:
df.head()

Unnamed: 0,Customer ID,Gender,Age,Under 30,Senior Citizen,Married,Dependents,Number of Dependents,Country,State,...,Monthly Charge,Total Charges,Total Refunds,Total Extra Data Charges,Total Long Distance Charges,Total Revenue,Satisfaction Score,Customer Status,CLTV,Churn Category
0,8779-QRDMV,Male,78,No,Yes,No,No,0,United States,California,...,39.65,39.65,0.0,20,0.0,59.65,3,Churned,5433,Competitor
1,7495-OOKFY,Female,74,No,Yes,Yes,Yes,1,United States,California,...,80.65,633.3,0.0,0,390.8,1024.1,3,Churned,5302,Competitor
2,1658-BYGOY,Male,71,No,Yes,No,Yes,3,United States,California,...,95.45,1752.55,45.61,0,203.94,1910.88,2,Churned,3179,Competitor
3,4598-XLKNJ,Female,78,No,Yes,Yes,Yes,1,United States,California,...,98.5,2514.5,13.43,0,494.0,2995.07,2,Churned,5337,Dissatisfaction
4,4846-WHAFZ,Female,80,No,Yes,Yes,Yes,1,United States,California,...,76.5,2868.15,0.0,0,234.21,3102.36,2,Churned,2793,Price


### Model Performance Comparison

**Random Forest Classifier** (from cell `teZ8GYwSGhn5`):

```
              precision    recall  f1-score   support

         0.0       0.95      1.00      0.98       944
         1.0       1.00      0.87      0.93       374

    accuracy                           0.96      1318
   macro avg       0.97      0.94      0.95      1318
weighted avg       0.97      0.96      0.96      1318
```

**Logistic Regression Classifier** (from cell `r4Z_sxrHHFu0`):

```
              precision    recall  f1-score   support

         0.0       0.95      0.99      0.97       944
         1.0       0.97      0.88      0.92       374

    accuracy                           0.96      1318
   macro avg       0.96      0.93      0.95      1318
weighted avg       0.96      0.96      0.96      1318
```

Defining Categorical Features

In [None]:
categorical_features = [
    "Gender",
    "Under 30",
    "Senior Citizen",
    "Married",
    "Dependents",
    "Country",
    "State"
]


Handling Numerical Features and Missing Values

In [None]:
numerical_features = [
    "Age",
    "Number of Dependents",
    "Monthly Charge",
    "Total Charges",
    "Total Refunds",
    "Total Extra Data Charges",
    "Total Long Distance Charges",
    "Total Revenue",
    "Satisfaction Score",
    "CLTV"
]

for col in numerical_features:
    X[col] = pd.to_numeric(X[col], errors="coerce")

X[numerical_features] = X[numerical_features].fillna(
    X[numerical_features].median()
)


Setting Up Preprocessing Pipeline

In [None]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), numerical_features),
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features)
    ]
)


Splitting Data into Training and Test Sets

In [None]:
from sklearn.model_selection import train_test_split

# Filter out rows where y is NaN to prepare for stratification
valid_indices = y.dropna().index
X_cleaned = X.loc[valid_indices]
y_cleaned = y.loc[valid_indices]

X_train, X_test, y_train, y_test = train_test_split(
    X_cleaned, y_cleaned,
    test_size=0.2,
    random_state=42,
    stratify=y_cleaned
)

Importing Modeling and Evaluation Libraries

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

Training Logistic Regression Model with GridSearchCV

In [None]:
from sklearn.model_selection import GridSearchCV

logreg_pipeline = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("classifier", LogisticRegression(max_iter=1000))
    ]
)

logreg_params = {
    "classifier__C": [0.01, 0.1, 1, 10],
    "classifier__solver": ["liblinear"]
}

logreg_grid = GridSearchCV(
    logreg_pipeline,
    logreg_params,
    cv=5,
    scoring="f1",
    n_jobs=-1
)

logreg_grid.fit(X_train, y_train)

Training Random Forest Model with GridSearchCV

In [None]:
rf_pipeline = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("classifier", RandomForestClassifier(random_state=42))
    ]
)

rf_params = {
    "classifier__n_estimators": [100, 200],
    "classifier__max_depth": [None, 20, 35],
    "classifier__min_samples_split": [2, 5]
}

rf_grid = GridSearchCV(
    rf_pipeline,
    rf_params,
    cv=5,
    scoring="f1",
    n_jobs=-1
)

rf_grid.fit(X_train, y_train)



Evaluating Random Forest Model

In [None]:
best_model = rf_grid.best_estimator_

y_pred1 = best_model.predict(X_test)

print(classification_report(y_test, y_pred1))


              precision    recall  f1-score   support

         0.0       0.95      1.00      0.98       944
         1.0       1.00      0.87      0.93       374

    accuracy                           0.96      1318
   macro avg       0.97      0.94      0.95      1318
weighted avg       0.97      0.96      0.96      1318



Evaluating Logistic Regression Model

In [None]:
best_model1 = logreg_grid.best_estimator_

y_pred = best_model1.predict(X_test)

print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

         0.0       0.95      0.99      0.97       944
         1.0       0.97      0.88      0.92       374

    accuracy                           0.96      1318
   macro avg       0.96      0.93      0.95      1318
weighted avg       0.96      0.96      0.96      1318



Saving Logistic Regression Model

In [None]:
import joblib
joblib.dump(best_model1, "churn_lr_pipeline_v2.pkl")


['churn_lr_pipeline_v2.pkl']

Saving Random Forest Model

In [None]:
import joblib
joblib.dump(best_model, "churn_rf_pipeline_v2.pkl")


['churn_rf_pipeline_v2.pkl']

### Key Observations:

Both models achieved a very high overall accuracy of **0.96**.

#### For Class 0 (Customer Stayed):
*   Both models show excellent performance with high precision (0.95) and recall (near 1.00), leading to a high f1-score (0.97-0.98).
*   The Random Forest model has a slightly higher recall for this class (1.00) compared to Logistic Regression (0.99), meaning it correctly identified all customers who stayed.

#### For Class 1 (Customer Churned):
*   **Random Forest** has a precision of **1.00** and a recall of **0.87**, resulting in an f1-score of **0.93**.
*   **Logistic Regression** has a precision of **0.97** and a recall of **0.88**, resulting in an f1-score of **0.92**.

#### Conclusion:
*   Both models are highly effective in predicting customer churn, with very similar overall performance.
*   The **Random Forest Classifier** has a perfect precision for predicting churned customers (Class 1), meaning when it predicts a customer will churn, it is always correct. However, its recall for churned customers (0.87) is slightly lower than Logistic Regression, indicating it misses identifying some actual churn cases.
*   The **Logistic Regression** model has a slightly better recall for churned customers (0.88), meaning it identifies a slightly higher proportion of actual churn cases, but with a slightly lower precision (0.97).

Depending on the business objective (e.g., minimizing false positives for churn vs. minimizing false negatives for churn), one model might be marginally preferred over the other. For instance, if avoiding incorrectly flagging a 'staying' customer as 'churned' is critical, Random Forest might be slightly better due to its perfect precision for Class 1. If identifying as many 'churned' customers as possible is the priority, Logistic Regression might have a slight edge due to its slightly higher recall for Class 1.