# Data Exploration and Training
* This notebook is used to explore our structured data - data source is via kaggle (https://www.kaggle.com/datasets/yeanzc/telco-customer-churn-ibm-dataset)
* XGboost ML model chosen as ML model used to give customer "Risk - Score"
* Model saved to churn_model_v1.pkl to be containerized in docker

In [1]:
# Import Libraries
import pandas as pd
import numpy as np
import joblib

# Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Model
from xgboost import XGBClassifier

# Evaluation
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

print("Libraries imported successfully.")

Libraries imported successfully.


In [2]:
# Load historical dataset
data_path = "../data/Telco_customer_churn.csv"
try:
    df = pd.read_csv(data_path)
    print("Data loaded successfully.")
except FileNotFoundError:
    print(f"Error: Data file not found at {data_path}")

# Display the first few rows and column info
df.head()

Data loaded successfully.


Unnamed: 0,CustomerID,Count,Country,State,City,Zip Code,Lat Long,Latitude,Longitude,Gender,...,Contract,Paperless Billing,Payment Method,Monthly Charges,Total Charges,Churn Label,Churn Value,Churn Score,CLTV,Churn Reason
0,3668-QPYBK,1,United States,California,Los Angeles,90003,"33.964131, -118.272783",33.964131,-118.272783,Male,...,Month-to-month,Yes,Mailed check,53.85,108.15,Yes,1,86,3239,Competitor made better offer
1,9237-HQITU,1,United States,California,Los Angeles,90005,"34.059281, -118.30742",34.059281,-118.30742,Female,...,Month-to-month,Yes,Electronic check,70.7,151.65,Yes,1,67,2701,Moved
2,9305-CDSKC,1,United States,California,Los Angeles,90006,"34.048013, -118.293953",34.048013,-118.293953,Female,...,Month-to-month,Yes,Electronic check,99.65,820.5,Yes,1,86,5372,Moved
3,7892-POOKP,1,United States,California,Los Angeles,90010,"34.062125, -118.315709",34.062125,-118.315709,Female,...,Month-to-month,Yes,Electronic check,104.8,3046.05,Yes,1,84,5003,Moved
4,0280-XJGEX,1,United States,California,Los Angeles,90015,"34.039224, -118.266293",34.039224,-118.266293,Male,...,Month-to-month,Yes,Bank transfer (automatic),103.7,5036.3,Yes,1,89,5340,Competitor had better devices


In [3]:
# Get a quick overview of our data
df.info()

# Cleaning
# 'Total Charges' is often read as object, bc it contains empty spaces. Convert to number
df['Total Charges'] = pd.to_numeric(df['Total Charges'], errors='coerce')

# Drop columns that are "too good" or are variations of the answer.
# 'Churn Label' is the text version of our target 'Churn Value'.
# 'Churn Score' is a pre-calculated score (what we're trying to make!).
# 'CLTV' is often calculated using churn, so it's a leaky predictor.
columns_to_drop = ['Churn Score', 'CLTV', 'Churn Label', 'CustomerID', 'Lat Long', 'Latitude', 'Longitude', 'Country', 'State', 'City', 'Zip Code', 'Count']
df_cleaned = df.drop(columns=columns_to_drop, errors='ignore') # 'errors=ignore' in case a column isn't found
print(f"Dropped leaky/unusable columns.")


# Check the distribution of our target variable
print("\nChurn Distribution:")
print(df_cleaned['Churn Value'].value_counts(normalize=True))

df_cleaned.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 33 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   CustomerID         7043 non-null   object 
 1   Count              7043 non-null   int64  
 2   Country            7043 non-null   object 
 3   State              7043 non-null   object 
 4   City               7043 non-null   object 
 5   Zip Code           7043 non-null   int64  
 6   Lat Long           7043 non-null   object 
 7   Latitude           7043 non-null   float64
 8   Longitude          7043 non-null   float64
 9   Gender             7043 non-null   object 
 10  Senior Citizen     7043 non-null   object 
 11  Partner            7043 non-null   object 
 12  Dependents         7043 non-null   object 
 13  Tenure Months      7043 non-null   int64  
 14  Phone Service      7043 non-null   object 
 15  Multiple Lines     7043 non-null   object 
 16  Internet Service   7043 

Unnamed: 0,Gender,Senior Citizen,Partner,Dependents,Tenure Months,Phone Service,Multiple Lines,Internet Service,Online Security,Online Backup,...,Tech Support,Streaming TV,Streaming Movies,Contract,Paperless Billing,Payment Method,Monthly Charges,Total Charges,Churn Value,Churn Reason
0,Male,No,No,No,2,Yes,No,DSL,Yes,Yes,...,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,1,Competitor made better offer
1,Female,No,No,Yes,2,Yes,No,Fiber optic,No,No,...,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,1,Moved
2,Female,No,No,Yes,8,Yes,Yes,Fiber optic,No,No,...,No,Yes,Yes,Month-to-month,Yes,Electronic check,99.65,820.5,1,Moved
3,Female,No,Yes,Yes,28,Yes,Yes,Fiber optic,No,No,...,Yes,Yes,Yes,Month-to-month,Yes,Electronic check,104.8,3046.05,1,Moved
4,Male,No,No,Yes,49,Yes,Yes,Fiber optic,No,Yes,...,No,Yes,Yes,Month-to-month,Yes,Bank transfer (automatic),103.7,5036.3,1,Competitor had better devices


In [4]:
# Define Features and Target

# Our target variable (y)
target = 'Churn Value'

# Our features (X)
numeric_features = ['Tenure Months', 'Monthly Charges', 'Total Charges']

categorical_features = [
    'Gender',
    'Senior Citizen',
    'Partner',
    'Dependents',
    'Phone Service',
    'Multiple Lines',
    'Internet Service',
    'Online Security',
    'Online Backup',
    'Device Protection',
    'Tech Support',
    'Streaming TV',
    'Streaming Movies',
    'Contract',
    'Paperless Billing',
    'Payment Method'
]

# Use all columns in df_cleaned *except* the target
all_features = numeric_features + categorical_features
X = df_cleaned[all_features]
y = df_cleaned[target]

print("Features and target defined.")
print(f"Features being used ({len(all_features)}): {all_features}")

Features and target defined.
Features being used (19): ['Tenure Months', 'Monthly Charges', 'Total Charges', 'Gender', 'Senior Citizen', 'Partner', 'Dependents', 'Phone Service', 'Multiple Lines', 'Internet Service', 'Online Security', 'Online Backup', 'Device Protection', 'Tech Support', 'Streaming TV', 'Streaming Movies', 'Contract', 'Paperless Billing', 'Payment Method']


In [5]:
# MLOps setup. Bundle all steps into one object.

# Create a pipeline for numeric features
# Impute missing values (if any) with median and scale them
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Create a pipeline for categorical features
# Impute missing values (if any) and then one-hot encode them.
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine these transformers using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ],
    remainder='passthrough' # Pass through any columns not specified
)

# Create the final, full pipeline
# This pipeline first runs the preprocessor, then fits the model.
# We use XGBoost
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', XGBClassifier(use_label_encoder=False, eval_metric='logloss'))
])

print("Full model pipeline created successfully.")
model_pipeline

Full model pipeline created successfully.


0,1,2
,steps,"[('preprocessor', ...), ('classifier', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('num', ...), ('cat', ...)]"
,remainder,'passthrough'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,missing_values,
,strategy,'median'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,missing_values,
,strategy,'most_frequent'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,objective,'binary:logistic'
,base_score,
,booster,
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,
,device,
,early_stopping_rounds,
,enable_categorical,False


In [6]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"Training data shape: {X_train.shape}")
print(f"Testing data shape: {X_test.shape}")

# Train the entire pipeline on our training data
print("\nTraining model...")
model_pipeline.fit(X_train, y_train)
print("Model training complete.")

Training data shape: (5634, 19)
Testing data shape: (1409, 19)

Training model...
Model training complete.


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


In [7]:
# Get predictions on the test set
y_pred = model_pipeline.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.4f}\n")

# Show a detailed classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Not Churn (0)', 'Churn (1)']))

# Show the confusion matrix
print("Confusion Matrix:")
cm = confusion_matrix(y_test, y_pred)
print(pd.DataFrame(cm, index=['Actual Not Churn', 'Actual Churn'], columns=['Predicted Not Churn', 'Predicted Churn']))

Model Accuracy: 0.7913

Classification Report:
               precision    recall  f1-score   support

Not Churn (0)       0.84      0.88      0.86      1035
    Churn (1)       0.62      0.55      0.58       374

     accuracy                           0.79      1409
    macro avg       0.73      0.72      0.72      1409
 weighted avg       0.78      0.79      0.79      1409

Confusion Matrix:
                  Predicted Not Churn  Predicted Churn
Actual Not Churn                  908              127
Actual Churn                      167              207


# Model 1: Not going to cut it
* This baseline model gives us an overall accuracy of 79%, which is decent but the classification report shows that our recall is 53%. This means our model misses almost half of all people who churn.
* A False Negative (predicting no churn when they do churn) means losing customers and their CLV. High and unknown cost.
* A False Positive (predicting churn when they won't churn) means giving out unnecessary discounts. Low and known cost.

* Need to optimize model for higher recall, acceptign a trade-off of lower precision.

In [8]:
# Model 2: Tuned for High Recall

# Calculate the imbalance ratio.
# We're telling the model to "care" about the minority class (Churn=1)
# The ratio is: (count of majority class) / (count of minority class)
y_for_ratio = df[target].value_counts()
imbalance_ratio = y_for_ratio[0] / y_for_ratio[1]
print(f"Imbalance Ratio (0s / 1s): {imbalance_ratio:.2f}")

# Create the new, weighted pipeline
# We use the 'scale_pos_weight' parameter in XGBClassifier to apply this ratio.
model_2_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor), # We can reuse the same preprocessor
    ('classifier', XGBClassifier(
        use_label_encoder=False,
        eval_metric='logloss',
        scale_pos_weight=imbalance_ratio  # key change
    ))
])

# Train the new model
print("\nTraining Model 2 (weighted)...")
model_2_pipeline.fit(X_train, y_train)
print("Model 2 training complete.")

Imbalance Ratio (0s / 1s): 2.77

Training Model 2 (weighted)...
Model 2 training complete.


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


In [10]:
# Get predictions on the test set from Model 2
y_pred_2 = model_2_pipeline.predict(X_test)

# Calculate accuracy
accuracy_2 = accuracy_score(y_test, y_pred_2)
print(f"Model 2 Accuracy: {accuracy_2:.4f}\n")

# Show a detailed classification report
print("Model 2 Classification Report (Tuned):")
print(classification_report(y_test, y_pred_2, target_names=['Not Churn (0)', 'Churn (1)']))

# Show the confusion matrix
print("Model 2 Confusion Matrix (Tuned):")
cm_2 = confusion_matrix(y_test, y_pred_2)
print(pd.DataFrame(cm_2, index=['Actual Not Churn', 'Actual Churn'], columns=['Predicted Not Churn', 'Predicted Churn']))

Model 2 Accuracy: 0.7644

Model 2 Classification Report (Tuned):
               precision    recall  f1-score   support

Not Churn (0)       0.88      0.79      0.83      1035
    Churn (1)       0.54      0.69      0.61       374

     accuracy                           0.76      1409
    macro avg       0.71      0.74      0.72      1409
 weighted avg       0.79      0.76      0.77      1409

Model 2 Confusion Matrix (Tuned):
                  Predicted Not Churn  Predicted Churn
Actual Not Churn                  819              216
Actual Churn                      116              258


# Model 2 Analysis: The Trade-off

Re-Trained with `scale_pos_weight`, the results are much better aligned with specific business goal.

* **Recall for Churn (1):** 69%
* **Precision for Churn (1):** 54%

The number of **False Negatives** (missed customers) has dropped significantly from 167 to 116.

* Note: I acknowledge that this model most likely would not be approved for deployment. Through further feature engineering, extensive model tuning, and richer data this model could be upgraded to have higher accuracy and recall. But for this projects goal, this will suffice.

We will proceed with **Model 2** as our final production model.

In [11]:
# This is the final and most important step for our project.
# We save the *entire* 'model_pipeline' object.

model_path = "../ml_model_api/churn_model_v2.pkl"
joblib.dump(model_2_pipeline, model_path)

print(f"\nModel pipeline saved successfully to: {model_path}")


Model pipeline saved successfully to: ../ml_model_api/churn_model_v2.pkl
