### Project Overview

Customer churn—the phenomenon where customers discontinue their subscription to a service—poses a significant challenge for subscription-based businesses. Accurately predicting which customers are likely to leave enables companies to proactively engage at-risk customers, optimize retention strategies, and ultimately improve profitability. This project aims to develop a predictive model using machine learning techniques such as logistic regression and random forests, combined with thoughtful feature engineering, to identify customers at high risk of churn. The insights derived will inform actionable business recommendations and a cost-benefit analysis to guide retention efforts

### Business Understanding
For subscription-based businesses, retaining existing customers is often more cost-effective than acquiring new ones. High churn rates can signal underlying issues with customer satisfaction, product fit, or competitive pressures, directly impacting revenue and growth. By understanding the drivers of churn and identifying at-risk customers, the business can:
Target retention campaigns more effectively (e.g., personalized offers, improved customer support)
Allocate resources efficiently to maximize return on investment in retention
Reduce lost revenue and improve customer lifetime value
The key business questions addressed in this project are:
Which customers are most likely to leave the service in the near future?
What are the main factors contributing to customer churn?
How can the business intervene to reduce churn, and what is the expected financial impact of these interventions?
The project will use historical customer data—including demographics, service usage, account information, and previous churn behavior—to build and evaluate predictive models. The final deliverables will include actionable recommendations and a cost-benefit analysis to support data-driven decision-making for customer retention.


### Import Libraries and Dataset

In [39]:
import pandas as pd
df = pd.read_csv("C:/Users/Administrator/Documents/Customer churn prediction project/WA_Fn-UseC_-Telco-Customer-Churn.csv")
print(df.head())

   customerID  gender  SeniorCitizen Partner Dependents  tenure PhoneService  \
0  7590-VHVEG  Female              0     Yes         No       1           No   
1  5575-GNVDE    Male              0      No         No      34          Yes   
2  3668-QPYBK    Male              0      No         No       2          Yes   
3  7795-CFOCW    Male              0      No         No      45           No   
4  9237-HQITU  Female              0      No         No       2          Yes   

      MultipleLines InternetService OnlineSecurity  ... DeviceProtection  \
0  No phone service             DSL             No  ...               No   
1                No             DSL            Yes  ...              Yes   
2                No             DSL            Yes  ...               No   
3  No phone service             DSL            Yes  ...              Yes   
4                No     Fiber optic             No  ...               No   

  TechSupport StreamingTV StreamingMovies        Contract Pape

### Exploratory Data Analysis(EDA)

In [40]:
# Check for missing values
df.isna().sum()

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

In [41]:
# Check data types
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [42]:
df.describe()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
count,7043.0,7043.0,7043.0
mean,0.162147,32.371149,64.761692
std,0.368612,24.559481,30.090047
min,0.0,0.0,18.25
25%,0.0,9.0,35.5
50%,0.0,29.0,70.35
75%,0.0,55.0,89.85
max,1.0,72.0,118.75


In [43]:
df.duplicated().sum()

0

In [44]:
df.columns

Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')

### Feature Engineering For Segmentation

In [45]:
#Tenure Grouping
df['tenure_group'] = pd.cut(df['tenure'], bins=[0, 12, 24, 36, 48, 60, 72], labels=['0-12', '12-24', '24-36', '36-48', '48-60', '60-72'])
print(df['tenure_group'])

0        0-12
1       24-36
2        0-12
3       36-48
4        0-12
        ...  
7038    12-24
7039    60-72
7040     0-12
7041     0-12
7042    60-72
Name: tenure_group, Length: 7043, dtype: category
Categories (6, object): ['0-12' < '12-24' < '24-36' < '36-48' < '48-60' < '60-72']


In [46]:
# Service Count
# count how many services a customer subscribes to.
service_cols = ['PhoneService', 'MultipleLines', 'OnlineSecurity', 'OnlineBackup', 
                'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']
df['num_services'] = df[service_cols].apply(lambda x: sum(x == 'Yes'), axis=1)
print(df['num_services'])

0       1
1       3
2       3
3       3
4       1
       ..
7038    7
7039    6
7040    1
7041    2
7042    6
Name: num_services, Length: 7043, dtype: int64


In [47]:
## Calculate Average Monthly Spend per user
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df['avg_monthly_spend'] = df['TotalCharges'] / (df['tenure'].replace(0, 1))

print(df['avg_monthly_spend'])

0        29.850000
1        55.573529
2        54.075000
3        40.905556
4        75.825000
           ...    
7038     82.937500
7039    102.262500
7040     31.495455
7041     76.650000
7042    103.704545
Name: avg_monthly_spend, Length: 7043, dtype: float64


In [48]:
# Contract and Payment Method 
# Combine Contract and PaymentMethod into a new feature
df['Contract_PaymentMethod'] = df['Contract'] + '_' + df['PaymentMethod']

# Display the first few values to check
print(df['Contract_PaymentMethod'].head())

0       Month-to-month_Electronic check
1                 One year_Mailed check
2           Month-to-month_Mailed check
3    One year_Bank transfer (automatic)
4       Month-to-month_Electronic check
Name: Contract_PaymentMethod, dtype: object


Customers on month-to-month contracts who pay by electronic check may be more likely to churn than those on two-year contracts with automatic payments.
This combined feature allows the model to learn such patterns, which may not be obvious when using the features separately.

### 1. Binary Encoding of Service Features

# Convert all Yes/No/No internet service/No phone service columns to binary (1/0).

In [49]:
# List of service columns
service_cols = [
    'PhoneService', 'MultipleLines', 'InternetService',
    'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
    'StreamingTV', 'StreamingMovies'
]

# Map Yes/No/No internet service/No phone service to 1/0
binary_map = {'Yes': 1, 'No': 0, 'No internet service': 0, 'No phone service': 0}

for col in service_cols:
    if col in ['InternetService']:
        # For InternetService, create dummies for each type
        dummies = pd.get_dummies(df[col], prefix=col)
        df = pd.concat([df, dummies], axis=1)
    else:
        df[col + '_bin'] = df[col].map(binary_map)


print(service_cols)

['PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']


### 2. Total Number Of Services Subscribed

## Count how many services a customer subscribes to(excluding InternetService type).

In [50]:
# Count number of services (excluding InternetService type)
service_bin_cols = [
    'PhoneService_bin', 'MultipleLines_bin', 'OnlineSecurity_bin',
    'OnlineBackup_bin', 'DeviceProtection_bin', 'TechSupport_bin',
    'StreamingTV_bin', 'StreamingMovies_bin'
]
df['num_services'] = df[service_bin_cols].sum(axis=1)

print(df['num_services'])

0       1
1       3
2       3
3       3
4       1
       ..
7038    7
7039    6
7040    1
7041    2
7042    6
Name: num_services, Length: 7043, dtype: int64


### 3. Bundled Services

## Create a feature indicating if a customer has all streaming services, all security services, etc.

In [51]:
# Has all streaming services
df['all_streaming'] = ((df['StreamingTV_bin'] == 1) & (df['StreamingMovies_bin'] == 1)).astype(int)

# Has all security/backup services
df['all_security'] = (
    (df['OnlineSecurity_bin'] == 1) &
    (df['OnlineBackup_bin'] == 1) &
    (df['DeviceProtection_bin'] == 1) &
    (df['TechSupport_bin'] == 1)
).astype(int)




### 4. Check the Engineered Features.

In [52]:
print(df[['num_services', 'all_streaming', 'all_security', 
          'InternetService_DSL', 'InternetService_Fiber optic', 'InternetService_No']].head())

   num_services  all_streaming  all_security  InternetService_DSL  \
0             1              0             0                 True   
1             3              0             0                 True   
2             3              0             0                 True   
3             3              0             0                 True   
4             1              0             0                False   

   InternetService_Fiber optic  InternetService_No  
0                        False               False  
1                        False               False  
2                        False               False  
3                        False               False  
4                         True               False  


## Summary of Engineered Features.
### 1.Binary columns for each service (e.g., PhoneService_bin)

### 2. One-hot encoding for InternetService type

### 3. num_services: total number of services subscribed

### 4. all_streaming: has both StreamingTV and StreamingMovies

### 5. all_security : has all four security/backup services


## Recommended Features for Modeling
### Based on your feature engineering and best practices for churn prediction, use the following features:

## Demographics & Account Info

SeniorCitizen,
Partner (encoded as 1/0),
Dependents (encoded as 1/0),
tenure or tenure_group (if using bins, one-hot encode),
Contract (one-hot encode),
PaperlessBilling (encoded as 1/0),
PaymentMethod (one-hot encode).

## Service Usage

PhoneService_bin,
MultipleLines_bin,
InternetService_DSL,
InternetService_Fiber optic,
InternetService_No,
OnlineSecurity_bin,
OnlineBackup_bin,
DeviceProtection_bin,
TechSupport_bin,
StreamingTV_bin,
StreamingMovies_bin,
num_services,
all_streaming,
all_security,

## Financial

MonthlyCharges,
TotalCharges (make sure it’s numeric),
avg_monthly_spend.

## Interaction Features

Contract_PaymentMethod

### Feature Selection code

In [53]:
# Encode categorical features
df['Partner_bin'] = df['Partner'].map({'Yes': 1, 'No': 0})
df['Dependents_bin'] = df['Dependents'].map({'Yes': 1, 'No': 0})
df['PaperlessBilling_bin'] = df['PaperlessBilling'].map({'Yes': 1, 'No': 0})

# One-hot encode Contract, PaymentMethod, tenure_group (if used)
df = pd.get_dummies(df, columns=['Contract', 'PaymentMethod', 'tenure_group', 'Contract_PaymentMethod'], drop_first=True)

# List of best features
feature_cols = [
    'SeniorCitizen', 'Partner_bin', 'Dependents_bin', 'tenure',
    'PaperlessBilling_bin', 'MonthlyCharges', 'TotalCharges', 'avg_monthly_spend',
    'PhoneService_bin', 'MultipleLines_bin', 'InternetService_DSL', 'InternetService_Fiber optic', 'InternetService_No',
    'OnlineSecurity_bin', 'OnlineBackup_bin', 'DeviceProtection_bin', 'TechSupport_bin',
    'StreamingTV_bin', 'StreamingMovies_bin', 'num_services', 'all_streaming', 'all_security'
]

# Add one-hot columns for Contract, PaymentMethod, tenure_group, Contract_PaymentMethod
feature_cols += [col for col in df.columns if col.startswith('Contract_') or col.startswith('PaymentMethod_') or col.startswith('tenure_group_') or col.startswith('Contract_PaymentMethod_')]

# Target variable
y = df['Churn'].map({'Yes': 1, 'No': 0})

# Final feature matrix
X = df[feature_cols]


print(X.head())
print(y.head())

   SeniorCitizen  Partner_bin  Dependents_bin  tenure  PaperlessBilling_bin  \
0              0            1               0       1                     1   
1              0            0               0      34                     0   
2              0            0               0       2                     1   
3              0            0               0      45                     0   
4              0            0               0       2                     1   

   MonthlyCharges  TotalCharges  avg_monthly_spend  PhoneService_bin  \
0           29.85         29.85          29.850000                 0   
1           56.95       1889.50          55.573529                 1   
2           53.85        108.15          54.075000                 1   
3           42.30       1840.75          40.905556                 0   
4           70.70        151.65          75.825000                 1   

   MultipleLines_bin  ...  \
0                  0  ...   
1                  0  ...   
2    

## Train-Test Split

In [54]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

## Model Building: Logistic Regression

In [58]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, classification_report, roc_curve
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer

# Input missing values
imputer = SimpleImputer(strategy='mean')
X_model = pd.DataFrame(imputer.fit_transform(X), columns = X.columns)





# Fit logistic regression model
logreg = LogisticRegression(max_iter=1000, solver='liblinear')  # 'liblinear' is good for small/medium datasets
logreg.fit(X_train, y_train)

# Predict on test set
y_pred = logreg.predict(X_test)
y_proba = logreg.predict_proba(X_test)[:, 1]

ValueError: Input X contains NaN.
LogisticRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values