## Problem Statement
A telecommunications company provides subscription-based services including mobile phone connectivity, internet access, and digital entertainment services such as streaming television and movies. Revenue is primarily generated through recurring customer subscriptions, with customers billed monthly or annually depending on their contract type and payment arrangement. As with most subscription-based businesses, long-term profitability depends heavily on retaining existing customers rather than continuously acquiring new ones.

In this context, customer churn refers to customers who discontinue their relationship with the company. Within the provided dataset, churn is explicitly captured as a binary target variable (Churn), where a value of “Yes” indicates that a customer has left the service and “No” indicates that the customer remains active. This framing positions churn prediction as a supervised classification problem, where historical customer data is used to learn patterns associated with customer attrition.
Churn is a critical business concern because acquiring new customers is significantly more expensive than retaining existing ones. When a customer churns, the company not only loses future subscription revenue but may also incur additional marketing and promotional costs to replace that customer. Early identification of customers who are at risk of churning allows the business to intervene proactively through targeted retention strategies such as contract adjustments, service improvements, or personalized incentives.

The objective of this analysis is to develop a predictive model that estimates the likelihood of customer churn based on demographic attributes, service usage patterns, contract details, and billing information. By leveraging these features, the model aims to identify high-risk customers early enough for the business to take preventive action, thereby reducing churn rates and protecting long-term revenue.


In [None]:
# import libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression

In [None]:
#load dataset

df = pd.read_csv("/content/WA_Fn-UseC_-Telco-Customer-Churn 2.csv")
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [None]:
df.columns

Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')

In [None]:
# from lose inspection "Total charges" dtype is object and should be converted to float though i noticed 11 empty space. Checked against tenure and i realise there is atrend so i didnot want to force drop it so i consider imputation of zero


df['TotalCharges'] = df['TotalCharges'].replace(' ', 0).astype(float)


In [None]:
df["Churn"].value_counts() / 7034 * 100

Unnamed: 0_level_0,count
Churn,Unnamed: 1_level_1
No,73.557009
Yes,26.570941


We know we have an imbalance dataset


In [None]:
df.info

In [None]:
df.drop(["customerID", "gender"], axis = 1, inplace = True)

In [None]:
df.head()

Unnamed: 0,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


Feature Engineering

In [None]:
# encode binary columnns

binary_cols = [ "Partner", "Dependents", "PhoneService", "OnlineSecurity", "OnlineBackup", "DeviceProtection", "TechSupport", "StreamingTV", "StreamingMovies", 'PaperlessBilling']

for col in binary_cols:
  df[col] = df[col].map({"Yes": 1, "No" : 0, 'No internet service': 0 })




In [None]:
df.head()

Unnamed: 0,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,0,1,0,1,0,No phone service,DSL,0,1,0,0,0,0,Month-to-month,1,Electronic check,29.85,29.85,No
1,0,0,0,34,1,No,DSL,1,0,1,0,0,0,One year,0,Mailed check,56.95,1889.5,No
2,0,0,0,2,1,No,DSL,1,1,0,0,0,0,Month-to-month,1,Mailed check,53.85,108.15,Yes
3,0,0,0,45,0,No phone service,DSL,1,0,1,1,0,0,One year,0,Bank transfer (automatic),42.3,1840.75,No
4,0,0,0,2,1,No,Fiber optic,0,0,0,0,0,0,Month-to-month,1,Electronic check,70.7,151.65,Yes


In [None]:
# Encode categorical columns

cat_col = ["MultipleLines", "InternetService", "Contract", "PaymentMethod"]
df = pd.get_dummies(df, columns = cat_col, drop_first = True)

In [None]:
df

Unnamed: 0,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,...,Churn,MultipleLines_No phone service,MultipleLines_Yes,InternetService_Fiber optic,InternetService_No,Contract_One year,Contract_Two year,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,0,1,0,1,0,0,1,0,0,0,...,No,True,False,False,False,False,False,False,True,False
1,0,0,0,34,1,1,0,1,0,0,...,No,False,False,False,False,True,False,False,False,True
2,0,0,0,2,1,1,1,0,0,0,...,Yes,False,False,False,False,False,False,False,False,True
3,0,0,0,45,0,1,0,1,1,0,...,No,True,False,False,False,True,False,False,False,False
4,0,0,0,2,1,0,0,0,0,0,...,Yes,False,False,True,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,0,1,1,24,1,1,0,1,1,1,...,No,False,True,False,False,True,False,False,False,True
7039,0,1,1,72,1,0,1,1,0,1,...,No,False,True,True,False,True,False,True,False,False
7040,0,1,1,11,0,1,0,0,0,0,...,No,True,False,False,False,False,False,False,True,False
7041,1,1,0,4,1,0,0,0,0,0,...,Yes,False,True,True,False,False,False,False,False,True


In [None]:
# Signals customers who just joined — often higher risk of early churn

df["is_new_customer"] = df["tenure"] ==0
df["is_new_customer"] = df["is_new_customer"].astype(int)

In [None]:
df['tenure_group'] = pd.cut(df['tenure'], bins=[0, 12, 24, 48, 60, 72],
                            labels=['0-12','13-24','25-48','49-60','61-72'])
# Optional: one-hot encode tenure_group
df = pd.get_dummies(df, columns=['tenure_group'], drop_first=True)


In [None]:
services = ['PhoneService', 'MultipleLines_Yes', 'InternetService_Fiber optic',
            'InternetService_No', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
            'TechSupport', 'StreamingTV', 'StreamingMovies']

df['num_services'] = df[services].sum(axis=1)


In [None]:
df.head()

Unnamed: 0,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,...,Contract_Two year,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,is_new_customer,tenure_group_13-24,tenure_group_25-48,tenure_group_49-60,tenure_group_61-72,num_services
0,0,1,0,1,0,0,1,0,0,0,...,False,False,True,False,0,False,False,False,False,1
1,0,0,0,34,1,1,0,1,0,0,...,False,False,False,True,0,False,True,False,False,3
2,0,0,0,2,1,1,1,0,0,0,...,False,False,False,True,0,False,False,False,False,3
3,0,0,0,45,0,1,0,1,1,0,...,False,False,False,False,0,False,True,False,False,3
4,0,0,0,2,1,0,0,0,0,0,...,False,False,True,False,0,False,False,False,False,2


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 30 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   SeniorCitizen                          7043 non-null   int64  
 1   Partner                                7043 non-null   int64  
 2   Dependents                             7043 non-null   int64  
 3   tenure                                 7043 non-null   int64  
 4   PhoneService                           7043 non-null   int64  
 5   OnlineSecurity                         7043 non-null   int64  
 6   OnlineBackup                           7043 non-null   int64  
 7   DeviceProtection                       7043 non-null   int64  
 8   TechSupport                            7043 non-null   int64  
 9   StreamingTV                            7043 non-null   int64  
 10  StreamingMovies                        7043 non-null   int64  
 11  Pape

In [None]:
#Define feature and target

X = df.drop("Churn", axis = 1)
y = df["Churn"]

In [None]:
y

Unnamed: 0,Churn
0,No
1,No
2,Yes
3,No
4,Yes
...,...
7038,No
7039,No
7040,No
7041,Yes


In [None]:
#Define feature and target

df["Churn"] = df["Churn"].map({"Yes": 1, "No":0})


In [None]:


X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)


In [None]:
from sklearn.preprocessing import StandardScaler

# Only scale numeric columns
num_cols = ['tenure', 'MonthlyCharges', 'TotalCharges', 'num_services']
scaler = StandardScaler()
X_train[num_cols] = scaler.fit_transform(X_train[num_cols])
X_test[num_cols] = scaler.transform(X_test[num_cols])


In [None]:

model = LogisticRegression(class_weight='balanced', solver='liblinear', max_iter=1000)
model.fit(X_train, y_train)


In [None]:


y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print("ROC-AUC:", roc_auc_score(y_test, y_prob))


[[737 298]
 [ 77 297]]
              precision    recall  f1-score   support

          No       0.91      0.71      0.80      1035
         Yes       0.50      0.79      0.61       374

    accuracy                           0.73      1409
   macro avg       0.70      0.75      0.71      1409
weighted avg       0.80      0.73      0.75      1409

ROC-AUC: 0.8426903304141157


## Business Interpretation

The model identifies most churners (high recall = 79%), so it’s useful for retention campaigns.

Some false positives (precision = 50%) → not all predicted churners will actually churn. That’s acceptable if retention incentives are cheap.

Non-churners are sometimes predicted as churners (FP = 298) → might result in unnecessary marketing spend.

## Observation:

High recall for churners (Yes = 0.79) → model is good at catching churners, which is what we care about in business.

Precision for churners is lower (0.50) → half of predicted churners are false alarms → may cost some unnecessary retention campaigns.