# TELCO CUSTOMER CHURN PREDICTION ANALYSIS

# 📘 Business Understanding

## 🎯 Objective
The goal of this project is to **predict whether a customer will churn (i.e., leave the telecom company)** using historical customer data. This enables the company to identify at-risk customers early and take action to retain them.

---

## 💼 Business Context
Customer churn is a major concern in subscription-based industries like telecommunications. It is **more cost-effective to retain existing customers than to acquire new ones**.

By analyzing customer data and predicting churn, telecom companies can:
- Understand why customers leave.
- Identify customers at risk of churning.
- Implement strategies to improve retention.

---

## ❓ Key Business Questions
- Which customers are likely to churn?
- What are the key factors driving customer churn?
- Can we predict churn early enough to take preventive action?
- What customer segments are more loyal, and why?

---

## 📊 Success Criteria
- Build a model with strong **predictive performance**, especially high **recall** (so we don’t miss likely churners).
- Provide **interpretable** results to guide business decisions.
- Generate **actionable insights** to inform retention strategies.

---

## 🏷️ Target Variable
- `Churn`: A binary variable indicating whether a customer has left the company (Yes/No).

---

## 📥 Input Features
- **Demographics**: Gender, SeniorCitizen, Partner, Dependents.
- **Account Information**: Tenure, Contract type, Paperless billing, Payment method.
- **Service Usage**: InternetService, StreamingTV, OnlineSecurity, etc.
- **Billing Info**: MonthlyCharges, TotalCharges.

---

## 🧠 Business Value
A successful churn prediction model allows the company to:
- **Reduce churn** by targeting interventions toward high-risk customers.
- **Improve customer retention** and satisfaction.
- **Increase profitability** by maximizing customer lifetime value (CLV).


In [1]:
# Imports

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from category_encoders import OneHotEncoder
from ipywidgets import Dropdown, FloatSlider, IntSlider, interact
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.metrics import mean_absolute_error, roc_curve, auc
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score, ConfusionMatrixDisplay, confusion_matrix
import joblib

import warnings
warnings.filterwarnings('ignore')


%matplotlib inline

## DATA UNDERSTANDING

In [2]:
# Loading My Dataset and Creating a DataFrame

df = pd.read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [3]:
df.shape

(7043, 21)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [5]:
# Checking for Null Values

df.isna().sum()

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

#### - The column `TotalCharge` is an object but should be an integer as it contains numerical values

In [6]:
# Define wrangle function

def wrangle(df):

    # Convert "TotalCharges" from string to integer, Coerce errors to NaN
    df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")

    # Dropping missing values in "TotalCharges" as "TotalCharges" = "tenure" * "MonthlyCharges" and All NaN values have a tenure of 0
    df.dropna(inplace=True)


    return df

In [7]:
# Wrangle Data

df = wrangle(df)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7032 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7032 non-null   object 
 1   gender            7032 non-null   object 
 2   SeniorCitizen     7032 non-null   int64  
 3   Partner           7032 non-null   object 
 4   Dependents        7032 non-null   object 
 5   tenure            7032 non-null   int64  
 6   PhoneService      7032 non-null   object 
 7   MultipleLines     7032 non-null   object 
 8   InternetService   7032 non-null   object 
 9   OnlineSecurity    7032 non-null   object 
 10  OnlineBackup      7032 non-null   object 
 11  DeviceProtection  7032 non-null   object 
 12  TechSupport       7032 non-null   object 
 13  StreamingTV       7032 non-null   object 
 14  StreamingMovies   7032 non-null   object 
 15  Contract          7032 non-null   object 
 16  PaperlessBilling  7032 non-null   object 
 17  

In [8]:
# Summariy statistics for numeric columns

numeric_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']
df[numeric_cols].describe()

Unnamed: 0,tenure,MonthlyCharges,TotalCharges
count,7032.0,7032.0,7032.0
mean,32.421786,64.798208,2283.300441
std,24.54526,30.085974,2266.771362
min,1.0,18.25,18.8
25%,9.0,35.5875,401.45
50%,29.0,70.35,1397.475
75%,55.0,89.8625,3794.7375
max,72.0,118.75,8684.8


In [16]:
string_cols = pd.DataFrame({col: [df[col].nunique()] for col in df.select_dtypes(include='object')}).T.rename(columns={0: 'Unique Count'}).sort_values('Unique Count', ascending=False)
string_cols

Unnamed: 0,Unique Count
customerID,7032
PaymentMethod,4
DeviceProtection,3
Contract,3
StreamingMovies,3
StreamingTV,3
TechSupport,3
OnlineBackup,3
OnlineSecurity,3
InternetService,3


## EXPLORATORY DATA ANALYSIS (EDA)