# Predicting Customer Churn & Revenue Risk in Telecom

Customer churn is a major driver of revenue loss in the telecom industry.
This project aims to build a predictive model that identifies customers
at high risk of churning so the business can take proactive retention actions.

We use the Telco Customer Churn dataset and compare multiple machine learning
models with a focus on interpretability and business value.


## Business Problem

From a business perspective, retaining an existing customer is often cheaper
than acquiring a new one. The key challenge is identifying *which customers*
are most likely to churn before they leave.

### Business Question:
Can we predict customer churn using historical customer data in order to
identify high-risk segments and reduce revenue loss?


## Data Loading and cleaning

We begin by loading the Telco Customer Churn dataset and cleaning it.


In [3]:
df = pd.read_csv(r"C:\Users\thobi\Downloads\archive (1)\WA_Fn-UseC_-Telco-Customer-Churn.csv")
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df = df.dropna()
df = df.drop(columns=['customerID'])


## Feature Engineering & Encoding

Machine learning models require numerical input. Since this dataset contains
multiple categorical variables, we apply one-hot encoding.

The target variable `Churn` is converted into a binary outcome where:
- 1 = Customer churned
- 0 = Customer did not churn


## Model 1: Logistic Regression

Logistic regression is used as a baseline model due to its:
- Interpretability
- Ability to quantify the impact of individual features
- Common use in churn and risk modeling


In [7]:
#Logistic recesssion
df = pd.get_dummies(df, drop_first=True)
X = df.drop('Churn_Yes', axis=1)
y = df['Churn_Yes']


In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)


In [13]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter=10000)
lr.fit(X_train, y_train)


In [15]:
from sklearn.metrics import classification_report, roc_auc_score

y_pred = lr.predict(X_test)
y_prob = lr.predict_proba(X_test)[:,1]

print(classification_report(y_test, y_pred))
print("ROC AUC:", roc_auc_score(y_test, y_prob))


              precision    recall  f1-score   support

       False       0.85      0.89      0.87      1291
        True       0.66      0.57      0.61       467

    accuracy                           0.81      1758
   macro avg       0.75      0.73      0.74      1758
weighted avg       0.80      0.81      0.80      1758

ROC AUC: 0.8403914764876919


**Interpretation:**

- The model achieves a ROC-AUC of approximately 0.84, indicating strong
  ability to rank customers by churn risk.
- Recall for churned customers is prioritized to capture as many at-risk
  customers as possible.


In [17]:
coeffs = pd.Series(lr.coef_[0], index=X.columns).sort_values()


## Model 2: Decision Tree

A decision tree is trained to provide interpretable, rule-based insights
into customer churn behavior. While decision trees may not always outperform
logistic regression, they are valuable for identifying high-risk segments.


In [19]:
#Decision tree
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(max_depth=4, min_samples_leaf=50, random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
y_prob_dt = dt.predict_proba(X_test)[:,1]

print(classification_report(y_test, y_pred_dt))
print("ROC AUC:", roc_auc_score(y_test, y_prob_dt))


              precision    recall  f1-score   support

       False       0.83      0.88      0.86      1291
        True       0.61      0.52      0.56       467

    accuracy                           0.78      1758
   macro avg       0.72      0.70      0.71      1758
weighted avg       0.77      0.78      0.78      1758

ROC AUC: 0.8216843009668318


## Model Comparison

Logistic regression outperforms the decision tree in terms of ROC-AUC and
recall for churned customers. As a result, logistic regression is selected
as the primary model for churn risk prediction, while the decision tree is
used to support business interpretation.


## Key Business Insights

- Customers on month-to-month contracts are significantly more likely to churn
- Short-tenure customers represent a high-risk segment
- Higher monthly charges are associated with increased churn probability
- Long-term contracts and support services reduce churn risk
