<a href="https://colab.research.google.com/github/RizaRafeek/Telecom-Churn-Prediction/blob/main/03_Churn_XGBoost.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 3: High-Performance Churn Prediction (XGBoost)
**Domain**: Telecom Industry

**Objective**: To leverage Gradient Boosting to maximize the detection of at-risk customers (Recall).

**Context**: This notebook serves as the "Industry Standard" benchmark against our earlier Random Forest and Neural Network models..

In [1]:
import os

# 1. Set up Kaggle Credentials
os.environ['KAGGLE_USERNAME'] = "enter your username url" # Replace with your username
os.environ['KAGGLE_KEY'] = "Enter your key"       # Replace with your key

# 2. Download and Unzip
!kaggle datasets download -d blastchar/telco-customer-churn
!unzip telco-customer-churn.zip

Dataset URL: https://www.kaggle.com/datasets/blastchar/telco-customer-churn
License(s): copyright-authors
telco-customer-churn.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  telco-customer-churn.zip
replace WA_Fn-UseC_-Telco-Customer-Churn.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# 1. Load and basic clean
df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df.dropna(inplace=True)
df.drop(columns=['customerID'], inplace=True)

# 2. Encode categorical variables
df = pd.get_dummies(df, drop_first=True)

# 3. Define Features and Target
X = df.drop(columns=['Churn_Yes'])
y = df['Churn_Yes']

# 4. Split (80% Train, 20% Test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# 5. Scaling (Standardization)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

I am using XGBoost because it uses Gradient Boostingâ€”a sequential learning process where each new tree corrects the specific errors made by the previous ones. This often leads to higher accuracy on tabular data compared to standard Random Forests

In [3]:
import xgboost as xgb
from sklearn.metrics import classification_report

# Initialize the model
model_xgb = xgb.XGBClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    scale_pos_weight=3, # Handles the 3:1 imbalance
    eval_metric='logloss'
)

# Train
model_xgb.fit(X_train_scaled, y_train)

# Evaluate
y_pred_xgb = model_xgb.predict(X_test_scaled)
print(classification_report(y_test, y_pred_xgb))

              precision    recall  f1-score   support

       False       0.90      0.71      0.79      1033
        True       0.49      0.79      0.61       374

    accuracy                           0.73      1407
   macro avg       0.70      0.75      0.70      1407
weighted avg       0.80      0.73      0.74      1407



### **Final Model Comparison**

| Model | Technology | Recall for Churn | Verdict |
| :--- | :--- | :---: | :--- |
| **Random Forest** | Classic ML | ~80% | **Winner (Best Balance)** |
| **Neural Network** | Deep Learning | 73% | High Complexity/Lower Recall |
| **XGBoost** | Gradient Boosting | **79%** | **Strongest Industry Rival** |