# *Final Report:*

# Client Churn Prediction for Interconnect Telecom

## Introduction :
The telecom operator Interconnect would like to be able to forecast their churn of clients. If it's discovered that a user is planning to leave, they will be offered promotional codes and special plan options. Interconnect's marketing team has collected some of their clientele's personal data, including information about their plans and contracts.

Interconnect mainly provides two types of services:

- Landline communication - The telephone can be connected to several lines simultaneously.
- Internet - The network can be set up via a telephone line (DSL, digital subscriber line) or through a fiber optic cable.
- Some other services the company provides include:
  
        - Internet security: antivirus software (DeviceProtection) and a malicious website blocker (OnlineSecurity)
        - A dedicated technical support line (TechSupport)
        - Cloud file storage and data backup (OnlineBackup)
        - TV streaming (StreamingTV) and a movie directory (StreamingMovies)

The clients can choose either a monthly payment or sign a 1- or 2-year contract. They can use various payment methods and receive an electronic invoice after a transaction.

## Business Problem Statement :
Interconnect wants to forecast the client churn.

## Business Value :
Interconnect wants to forecast client churn in order to proactively retain customers by offering promotions or customized plans.

## Data Overview :
The dataset included a comprehensive view of customer attributes:
- Demographics (e.g., age, senior citizen status, gender)
- Contract details (e.g., begin date, end date,  payment method, contract type)
- Services used (e.g., internet, tech support, device protection)
- Multiple line connections
All datasets were merged on CustomerID to create a unified customer-level dataset.

## Completed vs Skipped Steps

### Completed Steps
- Data collection and merging: All datasets merged using CustomerID.
- Data preprocessing:
      - Missing values handled
      - Corrected data types
      - Converted column names to snake_case
- EDA (Exploratory Data Analysis):
      - Churn vs. non-churn analysis
      - Demographics, services, payment method trends
      - Class imbalance identified
- Feature Engineering:
      - One-hot encoding for categorical features
      - Normalization for numeric features
- Model Training:
      - Logistic Regression (baseline)
      - Tree-based models:Decision Tree,  Random Forest, XGBoost, CatBoost, and LightGBM
- Model Evaluation: AUC-ROC and accuracy
- Model Interpretation & Deployment Recommendation

### Skipped or Adjusted Steps:
Outlier handling: Outliers were reviewed, but given the nature of business metrics (e.g., high charges for long-tenured customers), no extreme values were removed as they reflected genuine variability.



## Difficulties Encountered and Resolutions
|Challenge|	Resolution|
|---------|------------|
|Missing or inconsistent data (e.g., TotalCharges as object)|	Converted to numeric after identifying and handling blanks|
|Class imbalance (26.6% churn rate)|	Used class weighting in models and focused on AUC-ROC metric|
|Overfitting in complex models|	Used cross-validation and hyper parameter tuning to improve generalization|
|Time consumption during training and tuning|Particularly with ensemble models (XGBoost, CatBoost) solved by reducing cv, n_iter, n_jobs=-1 in  tuning parameters|

##  Key Steps That Solved the Task
### Data Preprocessing
- Merging datasets: Unified customer-level data was created by merging multiple files on CustomerID.
- Data Cleaning:
      - Missing values handled appropriately.
      - Converted data types (e.g., numeric conversion for charges, datetime to end and begin date).
      - Standardized column names to snake_case for consistency.
- Target variable: Defined as:
    -  churn = 1 for customers who discontinued the service
    -  churn = 0 for customers who remained This binary target enabled the use of supervised classification algorithms for churn prediction. 

## Exploratory Data Analysis (EDA)
- Approximately 26.58% of customers have churned, confirming a moderate class imbalance, with churn being the minority class.This imbalance should be addressed during modeling (e.g., using class weights or resampling).
- Senior citizens have a higher churn rate across both genders.
- Gender alone does not show significant difference in churn behavior.
- Month-to-month contracts have the highest churn rate (42.7%), while two-year contracts have the lowest (2.8%), indicating longer commitments reduce churn.
- Electronic check users have the highest churn (45.3%), while automatic payments (bank transfer/credit card) are linked with much lower churn rates (15–20%).
- Customers not subscribed to online security, tech support, or online backup are more likely to churn, suggesting these services contribute to retention.
- The median total charges for churned customers is significantly lower than for those who stayed.
- Median monthly charges are higher for churned customers.

## Feature Engineering

- Feature extraction
- Encoding categorical variables- One Hot Encoding
- Normalize/Scale numerical features

## Model selection and tuning: 
Trying multiple models and focusing on AUC-ROC guided us to a high-performing solution.
* Logistic Regression(Sanity check)
* Decision Tree Classifier
* Random Forest Classifier
* XGBClassifier
* CatBoostClassifier
* LightGBM Classifier

### Results:
|Model|	AUC-ROC|	Accuracy|
|-----|-----|------|
|LightGBM|	0.931942|	0.896473|
|XGBoost|	0.929942|	0.889078|
|CatBoost|	0.926445|	0.884528|
|Decision Tree|	0.861139|	0.819113|
|Random Forest|	0.856675|	0.812287|
|Logistic Regressioin|	0.830799|	0.799772|

- LightGBM outperformed all other models in both AUC-ROC and accuracy, making it the top choice for deployment.
- XGBoost and CatBoost also delivered strong performance, with AUC-ROC values above 0.92, showing excellent discrimination between churn and non-churn classes.
- Logistic regression, while less accurate, provided a valuable baseline and interpretability.

### Final Model and Evaluation

#### Final Model: LightGBM (Light Gradient Boosting Machine)
The LightGBM model demonstrated the best overall performance and is recommended for deployment to support churn mitigation strategies.

Test Set Performance:
- AUC-ROC Score of 0.932 indicates excellent discrimination between churners and non-churners.
- Accuracy Score of 0.896 implies the model correctly classified nearly 90% of test instances.
- The confusion matrix shows that

      - True Negatives (TN): 1238 non-churners correctly predicted
      - False Positives (FP): 53 non-churners incorrectly predicted as churners
      - False Negatives (FN): 129 churners missed by the model
      - True Positives (TP): 338 churners correctly identified

### Strategic Recommendations
- Encourage long-term contracts through onboarding promotions.
- Offer bundled services (e.g., tech support, online backup) as value-adds.
- Promote automatic payments for ease and retention.
- Monitor newly onboarded clients closely for the first few months (early churn window).

### Conclusion
This project successfully developed a robust churn prediction model for Interconnect. With LightGBM achieving over 93% AUC-ROC, the company can now effectively target at-risk customers and improve retention rates through informed, data-driven marketing strategies.



<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> Very nice report! Congratulations!
<a class="tocSkip"></a>