# **Solution Report**

---

## **Introduction**

The goal of this project was to develop a machine learning model that can predict customer churn for a telecom provider. The provided dataset included customer contract details, personal information, internet and phone service usage, and billing information. After data preparation, multiple models were trained, evaluated, and compared to determine the most effective churn prediction approach.

---

## **Part 1 — Steps Performed and Skipped**

### **Performed Steps:**

* Loaded and merged multiple CSV files into a unified dataset.
* Cleaned column names and corrected data types.
* Processed missing values for both numeric and categorical features.
* Created a new feature `contract_length` from date columns.
* Applied appropriate encoding: binary mapping for two-class variables, one-hot encoding for multi-class categorical variables.
* Dropped irrelevant columns and duplicates.
* Split the dataset into training, validation, and test subsets.
* Established a Dummy Classifier as a baseline model.
* Trained multiple models: Logistic Regression, Random Forest, and CatBoost.
* Performed hyperparameter tuning with GridSearchCV for CatBoost.
* Visualized feature importances.
* Plotted ROC-AUC curves for better model evaluation.
* Evaluated final model performance on the independent test set.

## **Part 2 — Solution Summary**

### **Difficulties Encountered:**

* Adjusting column names due to inconsistencies (e.g. `begindate` vs `begin_date`).
* Managing categorical encoding before splitting the dataset to prevent inconsistencies between train and validation sets.
* Implementing proper three-way dataset splitting (train, validation, test).
* Addressing class imbalance with appropriate class weights in models.
* Finding optimal hyperparameters for CatBoost through grid search tuning.

### **Key Steps to Solving the Task:**

* Careful data cleaning and feature engineering improved model quality.
* The creation of `contract_length` added strong predictive power.
* Dummy classifier established a valid baseline for performance comparison.
* AUC-ROC was selected as the main evaluation metric to handle class imbalance.
* Hyperparameter tuning with GridSearchCV significantly boosted the CatBoost model's performance.

### **Final Model and Quality Scores:**

The best-performing model is **CatBoost Classifier** with tuned hyperparameters:

* `learning_rate`: 0.2
* `iterations`: 400
* `depth`: 4

**Performance Summary:**

| Model                | AUC-ROC (Validation) | Accuracy (Validation) |
| -------------------- | -------------------- | --------------------- |
| Dummy Classifier     | 0.5000               | 73.41%                |
| Logistic Regression  | 0.8417               | 75.60%                |
| Random Forest        | 0.8853               | 84.64%                |
| CatBoost (initial)   | 0.9314               | 86.70%                |
| **CatBoost (tuned)** | **0.9433**           | **89.33%**            |

**Final Test Performance:**

* Test AUC-ROC: **0.9404**
* Test Accuracy: **88.41%**

---

## **General Conclusion**

The project successfully delivered a robust machine learning model to predict telecom customer churn. The CatBoost model, after hyperparameter tuning, achieved strong results on both validation and test sets, exceeding the expected AUC-ROC threshold. Key features influencing churn included total charges, contract length, monthly charges, and service subscription types. This model can now be used by the telecom provider to proactively identify high-risk customers and implement retention strategies.