# SOLUTION REPORT

### Explanation of steps performed and steps that were skipped

Most of the planned steps were successfully executed, with several improvements added along the way. Initially, my work plan was based on a dataset containing a user_id column, which did not align with the structure of the other files. After reassessing the data, I correctly used the version containing customer_id, allowing for a seamless and accurate merge of all datasets. This correction eliminated the need for a workaround and ensured the integrity of the merged dataset.

From there, I proceeded with standardizing all column names to snake_case, converting all text to lowercase, and removing whitespace. These preprocessing steps helped create consistency and made later transformations and visualizations much easier.

Although I initially planned to fill all missing values with 'unknown', further exploration of the dataset revealed that many missing values were not truly unknown. For example, missing entries in the multiple_lines column actually indicated customers with 'no_phone_service', and similarly, gaps in internet service columns corresponded to 'no_internet_service'. As a result, I refined my approach and filled those columns with more meaningful values based on context.

### Difficulties encountered and how they were managed

One major challenge was working with the wrong identifier column in the initial plan (user_id vs customer_id), which would have prevented successful merging. Once corrected, merging went smoothly.

Another hurdle was achieving strong model performance. At first, I used get_dummies() across the board and applied scaling inappropriately. Logistic regression performance dropped when scaled, and none of the models initially crossed an AUC-ROC score of 0.84.

To overcome this, I re-evaluated which encoding methods were best suited for each model. I applied One-Hot Encoding for Logistic Regression, Ordinal Encoding for Random Forest, and CatBoost-style Target Encoding for XGBoost and LightGBM. This model-specific encoding, combined with more meaningful feature engineering, significantly improved model quality.

### Key steps in solving the task

The most critical step was refining the feature engineering. I created several new columns that captured interactions and patterns within the data:

- 'internet_tv_combo' to identify customers using both internet and streaming services

- 'contract_support_combo' to link long-term contracts with tech support

- 'avg_monthly_charge' to reflect how much customers spend monthly over their service duration

- 'service_count' to show how many services each customer subscribed to

- 'charge_per_service' to reflect the cost per service a customer is using

These engineered features added important nuance to the dataset and helped models pick up deeper patterns that were not visible through raw features alone.

Another key improvement was switching from generic encoding (like get_dummies) to model-specific strategies that improved both computational efficiency and predictive accuracy.

### Final model and quality score

The final model is a Logistic Regression classifier with balanced class weights:

**log_model = LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42)**

This model performed strongly without scaling and achieved the highest AUC-ROC score among all models tested.

- Validation Accuracy: 94%

- Validation AUC-ROC: 0.9787

- Test Accuracy: 91.7%

- Test AUC-ROC: 0.9760

The ROC curve confirmed the model's ability to separate churners from non-churners with high confidence, showing a steep and well-formed curve that hugged the top-left corner.

### Conclusion

By using the correct datasets to refine the dataset merge, applying meaningful feature engineering, and using encoding strategies tailored to each model type, I was able to significantly improve model performance. The final logistic regression model not only performed best in terms of AUC-ROC but also remained stable across validation and test sets. These results highlight its effectiveness and reliability for predicting customer churn.

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> This is a very nicely written report! Great job on this and best of luck in your job search!
<a class="tocSkip"></a>