# Solution Report
---

We identified the problem at hand-- our shareholder is losing customers. Our job is to develop a model to identify churned customers to be able to detect any future customers whom are likely to terminate their service. We analyzed the data and developed a work plan to help us solve this task.

---
## __Work Report Review__
---
On our initial work report we highlighted a kew key tasks. We implemented the following:

1) We found and accurately filled in the missing values for `total_charges` which were all 0 as we found them to be brand new accounts. The other missing values were kept as `missing` and were designated into its own category for model training. We chose to keep them as missing values because there is essentially no way to accurately determine a value without creating a false bias/pattern.

2) We created the feature `service_length` for each account as it could hold some weight as it relates to the observation's target value.

3) We changed the categorical features allowing the models to recognize the features accordingly.

4) It was planned to remove any outliers but after filtering for any observations with values 3 standard deviations above the means we did not find any outliers so we skipped this task

5) Dataset was split into training, validation, and test sets for ideal model evaluations

6)  The SMOTE tool was our choice method to handle the class imbalance but we were not allowed to use it as the Jupyter notebook within the TripleTen platform was not permitting the use/install of the module. We instead created a function to upsample the minority class for a better distribution on our training set.

7)   We originally wanted to use the StandardScaler tool in the sklearn library but after switching to the OrdinalEncoder, I found it to be a better idea to utilize the MaxAbsScaler tool as each categorical feature has different ranges of values.

8) OneHotEncoder was the planned encoder for our categorical features but after some deliberation we found that OrdinalEncoder worked better with decision tree models (XGBoost). We skipped the encoding for our CatBoost model as it has its own internal encoder

9) We chose 3 models to solve our task:
- __CatBoost__: Performs well on data with many categorical features. Compared to other models, CatBoost does particularly well with smaller dataset. Our set holds about 15,000 after upsampling.
- __XGBoost__: One of the highly regarded gradient boosting models. It does well with highly complicated, large datasets. Has a big community which indicates high chance of continued maintenance, which is key when thinking long term model lifespan
- __KNeighbors__: We chose this algorithm after evaluating the results from the catboost and xgboost models. We noticed the model performed best with very high learning rates. This typically leads to overfitting but instead led to improved performance. This was a clear indication that observations that are near each other likely share the same target values.

---
## Troubleshooting
---
There were a few problems encountered in solving this task. Each one took some time to determine the best way we felt should address each problem.

1) __No pricepoints__<br>
We would have needed pricepoints for each plan and the features within each customers' plans. In order to accurately determine the correct missing values in the categorical features, we'd need to formulate a function to fill in said values but we did not have the pricepoints needed. We overcame this issue by decising to leave those values missing and categorize them as its own `missing` category.

2) __Class Balancing__<br>
We tried multiple times to install the imblearn library to utilize the SMOTE module tool. Given our inability to successfully istall the imblearn library we just used a familiar function we have written to upsample our minority class to help with our model training.

3) __XGBooster__<br>
I originally wanted to hypertune a booster on XGBoost to train to maximize its potential performance. I kept getting errors when attempting to train using a DMatix. To circumvent this issue I went with the XGBoostClassifier class and did not use a booster. Results were still impressive.

4) __Choosing a Winner__<br>
Being that the top 2 models had similar scores, I had to find a better way to evaluate both models' performance. I chose to visualize the models' predictions ad true values with confusion matrix heatmaps. This gave us a great look into each model's errors and its potential business implications.


---
## Key Steps
---

We found a few steps we've taken that was key in our models' performance. 

__Engineered Feature__<br>
We engineered a feature `service_length` which gave each customer/observation a value indicating the length of service. The features containing dates were dropped and the `service_length` feature allowed us to minimize any potential lost data.

__Keeping it Honest__<br>
The decision to keep the missing values as missing and assiging it as its own category kept the data honest and allowing the model to identify any potential patterns from an unadulterated dataset.

__Balanced Opportunity__<br>
Our classes were heavily imbalanced. We chose to balance our training dataset so our models can have a balanced look at instances within each class. Keeping the validation and test sets' class balances as is allowed us to measure each given model's ability to learn any potential patterns.

---
## Final Model Quality
---
Our final model is the KNeighbors model with n_neighbors hyperparameter value of 1 (closest neighbors) and p value equal to 1. The p value indicates the power parameter for the 'Minkowski' metric. P set to 1 has the model measure the Manhattan distance between point, as opposed to the Euclidean distance.

The model obtained the following scores:
- __Accuracy__: 0.997
- __F1 score__: 0.994
- __ROC-AUC__: 0.999

This passes our shareholder's threshold with flying colors. The model is highly suggested for use to mitigate customer churn as it successfully identified 99% of churned cutomers on both validation and test sets.