Project by:
- Name: Julius Kinyua Njeri
- Email: juliusczar36@gmail.com
- Date: June 2024
- Github Link: https://github.com/CzarProCoder/SyriaTel_Customer_Churn_ML
- LinkedIn: https://www.linkedin.com/in/julius-kinyua
- Twitter(X): https://x.com/Juliuskczar
- Website: https://lyonec.com/
SyriaTel, a telecommunications company, is concerned about customer churn, where customers stop using their services. To address this, the company has gathered data on customer behavior to identify those likely to leave and implement strategies to retain them, as losing customers is costly.
SyriaTel, a telecommunications company, is concerned about customer churn, where customers stop using their services. To address this, the company has gathered data on customer behavior to identify those likely to leave and implement strategies to retain them, as losing customers is costly.
The term “churn” refers to customers leaving the company, and the current churn rate is approximately 14%. Aiming to reduce this rate to about 7%.
The project utilized the provided dataset to address key questions:
- Identifying the main features that determine customer churn
- Uncovering any predictable patterns
- Exploring how SyriaTel can leverage these insights to implement cost-effective solutions.
The project aims to develop a classification model to predict customer churn using machine learning techniques. Following the CRISP-DM methodology, the project involves six stages: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment. By analyzing the dataset, we aim to uncover patterns and factors driving customer churn and create a predictive model to help reduce customer attrition.
Problem Statement
SyriaTel, a telecommunications company, is experiencing high customer churn as many customers switch to competitors. To address this, the company aims to develop a churn prediction model to identify factors associated with churn and improve customer retention, ultimately boosting profitability.
Objectives and Success Metrics
The project aims to:
- Identify key factors leading to customer churn.
- Develop an accurate churn prediction model.
- Implement strategies to retain at-risk customers.
Success will be measured by:
- Achieving a recall score of 0.8 with the prediction model.
- Identifying significant features contributing to churn.
- Providing actionable recommendations to reduce churn and enhance retention.
- Demonstrating the value of proactive retention strategies in reducing revenue losses.
For this analysis, the SyriaTel churn data was sourced from Kaggle
The dataset contains data on the customers of a Telecom company. Each row represents a customer and the columns contain customer’s attributes such as minutes, number of calls, and charge for each time of day and international. In addition we also have information about the customer's voicemail and customer call behavior.
- The dataset has no missingness and most columns are numeric
- Of the 3,333 customers in this dataset, 483 terminated their contract with SyriaTel
- Transforming target
churnas well asinternational_planandvoice_mail_planto binary 0/1 is needed stateappears numeric but is actually categorical and needs to be transformed as wellphone_numbercan be dropped as there are no duplicate entries
This is an imbalanced dataset, with 14.5% of customers lost, balancing will be necessary.
total_intl_charge, total_day_charge, total_eve_charge, and total_night_chargeis perfectly correlated withtotal_intl_minutes, total_day_minutes, total_eve_minutes, and total_night_minutes respectively. This makes sense since the company is charging by the minute.
If we need to, we can confidently drop the 'charge' column from each category; day, eve, night, and intl. We can keep the 'minutes' category as it is unclear what currency metric 'charge' is referring to.
In addition, there is a near perfect correlation between number_vmail_messages and voice_mail_plan, this makes sense and these two columns much like 'charge' and 'minutes' are telling us the same thing.
If we need to, we can drop
number_vmail_messages.
Lastly, there are a couple of weak correlations associated with our target churn variable; It seems customer_service_calls, international_plan and total_day_minutes have a slight positive correlation with churn.
While weak correlations, we would want to consider including these features in our models.
To prepare the data for modeling, several steps had to be taken as described below. All train/test splits maintain the default .75/.25 proportion respectively. We know we have class imbalance, so we will have stratify = y so our class proportions stay the same for both our train and test data.
Given our selected approach to these LogisticRegression models, we had slightly different steps applied depending on the model. All train/test splits maintain
Before splitting our data between train and test, we performed some simple processing:
- Since the column names were formatted with a space between words, we transformed them to include and underscore as per column name standard formatting
LabelEncoderwas used to perform transformtions on the following categorical columns:churnorriginally in binary True/False formatinternational_planandvoice_mail_planin binary Yes/No format
- Dropped
phone_numbercolumn as there were no duplicate entries as mentioned previously
After splitting our data into a train and test we had to perform a couple of other transformations depending on the model criteria.
For the first two models we used OneHotEncoder to transform the area_code and state categorical columns to numerical format. This left us with an X_train containing 69 features. In addition, we used SMOTE to resample our data and handle class imbalance.
Model 2 with SelectFromModel to aid with important feature selection we called on this meta-transformer to reduce our
For our 3rd model we took a manual approach and redefined the DataFrame criteria which lead us to having to conduct a fresh train/test split.
We decided to only include highly correlated variables since we had previously stated there were some features which had extremely high correlations with eachother. For this model we were left with only the features seen below in a heatmap no longer demonstrating any co-linearity.
Since tranforming column names applies generally to the dataframe, we did not have to repeat this as it was already complete as the first pre-processing step.
Since we redefined a new X and y we also applied SMOTE to this fresh data set, creating reduced versious of our training data.
Sinced we eliminated any categorical columns in need of transformation, the
OneHotEncoderwas not necessary for this model.
Focusing on predicting churn, we will focus on finetuning to the recall metric to ensure we are predicting as many True Positive results (customers predicted to churn who churn) and reducing False Negative (customers predicted to be retained who churn) as much as possible.
Our business initiatives are not high risk so a somewhat disproportionate amount of False Positives (customers predicted to churn who are retained) is tolerated and approach to this sect of customers will be addressed within our evaluation and recommendations.
We build our base model with DummyClassifier using the stratified strategy since we have an imbalanced dataset skewed in the direction of class 0 when we are interested in predicting class 1. We do not apply SMOTE here to get truly baseline results.
From the get go our Base model produced an average accuracy score of ~0.75. This is a good start and gives us confidence to proceed with improving our recall and still maintianing fairly balanced results.
To start, we evaluate a basic LogisticRegression model, before applying SMOTE. We use the detault L2 penalty with this initial model.
Comparing our first LogisticRegression model with our base, we can see that our LogisticRegression model does somewhat better at predicting churn with a higher True Positive Rate than our base.
To choose the right solver, we run this model with both L1 and L2 solvers. It looks like the Logisic L1 model does better than both previous models but only slightly. However our class imbalance makes it difficult to assess accurately and needs to be addressed.
After applying SMOTE with an even 0:1 split, we cross validate our model with the ModCrossVal class created to make cross validation an easier process, in which we specify scoring = 'recall'. Our model performs nearly the same on the train and test (validation) data. We can probably get this even higher after we simplify our model some more.
Finetuning C with Cross Validation: Creating a loop to test out C values [0.0001, 0.001, 0.01, 0.1, 1] we find that the lowest C yields the highest recall.
Our optimized results after finetuning the C look pretty good, though slightly less than before optimizing C. Once we attempt to simplify some more, we will want to look at other scores such as accuracy and precision to make sure our results are balanced enough for the business problem at hand.
As prevously stated, we know that there are features that are highly correlated. We use SelectFromModel to select features for us that are most important. After additional preprocessing with SelectFromModel we run and cross validate using the same ModCrossVal class.
We will use the default threshold to start and identify which features meet threshold requirements. Since we are still using our L1 Logistic model, the default threshold will be
Before finetuning our selected feature model did around the same as our Logistic L1 model before finetuning. It is worth noting that this is a simpler model as it has reduced features.
Finetuning C with Cross Validation: Just like our Logreg L1 model and using the same test C values, the Logreg Select model does best with smaller C values, so we will want to use the smallest value with our optimized model.
Our Logistic Select model did pretty well though It performed around the same as our first Logtistic model after optimization.
For our final itteration of the LogisticRegression model we should try manual feature selection with features we know to be highly correlated with churn.
Here we redefined our DataFrame:
Before finetuning and after performing a new split and re-applying SMOTE to the fresh data, we run our results. Our model performs slightly worse than our previous two.
Finetuning C with Cross Validation: Using a different set of tests C values [0.00015, 0.0002, 0.0015, 0.002, .015] , the smallest C values gives us the best results. We will again, use the smallest value within our optimized results.
We get a pretty good recall score after optimizing! We will definitely want to make sure we balance accuracy within our decision making process. All in all, it seems like our manual feature selection yields the best recall.
Comparing confusion matrices of all 3 LogisticRegression models, our most recent Logistic Reduced model does best at predicting True Positives (customers going to churn) and reducing False Negatives (customers appearing to be retained but who actually churn).
This can provide valuable intervention insights to our stakeholders given a strategic approach to address the high amount False Positives (customers appearing to potentially churn but actually end up retained).
We will now run our models with test data and evaluate each classification report associated. As expected, our 3rd Model produces the highest recall.
However, we can also clearly see that all of our LogisticRegression models are underfitting to our training data resulting in a higher recall score when running our test data. We may want to try to run a DecisionTreeClassifier to balance this detriment.
As stated, we will now construct and run a DecisionTreeClassifier on the dataset defined in our most recent model. We will also call on GridSearchCV to help us find the best parameters for our decision tree to run and result in the best recall score. No other preparation is necessary as it was addressed when prepping our previous model.
Our results look great! These recall scores are the highest we've seen, even after optimizing the other models. We are also not seeing any overfitting or underfitting since both train and test(validation) scores are balanced. Although, this was also the case with our other models so we have to run the test to be certain that this model doesn't pose the same issue.
Run Final Model on Test:
To confirm that our train data is not underfitting we now run the DecisionTreeClassifier on our test data. Our results look good! Our recall on the test is slighly lower than on the train but overall they look balanced. Our DecisionTreeCassifier model has outperformed all of our LogisticRegression
After becoming aware of the underfitting issues with our LogisticRegression and running a DecisionTreeClassifier it is clear that the latter is the clear choice for this Phase 1 of the business initiative. This model provides the highest Recall or True Postive Rate and most closely satisfies the goals. Below we go into detail regarding this decision including additional recommendation on intervention approach.
As this is Phase 1 of the project, we are hyper focused on identifying True Positive cases while reducing False Negative instances. Therefore, we are primarily focused on recall or true positive rate.
To account for our recall-focused path, a variety of low touch to high touch engagement models is recommended to account for the high number of False Positives within these models. An automated low touch model to start and gather data on customer satisfaction of those predicted to churn will yeild best results. Acting accordingly with a scaled approach given the feedback collected will be crutial and create a positive customer experience for all.
Positive Implications:
Customer Retention: High recall means that your model is effective at identifying customers who are likely to churn. This allows the business to proactively intervene and take steps to retain these customers, such as offering incentives, personalized promotions, or improved customer service.
Reduced Churn: By effectively targeting at-risk customers, you may be able to reduce the overall churn rate, leading to increased customer retention and long-term profitability.
Negative Implications:
Costs: A low precision score means that there may be a significant number of false positives, leading to unnecessary costs associated with retaining customers who were not actually at risk of churning. These costs may include incentives or discounts offered to retain customers.
Customer Experience: Misclassifying customers who were not actually at risk of churning as "churners" may lead to unnecessary interventions or communications, potentially impacting the customer experience negatively.
Data Limitation and Future Considerations:
In Phase 2 of the business initiative, when looking to optimize our results and produce the most accurate prediction of customers who are likely to churn, we find that it may be best to use a combination of classifier models to balance precision and recall. However, given the need to edit the training data, this posed an issue.
We would also recommend gathering additional data to account for class imbalance and revising which feature hold importance in relation to churn. Obtaining a larger dataset will also help resolve the underfitting issues we saw in our LogisticRegression models.
By simplifying the data before modeling, we are more likely to yield positive results and open up options to combine models using the same training data for a more balanced learning mechanism.
















