# Predicting Customer Churn

"Customer churn" is a term used to describe when a company loses customers.  The example in this notebook involves a scenario familiar to many: switching mobile phone providers.  If my provider could identify my intent to switch early, it could entice me to stay with timely offers, such as a phone upgrade or a new service feature.

Machine learning can help us identify patterns in the data for customers who left in the past, thus helping us prevent the same churn in the future.

The data for this notebook is included in `churn.csv`.

# 1. Setup

Import the necessary libraries.  NOTE: If you're using an environment other than Google Colab, you may need to first install some of these using `!pip`.
*   **pandas**: A library for organizing and manipulating data, making it easy to work with tables.
*   **numpy**: A library for numerical computing in Python, providing support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays.
*   **matplotlib**: A plotting library for Python, used to create static, animated, and interactive graphs and charts.
*   **seaborn**: A statistical data visualization library built on top of matplotlib, offering a higher-level interface for drawing statistical graphics.
*   **xgboost**: An optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable, used for building machine learning models, especially for solving data science challenges involving structured data.





In [None]:
# If you need to install libraries, you can do so using this syntax
# !pip install xgboost

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost as xgb

# 2. Data Cleansing and Exploratory Data Analysis

It all starts with data.  Load the data from churn.csv into a DataFrame, then explore and visualize it to help us better understand what we're working with, and what will be most important to pass to the model when we train it.

In [None]:
# LOAD THE DATA

# We'll use pandas to load the CSV data into a DataFrame
# If using Google Colab, the file should be uploaded to /content folder
churn = pd.read_csv('churn.csv')

# Display the first five rows of the dataset to understand its structure
churn.head()

In [None]:
# CLEAN THE DATA

# Check for missing values
# If there are any missing values, you can fill them with a placeholder or drop them, depending on the context
# We don't actually have missing values in this dataset, we'll just print
print(churn.isnull().sum())

# Standardize data for boolean columns, making everything "True" or "False"
churn["Churn?"] = churn["Churn?"].map({'False.': False, 'True.': True})
churn["Int'l Plan"] = churn["Int'l Plan"].map({'no': False, 'yes': True})
churn["VMail Plan"] = churn["VMail Plan"].map({'no': False, 'yes': True})

In [None]:
# EXPLORE AND VISUALIZE THE DATA

# Visualize the distribution of customer churn
sns.countplot(x='Churn?', data=churn)
plt.title('Distribution of Customer Churn')
plt.show()

# Calculate the percentage of churn
churn_percentage = churn['Churn?'].value_counts(normalize=True) * 100
# Print the percentage of customers who churned
print(f"Percentage of customers who churned: {churn_percentage[1]:.2f}%")

There is some imbalance in the data, as only 14.49% of customers churned, but this is not extreme imbalance.

In [None]:
# Histograms for each numeric features, to see how balanced the data is
display(churn.describe())
%matplotlib inline
hist = churn.hist(bins=30, sharey=True, figsize=(10, 10))

Most numeric features are well-distributed, showing a bell-shaped (or Gaussian) distribution.  In machine learning, this normality is desired for optimal perforamnce of the algorithms.  The exceptions are `VMail Message` and `Area Code`, which we'll handle below.

In [None]:
# Frequency tables for each categorical feature
for column in churn.select_dtypes(include=['object', 'bool']).columns:
    # Using 'display' to show the DataFrame and 'pd.crosstab()' to create the frequency table
    display(column, pd.crosstab(index=churn[column], columns='% observations', normalize='columns') * 100)  # Multiplying by 100 to convert to percentage

This exploration tells us:
*   `State`: Fairly well distributed; we will check correlations later
*   `Phone`: With all the unique values, it will be hard to use this as a feature.  We should drop this, along with `Area Code`, as it goes with `Phone`.
*   `Int'l Plan`: We will  check correlations later
*   `VMail Plan`: We will check correlations later
*   `Churn`: The target feature; included here because it's a categorical feature (meaning it can take on just a fixed number of possible values)

In [None]:
# Drop Area Code and Phone from the dataset
churn = churn.drop(['Phone', 'Area Code'], axis=1)

In [None]:
# Two options to help visualize the correlation between features and Churn (heatmap and correlation matrix)
# Correlation heatmap of numerical features
plt.figure(figsize=(10, 8))
sns.heatmap(churn.corr(), annot=True, fmt=".2f")
plt.title('Correlation Heatmap')
plt.show()

# Calculate the correlation matrix, focusing on the Churn column
correlation_matrix = churn.corr()
churn_correlations = correlation_matrix["Churn?"].sort_values(ascending=False)

# Plotting the correlations with Churn
plt.figure(figsize=(8, 10))
sns.barplot(x=churn_correlations.values, y=churn_correlations.index)
plt.title("Correlation with Churn")
plt.xlabel("Correlation Coefficient")
plt.show()

churn_correlations

This exploration tells us:

*   `Int'l Plan`: There's a positive correlation (0.26) between having an international plan and churn, suggesting customers with international plans are more likely to churn. This could be due to various factors such as cost, satisfaction with international service, or other reasons that warrant further investigation.

*   `CustServ Calls`: The number of customer service calls is positively correlated (0.21) with churn, indicating that customers who contact customer service more frequently are more likely to leave. This might reflect issues with service satisfaction or unresolved problems.

*   `Day Mins` and `Day Charge`: Both of these features show a positive correlation (about 0.21) with churn, suggesting that higher day time usage (and the associated charges) could be a factor in customers' decision to churn.

*   `VMail Plan` and `VMail Message`: These features are negatively correlated with churn (-0.10 and -0.09, respectively), indicating that customers who use voicemail services are less likely to churn. This could be interpreted as an indicator of customer engagement or satisfaction with the service.

*   `Intl Calls`: Interestingly, the number of international calls is negatively correlated (-0.05) with churn, which might suggest that customers making more international calls are less likely to leave, contrasting with the positive correlation seen with having an international plan.


In [None]:
# Scatter matrix to visualize the relationship between non-target (i.e., "Churn") features
pd.plotting.scatter_matrix(churn.select_dtypes(include=[np.number]), figsize=(12, 12), alpha=0.2)
plt.show()

This exploration tells us:


*   Some features (the ones with a diagonal line on the scatter matrix) have a correlation of 100% with one another (such as Day Charge and Day Mins, Night Charge and Night Mins)
*   Features like this can cause problems when we train the model later, so we'll remove the "Charge" features

In [None]:
# Drop features for Day Charge, Eve Charge, Night Charge, Intl Charge to remove the 100% correlation issue
churn = churn.drop(["Day Charge", "Eve Charge", "Night Charge", "Intl Charge"], axis=1)

In [None]:
# One-hot encode the State column so we have 0s and 1s rather than string data
churn_encoded = pd.get_dummies(churn, columns=['State'])

# 3. Building and Training the Model

In this section, we split the data into training, validation and test sets.  Then we build the model, inputting various parameters.  Finally, we train the model on the training dataset, evaluting it with the validation dataset.

In [None]:
# Split the data into training, validation and test sets

# Randomly shuffle the dataset and split it into training (70%), validation (20%), and testing (10%) sets
train_data, validation_data, test_data = np.split(churn_encoded.sample(frac=1, random_state=1729),
                                                  [int(0.7 * len(churn)), int(0.9 * len(churn))])

In [None]:
# Assuming the last column "Churn?" is the target variable
# Separate the features and the target variable for each dataset
X_train = train_data.drop("Churn?", axis=1)
y_train = train_data["Churn?"]
X_val = validation_data.drop("Churn?", axis=1)
y_val = validation_data["Churn?"]
X_test = test_data.drop("Churn?", axis=1)
y_test = test_data["Churn?"]

# Convert the datasets into DMatrix format for XGBoost
# XGBoost works well with DMatrix, a data structure optimized for both memory efficiency and training speed
dtrain = xgb.DMatrix(X_train, label=y_train)
dval = xgb.DMatrix(X_val, label=y_val)
dtest = xgb.DMatrix(X_test, label=y_test)

In [None]:
# Specify parameters for the model
params = {
    'max_depth': 3,  # The maximum depth of the decision tree; can be adjusted (typically ranges 3-10)
    'eta': 0.1,  # The learning rate (how much the model adjusts itself in response to errors); can be adjusted
    'objective': 'binary:logistic',  # We're choosing this because it's a binary problem (customers either churn or they don't)
                                     # Logistic means it will output a probability (the probability that a customer will leave)
    'eval_metric': 'auc',  # Evaluation metrics for validation data, e.g., "auc" or "Area Under the Curve" for binary classification
}
num_rounds = 100  # The number of rounds for boosting


In [None]:
# Train the model using the training dataset and evaluate it using the validation dataset
evallist = [(dval, 'eval'), (dtrain, 'train')]
bst = xgb.train(params, dtrain, num_rounds, evallist)

AUC values range from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0.
AUC values between 0.5 and 1.0 indicate useful models. A value of 0.5 suggests a model that does no better than random chance.

Higher AUC values indicate better model performance. In our model, the AUC for both the evaluation set (eval-auc) and the training set (train-auc) increases over the iterations, suggesting the  model is learning and improving its ability to distinguish between the classes over time.


# 4. Evaluating the Model



Initially, the gap between training and evaluation AUC is somewhat moderate, but it starts to widen significantly as training progresses. For instance, towards the end (around iteration 99), the train-auc is at 0.97169 while the eval-auc is at 0.87690. This growing gap may suggest that the model is beginning to overfit the training data-—meaning it's getting better at predicting the training data, but not the evaluation data.  This suggests that the model is fitting more to the idiosyncrasies of the training data rather than capturing generalizable patterns that apply to unseen data.

**Actions to Consider**:

**Early Stopping**: Implement early stopping to halt the training process once the evaluation metric stops improving for a specified number of rounds. XGBoost supports early stopping.

**Regularization**: Increase regularization parameters (lambda, alpha) to penalize more complex models and thus mitigate overfitting.

**Parameter Tuning**: Adjust other hyperparameters, like max_depth, min_child_weight, and subsample, to help control model complexity and fit.

**Cross-Validation**: Use XGBoost's built-in cross-validation method to assess model performance more robustly. This might give you a better indication of how the model will perform on unseen data.

**Feature Engineering**: Revisit your features to ensure they're relevant and not leading to overfitting. Removing irrelevant or highly correlated features can sometimes improve model generalizability.

In [None]:
# Implementing early stopping (i.e., train the model to learn from data until it stops getting better)

bst = xgb.train(
    params,
    dtrain,
    num_boost_round=1000,  # Set to a high number intentionally; training may stop much earlier
    evals=[(dval, "eval")],
    early_stopping_rounds=10  # Stops after 10 rounds of no improvement on the eval dataset
)

# After training, you can access the best iteration
# This is the number of the training round where the model performed the best on the validation dataset before it stopped improving
# We'll use this in the next code block
print(f"Best iteration: {bst.best_iteration}")


In [None]:
# Make predictions with the test dataset on the final (best) model

# Predicting the probabilities for the positive class ("Churn")
y_pred_proba = bst.predict(dtest, iteration_range=(0, bst.best_iteration + 1))

# The scikit-learn library is great for calculating AUC and other metrics, though this could also be calculated manually
from sklearn.metrics import roc_auc_score

# Print the final AUC score
auc_score = roc_auc_score(y_test, y_pred_proba)
print(f"Final AUC on Test Dataset: {auc_score:.3f}")

An AUC score of over 0.9 is often considered excellent in many applications. It indicates strong differentiation between the positive and negative classes in the dataset, suggesting that the model has a good predictive ability for the task at hand.