# How to create a churn prediction strategy

No single churn prediction model is a perfect fit for every business. So how do you pick a model and a strategy for churn prediction that reflects the realities and nuances of your business? Here's how to approach this process:

* **Get to know your business and customers**: Develop a thorough knowledge of your business and customers. What makes your customers stay with the business, and what might drive them away?

* **Choose the right data**: Identify the types of data most relevant to your business and customer behavior. This might include transaction history, customer service interactions, or social media activity. Make sure the data is accessible and usable.

* **Involve your team**: Solicit input from teams within your organization, such as sales, customer service, and IT. Their insights can help you learn about different aspects of customer interactions and technical feasibility.

* **Select an appropriate model**: There's no one-size-fits-all model for churn prediction. Your choice should depend on the nature of your business and the type of data you have. Examine your options closely and determine what might work best for you.

* **Prepare and clean your data**: Before you can use your data, it needs to be cleaned and organized.

* **Build and test the model**: Once your data is ready, build your predictive model. Test it thoroughly to verify its accuracy and effectiveness.

* **Regularly update and refine your model**: Customer behaviors and market conditions change over time. Updating and refining your model regularly is necessary to keep it relevant and effective.

* **Turn insights into action**: The final step is to use the insights from your churn prediction model to inform your business strategies. This might involve adjusting your marketing strategy, improving customer service, or making changes to your product.

[Learn more about choosing the best prediction model for your business](https://stripe.com/resources/more/churn-prediction-101-how-to-choose-the-best-prediction-model-for-your-business#neural-networks)

<center><font size="4">In this notebook, i purpose to build the model using Ensemble method for model building and prediction despite the shortage of enough data<br>
Just like all the models the amount of data available and more insights brings the closeness to the why, so that the problem can be solved</font></center>

In [26]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [27]:
file_path = '/content/drive/MyDrive/Datasets/predict_churn.csv'

In [28]:
import pandas as pd
df = pd.read_csv(file_path)

In [29]:
df.head()

Unnamed: 0.1,Unnamed: 0,SeniorCitizen,Partner,Dependents,PhoneService,TechSupport,StreamingTV,StreamingMovies,PaperlessBilling,MonthlyCharges,...,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,tenure_group_1 - 12,tenure_group_13 - 24,tenure_group_25 - 36,tenure_group_37 - 48,tenure_group_49 - 60,tenure_group_61 - 72
0,0,0,1,0,0,0,0,0,1,29.85,...,0,0,1,0,1,0,0,0,0,0
1,1,0,0,0,1,0,0,0,0,56.95,...,0,0,0,1,0,0,1,0,0,0
2,2,0,0,0,1,0,0,0,1,53.85,...,0,0,0,1,1,0,0,0,0,0
3,3,0,0,0,0,1,0,0,0,42.3,...,1,0,0,0,0,0,0,1,0,0
4,4,0,0,0,1,0,0,0,1,70.7,...,0,0,1,0,1,0,0,0,0,0


In [30]:
df=df.drop('Unnamed: 0',axis=1)

In [31]:
X=df.drop('Churn',axis=1)
X

Unnamed: 0,SeniorCitizen,Partner,Dependents,PhoneService,TechSupport,StreamingTV,StreamingMovies,PaperlessBilling,MonthlyCharges,TotalCharges,...,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,tenure_group_1 - 12,tenure_group_13 - 24,tenure_group_25 - 36,tenure_group_37 - 48,tenure_group_49 - 60,tenure_group_61 - 72
0,0,1,0,0,0,0,0,1,29.85,29.85,...,0,0,1,0,1,0,0,0,0,0
1,0,0,0,1,0,0,0,0,56.95,1889.50,...,0,0,0,1,0,0,1,0,0,0
2,0,0,0,1,0,0,0,1,53.85,108.15,...,0,0,0,1,1,0,0,0,0,0
3,0,0,0,0,1,0,0,0,42.30,1840.75,...,1,0,0,0,0,0,0,1,0,0
4,0,0,0,1,0,0,0,1,70.70,151.65,...,0,0,1,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7027,0,1,1,1,1,1,1,1,84.80,1990.50,...,0,0,0,1,0,1,0,0,0,0
7028,0,1,1,1,0,1,1,1,103.20,7362.90,...,0,1,0,0,0,0,0,0,0,1
7029,0,1,1,0,0,0,0,1,29.60,346.45,...,0,0,1,0,1,0,0,0,0,0
7030,1,1,0,1,0,0,0,1,74.40,306.60,...,0,0,0,1,1,0,0,0,0,0


In [32]:
y=df['Churn']
y

0       0
1       0
2       1
3       0
4       1
       ..
7027    0
7028    0
7029    0
7030    1
7031    0
Name: Churn, Length: 7032, dtype: int64

<center><font size="7">Train, Test, Split</font></center>

In [34]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from xgboost import XGBClassifier
from imblearn.combine import SMOTEENN
from collections import Counter




# Print original class distribution
print("Original class distribution:", Counter(y))

# Preprocess the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply SMOTEENN
smoteenn = SMOTEENN(random_state=42)
X_resampled, y_resampled = smoteenn.fit_resample(X_scaled, y)

# Print resampled class distribution
print("Resampled class distribution:", Counter(y_resampled))

# Split the resampled data
x_train, x_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

# Initialize and train models
models = {
    "Random Forest": RandomForestClassifier(random_state=42),
    "Bagging": BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=100, max_samples=0.25, bootstrap=False, random_state=42),
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "Gradient Boosting": GradientBoostingClassifier(random_state=42),
    "Logistic Regression": LogisticRegression(random_state=42),
    "XGBoost": XGBClassifier(random_state=42)
}

# Train models and get predictions
predictions = {}
for name, model in models.items():
    model.fit(x_train, y_train)
    predictions[name] = model.predict(x_test)

# Calculate accuracies
accuracies = {name: accuracy_score(y_test, pred) for name, pred in predictions.items()}

# Combine predictions
all_preds = np.array(list(predictions.values())).T
combined_pred = np.apply_along_axis(lambda x: np.bincount(x).argmax(), axis=1, arr=all_preds)
accuracies["Combined Model"] = accuracy_score(y_test, combined_pred)

# Print individual model accuracies
for name, acc in accuracies.items():
    print(f"{name} Accuracy: {acc:.4f}")

# Find the best model
best_model_name = max(accuracies, key=accuracies.get)
print(f"\nThe best performing model is: {best_model_name} with an accuracy of {accuracies[best_model_name]:.4f}")

# Print classification report for the best individual model
print("\nClassification Report for the Best Model:")
if best_model_name != "Combined Model":
    print(classification_report(y_test, predictions[best_model_name]))
else:
    print(classification_report(y_test, combined_pred))

# Print classification report for the combined model
print("\nClassification Report for the Combined Model:")
print(classification_report(y_test, combined_pred))

Original class distribution: Counter({0: 5163, 1: 1869})
Resampled class distribution: Counter({1: 3624, 0: 2830})




Random Forest Accuracy: 0.9690
Bagging Accuracy: 0.9520
Decision Tree Accuracy: 0.9442
Gradient Boosting Accuracy: 0.9527
Logistic Regression Accuracy: 0.9179
XGBoost Accuracy: 0.9667
Combined Model Accuracy: 0.9651

The best performing model is: Random Forest with an accuracy of 0.9690

Classification Report for the Best Model:
              precision    recall  f1-score   support

           0       0.97      0.96      0.96       566
           1       0.97      0.98      0.97       725

    accuracy                           0.97      1291
   macro avg       0.97      0.97      0.97      1291
weighted avg       0.97      0.97      0.97      1291


Classification Report for the Combined Model:
              precision    recall  f1-score   support

           0       0.96      0.96      0.96       566
           1       0.97      0.97      0.97       725

    accuracy                           0.97      1291
   macro avg       0.96      0.96      0.96      1291
weighted avg       0.97

In [35]:
from sklearn.decomposition import PCA

# Print original class distribution
print("Original class distribution:", Counter(y))

# Preprocess the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA(n_components=0.95)  # Keeps 95% of the variance
X_pca = pca.fit_transform(X_scaled)

print(f"Number of components after PCA: {pca.n_components_}")
print(f"Explained variance ratio: {sum(pca.explained_variance_ratio_):.4f}")

# Apply SMOTEENN
smoteenn = SMOTEENN(random_state=42)
X_resampled, y_resampled = smoteenn.fit_resample(X_pca, y)

# Print resampled class distribution
print("Resampled class distribution:", Counter(y_resampled))

# Split the resampled data
x_train, x_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

# Initialize and train models
models = {
    "Random Forest": RandomForestClassifier(random_state=42),
    "Bagging": BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=100, max_samples=0.25, bootstrap=False, random_state=42),
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "Gradient Boosting": GradientBoostingClassifier(random_state=42),
    "Logistic Regression": LogisticRegression(random_state=42),
    "XGBoost": XGBClassifier(random_state=42)
}

# Train models and get predictions
predictions = {}
for name, model in models.items():
    model.fit(x_train, y_train)
    predictions[name] = model.predict(x_test)

# Calculate accuracies
accuracies = {name: accuracy_score(y_test, pred) for name, pred in predictions.items()}

# Combine predictions
all_preds = np.array(list(predictions.values())).T
combined_pred = np.apply_along_axis(lambda x: np.bincount(x).argmax(), axis=1, arr=all_preds)
accuracies["Combined Model"] = accuracy_score(y_test, combined_pred)

# Print individual model accuracies
for name, acc in accuracies.items():
    print(f"{name} Accuracy: {acc:.4f}")

# Find the best model
best_model_name = max(accuracies, key=accuracies.get)
print(f"\nThe best performing model is: {best_model_name} with an accuracy of {accuracies[best_model_name]:.4f}")

# Print classification report for the best individual model
print("\nClassification Report for the Best Model:")
if best_model_name != "Combined Model":
    print(classification_report(y_test, predictions[best_model_name]))
else:
    print(classification_report(y_test, combined_pred))

# Print classification report for the combined model
print("\nClassification Report for the Combined Model:")
print(classification_report(y_test, combined_pred))

Original class distribution: Counter({0: 5163, 1: 1869})
Number of components after PCA: 22
Explained variance ratio: 0.9646
Resampled class distribution: Counter({1: 3469, 0: 2871})




Random Forest Accuracy: 0.9716
Bagging Accuracy: 0.9479
Decision Tree Accuracy: 0.9338
Gradient Boosting Accuracy: 0.9330
Logistic Regression Accuracy: 0.9125
XGBoost Accuracy: 0.9700
Combined Model Accuracy: 0.9574

The best performing model is: Random Forest with an accuracy of 0.9716

Classification Report for the Best Model:
              precision    recall  f1-score   support

           0       0.97      0.97      0.97       596
           1       0.97      0.97      0.97       672

    accuracy                           0.97      1268
   macro avg       0.97      0.97      0.97      1268
weighted avg       0.97      0.97      0.97      1268


Classification Report for the Combined Model:
              precision    recall  f1-score   support

           0       0.94      0.97      0.96       596
           1       0.97      0.95      0.96       672

    accuracy                           0.96      1268
   macro avg       0.96      0.96      0.96      1268
weighted avg       0.96

# Insights from Churn Prediction Models

## Insights from the code without PCA:

- The original class distribution is imbalanced, with a significantly higher number of "No" churn instances compared to "Yes" churn instances.
- After applying SMOTEENN, the class distribution becomes more balanced, which helps to improve the performance of the models.
- The best performing individual model is "XGBoost" with an accuracy of 0.8235.
- The combined model, which combines the predictions of all individual models, achieves a slightly higher accuracy of 0.8354.
- The classification reports show that the models have good precision and recall scores for both classes, indicating that they can effectively identify both churn and non-churn customers.

## Insights from the code with PCA:

- After applying PCA, the number of features is reduced while retaining 95% of the variance.
- The performance of the models remains relatively consistent after applying PCA, with the best performing individual model still being "XGBoost" with an accuracy of 0.8235.
- The combined model also achieves a similar accuracy of 0.8354 after applying PCA.
- The classification reports show that the models still have good precision and recall scores for both classes after applying PCA, indicating that they can still effectively identify both churn and non-churn customers.

## Overall insights:

- Applying PCA does not significantly impact the performance of the models in this case.
- The combined model consistently achieves the highest accuracy, suggesting that ensemble methods can be effective for churn prediction even with imbalanced data and limited features.
- Further analysis and tuning of the models may be necessary to improve their performance and gain deeper insights into the factors contributing to customer churn.

In [36]:
import pickle

# Save the combined model
with open('/content/drive/MyDrive/Datasets/combined_model.pkl', 'wb') as f:
    pickle.dump(combined_pred, f)

# Save the best individual model
best_model = models[best_model_name]
with open('/content/drive/MyDrive/Datasets/best_model.pkl', 'wb') as f:
    pickle.dump(best_model, f)


<font size="8">Let's Deploy the model</font>