#### Project Overview

Customer churn is when customers discontinue their relationship or subscription with a company or service provider. It represents the rate customers stop using a company's products or services within a specific period. Churn is an important business metric as it directly impacts revenue, growth, and customer retention.

This project aims to develop a predictive model identifying customers at high risk of churning the company. The model will identify the key indicators of churn and the retention strategies that can be implemented.

##### Import Required Libraries

In [1]:
import pandas as pd
import numpy as np

# Data Visualization Libraries
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use("ggplot")

# Training and scaling
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler, MinMaxScaler

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import LabelEncoder


from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.model_selection import KFold  # Updated import for KFold
from sklearn import metrics

# saving
import pickle
import joblib

import warnings
warnings.filterwarnings("ignore")

In [2]:
# Import tyhe cleaned Dataset

df = pd.read_csv(r"C:\Users\Harrison\Downloads\Customer churn\data\cleaned_churn_data.csv")

df.head(3)

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes


#### Feature Engineering

- This involves creating new features or transforming existing ones to improve the performance of machine-learning models

In [3]:
df['PhoneService'].value_counts()

PhoneService
Yes    6361
No      682
Name: count, dtype: int64

In [4]:
df['MultipleLines'].value_counts()

MultipleLines
No                  3390
Yes                 2971
No phone service     682
Name: count, dtype: int64

In [5]:
# Change 'No phone service' to 'No' in the 'MultipleLines' column

df['MultipleLines'] = df['MultipleLines'].replace('No phone service', 'No')

In [6]:
# Columns to change 'No internet service' to 'No'
columns_to_change = ['OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']  # Add more columns as needed

for column in columns_to_change:
    df[column] = df[column].replace('No internet service', 'No')

In [7]:
# Replace 'No internet service' with 'No':
# Get value counts for each specified column
for column in columns_to_change:
    print(f"Value counts for {column}:")
    print(df[column].value_counts())
    print("\n")

Value counts for OnlineSecurity:
OnlineSecurity
No     5024
Yes    2019
Name: count, dtype: int64


Value counts for OnlineBackup:
OnlineBackup
No     4614
Yes    2429
Name: count, dtype: int64


Value counts for DeviceProtection:
DeviceProtection
No     4621
Yes    2422
Name: count, dtype: int64


Value counts for TechSupport:
TechSupport
No     4999
Yes    2044
Name: count, dtype: int64


Value counts for StreamingTV:
StreamingTV
No     4336
Yes    2707
Name: count, dtype: int64


Value counts for StreamingMovies:
StreamingMovies
No     4311
Yes    2732
Name: count, dtype: int64




In [8]:
df.sample(5)

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
5262,Female,0,No,No,59,Yes,Yes,DSL,Yes,No,No,Yes,No,Yes,One year,No,Bank transfer (automatic),68.7,4070.95,No
6296,Female,1,Yes,No,31,No,No,DSL,Yes,Yes,Yes,No,No,Yes,One year,Yes,Credit card (automatic),50.4,1580.1,No
3812,Male,0,No,No,17,Yes,No,No,No,No,No,No,No,No,One year,Yes,Bank transfer (automatic),19.65,351.55,No
4081,Female,0,No,No,1,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,69.6,69.6,Yes
2260,Female,0,No,No,3,Yes,No,Fiber optic,No,No,Yes,Yes,No,No,Month-to-month,No,Credit card (automatic),80.1,217.55,Yes


In [9]:
# Creating new numerical features
df['MonthlyChargesRatio'] = df['MonthlyCharges'] / df['TotalCharges']
df['AverageMonthlyCharges'] = df['TotalCharges'] / df['tenure']

# new categorical features
df['HasSecurityAndBackup'] = ((df['OnlineSecurity'] == 'Yes') & (df['OnlineBackup'] == 'Yes')).astype(int)

additional_services = ['DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']
df['AdditionalServices'] = df[additional_services].apply(lambda x: (x == 'Yes').sum(), axis=1)

# Define bins for tenure categorization
bins = [0, 6, 24, float('inf')]
labels = ['New', 'Established', 'Long-term']
df['TenureCategory'] = pd.cut(df['tenure'], bins=bins, labels=labels, include_lowest=True)

df.head(3)

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,...,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn,MonthlyChargesRatio,AverageMonthlyCharges,HasSecurityAndBackup,AdditionalServices,TenureCategory
0,Female,0,Yes,No,1,No,No,DSL,No,Yes,...,Yes,Electronic check,29.85,29.85,No,1.0,29.85,0,0,New
1,Male,0,No,No,34,Yes,No,DSL,Yes,No,...,No,Mailed check,56.95,1889.5,No,0.03014,55.573529,0,1,Long-term
2,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,...,Yes,Mailed check,53.85,108.15,Yes,0.49792,54.075,1,0,New


In [10]:
df.shape

(7043, 25)

The final dataset for modeling contains 7,043 rows and 25 columns

In [11]:
# Check the data type
df.dtypes

gender                     object
SeniorCitizen               int64
Partner                    object
Dependents                 object
tenure                      int64
PhoneService               object
MultipleLines              object
InternetService            object
OnlineSecurity             object
OnlineBackup               object
DeviceProtection           object
TechSupport                object
StreamingTV                object
StreamingMovies            object
Contract                   object
PaperlessBilling           object
PaymentMethod              object
MonthlyCharges            float64
TotalCharges              float64
Churn                      object
MonthlyChargesRatio       float64
AverageMonthlyCharges     float64
HasSecurityAndBackup        int32
AdditionalServices          int64
TenureCategory           category
dtype: object

#### Building a Machine Learning Predictive Models

Let's convert all our categorical variables into numeric by encoding them

Scaling numerical features using StandardScaler

In [12]:
# Define the features and target
features = ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'HasSecurityAndBackup', 'MonthlyChargesRatio',
            'AdditionalServices', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'tenure', 'AverageMonthlyCharges']
target = 'Churn'

# Encode the target feature
df[target] = LabelEncoder().fit_transform(df[target])

# Handle infinite values and NaN values
numeric_cols = df.select_dtypes(include=[np.number]).columns
df[numeric_cols] = df[numeric_cols].replace([np.inf, -np.inf], np.nan)
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].mean())

# Define the preprocessing steps
numeric_features = ['MonthlyCharges', 'TotalCharges', 'tenure', 'MonthlyChargesRatio', 'AverageMonthlyCharges']
categorical_features = ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService',
                        'AdditionalServices', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'HasSecurityAndBackup']

# Preprocessing pipeline for numeric features
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),  # Impute missing values
    ('scaler', StandardScaler())  # Scale numeric values
])

# Preprocessing pipeline for categorical features
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),  # Impute missing values
    ('onehot', OneHotEncoder(handle_unknown='ignore'))  # One-hot encode categorical values
])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

# Apply preprocessing to the data
X = preprocessor.fit_transform(df[features])
y = df[target]

# Apply SMOTE to the data
smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(X, y)

# Generic function for making a classification model and assessing performance
def classification_model(model, X, y, model_name):
    # Split the data into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

    # Fit the model
    model.fit(X_train, y_train)
  
    # Make predictions on the training set
    train_predictions = model.predict(X_train)
  
    # Print training accuracy
    train_accuracy = accuracy_score(train_predictions, y_train)
    print("Training Accuracy : %s" % "{0:.3%}".format(train_accuracy))

    # Make predictions on the test set
    test_predictions = model.predict(X_test)
    
    # Print test accuracy
    test_accuracy = accuracy_score(test_predictions, y_test)
    print("Test Accuracy : %s" % "{0:.3%}".format(test_accuracy))

    # Perform k-fold cross-validation with 5 folds
    kf = KFold(n_splits=5, shuffle=True, random_state=1)
    error = []
    for train_index, test_index in kf.split(X):
        # Filter training data
        X_train_kf, X_test_kf = X[train_index], X[test_index]
        y_train_kf, y_test_kf = y[train_index], y[test_index]

        # Training the algorithm using the predictors and target
        model.fit(X_train_kf, y_train_kf)
  
        # Record error from each cross-validation run
        error.append(model.score(X_test_kf, y_test_kf))
 
    print("Cross-Validation Score : %s" % "{0:.3%}".format(np.mean(error)))

    # Print classification report and confusion matrix for the test set
    print("\nClassification Report:")
    print(classification_report(y_test, test_predictions))

    print("\nConfusion Matrix:")
    print(confusion_matrix(y_test, test_predictions))

    # Save the model to disk
    joblib.dump(model, f'{model_name}.pkl')
    print(f"{model_name} model saved to disk.")

# Create and evaluate models
models = {
    "Logistic Regression": LogisticRegression(random_state=42),
    "K-Nearest Neighbors": KNeighborsClassifier(),
    "Random Forest": RandomForestClassifier(random_state=42),
    "XGBoost": XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss'),
    "AdaBoost": AdaBoostClassifier(random_state=42),
    "Support Vector Classifier": SVC(random_state=42),
    "Gradient Boosting": GradientBoostingClassifier(random_state=42)
}

for model_name, model in models.items():
    print(f"\nModel: {model_name}")
    classification_model(model, X_res, y_res, model_name)



Model: Logistic Regression
Training Accuracy : 77.615%
Test Accuracy : 76.908%
Cross-Validation Score : 77.252%

Classification Report:
              precision    recall  f1-score   support

           0       0.80      0.71      0.76      1035
           1       0.74      0.83      0.78      1035

    accuracy                           0.77      2070
   macro avg       0.77      0.77      0.77      2070
weighted avg       0.77      0.77      0.77      2070


Confusion Matrix:
[[737 298]
 [180 855]]
Logistic Regression model saved to disk.

Model: K-Nearest Neighbors
Training Accuracy : 85.878%
Test Accuracy : 80.628%
Cross-Validation Score : 79.938%

Classification Report:
              precision    recall  f1-score   support

           0       0.88      0.71      0.79      1035
           1       0.76      0.90      0.82      1035

    accuracy                           0.81      2070
   macro avg       0.82      0.81      0.80      2070
weighted avg       0.82      0.81      0.80 