# Predict Churning customers

The goal of this project is to predict the customers who want to cancel a credit card program, such that actions can be taken to prevent the event from happening.

The top priority is to identify customers who are getting churned. Even if we predict non-churning customers as churned, it won’t harm our business. But predicting churning customers as non-churning will do. So recall (True positives/(True positives + False negatives) must be high.

The dataset is strongly un-balanced: only 16% of customers churned.

The notebook is organized as follow:

+ In Section 1, the dataset is explored, checking if null values are present.

+ In Section 2, feature engineering is performed as follow:

    + The categorical target feature (the Attrition_Flag) is converted to numerical.
    + Other categorical features are one-hot encoded.
    + The dataset is divided into train and test, using stratified sampling.
    + The outliers in the training dataset are identified.
    + A data preprocessing pipeline is built, removing the outliers identified in the previous step and standardizing each feature.
    + Highly correlated features are removed, identifing highly correleted feature pairs
    + Highly multicollinear features are removed, estimating the variance inflation factor

+ In Section 3, XGBoost is used as a model for predicting churned/not churned customers. Model hyperparameters are searched as follow:

    + An objective function is defined. The objective function computes the average value of the cross-validation score on the training dataset, using the negative log loss as a scoring metric
    + The maximum value of the objective function is searched using the Bayesian framework Optuna [https://optuna.org/]
    + On the test dataset, the recall value is 0.927

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Common imports

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
def ignore_warn(*args, **kwargs):
    pass
warnings.warn = ignore_warn #ignore annoying warning (from sklearn and seaborn)

%matplotlib inline

## 1. Explore dataset

In [None]:
bankchurners = pd.read_csv("/kaggle/input/credit-card-customers/BankChurners.csv")
bankchurners.head(10)

In [None]:
bankchurners.columns

## Ignore the last 2 columns (as suggested by the data description section)

In [None]:
bankchurners.drop(['Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1',
       'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2'], axis=1, inplace=True)

## How many NaNs values are present?

In [None]:
bankchurners.isnull().sum()

In [None]:
bankchurners.describe()

# 2. Feature Engineering

## CLIENTNUM will not have predictive power, it is an id

In [None]:
bankchurners.drop(['CLIENTNUM'],axis=1,inplace=True)

## Which columns are numerical and which categorical?

In [None]:
numerical_features = bankchurners._get_numeric_data().columns
numerical_features

In [None]:
# Categorical Columns
categorical_features = bankchurners.select_dtypes(include='object').columns
categorical_features

## Convert the income into ordinal features, it makes sense for this categorical variable

In [None]:
sns.countplot('Income_Category', data=bankchurners)

In [None]:
bankchurners['Income_Category'] = bankchurners['Income_Category'].replace({
    'Unknown':0,
    'Less than $40K':1,
    '$40K - $60K':2,
    '$60K - $80K':3,
    '$80K - $120K':4,
    '$120K +':5
})

## Convert attrition_flag to numerical

In [None]:
sns.countplot('Attrition_Flag', data=bankchurners)

In [None]:
bankchurners['Attrition_Flag'] = bankchurners['Attrition_Flag'].replace({
    'Existing Customer':0,
    'Attrited Customer':1,
})

## For the other categorical variables, use dummies (OneHot encoding)

In [None]:
categorical = bankchurners.select_dtypes(include='object').columns
categorical

In [None]:
bankchurners = pd.get_dummies(bankchurners, columns = categorical, drop_first=True)

In [None]:
bankchurners.info()

## Raname columns to remove white spaces and other charaters incompatible with patsy (used in the next section)

In [None]:
bankchurners.rename(columns={
    "Education_Level_High School": "Education_Level_High_School",
    "Education_Level_Post-Graduate": "Education_Level_Post_Graduate",
},inplace=True)
bankchurners.columns.values

## Split the dataset in train and dev, before any other analysis

In [None]:
from sklearn.model_selection import train_test_split
y = bankchurners['Attrition_Flag']
X = bankchurners.drop(['Attrition_Flag'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42,stratify =y)

In [None]:
train = pd.concat([X_train, y_train],axis=1)
test = pd.concat([X_test, y_test],axis=1)

## Check outliers on numerical features

In [None]:
sns.set(font_scale=1.5)
def box_plot(key):
    fig = plt.figure(figsize=(30, 20));
    sns.boxplot(x='Attrition_Flag', y=key, data=train[['Attrition_Flag', key]])

In [None]:
box_plot('Customer_Age')

In [None]:
box_plot('Dependent_count')

In [None]:
box_plot('Months_on_book')

In [None]:
box_plot('Total_Relationship_Count')

In [None]:
box_plot('Months_Inactive_12_mon')

In [None]:
box_plot('Contacts_Count_12_mon')

In [None]:
box_plot('Credit_Limit')

In [None]:
box_plot('Total_Revolving_Bal')

In [None]:
box_plot('Avg_Open_To_Buy')

In [None]:
box_plot('Total_Amt_Chng_Q4_Q1')

In [None]:
box_plot('Total_Trans_Amt')

In [None]:
box_plot('Total_Trans_Ct')

In [None]:
box_plot('Total_Ct_Chng_Q4_Q1')

In [None]:
box_plot('Avg_Utilization_Ratio')

## Define a pre-processing pipeline

### Write a custom transform to remove outliers

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
class RemoveOutliers(BaseEstimator, TransformerMixin):
    
    def __init__(self, feature, min_value, max_value):
        self.feature = feature
        self.min_value = min_value
        self.max_value = max_value
        
    def fit(self, X, y=None):
        return self # nothing else to do
    
    def transform(self, X, y=None):
        
        feature_values = X[self.feature]
        feature_values[feature_values < self.min_value ] = self.min_value
        feature_values[feature_values > self.max_value ] = self.max_value
        
        X[self.feature] = feature_values
        
        return X

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.feature_selection import VarianceThreshold

numerical_pipeline = Pipeline([
    ('age_outliers',RemoveOutliers('Customer_Age',0,70)),
    ('varianceThreshold', VarianceThreshold()),
    ('std_scaler', StandardScaler())])

full_pipeline = ColumnTransformer([("num", numerical_pipeline, numerical_features)])

X_train_values = full_pipeline.fit_transform(X_train)
X_test_values = full_pipeline.transform(X_test)

In [None]:
X_train[numerical_features] = X_train_values
X_test[numerical_features] = X_test_values

In [None]:
train = pd.concat([X_train, y_train],axis=1)
test = pd.concat([X_test, y_test],axis=1)

# Function to remove columns in both train and test set

In [None]:
def remove_columns(df, columns):
    for c in columns:
        if c in df.columns:
            df.drop(c, axis=1, inplace=True)

## Remove highly correlated columns

In [None]:
from scipy.stats import pearsonr
def highly_correleted_columns(df, columns_to_preserve, threshold):
    corr_columns=[]
    for c in df.columns:
        # column to preserve
        if c in corr_columns:
            continue
        # correlation with pval
        for cc in df.columns:
            if cc == c:
                continue
            if cc in columns_to_preserve:
                continue
            if cc in corr_columns:
                continue
            corrtest = pearsonr(df[c], df[cc])
            corr = corrtest[0]
            pval = corrtest[1]
            if abs(corr) > threshold and pval < 0.05:
                corr_columns.append(cc)
    return corr_columns

In [None]:
columns_to_remove = highly_correleted_columns(train,['Attrition_Flag'], 0.70)
columns_to_remove

In [None]:
remove_columns(train, columns_to_remove)
remove_columns(test, columns_to_remove)

## Remove highly collinear features

In [None]:
from patsy import dmatrices
from statsmodels.stats.outliers_influence import variance_inflation_factor
def compute_variance_inflation_factor(df, column_to_predict):

    feature_columns = list(df.columns.values)
    # always remove the column to predict
    feature_columns.remove(column_to_predict)
    features = "+".join(feature_columns)

    # get y and X dataframes based on this regression:
    y, X = dmatrices(column_to_predict + '~' + features, data=df, return_type='dataframe')

    # Calculate VIF Factors, for each X, calculate VIF and save in dataframe
    vif = pd.DataFrame()
    vif["VIF_Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vif["features"] = X.columns

    # Inspect VIF Factors
    print(vif.sort_values('VIF_Factor'))
    return vif

In [None]:
vif = compute_variance_inflation_factor(train, 'Attrition_Flag')

In [None]:
nans_columns = vif[vif.isin([np.nan, np.inf, -np.inf]).any(1)].features.values
remove_columns(train, nans_columns)
remove_columns(test, nans_columns)

In [None]:
highly_collinear = vif.loc[vif.VIF_Factor > 5.0].features.values
remove_columns(train, highly_collinear)
remove_columns(test, highly_collinear)

# 3. Modelling

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, recall_score
import optuna

## Plot confusion matrix

In [None]:
import itertools
def plot_confusion_matrix(cm, classes, normalize=False, title='Confusion matrix', cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

## XGBoost classifier

## Use SMOTE and RandomUnderSampler to reduce imbalance

In [None]:
import xgboost as xgb
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from imblearn.over_sampling import BorderlineSMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as imblearn_pipeline

over = BorderlineSMOTE(sampling_strategy=0.3)
under = RandomUnderSampler(sampling_strategy=0.6)
steps = [('oversampling', over), ('undersampling', under)]

def objectiveXGBoost(trial):
    
    over = BorderlineSMOTE(sampling_strategy=0.3)
    under = RandomUnderSampler(sampling_strategy=0.6)
    
    gamma_int = trial.suggest_float('gamma', 0.01, 10,log=True)
    max_depth = trial.suggest_int('max_depth', 1, 5)
    clf = xgb.XGBClassifier(n_jobs=3,seed=42,gamma=gamma_int,max_depth=max_depth)
    
    # by using a pipeline the metric is computed on the original, not balanced dataset
    full_pipeline = imblearn_pipeline( [('oversampling', over), ('undersampling', under),('model',clf)])
    
    # stratified k-fold cross-validation 
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=42)
    
    return cross_val_score(full_pipeline, X_train,y_train, n_jobs = 3, cv=cv, scoring='neg_log_loss').mean()

In [None]:
import optuna
study = optuna.create_study(direction='maximize')
study.optimize(objectiveXGBoost, n_trials=30)
trial = study.best_trial
print(trial.params)

## The dataset used for fitting should be re-balanced

In [None]:
over = BorderlineSMOTE(sampling_strategy=0.3)
under = RandomUnderSampler(sampling_strategy=0.6)
steps = [('oversampling', over), ('undersampling', under)]
sampling_pipeline = imblearn_pipeline(steps=steps)
X_train_fit, y_train_fit = sampling_pipeline.fit_resample(X_train, y_train)

In [None]:
clf = xgb.XGBClassifier()
clf.set_params(**study.best_trial.params)
clf.fit(X_train_fit,y_train_fit)

In [None]:
yhat = clf.predict(X_test)

In [None]:
cnf_matrix = confusion_matrix(y_test, yhat, labels=[0,1])
plot_confusion_matrix(cnf_matrix, classes=['Existing Customer','Attrited Customer'],normalize=True,  title='Confusion matrix')
print('Recall is ', recall_score(y_test, yhat, average='macro'))