# Churn prediction

### Author: Rasmus Davidsen
### Copenhagen, 02-12-2022


### Steps 
1) import and preprocess data 

2) profile data 

3) Handle imbalance 

4) split data into train, valid and test [70 20 10]

5) 


In [8]:
# for dataframes. options is set so the entire table can be inspected
import pandas as pd
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 1000)

# for convenient plotting
import seaborn as sns

# for a very detailed data profile
from pandas_profiling import ProfileReport
import ipywidgets

# cool library for data quality checks
#import great_expectations as ge



import numpy as np
from sklearn import set_config
set_config(display='diagram')

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier


from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold

## Import data and inspect

In [9]:
# import semicolon separated .csv file and inspect the shape
df = pd.read_csv('data/case_data_set.csv', delimiter=";")
df.shape

(5374, 21)

In [10]:
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,2985,2985,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,5695,18895,No
2,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),423,184075,No
3,1452-KIOVK,Male,0,No,Yes,22,Yes,Yes,Fiber optic,No,Yes,No,No,Yes,No,Month-to-month,Yes,Credit card (automatic),891,19494,No
4,6713-OKOMC,Female,0,No,No,10,No,No phone service,DSL,Yes,No,No,No,No,No,Month-to-month,No,Mailed check,2975,3019,No


In [11]:
df.dtypes

customerID          object
gender              object
SeniorCitizen        int64
Partner             object
Dependents          object
tenure               int64
PhoneService        object
MultipleLines       object
InternetService     object
OnlineSecurity      object
OnlineBackup        object
DeviceProtection    object
TechSupport         object
StreamingTV         object
StreamingMovies     object
Contract            object
PaperlessBilling    object
PaymentMethod       object
MonthlyCharges      object
TotalCharges        object
Churn               object
dtype: object

## Change data types
As the "TotalCharges" and the "MonthlyCharges" columns are given with the danish "," decimal separater this must be changed. Furtermore, we need to change the data type from object to float.

In [12]:
# replace , by . and change to float data type
df['TotalCharges'] = df['TotalCharges'].str.replace(',','.').astype(float)
df['MonthlyCharges'] = df['MonthlyCharges'].str.replace(',','.').astype(float)

In [13]:
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
3,1452-KIOVK,Male,0,No,Yes,22,Yes,Yes,Fiber optic,No,Yes,No,No,Yes,No,Month-to-month,Yes,Credit card (automatic),89.1,1949.4,No
4,6713-OKOMC,Female,0,No,No,10,No,No phone service,DSL,Yes,No,No,No,No,No,Month-to-month,No,Mailed check,29.75,301.9,No


## Data set profiling 
In order to inspect the columns for missings, the distributions etc. pandas_profiling package is used to generate a detailed report containing a profile of the data

In [14]:
#profile = ProfileReport(df, title ="churn data profile", html={'style':{'full_width':True}})

In [15]:
#profile

### data profile coclusions
* No missing data -> no need for imputations 
* the target variable "Churn" are higly unbalanced 
* we have only unique customers, as we have 100 % distinct values in CustomerID column
* tenure and TotalCharges has a high correlation

# Encode categorical features
For this we use ordial encoder from sklaern

# Define inputs and target and create a hold out partition
* Defining the columns in the data that is features and target

* Creating a hold out partition in order to give an unbiased estimate of the final selected models performance.

* Define numeric and categorical features

In [16]:
from sklearn.model_selection import train_test_split

# input features
X = df.drop(columns=['customerID', 'Churn'])

# target
y = df.Churn

# split data in train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, test_size = 0.2, random_state = 42)

# the numeric features to the model
numeric_features = ['MonthlyCharges', 'TotalCharges']

# the categorical features to the model
categorical_features = ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 
            'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 
            'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling']



In [21]:
k_vals = [20, 30, 40, 50, 60, 100, 200, 300]

best_auc = 0
best_k = 0
for k in k_vals:

    numeric_transformer = Pipeline(
        steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]
    )


    categorical_transformer = OneHotEncoder(handle_unknown="ignore")

    preprocessor = ColumnTransformer(
        transformers=[
            ("num", numeric_transformer, numeric_features),
            ("cat", categorical_transformer, categorical_features),
        ]
    )

    over = SMOTE(sampling_strategy=0.1)
    under = RandomUnderSampler(sampling_strategy=0.5)

    # Append classifier to preprocessing pipeline.
    # Now we have a full prediction pipeline.
    pipeline = Pipeline(
        steps=[("preprocessor", preprocessor) ,("over",over),("under",under), 
    ("classifier", GradientBoostingClassifier(n_estimators=k))]
    )
    
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=1, random_state=1)
    scores = cross_val_score(pipeline, X_train, y_train, scoring='roc_auc', cv=cv, n_jobs=-1)
    
    mean_score = np.mean(scores)
    
    print(f"Mean ROC AUC for {k} estimators: {mean_score}")
    
    #updating the best k parameter 
    if mean_score > best_auc:
        best_k = k
        best_auc = mean_score



print("best number of estimators: ", best_k)
pipeline

Mean ROC AUC for 20 estimators: 0.8310440572691862
Mean ROC AUC for 30 estimators: 0.8286161518463931
Mean ROC AUC for 40 estimators: 0.8331168631785802
Mean ROC AUC for 50 estimators: 0.8312342636944241
Mean ROC AUC for 60 estimators: 0.8345439322560269
Mean ROC AUC for 100 estimators: 0.8270387379958125
Mean ROC AUC for 200 estimators: 0.8213818742616181
Mean ROC AUC for 300 estimators: 0.816537669169854
best number of estimators:  60


In [30]:
from sklearn import metrics
from sklearn.metrics import RocCurveDisplay

final_pipeline = Pipeline( steps=[("preprocessor", preprocessor) ,("over",over),("under",under), 
("classifier", GradientBoostingClassifier(n_estimators=best_k))]
)


final_pipeline.fit(X_train, y_train)

pred = np.array(final_pipeline.predict(X_test))


print(pred)


y_test_arr = np.array(y_test)


print(y_test_arr)

['No' 'No' 'No' ... 'No' 'No' 'No']
['No' 'No' 'No' ... 'No' 'No' 'No']
