**Goal of the project**

In this project, we’ll build a contractual churn model for contractual settings

**Load the packages**

In [76]:
# Importing library
import pandas as pd

**Load the data**

The dataset we’re using is a [customer churn dataset](https://www.kaggle.com/datasets/krleee/churn) from Kaggle. 

In [77]:
# Load dataset
df = pd.read_csv('../input/churn/churn.csv')

In [78]:
# Convert Pandas column names to lowercase
df = df.rename(columns = str.lower)

In [79]:
# Examine the data
df.head()

Unnamed: 0,state,account_length,area_code,international_plan,voice_mail_plan,number_vmail_messages,total_day_minutes,total_day_calls,total_day_charge,total_eve_minutes,total_eve_calls,total_eve_charge,total_night_minutes,total_night_calls,total_night_charge,total_intl_minutes,total_intl_calls,total_intl_charge,number_customer_service_calls,churn
0,KS,128,area_code_415,no,yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,no
1,OH,107,area_code_415,no,yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,no
2,NJ,137,area_code_415,no,no,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,no
3,OH,84,area_code_408,yes,no,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,no
4,OK,75,area_code_415,yes,no,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,no


In [80]:
# Examine the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 20 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   state                          5000 non-null   object 
 1   account_length                 5000 non-null   int64  
 2   area_code                      5000 non-null   object 
 3   international_plan             5000 non-null   object 
 4   voice_mail_plan                5000 non-null   object 
 5   number_vmail_messages          5000 non-null   int64  
 6   total_day_minutes              5000 non-null   float64
 7   total_day_calls                5000 non-null   int64  
 8   total_day_charge               5000 non-null   float64
 9   total_eve_minutes              5000 non-null   float64
 10  total_eve_calls                5000 non-null   int64  
 11  total_eve_charge               5000 non-null   float64
 12  total_night_minutes            5000 non-null   f

**Define the target variable**

Before we start, we’ll tidy the data by converting the churn column to a numeric value, so it can be used in the model. 

In [81]:
df['churn'] = df['churn'].replace(('yes', 'no'), (1, 0))

**Examine the data**

As we’ll see from examining the round(df['churn'].value_counts(normalize = True) * 100, 1) of the target variable column, this dataset is imbalanced. The positive class (customers who churned) comprise about 14.1% of the dataset.

In [82]:
round(df['churn'].value_counts(normalize = True) * 100, 1)

0    85.9
1    14.1
Name: churn, dtype: float64

**Create the test and train datasets**

Next, we’ll assign all the columns apart from the churn column to our X feature set and the churn column to our y target variable.

In [83]:
X = df.drop(['churn'], axis = 1)

In [84]:
y = df['churn']

We’ll then use the train_test_split( ) function to create our train and test data, allocating 20% for testing or validation and using stratification to ensure that the proportions are split equally across the datasets.

In [85]:
from sklearn.model_selection import train_test_split

In [86]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)

**Create our feature engineering functions**

We’ll create a number of functions to create new features from the existing features. These functions will be used in the FunctionTransformer class to create new features from the existing features.

In [87]:
def get_total_net_minutes(df):
    
    df['total_net_minutes'] = df['total_day_minutes'] + df['total_eve_minutes'] + df['total_night_minutes']
    
    return df

In [88]:
def get_total_net_calls(df):
    
    df['total_net_calls'] = df['total_day_calls'] + df['total_eve_calls'] + df['total_night_calls']
    
    return df

In [89]:
def get_total_net_charge(df):
    
    df['total_net_charge'] = df['total_day_charge'] + df['total_eve_charge'] + df['total_night_charge']
    
    return df

In [90]:
def cs_calls_per_month(df):
    
    df['cs_calls_per_month'] = (df['number_customer_service_calls'] + df['number_vmail_messages']) / df['account_length']
    
    return df

**Create a feature engineering pipeline**

Next, we’ll use the ColumnTransformer class to run each of our feature engineering functions via the FunctionTransformer class.

In [91]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer

In [92]:
feature_engineering = ColumnTransformer([('total_net_minutes', FunctionTransformer(get_total_net_minutes, validate = False),
                                          ['total_day_minutes', 'total_eve_minutes', 'total_night_minutes']),
                                         ('total_net_calls', FunctionTransformer(get_total_net_calls, validate = False),
                                          ['total_day_calls', 'total_eve_calls', 'total_night_calls']),
                                         ('total_net_charge', FunctionTransformer(get_total_net_charge, validate = False),
                                          ['total_day_charge', 'total_eve_charge', 'total_night_charge']),
                                         ('cs_calls_per_month', FunctionTransformer(cs_calls_per_month, validate = False),
                                          ['account_length', 'number_customer_service_calls', 'number_vmail_messages'])])

**Define the model and bundle the pipeline**

Next we’ll define a simple CatBoostClassifier model, then we’ll use Pipeline to bundle our steps together. This will preprocess the data using our pipeline and then fit the model.

In [93]:
from sklearn.preprocessing import OneHotEncoder
from catboost import CatBoostClassifier
from imblearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

In [94]:
numeric_columns = list(X.select_dtypes(exclude = ['object']).columns.values.tolist())    
categorical_columns = list(X.select_dtypes(include = ['object']).columns.values.tolist())
categorical_transformer = OneHotEncoder(sparse = False, handle_unknown = 'ignore')
    
preprocessor = ColumnTransformer(transformers = [('feature_engineering', feature_engineering, numeric_columns),
                                                 ('numeric_transformers', 'passthrough', numeric_columns),
                                                 ('categorical_transformers', categorical_transformer, categorical_columns)])

model = CatBoostClassifier(silent = True, random_state = 42)

bundled_pipeline = Pipeline(steps = [('preprocessor', preprocessor),
                                     ('scaler', StandardScaler()),
                                     ('model', model)])

**Preprocess the training data and fit the model**

Now we can append the fit( ) function to our bundled_pipeline and pass in our X_train and y_train data. This will run the steps in our preprocessor, and fit the model, all in a single line of code.

In [95]:
bundled_pipeline.fit(X_train, y_train)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('feature_engineering',
                                                  ColumnTransformer(transformers=[('total_net_minutes',
                                                                                   FunctionTransformer(func=<function get_total_net_minutes at 0x7fd5a034ed40>),
                                                                                   ['total_day_minutes',
                                                                                    'total_eve_minutes',
                                                                                    'total_night_minutes']),
                                                                                  ('total_net_calls',
                                                                                   FunctionTransformer(func=<function get_total_net_call...
                                                   'total_night_char

**Generate predictions**

To generate predictions from our X_test data, which hasn’t been through the above steps, we can call the bundled_pipeline again and append the predict( ) function. This ensures the exact same processes are followed on both the test and train data.

In [96]:
y_pred = bundled_pipeline.predict(X_test)

In [97]:
from sklearn.metrics import f1_score

In [98]:
print('F1 score: ', f1_score(y_test, y_pred, average = 'macro'))

F1 score:  0.9416482426395741


**Evaluate the model**

There are a couple of scikit-learn functions we can use to evaluate the model. The first is the f1_score function, which returns the f1 score of the model. The second is the classification_report function, which returns a report with the precision, recall, and F1 score for each class. As we can see, the base CatBoostClassifier is actually pretty decent.

In [99]:
from sklearn.metrics import classification_report

In [100]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.98      0.99      0.98       859
           1       0.95      0.85      0.90       141

    accuracy                           0.97      1000
   macro avg       0.96      0.92      0.94      1000
weighted avg       0.97      0.97      0.97      1000

