## Model Training & Feature Relationships

This workbook outlines a repeatable process you can use to view feature relationships for a binary classificaiton, with the code being reasonably easy to adapt to other scenarios. In this case we use a simple (non-cross-validated) light GBM model, but the model itself is easily replacable with say a random forest or XGBoost.

Note: All data preperation including dealing with categorical variables has been performed in a previous workbook. This script is purely for the model and output creation


### Package import

Import all the required packages for the script

In [1]:
###
# Import necessary packages
###
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import lightgbm as lgb
from sklearn.metrics import roc_auc_score,roc_curve, roc_auc_score,confusion_matrix, accuracy_score


### Define the variables

Here we outline all the variables used to run the script:

path = the path to the folder that holds the dataset

dataset_name = the name of the dataset to train and test the model

target = the name of the column that is your binary target variable

test_frac = the fraction of the dataset that will be set aside for testing

In [2]:
# Define path to data & name of the file
path = "C:/Users/andrew.davidson/OneDrive - Concentra Consulting Limited/Documents/Projects/GBT Feature Relationships/KK Box/data/"

dataset_name = "Prepped Data.csv"

# The column name of the target variable
target_name = "Churn"
pred_prob_name = target_name + " Probability"

# Define the size of the test set
test_frac = 0.25

### Dataset import
Using the pre-defined path, import the dataset of your choice.

Here, the dataset contains both the train & test data which we later split

In [3]:
# Read in df
df = pd.read_csv(path + dataset_name)
df.head()

Unnamed: 0,Churn,bd,transaction_count,total_payment_plan_days,avg_payment_plan_days,plan_net_worth,mean_payment_each_transaction,total_actual_payment,auto_renew_times,cancel_times,...,no_transactions_flag,city_mean,normal_payment_method_id_mean,gender_female,gender_male,registered_via_3.0,registered_via_4.0,registered_via_7.0,registered_via_9.0,registered_via_13.0
0,1,28.0,0.0,,,,,,,,...,1,0.131997,,0,1,1,0,0,0,0
1,1,20.0,1.0,30.0,30.0,180.0,180.0,180.0,0.0,0.0,...,0,0.123023,0.085326,0,1,1,0,0,0,0
2,1,18.0,2.0,115.93624,67.36879,300.0,150.0,300.0,0.0,0.0,...,0,0.123023,0.92179,0,1,1,0,0,0,0
3,1,,10.0,115.93624,30.0,517.9374,149.0,514.92377,10.0,0.0,...,0,0.064056,0.033369,0,0,0,0,1,0,0
4,1,35.0,8.0,115.93624,30.0,517.9374,99.0,514.92377,8.0,1.0,...,0,0.123023,0.033369,1,0,0,0,1,0,0


### Split Data into a test train set


In [4]:
X = df.drop(target_name, axis = 1)
y = df[target_name]

#Spit into train & test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_frac)

### Define the model

Define the model & fit it to the training data


In [5]:
# Define the model:

# Define the mode params
# Note: we do not cross validate in this script for simplicity
params = {'num_leaves': 32,
                'max_depth': 3,
                 'learning_rate': 0.01,
                 'n_estimators' : 300,
                'subsample_for_bin': 500,
                 'min_child_samples': 20,
                'subsample': 0.5}

model= lgb.LGBMClassifier(num_leaves = params["num_leaves"],
                         max_depth = params["max_depth"],
                         learning_rate =params["learning_rate"],
                         n_estimators = params["n_estimators"],
                         subsample_for_bin = params["subsample_for_bin"],
                         min_child_samples = params["min_child_samples"],
                         subsample = params["subsample"],
                         #n_jobs = -1,
                         boosting_type = 'dart',
                          # Set the feature importance type to "gain"
                        importance_type = "gain")


# Fit the model
model.fit(X_train, y_train)

LGBMClassifier(boosting_type='dart', importance_type='gain', learning_rate=0.01,
               max_depth=3, n_estimators=300, num_leaves=32, subsample=0.5,
               subsample_for_bin=500)

### Predictions and prediction probabilities
Extract the predictions and prediction probabilities of the model on the test set

In [6]:
# Extract test predictions from the model
preds = model.predict(X_test)
# Extract the test prediction probabilities from the model
test_probs = model.predict_proba(X_test)[:,1]
# Calculate the ROC score, accuracy and confusion matrix for the test set
print('ROC: ', round(roc_auc_score(y_test, test_probs),2))
print('Accuracy: ', round(accuracy_score(y_test, (test_probs>0.5).astype(int)),2)*100)
print(confusion_matrix(y_test, (test_probs>0.5).astype(int)))

# Extract the predicted probability from the test set
probs_series = pd.Series(test_probs, index = X_test.index).rename(pred_prob_name)

# Add the target column & the predictions to the test set
test_output = pd.concat([X_test, y_test, probs_series], axis=1, sort=False)


ROC:  0.95
Accuracy:  97.0
[[217964   3000]
 [  4868  16908]]


### Model feature importances

Extract the overall feature importances from the model. This was pre-set to be information gain when setting up the model.

In [7]:
# Create a pandas series containing all the feature importances
feat_importance = pd.Series(model.feature_importances_, index = X_train.columns)
# Create a dataframe for feature importances with normalised importance & rank
feat_df = pd.DataFrame(feat_importance, columns = ["Importance"]).reset_index().rename({'index' : "Feature"}, axis = 1)
feat_df["Importance_Norm"] = feat_df["Importance"]/ max(feat_df["Importance"])
feat_df["ImportanceRank"] = feat_df.Importance.rank(ascending = False)
feat_df.sort_values("ImportanceRank").head()

Unnamed: 0,Feature,Importance,Importance_Norm,ImportanceRank
2,total_payment_plan_days,17209450.0,1.0,1.0
8,cancel_times,6145042.0,0.357074,2.0
7,auto_renew_times,724769.2,0.042115,3.0
28,day_diff_last_listen__first_listen,710550.1,0.041288,4.0
27,day_diff_membership_expire__last_listen,557841.3,0.032415,5.0


### Model Feature Contributions

Derive the model feature contributions for every row in our dataset. We do this by using the "predict_proba" method on our model and feeding it our test data. We can extract the prediction probabilties for the training and the test data. However, we are only using the test set for this example due to the size of the data.

In [17]:
####
# Get the feature contributions for every row
drivers_df = model.predict_proba(X_test, pred_contrib=True)

# Create a dataframe using the contributions
drivers_df = pd.DataFrame(drivers_df,
                              columns = list(X_test.columns) + ['<BIAS>'],
                              index = X_test.index)

# Reshape the contribution df
contribs = drivers_df.reset_index().melt(id_vars = "index")
values = X_test.reset_index().melt(id_vars = "index")

#Join the values & their contributions
driver_output = contribs.merge(values, how='left', 
                               left_on = ['index','variable'], 
                               right_on = ['index', 'variable'])

driver_output = driver_output.rename({'variable' : "Feature",
                                          'value_x' : 'Contribution',
                                          'value_y' : 'Value'}, axis=1)


### ROC Curve

Create the output needed to plot the ROC curve in Power BI

In [18]:
fpr, tpr, thresh = roc_curve(test_output["Churn"], test_output["Churn Probability"])
auc = roc_auc_score(test_output["Churn"], test_output["Churn Probability"])

roc_data = pd.DataFrame({'FalsePositiveRate': fpr, 'TruePositiveRate': tpr, "Threshold": thresh})
roc_data["Model"] = "Model"
roc_data["auc_score"] = auc

# generate a no skill prediction (majority class)
ns_probs = [0 for _ in range(len(test_output))]
ns_fpr, ns_tpr, _ = roc_curve(test_output["Churn"], ns_probs)
ns_roc_data = pd.DataFrame({'FalsePositiveRate': ns_fpr, 'TruePositiveRate': ns_tpr, "Threshold": _})
ns_roc_data["Model"] = "Naive Prediction"
roc_output = pd.concat([roc_data, ns_roc_data])
roc_output.head()

Unnamed: 0,FalsePositiveRate,TruePositiveRate,Threshold,Model,auc_score
0,0.0,0.0,1.775619,Model,0.948991
1,0.0,0.001286,0.775619,Model,0.948991
2,0.0,0.002526,0.77442,Model,0.948991
3,0.0,0.003352,0.771652,Model,0.948991
4,5e-06,0.005006,0.770439,Model,0.948991


### Final output
Write the outputs as csv's into the data folder

In [10]:
test_output.to_csv(path + "Test Set Predictions.csv")
feat_df.to_csv(path + "Feature Importances.csv", index = False)
driver_output.to_csv(path + "Feature Contributions.csv", index = False)
roc_output.to_csv(path + "ROC Curve.csv", index = False)