# Training Churn Predictors

Before we get to deploying new models on the Aqueduct platform, we're first going to train some models. This notebook isn't actually going to include any features from Aqueduct. It's just going to build three simple classifier models on some synethetic data. 

Aqueduct doesn't have any opinions about how you train your models or what tools you use. We are using SciKit-Learn here, but you could just as easily use TensorFlow, XGBoost, PyTorch, or anything else. 

In [1]:
import pandas as pd
import numpy as np
import pickle
from sklearn.metrics import confusion_matrix

# Set up a path for us to store our models.
from pathlib import Path

Path("models").mkdir(parents=True, exist_ok=True)

## Load Synthetic Data

Here is some synthetic data about theoretical Aqueduct customers. It captures features like how many users a company has on Aqueduct, how many workflows they've created, whether they use dbt, and so on. 

We also included a dataset of whether each customer ID has churned or not.

In [2]:
churn = pd.read_csv("../data/churn_data.csv")
churn.head()

Unnamed: 0,cust_id,churn
0,0,False
1,1,True
2,2,False
3,3,False
4,4,False


In [3]:
cust = pd.read_csv("../data/customers.csv")
cust.head()

Unnamed: 0,cust_id,n_workflows,n_rows,n_users,company_size,n_integrations,n_support_tickets,duration_months,using_deep_learning,n_data_eng,using_dbt
0,0,4,2007,2,29,5,3.0,1.0,False,2.0,True
1,1,3,8538,1,31,4,1.0,1.0,False,3.0,True
2,2,4,7548,1,29,3,1.0,3.0,False,1.0,True
3,3,3,4286,1,33,4,1.0,4.0,False,3.0,True
4,4,2,2136,1,28,3,0.0,1.0,False,2.0,True


## Feature Cleanup

The first thing we're going to do is to clean up our data. There's obviously many different types of featurization that we might do, but to keep things simple, the `log_featurize` function below is simply going to log-normalize all of our numerical features, and it is going to drop any features that are non-numerical. 

In [4]:
def log_featurize(cust: pd.DataFrame) -> pd.DataFrame:
    cust = cust.copy()
    skip_cols = ["cust_id", "using_deep_learning", "using_dbt"]
    for col in cust.columns.difference(skip_cols):
        cust["log_" + col] = np.log(cust[col] + 1.0)
    return cust


log_featurize(cust)

Unnamed: 0,cust_id,n_workflows,n_rows,n_users,company_size,n_integrations,n_support_tickets,duration_months,using_deep_learning,n_data_eng,using_dbt,log_company_size,log_duration_months,log_n_data_eng,log_n_integrations,log_n_rows,log_n_support_tickets,log_n_users,log_n_workflows
0,0,4,2007,2,29,5,3.0,1.0,False,2.0,True,3.401197,0.693147,1.098612,1.791759,7.604894,1.386294,1.098612,1.609438
1,1,3,8538,1,31,4,1.0,1.0,False,3.0,True,3.465736,0.693147,1.386294,1.609438,9.052399,0.693147,0.693147,1.386294
2,2,4,7548,1,29,3,1.0,3.0,False,1.0,True,3.401197,1.386294,0.693147,1.386294,8.929170,0.693147,0.693147,1.609438
3,3,3,4286,1,33,4,1.0,4.0,False,3.0,True,3.526361,1.609438,1.386294,1.609438,8.363342,0.693147,0.693147,1.386294
4,4,2,2136,1,28,3,0.0,1.0,False,2.0,True,3.367296,0.693147,1.098612,1.386294,7.667158,0.000000,0.693147,1.098612
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,495,2,3108,5,30,5,3.0,3.0,False,4.0,True,3.433987,1.386294,1.609438,1.791759,8.042056,1.386294,1.791759,1.098612
496,496,7,4240,1,30,1,2.0,1.0,False,2.0,False,3.433987,0.693147,1.098612,0.693147,8.352554,1.098612,0.693147,2.079442
497,497,3,2520,1,27,4,1.0,1.0,False,2.0,True,3.332205,0.693147,1.098612,1.609438,7.832411,0.693147,0.693147,1.386294
498,498,16,1972,1,33,1,3.0,3.0,False,3.0,True,3.526361,1.386294,1.386294,0.693147,7.587311,1.386294,0.693147,2.833213


## Model Training

Finally, we're going to train three simple models based on this data using three common classification techniques -- a logistic regression model, a decision tree, and an SVM. We're going to use SciKit-Learn for all three of these models, and once they're trained, we're going to print out our confusion matrix and serialize the model to disk.

### Logistic Regression Model

In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

In [6]:
linear_model = LogisticRegression(max_iter=10000)
linear_model.fit(log_featurize(cust).drop(columns="cust_id"), churn["churn"])
churn["m1_pred"] = linear_model.predict(log_featurize(cust).drop(columns="cust_id"))
cm = confusion_matrix(churn["churn"], churn["m1_pred"])
cm

array([[339,  21],
       [116,  24]])

In [7]:
with open("linear_model.pkl", "wb") as f:
    pickle.dump(linear_model, f)

### Decision Tree

In [8]:
decision_tree_model = DecisionTreeClassifier(
    max_depth=10,
    min_samples_split=3,
)
decision_tree_model.fit(log_featurize(cust).drop(columns="cust_id"), churn["churn"])
churn["m2_pred"] = decision_tree_model.predict(log_featurize(cust).drop(columns="cust_id"))
cm = confusion_matrix(churn["churn"], churn["m2_pred"])
cm

array([[349,  11],
       [ 29, 111]])

In [9]:
with open("decision_tree_model.pkl", "wb") as f:
    pickle.dump(decision_tree_model, f)

### Support Vector Machine

In [10]:
svm_model = SVC(C=1e7)
svm_model.fit(log_featurize(cust).drop(columns="cust_id"), churn["churn"])
churn["m3_pred"] = svm_model.predict(log_featurize(cust).drop(columns="cust_id"))
cm = confusion_matrix(churn["churn"], churn["m3_pred"])
cm

array([[344,  16],
       [104,  36]])

In [11]:
with open("svm_model.pkl", "wb") as f:
    pickle.dump(svm_model, f)

Okay, great! We have three models now to predict whether our customers might churn. We'd like to create an ensemble of all three models and deploy it to Aqueduct. 

To see how we do that, go to the `ensemble-workflow` notebook in this same directory!