# PDPilot Churn Example

This notebook demonstrates how to use PDPilot to anlayze a model trained on the [churn dataset](https://epistasislab.github.io/pmlb/profile/churn.html). Each row in the dataset is a customer of a telephone service provider. The goal is to predict whether or not the customer will churn, or switch to a different provider.

First, we import [pmlb](https://epistasislab.github.io/pmlb/) to load the dataset, our chosen model class from scikit-learn, and the `partial_dependence` function and `PDPilotWidget` class from PDPilot.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from pmlb import fetch_data
from pdpilot import partial_dependence, PDPilotWidget

Next, we load the dataset into a pandas dataframe and train a random forest model on it.

In [None]:
df = fetch_data('churn')

In [None]:
df

In [None]:
df_X = df.drop(columns=[
    'target', 'state', 'phone number',
    'total day charge', 'total night charge',
    'total eve charge'
])
y = df['target'].values

In [None]:
model = RandomForestClassifier(n_estimators=100, max_features='sqrt')
model.fit(df_X, y)

Next, we get a list of the names of features that we want to compute plots for.

In [None]:
features = list(df_X.columns)

PDPilot can support up to a few thousand instances. Here we randomly sample 1000 instances from our dataset and get the corresponding ground truth labels.

In [None]:
subset = df_X.sample(1000)
labels = y[subset.index].tolist()

Now we are ready to compute the data needed by the widget. For classification, the function that we pass to the `predict` parameter is expected to take a pandas dataframe containing instances as input and return a 1D numpy array containing the predicted probabilities of those instances. Since we are working with a binary classification dataset, we choose the probabilities for the positive class. If we had a multi-class dataset, then we would need to choose one class to calculate the plots for.

In [None]:
def predict(X):
    return model.predict_proba(X)[:,1]

With the `ordinal_features` and `nominal_features` parameters, we can override the default feature type inference.

In [None]:
pd_data = partial_dependence(
    predict=predict,
    df=subset,
    features=features,
    ordinal_features={'international plan', 'voice mail plan', 'number customer service calls'},
    nominal_features={'area code'},
    resolution=20,
    n_jobs=4,
)

Now we are ready to run the widget.

In [None]:
w = PDPilotWidget(
    predict=predict,
    df=subset,
    labels=labels,
    pd_data=pd_data,
    height=650
)

w