# Client Explore

In this notebook, we explore client metadata and aggregate stats related to client activity with BigBank.  The goal is to gain insight into what type of client is likely to leave BigBank, so that the business can identify these clients before they leave, and take measures to keep them as customers.

## Setup

In [None]:
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

In [None]:
import declarativewidgets
from declarativewidgets import channel

declarativewidgets.init()

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from pymongo import MongoClient
from bson.objectid import ObjectId

sns.set(style="whitegrid")

In [None]:
churn_labels = ['Did not churn', 'Did churn']

def filter_outliers(d, by_col=None):
    if isinstance(d, pd.Series):
        return d[((d-d.mean()).abs()<=3*d.std())]
    elif isinstance(d, pd.DataFrame):
        if not by_col:
            raise ValueError('by_col is required for DataFrame')
        return d[np.abs(d[by_col]-d[by_col].mean())<=(3*d[by_col].std())] 

In [None]:
%%html
<link rel="import" href="urth_components/paper-dropdown-menu/paper-dropdown-menu.html" 
    is='urth-core-import' package='PolymerElements/paper-dropdown-menu'>
<link rel="import" href="urth_components/paper-menu/paper-menu.html"
    is='urth-core-import' package='PolymerElements/paper-menu'>
<link rel="import" href="urth_components/paper-item/paper-item.html"
    is='urth-core-import' package='PolymerElements/paper-item'>
<link rel="import" href="urth_components/paper-card/paper-card.html"
    is='urth-core-import' package='PolymerElements/paper-card'>
<link rel="import" href="urth_components/paper-checkbox/paper-checkbox.html"
    is='urth-core-import' package='PolymerElements/paper-checkbox'>
<link rel="import" href="urth_components/iron-flex-layout/classes/iron-flex-layout.html" 
    is='urth-core-import' package='PolymerElements/iron-flex-layout'>

## Load client data

Load information about BigBank clients.  The data consists of client metadata, such as age, gender, etc., as well as aggregate statistics about each client's banking activity (e.g., number of credit/debit card transactions, total transaction amount).

The data also include a `churn` classifier, which indicates whether or not the client left BigBank.

To load the data, modify `mongo_configs` with the appropriate IP address, port, username and password.

In [None]:
MONGO_HOST = 'mongodb'
MONGO_PORT = 27017

In [None]:
mongo_configs = {
    "local": {
        "host": MONGO_HOST,
        "port": MONGO_PORT, 
        "db": "demo",
        "collection": "client_features"
    },
    "remote": {
        "host": MONGO_HOST,
        "port": MONGO_PORT, 
        "user": "mongo_user", 
        "password": "mongo_pass", 
        "db": "demo",
        "collection": "client_features"
    }
}

These are helper functions to load the data from MongoDB and query a collection.

In [None]:
def get_mongo_uri(**kwargs):
    if all([x in kwargs for x in ['user','password']]):
        return 'mongodb://{user}:{password}@{host}:{port}'.format(**kwargs)
    return 'mongodb://{host}:{port}'.format(**kwargs)

def query_collection(db, collection, limit=0):
    collection = db[collection]
    cursor = collection.find({}).limit(limit)
    df = pd.DataFrame(list(cursor))
    # Remove the MongoDB _id column
    del df['_id']
    return df

def load_data_from_mongo(uri, db_name, collection):
    client = MongoClient(uri)
    db = client[db_name]
    return query_collection(db, collection)

def load_data(location=None):
    loc = location or 'local'
    config = mongo_configs[loc]
    return load_data_from_mongo(
        get_mongo_uri(**config),
        config['db'],
        config['collection']
    )

In [None]:
client_df = load_data()
client_df.head()

## Plot X vs. Y

We begin our exploration of the data set by creating some scatterplots of each column vs. the others.

In [None]:
def jointplot(x, y, data, **kwargs):
    size = kwargs.pop('size', 10)
    alpha = kwargs.pop('alpha', 0.3)
    return sns.jointplot(x=x, y=y, data=data, 
                         alpha=alpha,
                         size=size,
                         **kwargs)

# for widget
def w_jointplot(x, y):
    g = jointplot(x, y, filter_outliers(client_df, by_col=y))
    plt.close()
    return g.fig

In [None]:
ax = jointplot('age_years', 'annual_income', filter_outliers(client_df, by_col='annual_income'))

We can use a widget to make exploration a bit easier.  Instead of having to type the columns and re-run the cell above, we can create drop-down menus to allow us to select which two columns to plot.  We then bind another widget to invoke the above `jointplot` function, which generates the plot for the widget to display.

In [None]:
channel('clients').set('columns', list(client_df.columns))
channel('clients').set('x', 'age_years')
channel('clients').set('y', 'annual_income')

In [None]:
%%html
<template is="urth-core-bind" channel="clients">
    <div class="card-content">
        <paper-dropdown-menu label="Select x" 
                selected-item-label="{{ x }}" noink>
            <paper-menu class="dropdown-content" selected="[[ x ]]" 
                attr-for-selected="label">
                <template is="dom-repeat" items="[[ columns ]]">
                    <paper-item label="[[ item ]]">[[item]]</paper-item>
                </template>
            </paper-menu>
        </paper-dropdown-menu>
        <paper-dropdown-menu label="Select y" 
                selected-item-label="{{ y }}" noink>
            <paper-menu class="dropdown-content" selected="[[ y ]]" 
                attr-for-selected="label">
                <template is="dom-repeat" items="[[ columns ]]">
                    <paper-item label="[[ item ]]">[[item]]</paper-item>
                </template>
            </paper-menu>
        </paper-dropdown-menu>
    </div>
    <urth-core-function
        ref="w_jointplot"
        arg-x="{{ x }}"
        arg-y="{{ y }}"
        result="{{ jointplot }}" 
        auto></urth-core-function>
    <img src="{{ jointplot }}">
</template>

## Correlations

Next, we compute the correlation coefficients between each variable. 

In [None]:
corr = client_df.corr()

# only show lower triangle
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True

f, ax = plt.subplots(figsize=(12,12))
ax = sns.heatmap(corr, mask=mask, square=True, annot=True, fmt='.2f',
                 cbar=True,
                 ax=ax)
title = ax.set_title('Correlations', size=14)

The data show that the total and average transaction amounts are very highly correlated to annual income.

We are most concerned with churn, however, which appears to be inversely correlated with both age and activity level, a measure of client activity with the bank.  Since churn is either 0 (did not churn) or 1 (did churn), this indicates that clients who churned were of lower age and activity level.

## Churn

We plot the distributions of clients who churned and those that did not on the same axes.

In [None]:
def plot_churn_by(df, col, **kwargs):
    f, ax = plt.subplots(figsize=(12,10), sharex=True)
    kde = kwargs.get('kde', False)
    hist = kwargs.get('hist', False)
    for churn in df.churn.unique():
        sns.distplot(df[df.churn == churn][col], 
                     label=churn_labels[churn], 
                     kde_kws={'shade': (kde and not hist)},
                     ax=ax, 
                     **kwargs)

    ax.set_title('Client Churn by {}'.format(col))
    label = ax.set_xlabel('{}'.format(col))
    return f, ax

def w_plot_churn_by(column, hist=True, kde=False, norm_hist=False):
    df = filter_outliers(client_df, by_col=column)
    f, ax = plot_churn_by(df, column, hist=hist, kde=kde, norm_hist=norm_hist)
    plt.legend()
    plt.close()
    return f

f, ax = plot_churn_by(client_df, 'age_years')
ax = plt.legend()

Once again, we use a widget to make it easier to generate distributions over different client features.

In [None]:
channel('clients').set('churn_dist_hist', True)

In [None]:
%%html
<template is="urth-core-bind" channel="clients">
    <div class="layout horizontal justified">
        <div class="card-content">
            <paper-dropdown-menu label="Select column" 
                    selected-item-label="{{ churn_dist_col }}" noink>
                <paper-menu class="dropdown-content" selected="[[ churn_dist_col ]]" 
                    attr-for-selected="label">
                    <template is="dom-repeat" items="[[ columns ]]">
                        <paper-item label="[[ item ]]">[[ item ]]</paper-item>
                    </template>
                </paper-menu>
            </paper-dropdown-menu>
        </div>
        <div><paper-checkbox checked="{{ churn_dist_hist }}" noink>histogram</paper-checkbox></div>
        <div><paper-checkbox checked="{{ churn_dist_norm_hist }}" noink>normalized</paper-checkbox></div>
        <div><paper-checkbox checked="{{ churn_dist_kde }}" noink>KDE</paper-checkbox></div>
    </div>
    <urth-core-function
        ref="w_plot_churn_by"
        arg-column="{{ churn_dist_col }}"
        arg-hist="{{ churn_dist_hist }}"
        arg-norm_hist="{{ churn_dist_norm_hist }}"
        arg-kde="{{ churn_dist_kde }}"
        result="{{ churn_dist }}" 
        auto></urth-core-function>
    <img src="{{ churn_dist }}">
</template>

When we plot the **age** distributions of clients who have churned and those who did not churn, we can see that clients who have churned are generally younger.

In [None]:
churn_age_stats = client_df.groupby('churn')['age_years'].describe().unstack().T
churn_age_stats.columns = churn_labels
churn_age_stats

The two features that showed a negative correlation with churn were age and activity level.  Here we generate a boxplot with those two features as the axes, and churn as the category.

The plot shows that clients that churn tend to be younger across all levels of activity.

In [None]:
col = 'age_years'
data = filter_outliers(client_df, by_col=col)

f, ax = plt.subplots(figsize=(12,8))
ax = sns.boxplot(x='activity_level', y=col, hue="churn", data=data, 
                 palette='muted', ax=ax)
title = ax.set_title('Client Churn by Activity Level')
label = ax.set_ylabel('Age (Years)')
label = ax.set_xlabel('Activity Level')
handles, labels = ax.get_legend_handles_labels()
legend = ax.legend(handles, churn_labels)

This beeswarm plot shows clients binned by the level of activity they maintain with the bank.  Clients that churned maintained lower levels of activity (0-2).  And of clients within these lower activity levels, younger clients churned more than others.

In [None]:
f, ax = plt.subplots(figsize=(10,8))
ax = sns.swarmplot(x='activity_level', y='age_years', hue='churn', 
                   data=data.sample(n=2000, random_state=51), 
                   palette='muted', ax=ax)
title = ax.set_title('Client Churn by Activity Level')
label = ax.set_ylabel('Age (Years)')
label = ax.set_xlabel('Activity Level')
handles, labels = ax.get_legend_handles_labels()
legend = ax.legend(handles, churn_labels)

## Select data source

This is a simple drop-down widget to allow the notebook user to select from multiple backend MongoDB hosts.  When the user selects a new MongoDB location, it invokes a handler that reloads the client data from that location.  The handler also tickles variables on the widget channel, which triggers the widgets to refresh.

In [None]:
def on_host_selected(old, new):
    host = new
    client_df = load_data(host)

channel('db').set('hosts', list(mongo_configs.keys()))
channel('db').watch('selected_host', on_host_selected)

In [None]:
%%html
<template is="urth-core-bind" channel="db">
    <div class="card-content">
        <paper-dropdown-menu label="Select MongoDB host" 
                selected-item-label="{{ selected_host }}" noink>
            <paper-menu class="dropdown-content" selected="[[ selected_host ]]" 
                attr-for-selected="label">
                <template is="dom-repeat" items="[[ hosts ]]">
                    <paper-item label="[[ item ]]">[[ item ]]</paper-item>
                </template>
            </paper-menu>
        </paper-dropdown-menu>
    </div>
</template>