# OH-19
#### Favio Vázquez

## The problem: Customer Churn

![](https://www.insideselfstorage.com/sites/insideselfstorage.com/files/styles/article_featured_retina/public/Sad-Customer-Service.jpg?itok=S9sd0R3T)
Credit:https://www.insideselfstorage.com/customer-service/7-deadly-customer-service-situations-self-storage-and-how-handle-them

Customer churn is defined as when customers or subscribers discontinue doing business with a firm or service.

Each row represents a customer, each column contains customer’s attributes.

The data set includes information about:

- Customers who left within the last month – the column is called **Churn**
- Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
- Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
- Demographic info about customers – gender, age range, and if they have partners and dependents

## Understand the business context and problem

Before spending time trying to solve a business problem, we have to be sure that we have a problem. For that we need to have meetings with the people close to the business problem and the steakholders. 

We had two meetings, one with HR and the other with the main excecutives. This is what we heard:

- Curstomers are leaving but we don't know why.
- We have 1 month of data for customers where we know which ones stayed and which ones left.
- The customer churn can't surpass 15% per year due to our calculations.
- We don't know the financial impact on losing a customer
- We can give a voucher for \$500 for customers identified as churn.
- The estimated life time value for a customer is \$7500.

After those meetings we have to check the existing data in the company and find useful information in it. Let's assume we did it and after a data integration process we created a comprehensive dataset for our customers and their information. Remember that we are working with a telco company. 

## Libraries

In [None]:
!pip install datatable

In [None]:
!pip install plotly

In [None]:
import pandas as pd
import datatable as dt
from datatable import f, min, max, mean
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objects as go
import plotly.io as pio
import scipy.stats as stats
import warnings
import numpy as np
warnings.filterwarnings("ignore")

## Load data

In [None]:
df = dt.fread("data/churn-data.csv")

In [None]:
df.head()

In [None]:
df.shape

The colour signifies the datatype where red denotes string, green denotes int and blue stands for float.

## How many customers have left? 

In [None]:
df[f.Churn == "Yes", dt.count()]

In [None]:
1869/7043

1869 customers have left, that means 26% of our customers. So if we remember the metrics from the business we have a problem. 

## How much money have we lost due to the loss of customers?

In [None]:
df[:, dt.count(), dt.by(dt.f.Churn)]

In [None]:
df[dt.f.Churn == 'Yes', 'TotalCharges'].sum1()

We have lost $2.862.926 due to customer churn. So let's try to solve this problem.

## Data exploration

In [None]:
df_pandas = df.to_pandas()

In [None]:
df_pandas.head()

In [None]:
def diagnostic_plots(df_pandas, variable):
    
    plt.figure(figsize=(20, 9))

    plt.subplot(1, 3, 1)
    sns.histplot(data = df_pandas, x=variable, bins=30, kde=True)
    plt.title('Histogram')
    
    plt.subplot(1, 3, 2)
    stats.probplot(df_pandas[variable], dist="norm", plot=plt)
    plt.ylabel('RM quantiles')

    plt.subplot(1, 3, 3)
    sns.boxplot(x=df_pandas[variable])
    plt.title('Boxplot')
    
    plt.show()

In [None]:
num_columns=df_pandas.select_dtypes(include=["number"]).columns
num_columns

In [None]:
for i in num_columns:
    diagnostic_plots(df_pandas,i)

In [None]:
sns.pairplot(df_pandas.drop("SeniorCitizen",axis=1),hue="Churn",aspect=3);

In [None]:
fig = px.histogram(df_pandas, x="Churn")
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()

In [None]:
fig = px.histogram(df_pandas, x="Churn", color="SeniorCitizen")
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()

In [None]:
fig = px.histogram(df_pandas, x="Churn", color="OnlineSecurity", barmode="group")
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()

In [None]:
fig = px.box(df_pandas, x='Churn', y = 'tenure')
fig.show()

In [None]:
ax = sns.kdeplot(df_pandas.MonthlyCharges[(df_pandas["Churn"] == 'No') ],
                color="Red", shade = True);
ax = sns.kdeplot(df_pandas.MonthlyCharges[(df_pandas["Churn"] == 'Yes') ],
                ax =ax, color="Blue", shade= True);
ax.legend(["Not Churn","Churn"],loc='upper right');
ax.set_ylabel('Density');
ax.set_xlabel('Monthly Charges');
ax.set_title('Distribution of monthly charges by churn');

In [None]:
corr = df_pandas.apply(lambda x: pd.factorize(x)[0]).corr()
mask = np.triu(np.ones_like(corr, dtype=bool))

heat = go.Heatmap(
    z=corr.mask(mask),
    x=corr.columns,
    y=corr.columns,
    colorscale=px.colors.diverging.RdBu,
    zmin=-1,
    zmax=1
)

pio.templates.default = "plotly_white"


fig.update_xaxes(side="bottom")

fig.update_layout(
    title_text='Heatmap', 
    title_x=0.5, 
    width=1000, 
    height=1000,
    xaxis_showgrid=False,
    yaxis_showgrid=False,
    xaxis_zeroline=False,
    yaxis_zeroline=False,
    yaxis_autorange='reversed',
    template='plotly_white'
)

fig=go.Figure(data=[heat])
fig.show()

## Data cleaning

Check this great source for learning more about datatable by my friend Rohan Rao:

https://github.com/vopani/datatableton

In [None]:
df.names

In [None]:
df.stypes

In [None]:
## missing values
dt.math.isna(df).sum()

We only have 11 missing values in the TotalCharges column. 

In [None]:
## Delete missing rows
df = df[dt.rowall(dt.f[:] != None), :]

In [None]:
# Delete customerID
del df[:, "customerID"]

In [None]:
df.head()

In [None]:
# Enconde Churn
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df[:, 'Churn'] = dt.Frame(le.fit_transform(np.ravel(df[:, 'Churn'])))

In [None]:
# Function for OHE
def ohe_columns(columns,df):
    df_work = df.copy()
    for column in columns:
        df_ohe = dt.str.split_into_nhot(df_work[column])
        df_ohe.names = [f'{column}_{col}' for col in df_ohe.names]
        df_work.cbind(df_ohe)
    return df_work

In [None]:
# Select categorical columns
categorical_columns = df[:, str].names

In [None]:
# Get final df after OHE
df_final = ohe_columns(categorical_columns,df)

In [None]:
# Delete orignal columns
del df_final[:, categorical_columns]

In [None]:
df_final.head()

In [None]:
df_final.to_csv("churn_data_cleaned.csv")

## Modeling

In [None]:
data = pd.read_csv("data/churn_data_cleaned.csv")

In [None]:
data.shape

In [None]:
data.dtypes

In [None]:
data.head()

### 1. H2O AutoML

In [None]:
!pip install h2o

In [None]:
import h2o
from h2o.automl import *

In [None]:
h2o.init()

In [None]:
dataset = h2o.import_file("data/churn_data_cleaned.csv")

In [None]:
dataset.head()

In [None]:
train, test = dataset.split_frame([0.8], seed=42)

In [None]:
print("train:%d test:%d" % (train.nrows, test.nrows))

In [None]:
# Identify predictors and response
x = train.columns
y = "Churn"
x.remove(y)

# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()

In [None]:
aml = H2OAutoML(max_runtime_secs = 900, 
                max_models = 25,  
                seed = 42, 
                project_name='classification_1',
                sort_metric = "AUC")

%time aml.train(x = x, y = y, training_frame = train)

In [None]:
lb = aml.leaderboard
lb.head(rows = lb.nrows)

In [None]:
aml.leader

In [None]:
aml.leader.model_performance(test_data=test)

In [None]:
aml.leader.model_performance(test_data=test).plot()

In [None]:
aml.predict(test)

In [None]:
aml.explain(test)

In [None]:
aml.leader.model_performance(test_data=test).confusion_matrix()

This confusion matrix is on the test set which includes 20% of our data (1400 rows) We have 211 True Positives (15%) — these are the customers for which we will be able to extend the lifetime value. If we wouldn’t have predicted, then there was no opportunity for intervention.

We also have 195 (14%) False Positives where we will lose money because the promotion offered to these customers will just be an extra cost.

596 (42%) are True Negatives (good customers) and 68 (5%) are False Negative (this is a missed opportunity).

In a churn model, often the reward of true positives is way different than the cost of false positives. Let’s use the following assumptions:

- \$500 voucher will be offered to all the customers identified as churn (True Positive + False Positive);
- If we are able to stop the churn, we will gain $7500 in customer lifetime value.

| Description                    | Customers | Value | Total     |
|--------------------------------|-----------|-------|-----------|
| True Positive                  | 211       | 7500  | 1,582,000 |
| True Positive + False Positive | 406       | 500   | -203,000  |
|                                |           |       | **1,379,000** |

### 2. GBM with H2O

In [None]:
from h2o.estimators import *
from h2o.grid import *

In [None]:
train, valid, test = dataset.split_frame([0.7, 0.15], seed=42)

In [None]:
# Identify predictors and response
x = train.columns
y = "Churn"
x.remove(y)

# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()
valid[y] = valid[y].asfactor()

In [None]:
gbm = H2OGradientBoostingEstimator(seed = 42, 
                                   model_id = 'default_gbm')

%time gbm.train(x = x, y = y, training_frame = train, validation_frame = valid)

In [None]:
gbm

In [None]:
gbm.predict(valid)

In [None]:
default_gbm_per = gbm.model_performance(test)

In [None]:
default_gbm_per

In [None]:
# Hyperparameter estimation

gbm = H2OGradientBoostingEstimator(ntrees = 500,
                                   learn_rate = 0.05,
                                   seed = 42,
                                   model_id = 'grid_gbm')

hyper_params_tune = {'max_depth' : [4, 5, 6, 7, 8],
                     'sample_rate': [x/100. for x in range(20,101)],
                     'col_sample_rate' : [x/100. for x in range(20,101)],
                     'col_sample_rate_per_tree': [x/100. for x in range(20,101)],
                     'col_sample_rate_change_per_level': [x/100. for x in range(90,111)]}

search_criteria_tune = {'strategy': "RandomDiscrete",
                        'max_runtime_secs': 900,  
                        'max_models': 100,  ## build no more than 100 models
                        'seed' : 42}

random_grid = H2OGridSearch(model = gbm, hyper_params = hyper_params_tune,
                            grid_id = 'random_grid',
                            search_criteria = search_criteria_tune)

%time random_grid.train(x = x, y = y, training_frame = train, validation_frame = valid)

In [None]:
sorted_random_search = random_grid.get_grid(sort_by = 'auc',decreasing = True)
sorted_random_search.sorted_metric_table()

In [None]:
tuned_gbm = sorted_random_search.models[0]

In [None]:
tuned_gbm_per = tuned_gbm.model_performance(test)
print(tuned_gbm_per.auc())

In [None]:
tuned_gbm.explain(test)

In [None]:
tuned_gbm.explain_row(test, row_index=0)