# Customer Churn Prediction


Heba El-Shimy  
IBM **Cloud** Developer Advocate

-------------------

- Customers are considered one of the most important assets for a business

- In a competitive market, companies in which the customers have numerous choices of service providers they can easily switch a service or even the provider.

- Such customers are referred to as churned customers.<sup>[1](#first)</sup>

### Churned Customer

Customers or subscribers who stop using a company's service.

### Significance and reasons of Customer  
### Churn <sup>[2](#second), [3](#third)</sup>

Customer churn is more often due to **bad brand experiences** rather than bad products.

![customer-churn](../doc/source/images/82.png)

Each year, 

# $62 billion

is lost by U.S. companies following a bad customer experience

![cost](../doc/source/images/dollars.jpeg)

For every dollar invested in improving the customer experience, businesses see

# 3-5x

return

Finding new customers costs

# 5x

more than keeping them

Repeat customers spend

# 3x

more than new ones.

And just **20%** of them account for **80%** of a company’s future profits

Reducing churn by **5%**, businesses can increase profits anywhere from

# 25% - 125%

![rise](../doc/source/images/rise.jpg)

# Pipeline

### 1. Loading Libraries

In [None]:
!pip install --upgrade pixiedust

In [None]:
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing, svm
from itertools import combinations
from sklearn.preprocessing import PolynomialFeatures, LabelEncoder, StandardScaler
import sklearn.feature_selection
from sklearn.model_selection import train_test_split
from collections import defaultdict
from sklearn import metrics
import pixiedust

### The Dataset

From a telecommunications company. It includes information about:  
- Customers who left within the last month – the column is called Churn

- Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies

- Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges

- Demographic info about customers – gender, age range, and if they have partners and dependents

Link for getting the dataset: [https://community.watsonanalytics.com/wp-content/uploads/2015/03/WA_Fn-UseC_-Telco-Customer-Churn.csv](https://community.watsonanalytics.com/wp-content/uploads/2015/03/WA_Fn-UseC_-Telco-Customer-Churn.csv)

Link for other datasets: [https://www.ibm.com/communities/analytics/watson-analytics-blog/guide-to-sample-datasets/](https://www.ibm.com/communities/analytics/watson-analytics-blog/guide-to-sample-datasets/)

### 2. Loading Our Dataset

Click on the cell below to highlight it, making sure your cursor is above the line:
`customer_data = pd.read_csv(body)`

Then do to the `Files` section to the right of this notebook and click `Insert to code` for the data you have uploaded. Choose `Insert pandas DataFrame`.

Make sure that the last line is:
` customer_data = pd.read_csv(body)`

In [None]:
# Place cursor below and insert the Pandas DataFrame for your uploaded data

customer_data = pd.read_csv(body)

In [None]:
# Checking that everything is correct
pd.set_option('display.max_columns', 30)
customer_data.head(10)

### 3. Get some info about our Dataset and whether we have missing values

In [None]:
# After running this cell we will see that we have no missing values
customer_data.info()

In [None]:
# Drop customerID column
customer_data = customer_data.drop('customerID', axis=1)
customer_data.head(5)

In [None]:
# Convert TotalCharges column to numeric as it is detected as object
new_col = pd.to_numeric(customer_data.iloc[:, 18], errors='coerce')
new_col

In [None]:
# Modify our dataframe to reflect the new datatype
customer_data.iloc[:, 18] = pd.Series(new_col)
customer_data

In [None]:
# Check if we have any NaN values
customer_data.isnull().values.any()

In [None]:
# Handle missing values
from sklearn.preprocessing import Imputer

imp = Imputer(missing_values="NaN", strategy="mean")

customer_data.iloc[:, 18] = imp.fit_transform(customer_data.iloc[:, 18].values.reshape(-1, 1))
customer_data.iloc[:, 18] = pd.Series(customer_data.iloc[:, 18])

In [None]:
# Check if we have any NaN values
customer_data.isnull().values.any()

In [None]:
customer_data.info()

### 4. Descriptive analytics for our data

In [None]:
# Describe columns with numerical values
pd.set_option('precision', 3)
customer_data.describe()

In [None]:
# Describe columns with objects
customer_data.describe(exclude=np.number)

In [None]:
# Find correlations
customer_data.corr(method='pearson')

### 5. Visualize our Data to understand it better

#### Plot Relationships

In [None]:
# Using Pixiedust for visualization
display(customer_data)

In [None]:
# Plot Tenure Frequency count
sns.set(style="darkgrid")
sns.set_palette("hls", 3)
fig, ax = plt.subplots(figsize=(20,10))
ax = sns.countplot(x="tenure", hue="Churn", data=customer_data)

In [None]:
# Plot Tenure Frequency count
sns.set(style="darkgrid")
sns.set_palette("hls", 3)
fig, ax = plt.subplots(figsize=(20,10))
ax = sns.countplot(x="Contract", hue="Churn", data=customer_data)

In [None]:
# Plot Tenure Frequency count
sns.set(style="darkgrid")
sns.set_palette("hls", 3)
fig, ax = plt.subplots(figsize=(20,10))
ax = sns.countplot(x="TechSupport", hue="Churn", data=customer_data)

In [None]:
# Create Grid for pairwise relationships
gr = sns.PairGrid(customer_data, size=5, hue="Churn")
gr = gr.map_diag(plt.hist)
gr = gr.map_offdiag(plt.scatter)
gr = gr.add_legend()

#### Understand Data Distribution

In [None]:
# Set up plot size
fig, ax = plt.subplots(figsize=(6,6))

# Attributes destribution
a = sns.boxplot(orient="v", palette="hls", data=customer_data.iloc[:, 18], fliersize=14)

In [None]:
# Tenure data distribution
histogram = sns.distplot(customer_data.iloc[:, 4], hist=True)
plt.show()

In [None]:
# Monthly Charges data distribution
histogram = sns.distplot(customer_data.iloc[:, 17], hist=True)
plt.show()

In [None]:
# Total Charges data distribution
histogram = sns.distplot(customer_data.iloc[:, 18], hist=True)
plt.show()

### 6. Encode string values in data into numerical values

In [None]:
# Use pandas get_dummies
customer_data_encoded = pd.get_dummies(customer_data)
customer_data_encoded.head(10)

### 7. Create Training Set and Labels 

In [None]:
# Create training data for non-preprocessed approach
X_npp = customer_data.iloc[:, :-1].apply(LabelEncoder().fit_transform)
pd.DataFrame(X_npp).head(5)

In [None]:
# Create training data for that will undergo preprocessing
X = customer_data_encoded.iloc[:, :-2]
X.head()

In [None]:
# Extract labels
y_unenc = customer_data['Churn']

In [None]:
# Convert strings of 'yes' and 'no' to binary values of 0 or 1
le = preprocessing.LabelEncoder()
le.fit(y_unenc)

y_le = le.transform(y_unenc)
pd.DataFrame(y_le)

### 8. Detect outliers in numerical values

In [None]:
# Calculate the Z-score using median value and median absolute deviation for more robust calculations
# Working on Monthly Charges column
threshold = 3

median = np.median(X['MonthlyCharges'])
median_absolute_deviation = np.median([np.abs(x - median) for x in X['MonthlyCharges']])
modified_z_scores = [0.6745 * (x - median) / median_absolute_deviation
                         for x in X['MonthlyCharges']]
results = np.abs(modified_z_scores) > threshold

print(np.any(results))

In [None]:
# Do the same for Total Charges column but using the interquartile method

quartile_1, quartile_3 = np.percentile(X['TotalCharges'], [25, 75])
iqr = quartile_3 - quartile_1
lower_bound = quartile_1 - (iqr * 1.5)
upper_bound = quartile_3 + (iqr * 1.5)

print(np.where((X['TotalCharges'] > upper_bound) | (X['TotalCharges'] < lower_bound)))

### 9. Feature Engineering

In [None]:
# Find interactions between current features and append them to the dataframe
def add_interactions(dataset):
    # Get feature names
    comb = list(combinations(list(dataset.columns), 2))
    col_names = list(dataset.columns) + ['_'.join(x) for x in comb]
    
    # Find interactions
    poly = PolynomialFeatures(interaction_only=True, include_bias=False)
    dataset = poly.fit_transform(dataset)
    dataset = pd.DataFrame(dataset)
    dataset.columns = col_names
    
    # Remove interactions with 0 values
    no_inter_indexes = [i for i, x in enumerate(list((dataset ==0).all())) if x]
    dataset = dataset.drop(dataset.columns[no_inter_indexes], axis=1)
    
    return dataset

In [None]:
X_inter = add_interactions(X)
X_inter.head(15)

In [None]:
# Select best features
select = sklearn.feature_selection.SelectKBest(k=25)
selected_features = select.fit(X_inter, y_le)
indexes = selected_features.get_support(indices=True)
col_names_selected = [X_inter.columns[i] for i in indexes]

X_selected = X_inter[col_names_selected]
X_selected.head(10)

### 10. Split our dataset into train and test datasets

#### Split non-preprocessed data

In [None]:
X_train_npp, X_test_npp, y_train_npp, y_test_npp = train_test_split(X_npp, y_le,\
                                                    test_size=0.33, random_state=42)
print(X_train_npp.shape, y_train_npp.shape)
print(X_test_npp.shape, y_test_npp.shape)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_selected, y_le,\
                                                    test_size=0.33, random_state=42)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

In [None]:
X_test.head()

#### Trying to send data to the endpoint will return predictions with probabilities

### 11. Scale our data

In [None]:
# Use StandardScaler
scaler = preprocessing.StandardScaler().fit(X_train, y_train)
X_train_scaled = scaler.transform(X_train)

pd.DataFrame(X_train_scaled, columns=X_train.columns).head()

In [None]:
pd.DataFrame(y_train).head()

### 12. Start building a classifier

#### Support Vector Macines on non-preprocessed data

In [None]:
from sklearn.svm import SVC

# Run classifier
clf_svc_npp = svm.SVC(random_state=42)
clf_svc_npp.fit(X_train_npp, y_train_npp)

#### Support Vector Macines on preprocessed data

In [None]:
# Run classifier
clf_svc = svm.SVC(random_state=42)
clf_svc.fit(X_train_scaled, y_train)

#### Logestic Regression on preprocessed data

In [None]:
from sklearn.linear_model import LogisticRegression

clf_lr = LogisticRegression()
model = clf_lr.fit(X_train_scaled, y_train)
model

#### Multilayer Perceptron (Neural Network) on preprocessed data

In [None]:
from sklearn.neural_network import MLPClassifier

clf_mlp = MLPClassifier(verbose=0)
clf_mlp.fit(X_train_scaled, y_train)

# Note: MLP as a NN, can use data without the feature engineering step, as the NN will handle that automatically

### 13. Evaluate our model

In [None]:
# Use the scaler fit on trained data to scale our test data
X_test_scaled = scaler.transform(X_test)
pd.DataFrame(X_test_scaled, columns=X_train.columns).head()

#### Evaluate SVC on non-preprocessed data

In [None]:
# Predict confidence scores for data
y_score_svc_npp = clf_svc_npp.decision_function(X_test_npp)
pd.DataFrame(y_score_svc_npp)

In [None]:
# Get accuracy score
from sklearn.metrics import accuracy_score
y_pred_svc_npp = clf_svc_npp.predict(X_test_npp)
acc_svc_npp = accuracy_score(y_test_npp, y_pred_svc_npp)
print(acc_svc_npp)

In [None]:
# Get Precision vs. Recall score
from sklearn.metrics import average_precision_score
average_precision_svc_npp = average_precision_score(y_test_npp, y_score_svc_npp)

print('Average precision-recall score: {0:0.2f}'.format(
      average_precision_svc_npp))

#### Evaluate SVC on preprocessed data

In [None]:
# Get model confidence of predictions
y_score_svc = clf_svc.decision_function(X_test_scaled)
y_score_svc

In [None]:
# Get accuracy score
y_pred_svc = clf_svc.predict(X_test_scaled)
acc_svc = accuracy_score(y_test, y_pred_svc)
print(acc_svc)

In [None]:
# Get Precision vs. Recall score
average_precision_svc = average_precision_score(y_test, y_score_svc)

print('Average precision-recall score: {0:0.2f}'.format(
      average_precision_svc))

#### Evaluate Logistic Regression on preprocessed data

In [None]:
y_score_lr = clf_lr.decision_function(X_test_scaled)
y_score_lr

In [None]:
y_pred_lr = clf_lr.predict(X_test_scaled)
acc_lr = accuracy_score(y_test, y_pred_lr)
print(acc_lr)

In [None]:
average_precision_lr = average_precision_score(y_test, y_score_lr)

print('Average precision-recall score: {0:0.2f}'.format(
      average_precision_lr))

#### Evaluate MLP on preprocessed data

In [None]:
y_score_mlp = clf_mlp.predict_proba(X_test_scaled)[:, 1]
y_score_mlp

In [None]:
y_pred_mlp = clf_mlp.predict(X_test_scaled)
acc_mlp = accuracy_score(y_test, y_pred_mlp)
print(acc_mlp)

In [None]:
average_precision_mlp = average_precision_score(y_test, y_score_mlp)

print('Average precision-recall score: {0:0.2f}'.format(
      average_precision_mlp))

### 14. ROC Curve and models comparisons

In [None]:
# Plot SVC ROC Curve
plt.figure(0, figsize=(20,15)).clf()

fpr_svc_npp, tpr_svc_npp, thresh_svc_npp = metrics.roc_curve(y_test_npp, y_score_svc_npp)
auc_svc_npp = metrics.roc_auc_score(y_test_npp, y_score_svc_npp)
plt.plot(fpr_svc_npp, tpr_svc_npp, label="SVC Non-Processed, auc=" + str(auc_svc_npp))

fpr_svc, tpr_svc, thresh_svc = metrics.roc_curve(y_test, y_score_svc)
auc_svc = metrics.roc_auc_score(y_test, y_score_svc)
plt.plot(fpr_svc, tpr_svc, label="SVC Processed, auc=" + str(auc_svc))

fpr_mlp, tpr_mlp, thresh_mlp = metrics.roc_curve(y_test, y_score_mlp)
auc_mlp = metrics.roc_auc_score(y_test, y_score_mlp)
plt.plot(fpr_mlp, tpr_mlp, label="MLP, auc=" + str(auc_mlp))

fpr_lr, tpr_lr, thresh_lr = metrics.roc_curve(y_test, y_score_lr)
auc_lr = metrics.roc_auc_score(y_test, y_score_lr)
plt.plot(fpr_lr, tpr_lr, label="Logistic Regression, auc=" + str(auc_lr))

plt.legend(loc=0)

#### Bonus: Sending the trained model to the cloud and scoring through a web app

In [None]:
# This cell contains Watson Machine Learning service credentials,
#  please replace the stars with your own credentials

credentials = {
    "url": "****",
    "access_key": "***",
    "username": "****",
    "password": "****",
    "instance_id": "*****"
}

In [None]:
# To work with the Watson Machine Learning REST API you must generate a Bearer access token

import urllib3, requests, json

headers = urllib3.util.make_headers(basic_auth='{}:{}'.format(credentials['username'], credentials['password']))
url = '{}/v3/identity/token'.format(credentials['url'])
response = requests.get(url, headers=headers)
ml_token = 'Bearer ' + json.loads(response.text).get('token')

In [None]:
# Create an online scoring endpoint

endpoint_instance = credentials['url'] + "/v3/wml_instances/" + credentials['instance_id']
header = {'Content-Type': 'application/json', 'Authorization': ml_token}

response_get_instance = requests.get(endpoint_instance, headers=header)
print(response_get_instance)
print(response_get_instance.text)

In [None]:
# Create API client

from watson_machine_learning_client import WatsonMachineLearningAPIClient

client = WatsonMachineLearningAPIClient(credentials)

In [None]:
# Publish model in Watson Machine Learning repository on Cloud

model_props = {client.repository.ModelMetaNames.AUTHOR_NAME: "Heba El-Shimy", 
               client.repository.ModelMetaNames.NAME: "Customer Churn Prediction Model"}

In [None]:
published_model = client.repository.store_model(model=model, meta_props=model_props, \
                                                training_data=X_train, training_target=y_train)

In [None]:
models_details = client.repository.list_models()

In [None]:
# Create model deployment

published_model_uid = client.repository.get_model_uid(published_model)
created_deployment = client.deployments.create(published_model_uid, "Deployment of Customer Churn Prediction Model")

In [None]:
# Get Scoring URL
scoring_endpoint = client.deployments.get_scoring_url(created_deployment)

print(scoring_endpoint)

In [None]:
# Get model details and expected input
model_details = client.repository.get_details(published_model_uid)
print(json.dumps(model_details, indent=2))

### Test the model

In [None]:
# Prepare the payload to be sent to the model
payload = {
    "fields": [
        "tenure",
        "OnlineSecurity_No",
        "TechSupport_No",
        "Contract_Month-to-month",
        "MonthlyCharges_OnlineSecurity_No",
        "MonthlyCharges_TechSupport_No",
        "MonthlyCharges_Contract_Month-to-month",
        "Dependents_No_OnlineSecurity_No",
        "Dependents_No_TechSupport_No",
        "Dependents_No_Contract_Month-to-month",
        "PhoneService_Yes_Contract_Month-to-month",
        "InternetService_Fiber optic_OnlineSecurity_No",
        "InternetService_Fiber optic_TechSupport_No",
        "InternetService_Fiber optic_Contract_Month-to-month",
        "InternetService_Fiber optic_PaymentMethod_Electronic check",
        "OnlineSecurity_No_OnlineBackup_No",
        "OnlineSecurity_No_TechSupport_No",
        "OnlineSecurity_No_Contract_Month-to-month",
        "OnlineSecurity_No_PaymentMethod_Electronic check",
        "OnlineBackup_No_Contract_Month-to-month",
        "DeviceProtection_No_Contract_Month-to-month",
        "TechSupport_No_Contract_Month-to-month",
        "TechSupport_No_PaymentMethod_Electronic check",
        "Contract_Month-to-month_PaperlessBilling_Yes",
        "Contract_Month-to-month_PaymentMethod_Electronic check"
 ],
    "values": [
        [20.0, 0.0, 1.0, 0.0, 60.55, 10.0, 15.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]	
 ]
}

In [None]:
# Send data to the model and print results
predictions = client.deployments.score(scoring_endpoint, payload)
print(json.dumps(predictions, indent=2))

#### Sending data to the model

Sending new data (may be collected from web/mobile app) in the format the model is excpecting as shown above.  
We get back a response with the predicted class (1 - Customer with sent data will churn)  
and probabilities of both classes (0 or No Curn has a probability of  1.2567231699733838e-9 which is very small, 1 or Churn has a probability of 0.9999999987432768 which means the model is confident of its prediction)

![postman](../doc/source/images/sample_output.png)

## References:

#### <a name="first" id="first"></a><sub>[1] https://www.sciencedirect.com/science/article/abs/pii/S0148296318301231 "Customer churn prediction in telecommunication industry using data certainty"</sub>  
#### <a name="second" id="second"></a><sub>[2] https://www.signal.co/blog/understanding-customer-churn/ "10 Stats Expose the Real Connection Between Customer Experience and Customer Churn"</sub>  
#### <a name="third" id="third"></a><sub>[3] https://www.pinterest.com/pin/456904324667676431/ "Mobile Telco Churn Infographic"</sub>  
#### <sub>[4] https://pandas.pydata.org/pandas-docs/stable/ "Pandas Documentation"</sub>  
#### <sub>[5] http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html "Scikit-Learn Imputer"</sub>  
#### <sub>[6] https://github.com/ibm-watson-data-lab/pixiedust/wiki/Tutorial:-Extending-the-PixieDust-Visualization "PixieDust Documentation"</sub>
#### <sub>[7] https://seaborn.pydata.org/ "Seaborn Documentation"</sub>
#### <sub>[8] http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder "Scikit-Learn LabelEncoder"</sub>
#### <sub>[9] http://colingorrie.github.io/outlier-detection.html "Outlier Detection Methods"</sub>
#### <sub>[10] http://scikit-learn.org/stable/auto_examples/linear_model/plot_polynomial_interpolation.html#sphx-glr-auto-examples-linear-model-plot-polynomial-interpolation-py "Scikit-Learn Polynomial"</sub>
#### <sub>[11] http://scikit-learn.org/stable/modules/feature_selection.html "Scikit-Learn Feature Selection"</sub>
#### <sub>[12] http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler "Scikit-Learn StandardScaler"</sub>
#### <sub>[13] http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC "Scikit-Learn SVC"</sub>
#### <sub>[14] http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression "Scikit-Learn Logistic Regression"</sub>
#### <sub>[15] http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html "Scikit-Learn MLP Classifier"</sub>
#### <sub>[16] http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score "Scikit-Learn Accuracy Score"</sub>
#### <sub>[17] http://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html#sklearn.metrics.average_precision_score "Scikit-Learn Average Precision Score"</sub>
#### <sub>[18] https://www.sciencedirect.com/science/article/pii/S016786550500303X "An introduction to ROC analysis"</sub>
#### <sub>[19] https://wml-api-pyclient.mybluemix.net/ "Watson Machine Learning Client Documentation"</sub>
#### <sub>[20] https://dataplatform.ibm.com/docs/content/analyze-data/ml-deploy-notebook.html?context=analytics "IBM Watson Studio Documentation-Deploy a model from a notebook"</sub>