In [1]:
# Churn Modelling Dataset 

In [3]:
#Environment set up
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns 

print("Setup complete")

Setup complete


In [5]:
file_path = '../datasets/churnmodeling.csv'
data = pd.read_csv(file_path)
data.tail()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
9995,9996,15606229,Obijiaku,771,France,Male,39,5,0.0,2,1,0,96270.64,0
9996,9997,15569892,Johnstone,516,France,Male,35,10,57369.61,1,1,1,101699.77,0
9997,9998,15584532,Liu,709,France,Female,36,7,0.0,1,0,1,42085.58,1
9998,9999,15682355,Sabbatini,772,Germany,Male,42,3,75075.31,2,1,0,92888.52,1
9999,10000,15628319,Walker,792,France,Female,28,4,130142.79,1,1,0,38190.78,0


# Questions 

In [6]:
# determine country spread 
# determine creditworthiness

## Customer churn is a problem 
Customer attrition, also known as customer churn, customer turnover, or customer defection, is the loss of clients or customers.

SaaS companies are relying on their customer success professionals to mitigate the problem of customer churn. This proactive approach to customer churn necessarily relies on the analysis of data. In fact, in our recent study, we found that two-thirds of companies with formal customer programs (including customer success programs) are leveraging data scientists to help them make sense of their data.

## Data Science
Data science is a method of extracting insights from data. The goal of a data scientist is to derive empirically-based insights that augment and enhance human decisions and algorithms. Rather than conceptualizing data science as a body of knowledge or facts, consider it as a way to think about the world. The data science approach brings three broad skills to bear on problems, including 1) subject matter expertise, 2) technology and programming skills and 3) statistics. In data-intensive projects, the application of these three skills helps you ask the right questions, access the right data and analyze the data to answer the questions, respectively.

## Descriptive and Predictive Analytics Power Insights
Broadly speaking, data scientists utilize two different types of analytics when getting value from their data: 1) descriptive and 2) predictive. The purpose of descriptive analytics is to summarize historical trends in the data that inform you about the current state of the world. The purpose of predictive analytics, on the other hand, is to predict the future or to be able to predict data that we don’t have.

Equipped with these two broad types of analytics, data scientists are able to answer a variety of different types of questions. While data scientists can utilize descriptive analytics for simple questions, their primary value is seen when they answer questions that are predictive in nature. Here are five ways data scientists extract insights from data. Data scientists can:

1. identify current state of affairs (means/percentages)
2. predict what will happen in the future (regression)
3. use algorithms to score people (classification)
4. identify naturally occurring groups (cluster)
5. identify treatments that are more/most effective (experimentation; A/B testing)

## Questions to ask 
1. What is our current customer churn rate?
- Computing your churn rate is a good first step toward understanding the health of the customer relationship and what you need to do to improve (or maintain) that health. Calculating your current churn rate helps you understand if you have a churn problem. Additionally, you can look at historical trends to gauge whether your current churn rate has improved or worsened. Finally, to know if your churn management efforts are effective tomorrow, you need to know where your churn stands today.
2. What are the main causes of customer churn?
- To fix a customer churn problem, you first need to identify the underlying reasons why your customers are leaving you. Data scientists, using historical data, build models to predict customer churn. These statistical models help quantify the degree to which different factors are responsible for customer churn. These factors could include: 1) how customers use the product, 2) customers’ interactions with the company – via support, 3) customer sentiment and more.
3. Which specific customers are at risk for churning?
- Answering this question helps customer success leaders take a proactive approach to dealing with customer churn. Data scientists apply the model (question 2) to new customers to classify/identify customers who are likely to churn in the future. The model uses each customer’s information (e.g., product usage, interactions with the company, web visits) to classify each customer into one of two groups: 1) likely to stay and 2) likely to churn. Customer success leaders can use this information to reach out to at risk customers to save those relationships.
4. How should we segment our customers?
- Traditionally, companies segment their customers based on demographic information, including geographical location, age and gender, to name a few. Now, data scientists can leverage existing customer data to create customer segments based on their underlying value to the company. By applying sophisticated analytics (e.g., cluster analysis) to the data, data scientists can group customers into smaller, homogeneous segments, each defined by specific characteristics (their values, product usage, social media, etc) as they relate to their likelihood of churning. By identifying high-value customers and what makes them tick, customer success leaders can then target prospects that match their characteristics. The hope is that bringing on new customers who have the same characteristics as your most valued, current customers will lead to higher future retention rates.
5. Which marketing campaign is more effective at reducing churn?
- Marketing collateral can be used to attract new customers or save/incentivize existing relationships. After customer segments have been created or after the at-risk (of churning) customers have been identified, the next step is to proactively reach out to them to either attract specific types of customers or save existing relationships, respectively. Different marketing campaigns can be tested against each other to identify which one results in the lowest churn rates. Actual experimentation (e.g., A/B testing) are conducted to test the efficacy of different campaigns.

## Summary
Customer success leaders rely on the practice of data science to help answer important questions about their customers. These insights help determine how best to manage customers to reduce customer churn, from understanding the current state of customer churn to identifying the key reasons why customers leave.

While data scientists can investigate these problems manually, they often have to deal with many different types and sources of data. Instead of relying on data scientists, customer professionals are now turning to the power of machine learning in their customer data platforms that generate insights automatically. These machine learning-derived insights are then incorporated into existing marketing, sales and support workflows to help you retain, engage and grow customer relationships.

Although the current post focused on the problem of customer churn, similar questions can be asked about other customer metrics, including average revenue per customer, conversions (from free to paid) and recommendations.

In [6]:
data.head(10)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0
5,6,15574012,Chu,645,Spain,Male,44,8,113755.78,2,1,0,149756.71,1
6,7,15592531,Bartlett,822,France,Male,50,7,0.0,2,1,1,10062.8,0
7,8,15656148,Obinna,376,Germany,Female,29,4,115046.74,4,1,0,119346.88,1
8,9,15792365,He,501,France,Male,44,4,142051.07,2,0,1,74940.5,0
9,10,15592389,H?,684,France,Male,27,2,134603.88,1,1,1,71725.73,0


In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB


In [8]:
data.shape

(10000, 14)

In [9]:
data.describe()

Unnamed: 0,RowNumber,CustomerId,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,5000.5,15690940.0,650.5288,38.9218,5.0128,76485.889288,1.5302,0.7055,0.5151,100090.239881,0.2037
std,2886.89568,71936.19,96.653299,10.487806,2.892174,62397.405202,0.581654,0.45584,0.499797,57510.492818,0.402769
min,1.0,15565700.0,350.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0
25%,2500.75,15628530.0,584.0,32.0,3.0,0.0,1.0,0.0,0.0,51002.11,0.0
50%,5000.5,15690740.0,652.0,37.0,5.0,97198.54,1.0,1.0,1.0,100193.915,0.0
75%,7500.25,15753230.0,718.0,44.0,7.0,127644.24,2.0,1.0,1.0,149388.2475,0.0
max,10000.0,15815690.0,850.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0


In [12]:
# from summary info we can observe that the 
# mean age for customers is 39 years with a std of 10

In [15]:
# Working on demographics
data.Age.unique()

array([42, 41, 39, 43, 44, 50, 29, 27, 31, 24, 34, 25, 35, 45, 58, 32, 38,
       46, 36, 33, 40, 51, 61, 49, 37, 19, 66, 56, 26, 21, 55, 75, 22, 30,
       28, 65, 48, 52, 57, 73, 47, 54, 72, 20, 67, 79, 62, 53, 80, 59, 68,
       23, 60, 70, 63, 64, 18, 82, 69, 74, 71, 76, 77, 88, 85, 84, 78, 81,
       92, 83], dtype=int64)

In [19]:
min_age = data.Age.min()
print("Minimum Age:", min_age)

Minimum Age: 18


In [21]:
max_age = data.Age.max()
print("Maximum Age:", max_age)

Maximum Age: 92
