# Introduction to the Churn Business Problem

In this project we are going to be predicting customer churn in the banking industry. Consider the following scenario:

- Bank XYZ has been observing a lot of customer closing their accounts or switching to competitor banks over the past couple of quarters. 

- As such, this has caused a huge dent in the quarterly revenues and might drastically affect annual revenues for the ongoing financial year, causing stocks to plunge and market cap to reduce by X%.

- Consequently, the leadership team has come into action by building a team of folks from business, product, engineering and data science to arrest this slide. Meaning that some interventions will be put in order to reduce the number of customers churning.

- But, because the organization doesn't have unlimited resources to put these interventions to everyone, so the first step is to identify such customers in order to allow for interventions to be targeted.

- Hence, the question to the data science team is: **Can we build a model to predict with reasonable accuracy the customers who are going to churn in the near future?**

## Defintions

- **Churn**: a consumer is said to have churned in our scenario if they have closed all of their active accounts with the bank.

- However, keep in mind that churn can be characterized in a variety of ways depending on the situation and what is most appropriate for the organization. For example, in some cases, if a customer has not transacting for 90 days/6 months/1 year, he can be said to have churned.

## Data Science Workflow

Note that when solving such kind of problem in real-world setting within an organisation, apart from the task of general modelling, a data science team would also have to collaborate with:
- (Business or Product) teams to define the (problem statement, metrics, etc)
- (Engineering) teams to get the (data)
- (DevOps) teams to monitor the (model) when launched to production

### With (Business and Product) teams
1. **Defining the business goal** => Arresting the slide in revenues caused by loss of active bank customers
2. **Identifying the data source** => transactional systems as event-based logs. This data can be stored in data warehouses (MySQL DBs, AWS Redshift), Data Lakes, NoSQL DBs, etc
3. **Perform auditing for data quality**, include aspects such as:
    - Deleting of duplicate events/transactions (de-deduplication)
    - Handling absence of data for chunks of time in between (handling missing values)
    - Obscuring PII (personally identifiable information) data. Because often in data science problems you would require using some private customer features but if they're obscured, then it can lead to privacy issues.
4. **Defining metrics**. There're two types of metrics
    - Business metrics => responsibility of both (business and data science) teams to combine their opinions to decide on the relevant metrics. In our case, business metrics could be:
        - churn rate which can be tracked over time (on a monthly, weeks, quartetly level)
            (we want this metric to descrease)
        - trend of average number of products per customer tracked over time
            (we want this metric to increase)
        - percentage of dormant customers tracked over time
            (we want this metric to decrease)
        - other such descriptive metrics tracked over time
    - Data-related metrics => responsibility of the data science team to define the relevant metrics
         - Recall = TP/(TP + FN)
         - Precision = TP/(TP + FP)
         - F1-Score = Harmonic mean of Recall and Precision
         - (where: TP=True Positive, FP=False Positive, FN=False Negative, TN=True Negative)
         - (we're not using Accuracy because we will most likely have an imbalanced dataset)
5. **Decide on prediction model output format** => Since this isn't going to be an online model, it doesn't require deployment. Instead, periodic (ex: monthly) model runs could be made to output list of customers with their propensity to churn shared with business (Sales/Marketing/Product) teams
    - Note: it's important to decide right at the beginning what should be the output format which will be given by the data science team to the sales/marketing team so that they can take the relevant interventions
6. **Decide the actions to be taken based on model's output/insights**. Based on the output obtained from the Data Science team, which would be the list customers with high propensity of churning in the near future, various business interventions can be made to save the customer from getting churned, for example:
    - Customer-centric bank offers
    - Getting in touch with customers to address any grievances
    
    
(PUT IN THE DIAGRAM OF THE WORKFLOW BTW BUSINESS/DS/DEVOPS/ENGINEERING)

### Data-related metrics: Intuition behind TP/FP/TN/FN

Fun explanation:
- True Positive (TP): Reality: A wolf threatened. Shepherd said: "Wolf." Outcome: Shepherd is a hero.
- True Negative (TN): Reality: No wolf threatened. Shepherd said: "No wolf." Outcome: Everyone is fine.
- False Positive (FP): Reality: No wolf threatened. Shepherd said: "Wolf." Outcome: Villagers are angry at shepherd for waking them up.
- False Negative (FN): Reality: A wolf threatened. Shepherd said: "No wolf." Outcome: The wolf ate all the sheep.

Concrete example, say we have hot news classifier:
- True Positive (TP): Reality: a piece of hot news. classifier predicts: hot.
- True Negative (TN): Reality: not a piece of hot news. classifier predicts: not hot.
- False Positive (FP): Reality: not a piece of hot news. classifier predicts: hot.
- False Negative (FN): Reality: a piece of hot news. classifier predicts: not hot.

## Part 1: Setting up the target/goal for the metrics

#### Data-related metrics:
- **Recall** = TP/(TP + FN) => out of the ones in the positive class, how many of them we could predict correctly?
- (so, out of all the customers who are potentially likely to churn, how many of them we could identify correctly?)
- (if we could identify 50% of them correctly, then recall would be 50%)
- **Precision** = TP/(TP + FP) => out of all the positive predictions we have made, how many of them were correct?
- (if we predict 100 customers as likely to churn, we need to check how many of them actually churn)
- (if only 30 of them actually churn, our precision is 30/100 => 30% - we predicted 100 customers to churn but only 30 of them actually churn)
- **F1-Score** = Harmonic mean of Recall and Precision


Although we don't know what's the maximum/minimum we can get on this dataset without exploring the data samples, we can set a rough conservative estimate. Good approach to set these metrics are:

First, find minimum and maximum values (create a range for these)
- To try find minimum value, let's say we predict all rows as (1 or churn) => in that case my recall would be 100% but my precision would be whatever the class imabalance ratio is. For example, if 20% of customers in the dataset have actually churned, then precision would be 20%. F1-Score, which is the harmonic mean of Recall and Precision, would be close to 30%. So, not a great score at all.
- Maximum value, would preferably be 100%, but we know that is not realistically possible.

So, a conservative estimate would be around 70%.


#### Business metrics:

Actual values/thresholds for business metrics usually come from the leadership team. So, we should try and achieve the given target values. But, at the same time we should ensure that that value/threshold isn't something improbable.

For example:
- if we take the recall target to be 70%, which means correctly identifying 70% of customers who're going to churn in the near future
- we can expect that due to business interventions (offers, getting in touch with customers) - 50% of customers can be saved from being churned
- which means at least 35% improvement in churn rate

## Part 2: Exploring the dataset

Now that we understand the problem statement, decided on the metrics and set their target values/thresholds.
Let's directly dive into the code to explore the dataset.

#### Importing libraries

In [3]:
%matplotlib inline

# Importing the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#### Quick setup of settings

In [6]:
# Ensure that multiple outputs get displayed in the same cell (just from a convenience perspective)
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Ignore all warnings
import warnings
warnings.filterwarnings("ignore")
warnings.filterwarnings(action="ignore", category=DeprecationWarning)

In [8]:
# Display all rows and columns of a dataframe instead of a truncated version
from IPython.display import display
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

#### Exploring the dataset

In [11]:
# Reading the dataset
df = pd.read_csv("../data/Churn_Modelling.csv")

In our case, we have our dataset stored in a CSV file.

In some circumstances, the dataset may be stored in AWS S3 (Cloud Object Storage) or retrieved using SQL queries against any database (AWS Redshift). The reason for this is because the data is often huge and we cannot always have a local copy, so we instead read directly from a connected database.

In [12]:
df.shape

# (10000, 14)
# There're 10,000 rows and 14 columns

(10000, 14)

In [13]:
# Let's look at a sample of the first 10 rows of this dataset
df.head(10).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
RowNumber,1,2,3,4,5,6,7,8,9,10
CustomerId,15634602,15647311,15619304,15701354,15737888,15574012,15592531,15656148,15792365,15592389
Surname,Hargrave,Hill,Onio,Boni,Mitchell,Chu,Bartlett,Obinna,He,H?
CreditScore,619,608,502,699,850,645,822,376,501,684
Geography,France,Spain,France,France,Spain,Spain,France,Germany,France,France
Gender,Female,Female,Female,Female,Female,Male,Male,Female,Male,Male
Age,42,41,42,39,43,44,50,29,44,27
Tenure,2,1,8,1,2,8,7,4,4,2
Balance,0.0,83807.86,159660.8,0.0,125510.82,113755.78,0.0,115046.74,142051.07,134603.88
NumOfProducts,1,1,3,2,1,2,2,4,2,1


#### Features (independent variables) (Xs)
- **CustomerId** => is a _unique identifier_ of each of our customers


- **Surname** => is a _categorical feature_, which indicates the surname of the customer


- **CreditScore** => is a _numerical feature_, which generally lies beetween 300 to 900


- **Geography** => is a _categorical feature_, which indicates where the bank of the customer is based (as this bank is a multinational company (MNC) with multiple offices around the world)


- **Gender** => is a _categorical feature_, which can either be (Male or Female)


- **Age** => is a _numerical feature_, which can range from (minimum eligibility age to open a bank acocunt - 18) to 100


- **Tenure** => is a _numerical feature_, which indicates the number of years our customer has association with the bank


- **Balance** => is a _numerical feature_, which indicates the amount of money in customer's bank account (before he closes his bank account) or (at this particular time if he didn't close his bank account)


- **NumOfProducts** => is a _numerical feature_, which indicates the number of products the customer has used from our bank (as our bank has various products (ex: Savings Bank Account, Deposit, Mortgages, Loans))


- **HasCrCard** => is a _categorical feature_ which indicates whether the customer has a credit card or not


- **IsActiveMember** => is a _categorical feature_ which indicates whether the customer is an active member or not which is probably based on some threshold of activity which is being tracked


- **EstimatedSalary** => is a _numerical feature_ which indicates the salary of our customer

#### Target (dependent variable) (y)
- **Exited** => is a _categorical variable_ which indicates whether or not the customer churned

## Part 3: Performing EDA (Exploratory Data Analysis)

Now, we will analyse statistical information about each of our variables.
We must split our dataframe into categorical and numeric columns, which will be achieved by using the following method:
df.select_dtypes() => return a subset of the DataFrame’s columns based on the column dtypes

Calculating stastical data of our numeric variables

In [16]:
df.select_dtypes(exclude=["object"]).describe()

Unnamed: 0,RowNumber,CustomerId,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,5000.5,15690940.0,650.5288,38.9218,5.0128,76485.889288,1.5302,0.7055,0.5151,100090.239881,0.2037
std,2886.89568,71936.19,96.653299,10.487806,2.892174,62397.405202,0.581654,0.45584,0.499797,57510.492818,0.402769
min,1.0,15565700.0,350.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0
25%,2500.75,15628530.0,584.0,32.0,3.0,0.0,1.0,0.0,0.0,51002.11,0.0
50%,5000.5,15690740.0,652.0,37.0,5.0,97198.54,1.0,1.0,1.0,100193.915,0.0
75%,7500.25,15753230.0,718.0,44.0,7.0,127644.24,2.0,1.0,1.0,149388.2475,0.0
max,10000.0,15815690.0,850.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0


Calculating stastical data of our numeric variables

In [15]:
df.select_dtypes(include=["object"]).describe()

Unnamed: 0,Surname,Geography,Gender
count,10000,10000,10000
unique,2932,3,2
top,Smith,France,Male
freq,32,5014,5457


#### Insights

1. Since the count values for all the columns match the number of rows in our dataset, we have no missing values in our dataset. Because count would ignore missing values and not count them.
2. By observing (mean, std, 25%, 50%, 70%) we can get an idea of the distribution of each of our individual variables.
   - ex: 25% of the customers have a zero balance => it's an important insight to have
   - ex: mean of hasCrCard is 0.7055, so 70% of the customers have a credit card
   - ex: mean of isActiveMember is 0.5151, so 50% of the customers are active
   - ex: mean of exited (target variable) is 0.2037 and q1=q2=q3=0, so only in 20% of the cases the customer have churned
3. By observing (min, max) we can get an idea of the range of each of our individual variables.
    - ex: age varies from 18 to 92
    - ex: tenure varies from 0 to 10
    - ex: number of products varies from 1 to 4