# Churn Prediction Project

<p align="left">
  <img src="https://uruit.com/blog/wp-content/uploads/2020/11/Churn1-1024x724.jpg" width="600">
</p>

**Churn** is a phenomenon where customers stop using the services of a company. Therefore, churn prediction involves identifying customers who are most likely to terminate their contracts in the near future. If a company can do this, it can offer discounts or special deals on its services in order to retain those customers.

Of course, we can apply machine learning to this problem: using historical data about customers who have already left and building a model to identify current customers who are likely to leave. This is a **binary classification** task. The target variable we want to predict is categorical and has only two possible outcomes: **will leave** or **will not leave**.


## Project Context and Goals

Telecommunication company is experiencing a problem that some of their customers are churning and switch to competitors.  
Our aim is to develop a system to identify such users and offer them incentives that will encourage them to stay.  
  
We want to target these customers with our marketing messages and provide discounts. We would also like to understand why the model believes that certain customers are about to leave, and for that we need to be able to interpret its predictions.

We have collected a dataset that contains certain information about our customers: which services they used, how much they paid, and how long they stayed with us. We also know which customers terminated their contracts and stopped using our services (as a result of churn). We will use this information as the target variable in a machine learning model and predict it using all the other available information.

## Dataset


According to the description, the dataset contains the following information:

- **Customer services** — telephone service; multiple lines; Internet; technical support; and additional services such as online security, backup, device protection, and streaming TV;

- **Account information** — how long the customer has been with the company, contract type, and payment method;

- **Charges** — how much the customer paid for the last month and in total;

- **Demographic information** — gender, age, whether the customer has dependents or a partner;

- **Churn** — yes/no, whether the customer left the company during the last month.


## Packages

In [2]:
import numpy as np
import pandas as pd

import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split

pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)


<a name = '1'></a>
## 1 - Preliminary Data Exploration and Modifications 


In [3]:
df = pd.read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")
display(df.head())
print(f"{len(df)} rows and {df.shape[1]} columns")

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


7043 rows and 21 columns


#### Columns definition

- **CustomerID** — customer identifier;

- **Gender** — male/female;

- **SeniorCitizen** — whether the customer is a senior citizen (0/1);

- **Partner** — whether the customer lives with a partner (yes/no);

- **Dependents** — whether the customer has dependents (yes/no);

- **Tenure** — number of months since the contract started;

- **PhoneService** — whether the customer has phone service (yes/no);

- **MultipleLines** — whether the customer has multiple phone lines (yes/no/no phone service);

- **InternetService** — type of internet service (no/DSL/fiber optic);

- **OnlineSecurity** — whether online security is enabled (yes/no/no internet);

- **OnlineBackup** — whether online backup service is enabled (yes/no/no internet);

- **DeviceProtection** — whether device protection service is enabled (yes/no/no internet);

- **TechSupport** — whether the customer has technical support (yes/no/no internet);

- **StreamingTV** — whether streaming TV service is enabled (yes/no/no internet);

- **StreamingMovies** — whether streaming movie service is enabled (yes/no/no internet);

- **Contract** — type of contract (month-to-month/one year/two year);

- **PaperlessBilling** — whether paperless billing is enabled (yes/no);

- **PaymentMethod** — payment method (electronic check, mailed check, bank transfer, credit card);

- **MonthlyCharges** — monthly amount charged (numeric);

- **TotalCharges** — total amount charged (numeric);

- **Churn** — whether the customer terminated the contract (yes/no).


In [4]:
df.dtypes

customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object

In [5]:
# TotalCharges column identified as object we will change it to float type
total_charges = pd.to_numeric(df.TotalCharges, errors = 'coerce')
df[total_charges.isnull()][['customerID','TotalCharges']]

Unnamed: 0,customerID,TotalCharges
488,4472-LVYGI,
753,3115-CZMZD,
936,5709-LVOEQ,
1082,4367-NUYAO,
1340,1371-DWPAZ,
3331,7644-OMVMY,
3826,3213-VVOLG,
4380,2520-SGTTA,
5218,2923-ARZLG,
6670,4075-WKNIU,


The reason why TotalCharges has been identified as an object data type because the column contains rows that contained space which we changed to "_"  

We will fill them with zeroes.

In [6]:
df.TotalCharges = pd.to_numeric(df.TotalCharges, errors='coerce')
df.TotalCharges = df.TotalCharges.fillna(0)


Now let's harmonize object columns and column names 

In [7]:
# Harmonize columns and strings inside 
df.columns = df.columns.str.lower().str.replace(' ', '_')

string_columns = df.select_dtypes(include='object').columns
string_columns
df[string_columns] = df[string_columns].apply(
    lambda s: s.str.lower().str.replace(' ', '_')
)


Next, let us turn to our target variable: **churn**. At the moment, it is categorical and takes two values: **yes** and **no**.

In the case of binary classification, most models usually expect numerical values: **0** for *no* and **1** for *yes*. Therefore, we will convert these values into numbers.


In [8]:
df.churn = (df.churn == 'yes').astype(int)

Now split our main data frame into train_full (which will be further split to train and validation frames) and test. 

In [9]:
from sklearn.model_selection import train_test_split
df_train_full, df_test = train_test_split(df, test_size=0.2, random_state=1)

In [10]:
df_train, df_val = train_test_split(df_train_full, test_size=0.33,
     random_state=11)
y_train = df_train.churn.values 
y_val = df_val.churn.values
del df_train['churn'] 
del df_val['churn']

<a name = '2'></a>
## 2 - EDA

In [11]:
df_train_full.isnull().sum()

customerid          0
gender              0
seniorcitizen       0
partner             0
dependents          0
tenure              0
phoneservice        0
multiplelines       0
internetservice     0
onlinesecurity      0
onlinebackup        0
deviceprotection    0
techsupport         0
streamingtv         0
streamingmovies     0
contract            0
paperlessbilling    0
paymentmethod       0
monthlycharges      0
totalcharges        0
churn               0
dtype: int64

In [12]:
df_train_full.churn.value_counts()

0    4113
1    1521
Name: churn, dtype: int64

In [13]:
round(df_train_full.churn.mean(),3)

0.27

Our churn dataset is an example of a so-called **imbalanced dataset**. In our data, there are three times more people who did **not** churn than those who did. Therefore, we can say that the **non-churn** class dominates the **churn** class.

This is quite obvious: the churn rate in our data is **0.27**, which is a strong indicator of class imbalance.

Let us create two lists:
- **`categorical`** — which will contain the names of categorical variables;
- **`numerical`** — which will similarly contain the names of numerical variables.


In [14]:
categorical = ['gender', 'seniorcitizen', 'partner', 'dependents',
               'phoneservice', 'multiplelines', 'internetservice',
               'onlinesecurity', 'onlinebackup', 'deviceprotection',
               'techsupport', 'streamingtv', 'streamingmovies',
               'contract', 'paperlessbilling', 'paymentmethod']
numerical = ['tenure', 'monthlycharges', 'totalcharges']

In [15]:
df_train_full[categorical].nunique()

gender              2
seniorcitizen       2
partner             2
dependents          2
phoneservice        2
multiplelines       3
internetservice     3
onlinesecurity      3
onlinebackup        3
deviceprotection    3
techsupport         3
streamingtv         3
streamingmovies     3
contract            3
paperlessbilling    2
paymentmethod       4
dtype: int64

It is convenient that categorical variables have few unique values. So that we don't have to spend more time preparing the data.  
  
Now we embark upon another important step of EDA which is how to understand importance of features.

<a name = '2.1'></a>
### 2.1 - Feature Importance

Understanding how other variables influence the target variable (churn) is the key to understanding the data and building a good model. This process is called **feature importance analysis**, and it is often performed as part of exploratory data analysis to determine which variables are useful for the model.

It also provides us with additional insights into the dataset and helps answer questions such as **“What causes customer churn?”** and **“What are the characteristics of customers who leave?”**


`Churn Rate`

Let us start by looking at the **categorical variables**. The first thing we can do is examine the **churn rate for each variable**.

We can look at all the distinct values of a variable. Each value corresponds to a group of customers — all customers who have that particular value. For each such group, we can calculate the churn rate, which is the **group-specific churn rate**.

Once we have it, we can compare it with the **global churn rate**, calculated across all observations in the dataset.

If the difference between the group churn rate and the global churn rate is small, then this value is not very important for predicting churn, since this group of customers does not really differ from the rest. On the other hand, if the difference is significant, then something within this group distinguishes it from others.


In [22]:
# Let's first check variable gender
female_mean = df_train_full[df_train_full.gender == 'female'].churn.mean()
male_mean = df_train_full[df_train_full.gender == 'male'].churn.mean()
print(f"Churn Rate in male group: {round(male_mean,3)*100}% \n \t female group {round(female_mean,3)*100:.1f}%")

Churn Rate in male group: 26.3% 
 	 female group 27.7%


This shows that gender is not very useful variable for churn prediction.  
In order to do the same operation for all categorical variables we will use code below.  
  
**Important** to mention `Risk Ratio` 
In addition to looking at the difference between the group-specific and global churn rates, it is also interesting to examine the **ratio between them**. In statistics, the ratio between probabilities in different groups is called the **risk ratio**, where *risk* refers to the probability of an event occurring.

In our case, the event is churn, so the **churn risk** is defined as:

$$
\text{risk} = \frac{\text{group churn rate}}{\text{global churn rate}}
$$




##### Interpreting the Results

- **RR = 1**: The exposure or treatment has no effect on the risk.

- **RR > 1**: The exposure increases the risk  
  *(for example, RR = 2 means twice the risk)*.

- **RR < 1**: The exposure decreases the risk  
  *(for example, RR = 0.5 means half the risk, or a 50% reduction, which is saying that churn risk in that group is twice lower than overall churn rate)*.

In our case **Risk** is `Churn`


In [27]:
global_churn = df_train_full.churn.mean()

for col in categorical:
    df_group = df_train_full.groupby(by=col).churn.agg(['mean'])
    df_group['diff'] = df_group['mean'] - global_churn
    df_group['rate'] = df_group['mean'] / global_churn
    display(df_group)



Unnamed: 0_level_0,mean,diff,rate
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.276824,0.006856,1.025396
male,0.263214,-0.006755,0.97498


Unnamed: 0_level_0,mean,diff,rate
seniorcitizen,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.24227,-0.027698,0.897403
1,0.413377,0.143409,1.531208


Unnamed: 0_level_0,mean,diff,rate
partner,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.329809,0.059841,1.221659
yes,0.205033,-0.064935,0.759472


Unnamed: 0_level_0,mean,diff,rate
dependents,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.31376,0.043792,1.162212
yes,0.165666,-0.104302,0.613651


Unnamed: 0_level_0,mean,diff,rate
phoneservice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.241316,-0.028652,0.89387
yes,0.273049,0.003081,1.011412


Unnamed: 0_level_0,mean,diff,rate
multiplelines,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.257407,-0.012561,0.953474
no_phone_service,0.241316,-0.028652,0.89387
yes,0.290742,0.020773,1.076948


Unnamed: 0_level_0,mean,diff,rate
internetservice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
dsl,0.192347,-0.077621,0.712482
fiber_optic,0.425171,0.155203,1.574895
no,0.077805,-0.192163,0.288201


Unnamed: 0_level_0,mean,diff,rate
onlinesecurity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.420921,0.150953,1.559152
no_internet_service,0.077805,-0.192163,0.288201
yes,0.153226,-0.116742,0.56757


Unnamed: 0_level_0,mean,diff,rate
onlinebackup,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.404323,0.134355,1.497672
no_internet_service,0.077805,-0.192163,0.288201
yes,0.217232,-0.052736,0.80466


Unnamed: 0_level_0,mean,diff,rate
deviceprotection,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.395875,0.125907,1.466379
no_internet_service,0.077805,-0.192163,0.288201
yes,0.230412,-0.039556,0.85348


Unnamed: 0_level_0,mean,diff,rate
techsupport,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.418914,0.148946,1.551717
no_internet_service,0.077805,-0.192163,0.288201
yes,0.159926,-0.110042,0.59239


Unnamed: 0_level_0,mean,diff,rate
streamingtv,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.342832,0.072864,1.269897
no_internet_service,0.077805,-0.192163,0.288201
yes,0.302723,0.032755,1.121328


Unnamed: 0_level_0,mean,diff,rate
streamingmovies,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.338906,0.068938,1.255358
no_internet_service,0.077805,-0.192163,0.288201
yes,0.307273,0.037305,1.138182


Unnamed: 0_level_0,mean,diff,rate
contract,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
month-to-month,0.431701,0.161733,1.599082
one_year,0.120573,-0.149395,0.446621
two_year,0.028274,-0.241694,0.10473


Unnamed: 0_level_0,mean,diff,rate
paperlessbilling,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.172071,-0.097897,0.637375
yes,0.338151,0.068183,1.25256


Unnamed: 0_level_0,mean,diff,rate
paymentmethod,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bank_transfer_(automatic),0.168171,-0.101797,0.622928
credit_card_(automatic),0.164339,-0.10563,0.608733
electronic_check,0.45589,0.185922,1.688682
mailed_check,0.19387,-0.076098,0.718121


##### Quick Summary of categorical variables
- **Senior Citizen**: Senior clients tend to churn more than their younger counterparts with Churn Risk Ratio of 1.53
- **Tech Support**: Clients without Tech Support are more likely to churn as risk 1.55 while clients with support show 0.59
- **Contract**: Clients with monthly Contracts tend to churn risk is nearly 1.6 and people with two year contracts are churning very rarely

Just simply glancing over the differences and risks values we can spot distinguishable features which may be useful for the classification model.

##### Mutual Information

The differences we have just examined are useful for our analysis and important for understanding the data, but they are difficult to use to determine which feature is the most important and whether, for example, the technical support variable is more useful than the contract type.

Fortunately, **feature importance metrics** come to our aid: we can measure the degree of dependence between a categorical variable and the target variable. If two variables are dependent, then knowing the value of one variable gives us some information about the other. On the other hand, if a variable is completely independent of the target variable, then it is useless and can be safely removed from the dataset.


For categorical variables, one such metric is **mutual information**, which shows how much information we gain about one variable if we know the value of another. This concept comes from information theory, and in machine learning we often use it to measure the **dependence between two variables**.

Higher values of mutual information indicate a stronger degree of dependence: if the mutual information between a categorical variable and the target variable is high, then the categorical variable can be used to predict the target. On the other hand, if the mutual information is small, then the categorical variable and the target are independent, and therefore the variable will not be useful for predicting the target.


In [28]:
from sklearn.metrics import mutual_info_score
def calculate_mi(series):
    return mutual_info_score(series, df_train_full.churn)
df_mi = df_train_full[categorical].apply(calculate_mi)
df_mi = df_mi.sort_values(ascending=False).to_frame(name='MI')
df_mi

Unnamed: 0,MI
contract,0.09832
onlinesecurity,0.063085
techsupport,0.061032
internetservice,0.055868
onlinebackup,0.046923
deviceprotection,0.043453
paymentmethod,0.04321
streamingtv,0.031853
streamingmovies,0.031581
paperlessbilling,0.017589


As we can see `contract`, `onlinesecurity`,`techsupport` represent the most important features.

##### Correlation Coefficient

Mutual information is a way to quantitatively assess the degree of dependence between two **categorical variables**, but it does not work when one of the features is numerical. Therefore, we cannot apply it to the three numerical variables that we have.

However, we can measure the dependence between a **binary target variable** and a **numerical variable**. We can pretend that the binary variable is numerical (containing only the values 0 and 1) and then use classical statistical methods to check whether there is any dependence between these variables.

One such method is the **correlation coefficient** (sometimes called the **Pearson correlation coefficient**). This value ranges from **–1 to 1**:


- **Positive correlation** means that when one variable increases, the other also tends to increase. In the case of a binary target, when the values of the variable are high, we observe ones more often than zeros. When the values are low, zeros occur more frequently than ones.

- **Zero correlation** means there is no relationship between the two variables; they are completely independent.

- **Negative correlation** occurs when one variable increases while the other decreases. In the case of a binary target, at high values of the variable we observe more zeros than ones in the target variable. When the values are low, we see more ones.
