# Churn Prediction Project

<p align="left">
  <img src="https://uruit.com/blog/wp-content/uploads/2020/11/Churn1-1024x724.jpg" width="600">
</p>

**Churn** is a phenomenon where customers stop using the services of a company. Therefore, churn prediction involves identifying customers who are most likely to terminate their contracts in the near future. If a company can do this, it can offer discounts or special deals on its services in order to retain those customers.

Of course, we can apply machine learning to this problem: using historical data about customers who have already left and building a model to identify current customers who are likely to leave. This is a **binary classification** task. The target variable we want to predict is categorical and has only two possible outcomes: **will leave** or **will not leave**.


## Project Context and Goals

Telecommunication company is experiencing a problem that some of their customers are churning and switch to competitors.  
Our aim is to develop a system to identify such users and offer them incentives that will encourage them to stay.  
  
We want to target these customers with our marketing messages and provide discounts. We would also like to understand why the model believes that certain customers are about to leave, and for that we need to be able to interpret its predictions.

We have collected a dataset that contains certain information about our customers: which services they used, how much they paid, and how long they stayed with us. We also know which customers terminated their contracts and stopped using our services (as a result of churn). We will use this information as the target variable in a machine learning model and predict it using all the other available information.

## Dataset


According to the description, the dataset contains the following information:

- **Customer services** — telephone service; multiple lines; Internet; technical support; and additional services such as online security, backup, device protection, and streaming TV;

- **Account information** — how long the customer has been with the company, contract type, and payment method;

- **Charges** — how much the customer paid for the last month and in total;

- **Demographic information** — gender, age, whether the customer has dependents or a partner;

- **Churn** — yes/no, whether the customer left the company during the last month.


## Packages

In [16]:
import numpy as np
import pandas as pd

import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split

pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)


<a name = '1'></a>
## 1 - Preliminary Data Exploration and Modifications 


In [17]:
df = pd.read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")
display(df.head())
print(f"{len(df)} rows and {df.shape[1]} columns")

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


7043 rows and 21 columns


#### Columns definition

- **CustomerID** — customer identifier;

- **Gender** — male/female;

- **SeniorCitizen** — whether the customer is a senior citizen (0/1);

- **Partner** — whether the customer lives with a partner (yes/no);

- **Dependents** — whether the customer has dependents (yes/no);

- **Tenure** — number of months since the contract started;

- **PhoneService** — whether the customer has phone service (yes/no);

- **MultipleLines** — whether the customer has multiple phone lines (yes/no/no phone service);

- **InternetService** — type of internet service (no/DSL/fiber optic);

- **OnlineSecurity** — whether online security is enabled (yes/no/no internet);

- **OnlineBackup** — whether online backup service is enabled (yes/no/no internet);

- **DeviceProtection** — whether device protection service is enabled (yes/no/no internet);

- **TechSupport** — whether the customer has technical support (yes/no/no internet);

- **StreamingTV** — whether streaming TV service is enabled (yes/no/no internet);

- **StreamingMovies** — whether streaming movie service is enabled (yes/no/no internet);

- **Contract** — type of contract (month-to-month/one year/two year);

- **PaperlessBilling** — whether paperless billing is enabled (yes/no);

- **PaymentMethod** — payment method (electronic check, mailed check, bank transfer, credit card);

- **MonthlyCharges** — monthly amount charged (numeric);

- **TotalCharges** — total amount charged (numeric);

- **Churn** — whether the customer terminated the contract (yes/no).


In [18]:
df.dtypes

customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object

In [19]:
# TotalCharges column identified as object we will change it to float type
total_charges = pd.to_numeric(df.TotalCharges, errors = 'coerce')
df[total_charges.isnull()][['customerID','TotalCharges']]

Unnamed: 0,customerID,TotalCharges
488,4472-LVYGI,
753,3115-CZMZD,
936,5709-LVOEQ,
1082,4367-NUYAO,
1340,1371-DWPAZ,
3331,7644-OMVMY,
3826,3213-VVOLG,
4380,2520-SGTTA,
5218,2923-ARZLG,
6670,4075-WKNIU,


The reason why TotalCharges has been identified as an object data type because the column contains rows that contained space which we changed to "_"  

We will fill them with zeroes.

In [20]:
df.TotalCharges = pd.to_numeric(df.TotalCharges, errors='coerce')
df.TotalCharges = df.TotalCharges.fillna(0)


Now let's harmonize object columns and column names 

In [21]:
# Harmonize columns and strings inside 
df.columns = df.columns.str.lower().str.replace(' ', '_')

string_columns = df.select_dtypes(include='object').columns
string_columns
df[string_columns] = df[string_columns].apply(
    lambda s: s.str.lower().str.replace(' ', '_')
)


Next, let us turn to our target variable: **churn**. At the moment, it is categorical and takes two values: **yes** and **no**.

In the case of binary classification, most models usually expect numerical values: **0** for *no* and **1** for *yes*. Therefore, we will convert these values into numbers.


In [22]:
df.churn = (df.churn == 'yes').astype(int)

Now split our main data frame into train_full (which will be further split to train and validation frames) and test. 

In [23]:
from sklearn.model_selection import train_test_split
df_train_full, df_test = train_test_split(df, test_size=0.2, random_state=1)

In [24]:
df_train, df_val = train_test_split(df_train_full, test_size=0.33,
     random_state=11)
y_train = df_train.churn.values 
y_val = df_val.churn.values
del df_train['churn'] 
del df_val['churn']

<a name = '2'></a>
## 2 - EDA

In [26]:
df_train_full.isnull().sum()

customerid          0
gender              0
seniorcitizen       0
partner             0
dependents          0
tenure              0
phoneservice        0
multiplelines       0
internetservice     0
onlinesecurity      0
onlinebackup        0
deviceprotection    0
techsupport         0
streamingtv         0
streamingmovies     0
contract            0
paperlessbilling    0
paymentmethod       0
monthlycharges      0
totalcharges        0
churn               0
dtype: int64

In [27]:
df_train_full.churn.value_counts()

0    4113
1    1521
Name: churn, dtype: int64

In [29]:
round(df_train_full.churn.mean(),3)

0.27

Our churn dataset is an example of a so-called **imbalanced dataset**. In our data, there are three times more people who did **not** churn than those who did. Therefore, we can say that the **non-churn** class dominates the **churn** class.

This is quite obvious: the churn rate in our data is **0.27**, which is a strong indicator of class imbalance.

Let us create two lists:
- **`categorical`** — which will contain the names of categorical variables;
- **`numerical`** — which will similarly contain the names of numerical variables.


In [30]:
categorical = ['gender', 'seniorcitizen', 'partner', 'dependents',
               'phoneservice', 'multiplelines', 'internetservice',
               'onlinesecurity', 'onlinebackup', 'deviceprotection',
               'techsupport', 'streamingtv', 'streamingmovies',
               'contract', 'paperlessbilling', 'paymentmethod']
numerical = ['tenure', 'monthlycharges', 'totalcharges']

In [31]:
df_train_full[categorical].nunique()

gender              2
seniorcitizen       2
partner             2
dependents          2
phoneservice        2
multiplelines       3
internetservice     3
onlinesecurity      3
onlinebackup        3
deviceprotection    3
techsupport         3
streamingtv         3
streamingmovies     3
contract            3
paperlessbilling    2
paymentmethod       4
dtype: int64

It is convenient that categorical variables have few unique values. So that we don't have to spend more time preparing the data.  
  
Now we embark upon another important step of EDA which is how to understand importance of features.

<a name = '2.1'></a>
### 2.1 - Feature Importance