# Intro

## Context

"Predict behavior to retain customers. You can analyze all relevant customer data and develop focused customer retention programs." [IBM Sample Data Sets]

## Content

Each row represents a customer, each column contains customer’s attributes described on the column Metadata.

The data set includes information about:

- Customers who left within the last month – the column is called `Churn`
- Services that each customer has signed up for:
    - phone (`PhoneService`)
    - multiple lines (`MultipleLines`)
    - internet (`InternetService`)
    - online security (`OnlineSecurity`)
    - online backup (`OnlineBackup`)
    - device protection (`DeviceProtection`)
    - tech support (`TechSupport`)
    - streaming TV (`StreamingTV`)
    - streaming movies (`StreamingMovies`)
- Customer account information:
    - how long they’ve been a customer (`tenure`)
    - contract (`Contract`)
    - payment method (`PaymentMethod`)
    - paperless billing (`PaperlessBilling`)
    - monthly charges (`MonthlyCharges`)
    - total charges (`TotalCharges`)
- Demographic info about customers
    - Gender (`gender`)
    - age range (`SeniorCitizen`)
    - if they have partners (`Partner`)
    - dependents (`Dependents`)

# Resources

- https://github.com/IBM/telco-customer-churn-on-icp4d

# Import libraries

First, let's import all libraries that we might need for our analysis.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
from scipy import stats

# set seaborn theme
sns.set_style(style="whitegrid")

from cleaning_pipeline import *

# Load data

Let's load the *CSV* data:

In [2]:
df = pd.read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")

In [3]:
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [4]:
df.columns

Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')

In [5]:
df.shape

(7043, 21)

# Getting to know the data

The data consists of 21 attributes and about 7,000 samples.

At this section, we'll try to *get to know* the data we're dealing with, and answer some questions like the following:
- Do columns have the correct `dtype`?
- What are the types of variables (Nominal, Ordinal, Discrete, Continuous, Binary)?
- Are there any missing values?


In [13]:
demographic_cols = [
    "gender",
    "SeniorCitizen",
    "Partner",
    "Dependents",
]

In [14]:
account_cols = [
    "tenure",
    "Contract",
    "PaymentMethod",
    "PaperlessBilling",
    "MonthlyCharges",
    "TotalCharges",
]

In [15]:
services_cols = [
    "PhoneService",
    "MultipleLines",
    "InternetService",
    "OnlineSecurity",
    "OnlineBackup",
    "DeviceProtection",
    "TechSupport",
    "StreamingTV",
    "StreamingMovies",
]

## Demographic attributes

### `Gender`

In [10]:
df.gender.head()

0    Female
1      Male
2      Male
3      Male
4    Female
Name: gender, dtype: object

In [8]:
df.gender.value_counts()

Male      3555
Female    3488
Name: gender, dtype: int64

The `gender` variable is a binary variable, and it's represented as a `string`.

### `SeniorCitizen`

In [11]:
df.SeniorCitizen.head()

0    0
1    0
2    0
3    0
4    0
Name: SeniorCitizen, dtype: int64

In [12]:
df.SeniorCitizen.value_counts()

0    5901
1    1142
Name: SeniorCitizen, dtype: int64

The `SeniorCitizen` variable is a binary variable, but it's represented as integer.
It would be better (for visualization) to convert it to *categorical* variable (or `string`).

### `Partner`

In [17]:
df.Partner.head()

0    Yes
1     No
2     No
3     No
4     No
Name: Partner, dtype: object

In [18]:
df.Partner.value_counts()

No     3641
Yes    3402
Name: Partner, dtype: int64

The `Partner` variable is a binary variable, and its data type is `string`.

### `Dependents`

In [20]:
df.Dependents.head()

0    No
1    No
2    No
3    No
4    No
Name: Dependents, dtype: object

In [21]:
df.Dependents.value_counts()

No     4933
Yes    2110
Name: Dependents, dtype: int64

The `Dependents` variable is a binary variable, and its data type is `string`

## Account attributes