# Intro

## Context

"Predict behavior to retain customers. You can analyze all relevant customer data and develop focused customer retention programs." [IBM Sample Data Sets]

## Content

Each row represents a customer, each column contains customer’s attributes described on the column Metadata.

The data set includes information about:

- Customers who left within the last month – the column is called `Churn`
- Services that each customer has signed up for:
    - phone (`PhoneService`)
    - multiple lines (`MultipleLines`)
    - internet (`InternetService`)
    - online security (`OnlineSecurity`)
    - online backup (`OnlineBackup`)
    - device protection (`DeviceProtection`)
    - tech support (`TechSupport`)
    - streaming TV (`StreamingTV`)
    - streaming movies (`StreamingMovies`)
- Customer account information:
    - how long they’ve been a customer (`tenure`)
    - contract (`Contract`)
    - payment method (`PaymentMethod`)
    - paperless billing (`PaperlessBilling`)
    - monthly charges (`MonthlyCharges`)
    - total charges (`TotalCharges`)
- Demographic info about customers
    - Gender (`gender`)
    - age range (`SeniorCitizen`)
    - if they have partners (`Partner`)
    - dependents (`Dependents`)

# Resources

- https://github.com/IBM/telco-customer-churn-on-icp4d

# Import libraries

First, let's import all libraries that we might need for our analysis.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
from scipy import stats

# set seaborn theme
sns.set_style(style="whitegrid")

from cleaning_pipeline import *

# Load data

Let's load the *CSV* data:

In [None]:
df = pd.read_csv("data/WA_Fn-UseC_-Telco-Customer-Churn.csv")

In [None]:
df.head()

In [None]:
df.columns

In [None]:
df.shape

# Getting to know the data

The data consists of 21 attributes and about 7,000 samples.

At this section, we'll try to *get to know* the data we're dealing with, and answer some questions like the following:
- Do columns have the correct `dtype`?
- What are the types of variables (Nominal, Ordinal, Discrete, Continuous, Binary)?
- Are there any missing values?


In [None]:
demographic_cols = [
    "gender",
    "SeniorCitizen",
    "Partner",
    "Dependents",
]

In [None]:
account_cols = [
    "tenure",
    "Contract",
    "PaymentMethod",
    "PaperlessBilling",
    "MonthlyCharges",
    "TotalCharges",
]

In [None]:
services_cols = [
    "PhoneService",
    "MultipleLines",
    "InternetService",
    "OnlineSecurity",
    "OnlineBackup",
    "DeviceProtection",
    "TechSupport",
    "StreamingTV",
    "StreamingMovies",
]

## Demographic attributes

### `Gender`

In [None]:
df.gender.head()

In [None]:
df.gender.value_counts()

The `gender` variable is a binary variable, and it's represented as a `string`.

### `SeniorCitizen`

In [None]:
df.SeniorCitizen.head()

In [None]:
df.SeniorCitizen.value_counts()

The `SeniorCitizen` variable is a binary variable, but it's represented as integer.
It would be better (for visualization) to convert it to *categorical* variable (or `string`).

### `Partner`

In [None]:
df.Partner.head()

In [None]:
df.Partner.value_counts()

The `Partner` variable is a binary variable, and its data type is `string`.

### `Dependents`

In [None]:
df.Dependents.head()

In [None]:
df.Dependents.value_counts()

The `Dependents` variable is a binary variable, and its data type is `string`

## Account attributes

### `Tenure`

In [None]:
df.tenure.head()

In [None]:
df.tenure.value_counts()

The `tenure` variable is a dsicrete (numeric) variable, and its data type is integer.

### `Contract`

In [None]:
df.Contract.head()

In [None]:
df.Contract.value_counts()

The `Contract` variable is an ordinal variable, and its data type is `string`

### `PaymentMethod`

In [None]:
df.PaymentMethod.head()

In [None]:
df.PaymentMethod.value_counts()

The `PaymentMethod` is a nominal variable, and its data type is `string`

### `MonthlyCharges`

In [None]:
df.MonthlyCharges.head()

The `MonthlyCharges` is a continuous variable, and its data type is `float`

### `TotalCharges`

In [None]:
df.TotalCharges.head()

This variable is continuous variable, but its data type is `string` instead of `float`.

Let's see if there are any rwos which have non-numeric values (letters, symbols, etc ...)

Pandas string functions such as `isalnum`, `isdecimal`, `isdigit` and `isnumeric` won't work here, because we want to match a number with floating point number, instead, we'll use a regular expression to match it.

The following pattern will match floating point numbers: `\d+(\.\d*)?`

Let's display all rows which **don't** this pattern:

In [None]:
df.loc[df.TotalCharges.str.match(r"[^\d+(\.\d*)?]"), "TotalCharges"]

In [None]:
df.loc[df.TotalCharges.str.match(r"[^\d+(\.\d*)?]"), "TotalCharges"].values

It seems that thos are missing entries, represented a empty strings.

We should replace them with `NaN` values, so we can deal with all missing values later.

And then convert the variable data type to `float`, instead of `string`.