# Resources

- https://en.wikipedia.org/wiki/Customer_attrition
- https://public.tableau.com/profile/sandeep7769#!/vizhome/CustomerChurnAnalysis_4/TelecomCustomerChurnDataExploration
- https://developer.ibm.com/technologies/data-science/patterns/predict-customer-churn-using-watson-studio-and-jupyter-notebooks/#

# Import libraries

In [None]:
import pandas as pd
import numpy as np

import plotly.express as px
import seaborn as sns
from statsmodels.graphics.mosaicplot import mosaic

sns.set_style("ticks")

from cleaning_pipeline import *

# Load data

In [None]:
df = pd.read_csv("data/WA_Fn-UseC_-Telco-Customer-Churn.csv")

In [None]:
df.head()

# Clean data

In [None]:
cleaned_df = (
    df.pipe(start_pipeline)
    .pipe(drop_noisy_columns, cols=["customerID"])
    .pipe(replace_empty_strings_with_nan)
)

In [None]:
cleaned_df.head()

In [None]:
cleaned_df.isna().sum() / len(df)

We can see that the percentage of missing values in the `TotalCharges` column is less than 1%, so we can drop all rows with missing values.

In [None]:
cleaned_df = cleaned_df.pipe(drop_missing_values)

In [None]:
cleaned_df

Now we can convert each column to the appropriate `dtype`.

In [None]:
cleaned_df = cleaned_df.pipe(
    convert_column_dtypes, {"SeniorCitizen": "str", "TotalCharges": np.float}
)

In [None]:
cleaned_df.head()

In [None]:
cleaned_df.dtypes

Our final cleaning and processing pipeline would be:

In [None]:
cleaned_df = (
    df.pipe(start_pipeline)
    .pipe(drop_noisy_columns, cols=["customerID"])
    .pipe(replace_empty_strings_with_nan)
    .pipe(drop_missing_values)
    .pipe(map_column_values, col="SeniorCitizen", mapping_dict={0: "No", 1: "Yes"})
    .pipe(convert_column_dtypes, dtypes_mapping={"TotalCharges": np.float},)
)

In [None]:
cleaned_df.head()

# Exploratory Data Analysis

In [None]:
demographic_cols = [
    "gender",
    "SeniorCitizen",
    "Partner",
    "Dependents",
]

In [None]:
account_cols = [
    "tenure",
    "Contract",
    "PaymentMethod",
    "PaperlessBilling",
    "MonthlyCharges",
    "TotalCharges",
]

In [None]:
services_cols = [
    "PhoneService",
    "MultipleLines",
    "InternetService",
    "OnlineSecurity",
    "OnlineBackup",
    "DeviceProtection",
    "TechSupport",
    "StreamingTV",
    "StreamingMovies",
]

## Target variable

In [None]:
temp_df = cleaned_df.groupby(by="Churn", as_index=False).agg(
    count=pd.NamedAgg(column="Churn", aggfunc="count")
)

temp_df["count"] = temp_df["count"].map(lambda x: x / len(cleaned_df))

In [None]:
fig = px.bar(data_frame=temp_df, x="Churn", y="count", color="Churn")

fig.show()

## Demographic attributes

### `Gender`

In [None]:
fig = px.histogram(data_frame=cleaned_df, x="gender", color="gender")

fig.show()

### `SeniorCitizen`

In [None]:
fig = px.histogram(data_frame=cleaned_df, x="SeniorCitizen", color="SeniorCitizen")

fig.show()

### `Partner`

In [None]:
fig = px.histogram(data_frame=cleaned_df, x="Partner", color="Partner")

fig.show()

### `Dependents`

In [None]:
fig = px.histogram(data_frame=cleaned_df, x="Dependents", color="Dependents")

fig.show()

## Account attributes

### `tenure`

In [None]:
cleaned_df.tenure.head()

In [None]:
df.tenure.describe()

In [None]:
df.tenure.skew()

In [None]:
fig = px.histogram(data_frame=cleaned_df, x="tenure", marginal="box")

fig.show()

In [None]:
fig = px.histogram(
    data_frame=cleaned_df, x="tenure", color="Churn", barmode="group", marginal="box"
)

fig.show()

We can conclude from this plot some important insights:
- New customers with about 3 months of tenure are the most likely to churn.
- The chart also shows that the higher customer tenure is, the less likely is he to churn.
- There are some extreme points (outliers) for the *churning* customers, where they have a very high tenure (70 months, almost 6 years), yet they churn from the company. These might be *outliers*, or this could be related to other factors.

### `Contract`

In [None]:
cleaned_df.Contract.head()

In [None]:
temp_df = (
    cleaned_df.groupby(by="Contract", as_index=False)
    .agg(count=pd.NamedAgg(column="Contract", aggfunc="count"))
    .sort_values(by="count", ascending=False)
)

In [None]:
temp_df

In [None]:
fig = px.bar(data_frame=temp_df, x="Contract", y="count", color="Contract")

fig.show()

The majority of customers prefer *monthly* contracts.

Let's see the relation between contract type and whether customer churn the company or not:

In [None]:
temp_df = cleaned_df.groupby(by=["Contract", "Churn"], as_index=False).size().sort_values(
    by="size", ascending=False
)

In [None]:
fig = px.bar(data_frame=temp_df, x='Contract', y='size', color='Churn', barmode='group')

fig.show()

This chart suggests that customers who use short term contracts are far more likely to leave the service.

### `PaymentMethod`