# ðŸ“Š Exploratory Data Analysis â€” Telco Customer Churn

This notebook will explore and analyze the **IBM Telco Customer** dataset. The goal is to load, clean, and analyze the data for patterns and relationships that may influence customer satisfaction, resulting in customer churn.

In [None]:
from pathlib import Path

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

## Load the Dataset

In [None]:
data_dir = Path().resolve().parent / "data"
raw_data_path = data_dir / "raw" / "Telco_customer_churn.xlsx"
df = pd.read_excel(raw_data_path)

In [None]:
df.head().T

## Clean Data

In [None]:
df.info()

In [None]:
# Standardize column names
df.columns = df.columns.str.lower().str.replace(" ", "_")

In [None]:
df.head().T

In [None]:
# Convert total_charges to numeric, coercing errors to NaN
df.total_charges = pd.to_numeric(df.total_charges, errors="coerce")

In [None]:
df.total_charges.info()

In [None]:
df.isnull().sum()

In [None]:
df[df.total_charges.isnull()]

In [None]:
# Calculate NA total_charges
df.total_charges = df.total_charges.fillna(df.monthly_charges * df.tenure_months)
df.isnull().sum()

In [None]:
df[df.churn_reason.isnull()].churn_value.value_counts()

In [None]:
df.churn_reason = df.churn_reason.fillna("No Churn")
df.isnull().sum()

## Analysis

In [23]:
df.nunique()

customerid           7043
count                   1
country                 1
state                   1
city                 1129
zip_code             1652
lat_long             1652
latitude             1652
longitude            1651
gender                  2
senior_citizen          2
partner                 2
dependents              2
tenure_months          73
phone_service           2
multiple_lines          3
internet_service        3
online_security         3
online_backup           3
device_protection       3
tech_support            3
streaming_tv            3
streaming_movies        3
contract                3
paperless_billing       2
payment_method          4
monthly_charges      1585
total_charges        6531
churn_label             2
churn_value             2
churn_score            85
cltv                 3438
churn_reason           21
dtype: int64

In [24]:
df.groupby(["country", "state"]).size()

country        state     
United States  California    7043
dtype: int64

In [25]:
# Pie chart of churn rate
churn_counts = df.churn_label.value_counts()
fig = px.pie(
    names=churn_counts.keys(),
    values=churn_counts.values,
    title="Churn Label Distribution",
)
fig.show()

**26.5%** of customers churned â€” mild class imbalance. The goal is to know which type of customers are more likely to churn in order to predict future churning and take action to avoid it.

In [None]:
values = (
    df.churn_reason[df.churn_reason != "No Churn"].value_counts(ascending=False).values
)
keys = (
    df.churn_reason[df.churn_reason != "No Churn"].value_counts(ascending=False).keys()
)

fig = px.bar(x=keys, y=values, color=values, text=values)

fig.update_layout(yaxis_title="Churn Reason", xaxis_title="Count")
fig.show()

## Feature Selection

## Save Processed Data

In [29]:
# Save cleaned data
df.to_excel(data_dir / "processed" / "telco_customer_churn_cleaned.xlsx", index=False)