# 01 – Data Overview: IBM Telco Customer Churn

This notebook performs initial data understanding for the IBM Telco Customer Churn dataset.
We will inspect the shape, data types, missing values and some key categorical distributions.


In [1]:
import pandas as pd

In [2]:
file_path = "../data/raw/Telco_customer_churn.xlsx"
df = pd.read_excel(file_path)

df.head()

Unnamed: 0,CustomerID,Count,Country,State,City,Zip Code,Lat Long,Latitude,Longitude,Gender,...,Contract,Paperless Billing,Payment Method,Monthly Charges,Total Charges,Churn Label,Churn Value,Churn Score,CLTV,Churn Reason
0,3668-QPYBK,1,United States,California,Los Angeles,90003,"33.964131, -118.272783",33.964131,-118.272783,Male,...,Month-to-month,Yes,Mailed check,53.85,108.15,Yes,1,86,3239,Competitor made better offer
1,9237-HQITU,1,United States,California,Los Angeles,90005,"34.059281, -118.30742",34.059281,-118.30742,Female,...,Month-to-month,Yes,Electronic check,70.7,151.65,Yes,1,67,2701,Moved
2,9305-CDSKC,1,United States,California,Los Angeles,90006,"34.048013, -118.293953",34.048013,-118.293953,Female,...,Month-to-month,Yes,Electronic check,99.65,820.5,Yes,1,86,5372,Moved
3,7892-POOKP,1,United States,California,Los Angeles,90010,"34.062125, -118.315709",34.062125,-118.315709,Female,...,Month-to-month,Yes,Electronic check,104.8,3046.05,Yes,1,84,5003,Moved
4,0280-XJGEX,1,United States,California,Los Angeles,90015,"34.039224, -118.266293",34.039224,-118.266293,Male,...,Month-to-month,Yes,Bank transfer (automatic),103.7,5036.3,Yes,1,89,5340,Competitor had better devices


In [3]:
df.shape

(7043, 33)

## Dataset Size

The dataset has `7043` rows and `33` columns (approximate values from `df.shape`).
Each row represents a single telecom customer.


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 33 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   CustomerID         7043 non-null   object 
 1   Count              7043 non-null   int64  
 2   Country            7043 non-null   object 
 3   State              7043 non-null   object 
 4   City               7043 non-null   object 
 5   Zip Code           7043 non-null   int64  
 6   Lat Long           7043 non-null   object 
 7   Latitude           7043 non-null   float64
 8   Longitude          7043 non-null   float64
 9   Gender             7043 non-null   object 
 10  Senior Citizen     7043 non-null   object 
 11  Partner            7043 non-null   object 
 12  Dependents         7043 non-null   object 
 13  Tenure Months      7043 non-null   int64  
 14  Phone Service      7043 non-null   object 
 15  Multiple Lines     7043 non-null   object 
 16  Internet Service   7043 

In [5]:
df.isna().sum().sort_values(ascending=False)

Churn Reason         5174
CustomerID              0
Count                   0
State                   0
Country                 0
Zip Code                0
Lat Long                0
Latitude                0
City                    0
Gender                  0
Senior Citizen          0
Partner                 0
Dependents              0
Tenure Months           0
Phone Service           0
Multiple Lines          0
Longitude               0
Internet Service        0
Online Security         0
Device Protection       0
Online Backup           0
Streaming TV            0
Streaming Movies        0
Contract                0
Tech Support            0
Paperless Billing       0
Payment Method          0
Total Charges           0
Monthly Charges         0
Churn Label             0
Churn Value             0
Churn Score             0
CLTV                    0
dtype: int64

## Structure and Missing Values

- Key columns include: `CustomerID`, `Churn Label`, `Tenure Months`, `Monthly Charges`, `Total Charges`, `CLTV`, `Churn Score`.
- `Total Charges` appears as a non-numeric (`object`) column and will need conversion to numeric in the cleaning step.
- Missing values are mainly present in:
  - `Churn Reason` — this column has many missing values (customers who did not churn or did not provide a reason).
- Other core fields such as `CustomerID`, `Churn Label`, `Tenure Months`, `Monthly Charges` and `CLTV` have no missing values.


In [6]:
df.describe()

Unnamed: 0,Count,Zip Code,Latitude,Longitude,Tenure Months,Monthly Charges,Churn Value,Churn Score,CLTV
count,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0
mean,1.0,93521.964646,36.282441,-119.79888,32.371149,64.761692,0.26537,58.699418,4400.295755
std,0.0,1865.794555,2.455723,2.157889,24.559481,30.090047,0.441561,21.525131,1183.057152
min,1.0,90001.0,32.555828,-124.301372,0.0,18.25,0.0,5.0,2003.0
25%,1.0,92102.0,34.030915,-121.815412,9.0,35.5,0.0,40.0,3469.0
50%,1.0,93552.0,36.391777,-119.730885,29.0,70.35,0.0,61.0,4527.0
75%,1.0,95351.0,38.224869,-118.043237,55.0,89.85,1.0,75.0,5380.5
max,1.0,96161.0,41.962127,-114.192901,72.0,118.75,1.0,100.0,6500.0


Tenure ranges from 0 to 72 months, and Monthly Charges are centered around ~€X with a max of ~€Y.

In [7]:
df['Churn Label'].value_counts()

Churn Label
No     5174
Yes    1869
Name: count, dtype: int64

In [8]:
df['Contract'].value_counts()

Contract
Month-to-month    3875
Two year          1695
One year          1473
Name: count, dtype: int64

In [9]:
# Churn rate calculation
churn_counts = df['Churn Label'].value_counts()
churn_rate = churn_counts['Yes'] / churn_counts.sum()

print("Churn counts:")
print(churn_counts)
print()
print(f"Overall churn rate: {churn_rate:.3%}")

# Contract distribution (counts and proportions)
contract_counts = df['Contract'].value_counts()
contract_props = df['Contract'].value_counts(normalize=True)

print("\nContract counts:")
print(contract_counts)
print("\nContract proportions:")
print((contract_props * 100).round(1).astype(str) + " %")


Churn counts:
Churn Label
No     5174
Yes    1869
Name: count, dtype: int64

Overall churn rate: 26.537%

Contract counts:
Contract
Month-to-month    3875
Two year          1695
One year          1473
Name: count, dtype: int64

Contract proportions:
Contract
Month-to-month    55.0 %
Two year          24.1 %
One year          20.9 %
Name: proportion, dtype: object


## Churn Rate and Contract Mix

### Churn Rate
The overall churn rate in the dataset is **26.537%**.  
This means roughly **1 out of 4** customers has churned.

This confirms that:
- The dataset is **moderately imbalanced** (more “No” than “Yes” labels).
- Churn is a significant problem but not extremely rare, which is typical for telecom industries.

### Contract Distribution
From the contract proportions:

- **55.0%** of customers are on **Month-to-month** contracts  
- **24.1%** are on **Two year** contracts  
- **20.9%** are on **One year** contracts  

Interpretation:
- Month-to-month is the **largest group** and typically the most volatile, which strongly contributes to higher churn.
- One-year and two-year customers represent **more stable, long-term relationships**, and they generally churn less.

These insights help guide later segmentation and retention strategy in Tableau.


In [10]:
df['Internet Service'].value_counts()

Internet Service
Fiber optic    3096
DSL            2421
No             1526
Name: count, dtype: int64

In [11]:
df['Payment Method'].value_counts()

Payment Method
Electronic check             2365
Mailed check                 1612
Bank transfer (automatic)    1544
Credit card (automatic)      1522
Name: count, dtype: int64

In [12]:
df['Churn Reason'].value_counts().head(10)

Churn Reason
Attitude of support person                   192
Competitor offered higher download speeds    189
Competitor offered more data                 162
Don't know                                   154
Competitor made better offer                 140
Attitude of service provider                 135
Competitor had better devices                130
Network reliability                          103
Product dissatisfaction                      102
Price too high                                98
Name: count, dtype: int64

In [13]:
df['CustomerID'].is_unique

True

## First Impressions on Customer Segments

In addition to churn behavior, several categorical variables show useful segmentation patterns:

- **Internet Service**: customers are split across DSL, Fiber optic, and No-internet groups. These categories will likely differ in churn behavior due to service quality and pricing differences.
- **Payment Method**: Electronic check appears frequently and is often associated with higher churn in telecom datasets. Other methods (credit card, bank transfer, mailed check) may show more stable customers.
- **Tech Support and Online Security** (if explored later): these features often correlate with churn, as customers without support bundles may perceive lower value.

These observations will guide deeper churn-driver analysis in Tableau.


## Next Steps

Based on this initial data understanding, the following actions will be performed in the next notebook:

1. **Data Cleaning**
   - Convert `Total Charges` from `object` to numeric.
   - Handle missing values in `Churn Reason`.
   - Ensure numeric columns (`Monthly Charges`, `CLTV`, etc.) are typed correctly.

2. **Feature Engineering**
   - Create `ChurnFlag` (0/1).
   - Build tenure cohorts (0–6, 7–12, etc.).
   - Create CLTV segments (Low / Medium / High).
   - Build churn-risk buckets based on `Churn Score`.

3. **Export Analytic Dataset**
   - Save cleaned and engineered dataset to `data/processed/telco_churn_clean.csv` for Tableau.

This prepares the foundation for building churn dashboards and retention insights.
