# Customer Retention Analysis: Telco Churn Dataset

### Understanding customer behavior and identifying churn drivers for a telecommunications company

---

## Notebook 02: Data Preparation
This notebook focuses on preparing the Telco Churn dataset for analysis. Key steps include handling missing values, standardizing feature names, and engineering meaningful variables.

---

## Table of Contents
- [2.0 Data Preparation](#20-data-preparation)
- [2.1 Handling Missing Values](#21-handling-missing-values)
- [2.2 Column Name Normalization](#22-column-name-normalization)
- [2.3 Exploring Data (Feature Balance & Summaries)](#23-exploring-data-feature-balance--summaries)
    - [2.3.1 SQL Data Exploration](#231-sql-exploration)
- [2.4 Discretizing Variables](#24-discretizing-variables)
    - [2.4.1 The Tenure Feature](#241-handling-the-tenure-feature)
    - [2.4.2 The Monthly Charges Feature](#242-handling-the-monthlycharges-feature)
    - [2.4.3 The Total Charges Feature](#243-handling-the-totalcharges-feature)
- [2.5 Additional Feature Engineering](#25-additional-feature-engineering)
    - [2.5.1 Customer Tenure Features](#251-customer-tenure-features)
    - [2.5.2 Revenue Features](#252-revenue-features)
    - [2.5.3 Service Combination Features](#253-service-combination-features)
    - [2.5.4 Contract & Payment Features](#254-contract--payment-features)
    - [2.5.5 Interaction Features](#255-interaction-features)
- [2.6 Summary](#26-summary)

## 2.0 Data Preparation <a class="anchor" id="20-data-preparation"></a>

Prior to exploratory data analysis and modeling, the data will have to be cleaned and processed to remove missing values, homogenize naming conventions, and ensure type accuracy. 

#### Import Libraries

In [1]:
# Setup project root path
from setup_paths import add_project_root
add_project_root()

In [2]:
# Import libaries
import pandas as pd
import numpy as np
import sqlite3

#### Load Data
Load the raw dataset for cleaning.

In [3]:
# Load the data
df = pd.read_csv('../data/raw/WA_Fn-UseC_-Telco-Customer-Churn.csv')

## 2.1 Handling Missing Values <a class="anchor" id="21-handling-missing-values"></a>

Before performing any modeling or deeper analysis, it's important to ensure the dataset does not contain missing or invalid values. We've already identified missing `TotalCharges` values, but we should ensure that no other columns contain mising data.


In [4]:
# Force type change to numeric; convert empty strings to NaN values
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

# Replace NaN values with 0
df['TotalCharges'] = df['TotalCharges'].fillna(0)

In [5]:
# Check missing values by percentage
missing = df.isna().mean().sort_values(ascending=False)
missing[missing > 0]

Series([], dtype: float64)

**Observation:** The only missing values were in `TotalCharges`, which correspond to new customers (tenure = 0). These have been set to 0 to reflect that no charges have been billed yet.

## 2.2 Column Name Normalization <a class="anchor" id="22-column-name-normalization"></a>

The original dataset uses inconsistent casing in feature names
(e.g., `customerID`, `SeniorCitizen`, `gender`).
To ensure readability and consistency across SQL and Python,
we will convert all feature names to lowercase snake_case format.

This improves maintainability without altering data semantics.

In [6]:
# Transform column headers to lowercase snake_case
df.columns = (
    df.columns
      .str.strip() # Remove leading/trailing spaces
      .str.replace(' ', '_') # Replace spaces with underscores
      .str.replace(r'(?<!^)(?=[A-Z][a-z])', '_', regex=True) # Insert underscores before single capital letters
      .str.replace(r'(?<!^)(?=[A-Z][A-Z])', '_', regex=True) # Insert underscores before double capital letters
      .str.lower()
)

# Inspect transformed column headers
df.columns

Index(['customer_id', 'gender', 'senior_citizen', 'partner', 'dependents',
       'tenure', 'phone_service', 'multiple_lines', 'internet_service',
       'online_security', 'online_backup', 'device_protection', 'tech_support',
       'streaming_tv', 'streaming_movies', 'contract', 'paperless_billing',
       'payment_method', 'monthly_charges', 'total_charges', 'churn'],
      dtype='object')

## 2.3 Exploring Data (Feature Balance & Summaries) <a class="anchor" id="23-exploring-data-feature-balance--summaries"></a>
Binary label features should ideally be split evenly. Unbalanced label features may lead to label bias, causing models to overestimate or underestimate their predictive abilities - depending on which way it is biased. 

In [7]:
# Check churn feature balance
churn_counts = df['churn'].value_counts(normalize=True) * 100

# Check gender feature balance
gender_counts = df['gender'].value_counts(normalize=True) * 100

# Display feature balance
print(f'Churn distribution:\n\tNo: {churn_counts['No']:.1f}%\n\tYes: {churn_counts['Yes']:.1f}\n')
print(f'Gender distribution:\n\tMale: {gender_counts['Male']:.1f}%\n\tFemale: {gender_counts['Female']:.1f}%')

Churn distribution:
	No: 73.5%
	Yes: 26.5

Gender distribution:
	Male: 50.5%
	Female: 49.5%


**Observation:** The gender feature is fairly evenly split, with approximately 50.05% Males and 49.95% Females. The churn rate is uneven with 26.5% churned customers, and this imbalance will have to be accounted for in future modeling and model evaluation.

### 2.3.1 SQL Exploration <a class="anchor" id="231-sql-exploration"></a>
To demonstrate SQL querying, we’ll use SQLite to explore customer churn distribution and summarize key categorical features. SQL is particularly useful for quick, repeatable aggregations and filters.

In [8]:
# Connect to SQLite (creates a file if it doesn't exist)
conn = sqlite3.connect('../db/telco_churn.db')

# Write the DataFrame to SQLite
df.to_sql('telco_churn', conn, if_exists='replace', index=False)

7043

In [9]:
# Check data schema
schema = pd.read_sql_query('PRAGMA table_info(telco_churn);', conn)
schema

Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,customer_id,TEXT,0,,0
1,1,gender,TEXT,0,,0
2,2,senior_citizen,INTEGER,0,,0
3,3,partner,TEXT,0,,0
4,4,dependents,TEXT,0,,0
5,5,tenure,INTEGER,0,,0
6,6,phone_service,TEXT,0,,0
7,7,multiple_lines,TEXT,0,,0
8,8,internet_service,TEXT,0,,0
9,9,online_security,TEXT,0,,0


In [10]:
# Display summary table with missing values count 
summary = pd.concat([schema[['name','type']], missing.T.reset_index(drop=True)], axis=1)
summary.columns = ['column', 'type', 'missing_count']
summary

Unnamed: 0,column,type,missing_count
0,customer_id,TEXT,0.0
1,gender,TEXT,0.0
2,senior_citizen,INTEGER,0.0
3,partner,TEXT,0.0
4,dependents,TEXT,0.0
5,tenure,INTEGER,0.0
6,phone_service,TEXT,0.0
7,multiple_lines,TEXT,0.0
8,internet_service,TEXT,0.0
9,online_security,TEXT,0.0


In [11]:
# Inspect contract distribution
contract_query = 'Select contract, COUNT(*) from telco_churn GROUP BY contract;'
pd.read_sql(contract_query, conn)

Unnamed: 0,contract,COUNT(*)
0,Month-to-month,3875
1,One year,1473
2,Two year,1695


In [12]:
# Examine label column balance by contract and internet service
balance_query = """
    SELECT 
        contract,
        internet_service,
        COUNT(*) as total_customers,
        SUM(CASE WHEN churn = 'Yes' THEN 1 ELSE 0 END) as churned_customers,
        ROUND(AVG(CASE WHEN churn = 'Yes' THEN 1.0 ELSE 0 END) * 100, 2) AS churn_rate
    FROM telco_churn
    GROUP BY contract, internet_service
    ORDER BY churn_rate DESC;
"""
pd.read_sql(balance_query, conn)

Unnamed: 0,contract,internet_service,total_customers,churned_customers,churn_rate
0,Month-to-month,Fiber optic,2128,1162,54.61
1,Month-to-month,DSL,1223,394,32.22
2,One year,Fiber optic,539,104,19.29
3,Month-to-month,No,524,99,18.89
4,One year,DSL,570,53,9.3
5,Two year,Fiber optic,429,31,7.23
6,One year,No,364,9,2.47
7,Two year,DSL,628,12,1.91
8,Two year,No,638,5,0.78


**Observations:** Churn rate appears highest amongst short-term contract customers with internet service. 

## 2.4 Discretizing Variables <a class="anchor" id="24-discretizing-variables"></a>
The `tenure`, `monthly_charges`, and `total_charges` columns have a range of values which can be grouped into bins. This makes visualization and interpretation easier, but risks losing some of the details and potentially introduces some bias into the dataset if done improperly. To prevent this, the column labels are split into business insights and exploratory data analysis features and model features with the `get_eda_only_features()` and `get_model_features()` methods included in the src folder.

### 2.4.1 Handling the Tenure Feature <a class="anchor" id="241-handling-the-tenure-feature"></a>

In [13]:
# Display where each year lands in the dataset's percentiles
for months in range(12, df['tenure'].max() + 1, 12):
    print(f'{months} Month Percentile: {100 * np.sum(df['tenure'] <= months)/df.shape[0]:.0f}')

12 Month Percentile: 31
24 Month Percentile: 46
36 Month Percentile: 57
48 Month Percentile: 68
60 Month Percentile: 80
72 Month Percentile: 100


We can create three roughly equal-sized bins: 0-12 months (~31%), 13-48 months (~37%), and 49+ months (~32%). Splitting into equal width bins is better for statistical comparison and helps maintain balance in modeling, but is harder to interpret in business terms. 

We can create bins with an equal temporal range: 0-12, 13-24, 25-36, 37-48. Splitting into consistent time periods is intuitive and helpful for trend visualizations, but unequal sample sizes and small bins may hinder statistical modeling.


Alternatively, we could try to mix the two to maintain interpretability and bin-size balance:
- **New Customers (0-12 months):** The second-largest bin, with roughly 31% of customers using Telco's service for less than 1 year.

- **Established Customers (13-24 months):** Customers in a transition period between new and loyal customers; approximately 15% of customers fall in this bin, making it the smallest.

- **Mid-term Customers (25-48 months):** Customers who have been with the company for a reasonably long time, likely to be more stable; approximately 22% of customers fall into this bin.

- **Long-term Customers (49+ months):** Loyal customers who have been using Telco's service for over 4 years; the largest bin, with ~32% of customers.

In [14]:
bins = [0, 12, 24, 48, 72] # Create bins for each range of months
labels = ['0-12m', '13-24m', '25-48m', '49-72m']
# Create tenure_group column with tenure sorted into corresponding bins
df['tenure_group'] = pd.cut(df['tenure'], bins=bins, labels=labels, right=True, include_lowest=True)

### 2.4.2 Handling the Monthly Charges Feature <a class="anchor" id="242-handling-the-monthly-charges-feature"></a>

In [15]:
# Test potential monthly_charges bins
# Print how much of the dataset is contained by each $10 increment
for charges in range(10, int(df['monthly_charges'].max()) + 5, 10):
    print(f'${charges} Monthly Charge Percentile: {100 * np.sum(df['monthly_charges'] <= charges)/df.shape[0]:.0f}')

$10 Monthly Charge Percentile: 0
$20 Monthly Charge Percentile: 9
$30 Monthly Charge Percentile: 23
$40 Monthly Charge Percentile: 26
$50 Monthly Charge Percentile: 33
$60 Monthly Charge Percentile: 41
$70 Monthly Charge Percentile: 49
$80 Monthly Charge Percentile: 62
$90 Monthly Charge Percentile: 75
$100 Monthly Charge Percentile: 87
$110 Monthly Charge Percentile: 97
$120 Monthly Charge Percentile: 100


`MonthlyCharges` are fairly evenly distributed with a few outliers. It can be split into 5 segments of roughly equal size, with the first starting at $10, since the lowest value in the dataset is $18.25. Changes in future pricing will require adjusting these values to better discretize the data. 

In [16]:
bins = [10, 40, 70, 100, 120] # Create bins for each monthly spending bracket
labels = ['Low(<40$)' ,'Medium ($40-$70)', 'High ($70-$100)', 'Very High (>$100)']

# Create monthly_charge_group featurre with charges sorted into distinct bins
df['monthly_charge_group'] = pd.cut(df['monthly_charges'], bins=bins, labels=labels)

# Inspect new feature
df[['monthly_charge_group', 'monthly_charges']].head()

Unnamed: 0,monthly_charge_group,monthly_charges
0,Low(<40$),29.85
1,Medium ($40-$70),56.95
2,Medium ($40-$70),53.85
3,Medium ($40-$70),42.3
4,High ($70-$100),70.7


### 2.4.3 Handling the Total Charges Feature <a class="anchor" id="243-handling-the-totalcharges-feature"></a>

`total_charges` represents the cumulative revenue per customer and is closely correlated with both `tenure` and `monthly_charges`.  
We retain it as a continuous variable for quantitative analysis but also create business-defined value tiers to visualize churn across customer segments. The value tiers are split into 5 distinct groups, representing the bottom and top 10%, as well as the 25th, 50th, and 75th percentiles. These groups are suited to business analysis. 

In [17]:
percentiles = [0, 0.1, 0.25, 0.5, 0.75, 0.9, 1.0] # Create bins for each percentile
bins = df['total_charges'].quantile(percentiles).values
labels = ['P0-10', 'P10-25', 'P25-50', 'P50-75', 'P75-90', 'P90-100']

# Create total_charge_group feature with total charges sorted by percentile
df['total_charge_group'] = pd.cut(df['total_charges'], bins=bins, labels=labels, include_lowest=True)

# Inspect new feature
df[['total_charge_group', 'total_charges']].head()

Unnamed: 0,total_charge_group,total_charges
0,P0-10,29.85
1,P50-75,1889.5
2,P10-25,108.15
3,P50-75,1840.75
4,P10-25,151.65


## 2.5 Additional Feature Engineering <a class="anchor" id="25-additional-feature-engineering"></a>

### 2.5.1 Customer Tenure Features <a class="anchor" id="251-customer-tenure-features"></a>

We can measure tenure in years for clarity in modeling by creating a binary feature displaying whether a customer has been with Telco for less than 12 months (new customers) or for more than 48 months (long-term customers). How long a customer has been with the company is often strongly correlated to churn, and encoding these features should aid future models.

In [18]:
# Create tenure in years measure
df['tenure_years'] = df['tenure']/12

# Add new customer column
df['new_customer'] = df['tenure'].map(lambda x: 1 if x <= 12 else 0)

# Add long-term customer column
df['long_term_customer'] = df['tenure'].map(lambda x: 1 if x >= 48 else 0)

# Inspect newly created columns
df[['tenure', 'tenure_years', 'new_customer', 'long_term_customer']].tail()

Unnamed: 0,tenure,tenure_years,new_customer,long_term_customer
7038,24,2.0,0,0
7039,72,6.0,0,1
7040,11,0.916667,1,0
7041,4,0.333333,1,0
7042,66,5.5,0,1


### 2.5.2 Revenue Features <a class="anchor" id="252-revenue-features"></a>

We can better understand the customer revenue relationship through average monthly charges and revenue per year. This can help us quantify the revenues generated by each customer and appropriate investment levels to help prevent churn. 

In [19]:
# Calculate average monthly charges for each customer
df['avg_monthly_charges'] = df['total_charges'] / df['tenure']

# Calculate revenue per year by customer
df['revenue_per_year'] = df['total_charges'] / df['tenure_years']

### 2.5.3 Service Combination Features <a class="anchor" id="253-service-combination-features"></a>

Customers who use more services from a telecommunications provider like Telco may be "sticker" and less likely to churn. Having a count of the number of services each customer is using will help to identify possible correlations.

In [20]:
# Create summary has_streaming feature
has_streaming_func = (lambda x: 'Yes' if (x.iloc[0] == 'Yes') or (x.iloc[1] == 'Yes') else 'No')
df['has_streaming'] = df[[
    'streaming_tv', 
    'streaming_movies'
    ]].apply(func=has_streaming_func, axis=1)

# Create summary has_internet feature
has_internet_func = (lambda x: 'Yes' if x != 'No' else 'No')
df['has_internet'] = df['internet_service'].apply(func=has_internet_func)

# Check new features
df[['internet_service', 'streaming_tv', 'streaming_movies', 'has_streaming', 'has_internet']].tail(10)

Unnamed: 0,internet_service,streaming_tv,streaming_movies,has_streaming,has_internet
7033,Fiber optic,No,No,No,Yes
7034,Fiber optic,Yes,No,Yes,Yes
7035,Fiber optic,Yes,No,Yes,Yes
7036,DSL,Yes,Yes,Yes,Yes
7037,No,No internet service,No internet service,No,No
7038,DSL,Yes,Yes,Yes,Yes
7039,Fiber optic,Yes,Yes,Yes,Yes
7040,DSL,No,No,No,Yes
7041,Fiber optic,No,No,No,Yes
7042,Fiber optic,Yes,Yes,Yes,Yes


In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 31 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   customer_id           7043 non-null   object  
 1   gender                7043 non-null   object  
 2   senior_citizen        7043 non-null   int64   
 3   partner               7043 non-null   object  
 4   dependents            7043 non-null   object  
 5   tenure                7043 non-null   int64   
 6   phone_service         7043 non-null   object  
 7   multiple_lines        7043 non-null   object  
 8   internet_service      7043 non-null   object  
 9   online_security       7043 non-null   object  
 10  online_backup         7043 non-null   object  
 11  device_protection     7043 non-null   object  
 12  tech_support          7043 non-null   object  
 13  streaming_tv          7043 non-null   object  
 14  streaming_movies      7043 non-null   object  
 15  cont

In [22]:
# Count number of services used
df['num_services'] = df[[
    'phone_service',
    'has_internet',
    'online_security', 
    'online_backup',
    'device_protection',
    'tech_support', 
    'streaming_tv', 
    'streaming_movies'
]].apply(lambda x: (x == 'Yes').sum(), axis=1)

# Check feature values
df[['num_services', 'phone_service', 'has_internet', 'online_security', 'device_protection', 'tech_support', 'streaming_tv', 'streaming_movies']].tail(10)

Unnamed: 0,num_services,phone_service,has_internet,online_security,device_protection,tech_support,streaming_tv,streaming_movies
7033,2,Yes,Yes,No,No,No,No,No
7034,6,Yes,Yes,Yes,Yes,No,Yes,No
7035,3,Yes,Yes,No,No,No,Yes,No
7036,6,No,Yes,No,Yes,Yes,Yes,Yes
7037,1,Yes,No,No internet service,No internet service,No internet service,No internet service,No internet service
7038,7,Yes,Yes,Yes,Yes,Yes,Yes,Yes
7039,6,Yes,Yes,No,Yes,No,Yes,Yes
7040,2,No,Yes,Yes,No,No,No,No
7041,2,Yes,Yes,No,No,No,No,No
7042,7,Yes,Yes,Yes,Yes,Yes,Yes,Yes


Classifying customers by whether they have internet service or streaming will help in understanding which services are more closely correlated with churn. This will also simplify internet-related feature values like `online_security` or `tech_support`.

### 2.5.4 Contract & Payment Features <a class="anchor" id="254-contract--payment-features"></a>

Transitory contract and payment features may be related to high churn rates, as they are easier to cancel and indicate less of a commitment. Pairing payment and contract features by commitment or time to set up may reveal new insights about churn rates. Automatic payment method indicators are another potential correlation to churn rates.

In [23]:
# Feature identifying customers using paperless payments on a month-to-month contract
df['paperless_and_monthly'] = df[[
    'payment_method',
    'contract'
]].apply(lambda x: 'Yes' if ('Mailed' not in x.iloc[0] and x.iloc[1] == "Month-to-month") else "No", axis=1)

# Check features
df[['paperless_and_monthly', 'payment_method', 'contract']].head()

Unnamed: 0,paperless_and_monthly,payment_method,contract
0,Yes,Electronic check,Month-to-month
1,No,Mailed check,One year
2,No,Mailed check,Month-to-month
3,No,Bank transfer (automatic),One year
4,Yes,Electronic check,Month-to-month


In [24]:
# Feature identifying if customer uses automatic credit card payments or neither check method
df['automatic_payments'] = df[[
    'payment_method'
]].apply(lambda x: 'Yes' if 'automatic' in x.iloc[0] else 'No', axis=1) 

# Check newly created feature
df[['automatic_payments', 'payment_method']].head()

Unnamed: 0,automatic_payments,payment_method
0,No,Electronic check
1,No,Mailed check
2,No,Mailed check
3,Yes,Bank transfer (automatic)
4,No,Electronic check


### 2.5.5 Interaction Features <a class="anchor" id="255-interaction-features"></a>

The crux of data analysis is in the data relationships. By examining the charges per service we can judge whether customers are paying a premium per service, identifying potential consumer cost calculations influencing their decision to leave Telco. The most valuable customer segment is also woth examining, identified by having both high tenure and high monthly charge.

In [25]:
# Total charges per service 
df['total_charges_per_service'] = round(df['total_charges'] / (df['num_services'] + 1),2 )

# Total charges per service 
df['monthly_charges_per_service'] = round(df['monthly_charges'] / (df['num_services'] + 1),2 )

# Check new features
df[['total_charges_per_service', 'total_charges', 'monthly_charges', 'monthly_charges_per_service', 'num_services']].head()

Unnamed: 0,total_charges_per_service,total_charges,monthly_charges,monthly_charges_per_service,num_services
0,9.95,29.85,29.85,9.95,2
1,377.9,1889.5,56.95,11.39,4
2,21.63,108.15,53.85,10.77,4
3,368.15,1840.75,42.3,8.46,4
4,50.55,151.65,70.7,23.57,2


In [26]:
# Loyal high spenders (tenure greater than 48 months, monthly spending greater than $100)
df['loyal_high_spender'] = df[[
    'tenure', 
    'monthly_charges'
    ]].apply(lambda x: 'Yes' if (x.iloc[0] >= 48 and x.iloc[1] >= 100) else 'No', axis=1)

# Check features
df[['loyal_high_spender', 'tenure', 'monthly_charges']].tail()

Unnamed: 0,loyal_high_spender,tenure,monthly_charges
7038,No,24,84.8
7039,Yes,72,103.2
7040,No,11,29.6
7041,No,4,74.4
7042,Yes,66,105.65


In [27]:
df.head()

Unnamed: 0,customer_id,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service,online_security,...,avg_monthly_charges,revenue_per_year,has_streaming,has_internet,num_services,paperless_and_monthly,automatic_payments,total_charges_per_service,monthly_charges_per_service,loyal_high_spender
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,29.85,358.2,No,Yes,2,Yes,No,9.95,9.95,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,55.573529,666.882353,No,Yes,4,No,No,377.9,11.39,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,54.075,648.9,No,Yes,4,No,No,21.63,10.77,No
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,40.905556,490.866667,No,Yes,4,No,Yes,368.15,8.46,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,75.825,909.9,No,Yes,2,Yes,No,50.55,23.57,No


In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 37 columns):
 #   Column                       Non-Null Count  Dtype   
---  ------                       --------------  -----   
 0   customer_id                  7043 non-null   object  
 1   gender                       7043 non-null   object  
 2   senior_citizen               7043 non-null   int64   
 3   partner                      7043 non-null   object  
 4   dependents                   7043 non-null   object  
 5   tenure                       7043 non-null   int64   
 6   phone_service                7043 non-null   object  
 7   multiple_lines               7043 non-null   object  
 8   internet_service             7043 non-null   object  
 9   online_security              7043 non-null   object  
 10  online_backup                7043 non-null   object  
 11  device_protection            7043 non-null   object  
 12  tech_support                 7043 non-null   object  
 13  str

**Observation:** These data appear to be correctly processed, so we may now save these data and move onto EDA.

In [29]:
# Save transformed data
df.to_csv('../data/cleaned/telco_churn_clean.csv', index=False)

- - - 
### 2.6 Summary <a class="anchor" id="26-summary"></a>

In this notebook, we prepared the Telco Churn dataset for analysis and future modeling by addressing missing values, standardizing feature names, and engineering new features. 
The only missing values were in `total_charges` for new customers, which were set to zero. Additional features were created to capture customer tenure, service usage, revenue contribution, and contract/payment characteristics, helping to quantify factors likely related to churn. 
With the dataset cleaned, standardized, and enriched with new features, it is now ready for exploratory analysis to uncover patterns and insights.

