## Telco Customer Churn – Data Cleaning Notebook

### Description:
This notebook performs a reproducible data cleaning pipeline for the Telco Customer Churn dataset.
It includes: loading the data, inspecting it, dropping unwanted columns, converting types, imputing missing values, scaling numeric features, and saving a cleaned CSV ready for modeling.

Target column: Churn Value (0 = No, 1 = Yes)
Dropped columns : Churn Label, Count

### Objectives

- Load the Telco dataset.
- Inspect structure, types, and missing values.
- Remove specified columns.
- Convert Total Charges to numeric.
- Impute missing values (median for numeric, mode for categorical).
- Export cleaned CSV: cleaned_customer_churn.csv

In [28]:
import pandas as pd
import numpy as np
INPUT_PATH = "RawData/Telco_customer_churn.csv"
OUTPUT_PATH = "RawData/cleaned_customer_churn.csv"


In [29]:
df = pd.read_csv(INPUT_PATH)
print("Loaded dataset shape:", df.shape)
display(df.head())

Loaded dataset shape: (7043, 33)


Unnamed: 0,CustomerID,Count,Country,State,City,Zip Code,Lat Long,Latitude,Longitude,Gender,...,Contract,Paperless Billing,Payment Method,Monthly Charges,Total Charges,Churn Label,Churn Value,Churn Score,CLTV,Churn Reason
0,3668-QPYBK,1,United States,California,Los Angeles,90003,"33.964131, -118.272783",33.964131,-118.272783,Male,...,Month-to-month,Yes,Mailed check,53.85,108.15,Yes,1,86,3239,Competitor made better offer
1,9237-HQITU,1,United States,California,Los Angeles,90005,"34.059281, -118.30742",34.059281,-118.30742,Female,...,Month-to-month,Yes,Electronic check,70.7,151.65,Yes,1,67,2701,Moved
2,9305-CDSKC,1,United States,California,Los Angeles,90006,"34.048013, -118.293953",34.048013,-118.293953,Female,...,Month-to-month,Yes,Electronic check,99.65,820.5,Yes,1,86,5372,Moved
3,7892-POOKP,1,United States,California,Los Angeles,90010,"34.062125, -118.315709",34.062125,-118.315709,Female,...,Month-to-month,Yes,Electronic check,104.8,3046.05,Yes,1,84,5003,Moved
4,0280-XJGEX,1,United States,California,Los Angeles,90015,"34.039224, -118.266293",34.039224,-118.266293,Male,...,Month-to-month,Yes,Bank transfer (automatic),103.7,5036.3,Yes,1,89,5340,Competitor had better devices


In [30]:
#check unique values in each columns
for col in df.columns:
    print(f"Column: {col}")
    print(df[col].unique())
    print("-" * 50)

Column: CustomerID
['3668-QPYBK' '9237-HQITU' '9305-CDSKC' ... '2234-XADUH' '4801-JZAZL'
 '3186-AJIEK']
--------------------------------------------------
Column: Count
[1]
--------------------------------------------------
Column: Country
['United States']
--------------------------------------------------
Column: State
['California']
--------------------------------------------------
Column: City
['Los Angeles' 'Beverly Hills' 'Huntington Park' ... 'Standish' 'Tulelake'
 'Olympic Valley']
--------------------------------------------------
Column: Zip Code
[90003 90005 90006 ... 96128 96134 96146]
--------------------------------------------------
Column: Lat Long
['33.964131, -118.272783' '34.059281, -118.30742' '34.048013, -118.293953'
 ... '40.346634, -120.386422' '41.813521, -121.492666'
 '39.191797, -120.212401']
--------------------------------------------------
Column: Latitude
[33.964131 34.059281 34.048013 ... 40.346634 41.813521 39.191797]
-----------------------------------

### Initial inspection

In [31]:
print("Columns:\n", df.columns.tolist())
print("\nData types:\n", df.dtypes.value_counts())
print("\nMissing values (count and percent):")
missing = pd.concat([df.isnull().sum(), (df.isnull().mean()*100)], axis=1)
missing.columns = ["missing_count", "missing_percent"]
display(missing.sort_values(by="missing_count", ascending=False))



Columns:
 ['CustomerID', 'Count', 'Country', 'State', 'City', 'Zip Code', 'Lat Long', 'Latitude', 'Longitude', 'Gender', 'Senior Citizen', 'Partner', 'Dependents', 'Tenure Months', 'Phone Service', 'Multiple Lines', 'Internet Service', 'Online Security', 'Online Backup', 'Device Protection', 'Tech Support', 'Streaming TV', 'Streaming Movies', 'Contract', 'Paperless Billing', 'Payment Method', 'Monthly Charges', 'Total Charges', 'Churn Label', 'Churn Value', 'Churn Score', 'CLTV', 'Churn Reason']

Data types:
 object     24
int64       6
float64     3
Name: count, dtype: int64

Missing values (count and percent):


Unnamed: 0,missing_count,missing_percent
Churn Reason,5174,73.463013
CustomerID,0,0.0
Count,0,0.0
State,0,0.0
Country,0,0.0
Zip Code,0,0.0
Lat Long,0,0.0
Latitude,0,0.0
City,0,0.0
Gender,0,0.0


In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 33 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   CustomerID         7043 non-null   object 
 1   Count              7043 non-null   int64  
 2   Country            7043 non-null   object 
 3   State              7043 non-null   object 
 4   City               7043 non-null   object 
 5   Zip Code           7043 non-null   int64  
 6   Lat Long           7043 non-null   object 
 7   Latitude           7043 non-null   float64
 8   Longitude          7043 non-null   float64
 9   Gender             7043 non-null   object 
 10  Senior Citizen     7043 non-null   object 
 11  Partner            7043 non-null   object 
 12  Dependents         7043 non-null   object 
 13  Tenure Months      7043 non-null   int64  
 14  Phone Service      7043 non-null   object 
 15  Multiple Lines     7043 non-null   object 
 16  Internet Service   7043 

In [33]:
print("Duplicates:", df.duplicated().sum())


Duplicates: 0


In [34]:
# remove duplicates rows
df = df.drop_duplicates()
print("Duplicates:", df.duplicated().sum())


Duplicates: 0


In [35]:
# Drop unneeded columns

cols_to_drop = ["Churn Label", "Count"]
cols_present_to_drop = [c for c in cols_to_drop if c in df.columns]
df.drop(columns=cols_present_to_drop, inplace=True, errors="ignore")
print(df.shape)
df.head()



(7043, 31)


Unnamed: 0,CustomerID,Country,State,City,Zip Code,Lat Long,Latitude,Longitude,Gender,Senior Citizen,...,Streaming Movies,Contract,Paperless Billing,Payment Method,Monthly Charges,Total Charges,Churn Value,Churn Score,CLTV,Churn Reason
0,3668-QPYBK,United States,California,Los Angeles,90003,"33.964131, -118.272783",33.964131,-118.272783,Male,No,...,No,Month-to-month,Yes,Mailed check,53.85,108.15,1,86,3239,Competitor made better offer
1,9237-HQITU,United States,California,Los Angeles,90005,"34.059281, -118.30742",34.059281,-118.30742,Female,No,...,No,Month-to-month,Yes,Electronic check,70.7,151.65,1,67,2701,Moved
2,9305-CDSKC,United States,California,Los Angeles,90006,"34.048013, -118.293953",34.048013,-118.293953,Female,No,...,Yes,Month-to-month,Yes,Electronic check,99.65,820.5,1,86,5372,Moved
3,7892-POOKP,United States,California,Los Angeles,90010,"34.062125, -118.315709",34.062125,-118.315709,Female,No,...,Yes,Month-to-month,Yes,Electronic check,104.8,3046.05,1,84,5003,Moved
4,0280-XJGEX,United States,California,Los Angeles,90015,"34.039224, -118.266293",34.039224,-118.266293,Male,No,...,Yes,Month-to-month,Yes,Bank transfer (automatic),103.7,5036.3,1,89,5340,Competitor had better devices


In [36]:
# Convert Total Charges to numeric
if "Total Charges" in df.columns:
   df["Total Charges"] = pd.to_numeric(df["Total Charges"], errors="coerce")
print("Converted 'Total Charges' to numeric. NaNs after conversion:", df["Total Charges"].isnull().sum())

Converted 'Total Charges' to numeric. NaNs after conversion: 11


In [37]:
print(df.isnull().sum())

CustomerID              0
Country                 0
State                   0
City                    0
Zip Code                0
Lat Long                0
Latitude                0
Longitude               0
Gender                  0
Senior Citizen          0
Partner                 0
Dependents              0
Tenure Months           0
Phone Service           0
Multiple Lines          0
Internet Service        0
Online Security         0
Online Backup           0
Device Protection       0
Tech Support            0
Streaming TV            0
Streaming Movies        0
Contract                0
Paperless Billing       0
Payment Method          0
Monthly Charges         0
Total Charges          11
Churn Value             0
Churn Score             0
CLTV                    0
Churn Reason         5174
dtype: int64


In [38]:
# for customer not leave the service the reason will fill by No churn
df["Churn Reason"] = df["Churn Reason"].fillna("No churn")
df = df.dropna(subset=["Total Charges"])

In [39]:
print(df.isnull().sum())

CustomerID           0
Country              0
State                0
City                 0
Zip Code             0
Lat Long             0
Latitude             0
Longitude            0
Gender               0
Senior Citizen       0
Partner              0
Dependents           0
Tenure Months        0
Phone Service        0
Multiple Lines       0
Internet Service     0
Online Security      0
Online Backup        0
Device Protection    0
Tech Support         0
Streaming TV         0
Streaming Movies     0
Contract             0
Paperless Billing    0
Payment Method       0
Monthly Charges      0
Total Charges        0
Churn Value          0
Churn Score          0
CLTV                 0
Churn Reason         0
dtype: int64


In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7032 entries, 0 to 7042
Data columns (total 31 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   CustomerID         7032 non-null   object 
 1   Country            7032 non-null   object 
 2   State              7032 non-null   object 
 3   City               7032 non-null   object 
 4   Zip Code           7032 non-null   int64  
 5   Lat Long           7032 non-null   object 
 6   Latitude           7032 non-null   float64
 7   Longitude          7032 non-null   float64
 8   Gender             7032 non-null   object 
 9   Senior Citizen     7032 non-null   object 
 10  Partner            7032 non-null   object 
 11  Dependents         7032 non-null   object 
 12  Tenure Months      7032 non-null   int64  
 13  Phone Service      7032 non-null   object 
 14  Multiple Lines     7032 non-null   object 
 15  Internet Service   7032 non-null   object 
 16  Online Security    7032 non-n

In [41]:
# save data after cleaning and
df.to_csv(OUTPUT_PATH, index=False)
print(f"✅ Cleaned dataset saved to: {OUTPUT_PATH}")

✅ Cleaned dataset saved to: RawData/cleaned_customer_churn.csv


In [None]:
# Column Name	                                              Description
# CustomerID	                                              Unique identifier for each customer	
# Count	                                                      Likely indicates number of connections or linked services (often always 1 in many datasets)	
# Country / State / City / Zip Code	                          Geographic location of the customer	
# Lat Long / Latitude / Longitude	                          Precise geographic coordinates	
# Gender	Male / Female	
# Senior Citizen	                                          Indicates if customer is a senior (e.g., age 65+)	Seniors may churn differently than younger customers
# Partner	                                                  Whether the customer has a partner/spouse	
# Dependents	                                              Whether others rely on the customer (e.g., children)	
# Tenure	                                                  Number of months the customer has stayed with the company  
# Phone Service	                                              Whether they subscribed to basic phone service	
# Multiple Lines	                                          Indicates if they have more than one phone line	Customers with multiple services are more likely to stay
# Internet Service	Type of internet (DSL / Fiber / None)	
# Online Security / Online Backup / Device Protection / Tech Support / Streaming TV / Streaming Movies	Optional add-on services	
# Contract	Month-to-month / One year / Two years	
# Paperless Billing	Yes / No                                digital or paper invoices	
# Payment Method	                                        Electronic check, credit card, bank transfer, etc.	
# Monthly Charges	                                        Amount billed per month	High monthly cost customers might be more likely to leave
# Total Charges	                                            Lifetime revenue from the customer	
# Churn Label	Yes / No —                                  Did the customer leave?	Text-based target column
# Churn Value	0 / 1                                       numerical version of churn	Numeric target column (better for modeling)
# Churn Score	                                            Predefined churn probability score from another system	
# CLTV (Customer Lifetime Value)	                        Estimated business value of this customer in the long-term	
# Churn Reason	                                            Stated reason for leaving (if churned)	