# Globalcom telecom Customer Churn Project: Data Preparation & EDA

**Author:** Akinradewo Aarinola Olamiposi  
**Date:** 21 September, 2025    
**Description:** This notebook outlines the data cleaning, transformation, and initial exploratory data analysis (EDA) for the hypothetical Telco customer churn project.


In [1]:
# Cell 1: Data Manipulation
import pandas as pd
import numpy as np

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Print all outputs from a cell, not just the last one
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

To load each individual dataset and get a first look to understand its structure.

In [3]:
# Cell 2: Load and Inspect Raw Datasets

# 1. Load the data from CSV files
df_demographics = pd.read_csv('../data/raw/customer_demographics.csv')
df_services = pd.read_csv('../data/raw/customer_services.csv')
df_support = pd.read_csv('../data/raw/support_tickets.csv')

# 2. Inspect each DataFrame one by one
print("DEMOGRAPHICS DATA:")
print("Shape:", df_demographics.shape)
df_demographics.head(2)
print("\n") # Adds a space for readability

print("SERVICES DATA:")
print("Shape:", df_services.shape)
df_services.head(2)
print("\n")

print("SUPPORT TICKETS DATA:")
print("Shape:", df_support.shape)
df_support.head(2)

FileNotFoundError: [Errno 2] No such file or directory: '../data/raw/customer_demographics.csv'

To combine the separate DataFrames from Cell 2 into one master dataset.

In [4]:
# Cell 3: Merge and Integrate Data

# 1. Merge demographics and services data on customerID
df_merged = pd.merge(df_demographics, df_services, on='customerID', how='inner')
print("Shape after merging demographics and services:", df_merged.shape)

# 2. Aggregate support tickets: count tickets per customer
# We need to transform the support log into a customer-level summary
df_support_agg = df_support.groupby('customerID').size().reset_index(name='number_of_support_tickets')
print("Unique customers in support logs:", df_support_agg.shape[0])

# 3. Merge the support ticket count into the main dataframe
df_main = pd.merge(df_merged, df_support_agg, on='customerID', how='left')

# 4. Fill NaN values for customers with no support tickets
df_main['number_of_support_tickets'].fillna(0, inplace=True)

# 5. Inspect the final merged dataset
print("\nFINAL MERGED DATASET:")
print("Shape:", df_main.shape)
print("\nFirst 3 rows:")
df_main.head(3)

# Show info to check for missing values after merge
print("\nInfo:")
df_main.info()

NameError: name 'df_demographics' is not defined

To check for data quality issues in the newly merged DataFrame (df_main).

In [5]:
# Cell 4: Initial Data Assessment
print("MISSING VALUES:")
print(df_main.isnull().sum())
print("\nDATA TYPES:")
print(df_main.dtypes)
print(f"\nDUPLICATE CUSTOMER IDs: {df_main.duplicated(subset=['customerID']).sum()}")
df_main.describe()

MISSING VALUES:


NameError: name 'df_main' is not defined

To fix the issues found in Cell 4.

In [None]:
# Cell 5: Data Cleaning
# Convert TotalCharges to numeric, forcing errors to NaN
df_main['TotalCharges'] = pd.to_numeric(df_main['TotalCharges'], errors='coerce')
# Check for the new missing values created by conversion
print("Customers with missing TotalCharges after conversion:")
print(df_main[df_main['TotalCharges'].isnull()][['customerID', 'tenure', 'MonthlyCharges', 'TotalCharges']])
# Impute missing TotalCharges with 0 for customers with tenure=0
df_main['TotalCharges'].fillna(0, inplace=True)
# Standardize categorical values
df_main['PaymentMethod'].replace({'e-check': 'Electronic check'}, inplace=True)
print("Missing values after cleaning:", df_main.isnull().sum().sum())

To create new, more powerful features from the existing clean data.

In [None]:
# Cell 6: Feature Engineering
# Create TenureGroup feature
def tenure_group(tenure):
    if tenure < 12:
        return '0-12'
    elif tenure < 24:
        return '13-24'
    elif tenure < 36:
        return '25-36'
    elif tenure < 48:
        return '37-48'
    else:
        return '49+'
df_main['TenureGroup'] = df_main['tenure'].apply(tenure_group)
# Display new features
df_main[['customerID', 'tenure', 'TenureGroup']].head()

To visualize the data and start generating insights

In [None]:
# Cell 7: Exploratory Data Analysis (EDA)
# Set visual style
sns.set_style("whitegrid")
# 1. Churn Count Plot
plt.figure(figsize=(6,4))
df_main['Churn'].value_counts().plot(kind='bar', color=['skyblue', 'salmon'])
plt.title('Overall Customer Churn Count')
plt.ylabel('Number of Customers')
plt.show()

To export the cleaned and processed data for the next stage.

In [None]:
# Cell 8: Save the Final Dataset
df_main.to_csv('../data/processed/telco_churn_clean.csv', index=False)
print("Clean dataset saved successfully. Ready for modeling!")
print(f"Final Dataset Shape: {df_main.shape}")