Project stage 1 review remark 1: Handling NaN values
The cleaned dataset has 11 customers with null TotalCharges.
All 11 customers with missing TotalCharges have a tenure of 0, meaning they are new customers who likely haven’t been billed yet.
The best solution is to fill the missing TotalCharges with 0, as it accurately reflects that they haven't accumulated charges yet.

In [1]:
import pandas as pd

# Load the CSV file
df = pd.read_csv("Telco-Customer-Churn-Cleaned.csv")

# Fill missing values in 'TotalCharges' with 0
df['TotalCharges'] = df['TotalCharges'].fillna(0)

# ensure correct data type for 'TotalCharges'
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

# Save the updated DataFrame back to the same or new CSV file
df.to_csv("Telco-Customer-Churn-Cleaned-Updated.csv", index=False)


Project Stage 1 review remark 2:Formatting issues, such as typing errors or multiple ways to identify the same variable, were not checked. In the below section Im checking for Formatting and Consistency Issues


In [8]:
# 1. Get all object (categorical) columns
cat_columns = df.select_dtypes(include='object').columns

# 2. Check for inconsistent string values (capitalisation, whitespace, variants)
print("=== Unique Values in Categorical Columns ===")
for col in cat_columns:
    unique_vals = df[col].unique()
    print(f"{col}: {unique_vals}")

# 3. Check for known variants that should be standardized
columns_to_check_variants = [
    'MultipleLines', 'OnlineSecurity', 'OnlineBackup',
    'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies'
]

print("\n=== Columns Likely to Contain Inconsistent 'No <service>' Values ===")
for col in columns_to_check_variants:
    if col in df.columns:
        values = df[col].unique()
        if any("No " in str(v) for v in values):
            print(f"{col} → {values}")

# 4. Check if 'SeniorCitizen' is numeric but only has 0 or 1
print("\n=== Checking if 'SeniorCitizen' is Binary Categorical ===")
if df['SeniorCitizen'].nunique() == 2 and set(df['SeniorCitizen'].unique()) == {0, 1}:
    print("✔ 'SeniorCitizen' is numeric but categorical (0/1)")

# 5. Check 'TotalCharges' data type and missing values
print("\n=== TotalCharges Type & Missing Check ===")
print(f"Data type: {df['TotalCharges'].dtype}")
missing_count = df['TotalCharges'].isna().sum()
print(f"Missing values: {missing_count}")

# 6. Detect strings with inconsistent case or whitespace
print("\n=== Inconsistent Formatting (Case/Whitespace) ===")
for col in cat_columns:
    cleaned = df[col].astype(str).str.strip().str.lower()
    unique_cleaned = cleaned.unique()
    if len(unique_cleaned) < len(df[col].unique()):
        print(f"{col} has inconsistent formatting (case or whitespace)")


=== Unique Values in Categorical Columns ===
gender: ['Female' 'Male']
Partner: ['Yes' 'No']
Dependents: ['No' 'Yes']
PhoneService: ['No' 'Yes']
MultipleLines: ['No' 'Yes']
InternetService: ['DSL' 'Fiber optic' 'No']
OnlineSecurity: ['No' 'Yes']
OnlineBackup: ['Yes' 'No']
DeviceProtection: ['No' 'Yes']
TechSupport: ['No' 'Yes']
StreamingTV: ['No' 'Yes']
StreamingMovies: ['No' 'Yes']
Contract: ['Month-to-month' 'One year' 'Two year']
PaperlessBilling: ['Yes' 'No']
PaymentMethod: ['Electronic check' 'Mailed check' 'Bank transfer (automatic)'
 'Credit card (automatic)']
Churn: ['No' 'Yes']

=== Columns Likely to Contain Inconsistent 'No <service>' Values ===

=== Checking if 'SeniorCitizen' is Binary Categorical ===
✔ 'SeniorCitizen' is numeric but categorical (0/1)

=== TotalCharges Type & Missing Check ===
Data type: float64
Missing values: 0

=== Inconsistent Formatting (Case/Whitespace) ===


Fix all identified issues

In [7]:

# 1. Replace 'No internet service' and 'No phone service' with 'No'
columns_with_variants = [
    'MultipleLines', 'OnlineSecurity', 'OnlineBackup',
    'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies'
]

for col in columns_with_variants:
    df[col] = df[col].replace({'No internet service': 'No', 'No phone service': 'No'})

# 2. Standardize string values across all object columns (strip whitespace, lowercasing optional)
object_cols = df.select_dtypes(include='object').columns
for col in object_cols:
    df[col] = df[col].astype(str).str.strip()

# 3. Convert 'SeniorCitizen' to category if needed
df['SeniorCitizen'] = df['SeniorCitizen'].astype('category')

# 4. Ensure 'TotalCharges' is numeric and fill missing values with 0 (valid for tenure == 0)
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df['TotalCharges'] = df['TotalCharges'].fillna(0)

# 5. Save the cleaned dataset to a new CSV file
df.to_csv("Telco-Customer-Churn-Cleaned-Final.csv", index=False)

print("✔ Dataset cleaned and saved as 'Telco-Customer-Churn-Cleaned-Final.csv'")


✔ Dataset cleaned and saved as 'Telco-Customer-Churn-Cleaned-Final.csv'


All "No internet service" and "No phone service" → replaced with "No"

All text columns are stripped of extra whitespace

SeniorCitizen is treated as categorical ( 0/1 flags)

TotalCharges is converted to float and missing values filled with 0

Project stage 1 review remark 3: Data schema is not clearly outlined. In the below section i outline the data schema

In [10]:
# Column descriptions 
column_descriptions = {
    'customerID': 'Unique customer identifier',
    'gender': 'Gender of the customer',
    'SeniorCitizen': '1 = Senior, 0 = Non-senior',
    'Partner': 'Whether the customer has a partner (Yes/No)',
    'Dependents': 'Whether the customer has dependents (Yes/No)',
    'tenure': 'Number of months the customer has stayed',
    'PhoneService': 'Whether the customer has phone service',
    'MultipleLines': 'Whether the customer has multiple phone lines',
    'InternetService': 'Type of internet service (DSL/Fiber/No)',
    'OnlineSecurity': 'Whether the customer has online security service',
    'OnlineBackup': 'Whether the customer has online backup service',
    'DeviceProtection': 'Whether the customer has device protection',
    'TechSupport': 'Whether the customer has tech support service',
    'StreamingTV': 'Whether the customer has streaming TV service',
    'StreamingMovies': 'Whether the customer has streaming movies service',
    'Contract': 'Contract term (Month-to-month, One year, Two year)',
    'PaperlessBilling': 'Whether the customer uses paperless billing',
    'PaymentMethod': 'Customer\'s payment method',
    'MonthlyCharges': 'Monthly amount charged to the customer',
    'TotalCharges': 'Total amount charged to the customer',
    'Churn': 'Whether the customer has churned (Yes/No)'
}

# Schema table
schema_df = pd.DataFrame({
    "Column": df.columns,
    "Data Type": df.dtypes.values,
    "Non-Null Count": df.notnull().sum().values,
    "Unique Values": [df[col].nunique() for col in df.columns],
    "Description": [column_descriptions.get(col, "") for col in df.columns]
})

# Print the table
print(schema_df)


schema_df.to_csv("data_schema.csv", index=False)


              Column Data Type  Non-Null Count  Unique Values  \
0             gender    object            7043              2   
1      SeniorCitizen  category            7043              2   
2            Partner    object            7043              2   
3         Dependents    object            7043              2   
4             tenure     int64            7043             73   
5       PhoneService    object            7043              2   
6      MultipleLines    object            7043              2   
7    InternetService    object            7043              3   
8     OnlineSecurity    object            7043              2   
9       OnlineBackup    object            7043              2   
10  DeviceProtection    object            7043              2   
11       TechSupport    object            7043              2   
12       StreamingTV    object            7043              2   
13   StreamingMovies    object            7043              2   
14          Contract    o

Project stage 1 review remark 4:The analysis also focuses on only three variables, despite the dataset containing 21 attributes. Was there a specific reason for this selection? Given your research question, a correlation analysis could have provided useful insights into which variables are most relevant to customer churn.

To identify the key variables influencing customer churn, I started with all 21 features in the dataset and removed non-informative columns such as customerID. The remaining features were categorized into numerical and categorical types. I retainedthe variables that directly reflect customer behavior, subscription details, billing preferences, or demographic attributes. This included numerical features like tenure, MonthlyCharges, and TotalCharges, which are likely indicators of usage or billing pressure. Categorical features such as Contract, TechSupport, InternetService, and PaymentMethod were selected based on domain knowledge and their potential relevance to customer satisfaction and loyalty.

In [11]:
print(df.columns.tolist())


['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn']


In [13]:
# Separate features by type
categorical_vars = [
    'gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService',
    'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup',
    'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies',
    'Contract', 'PaperlessBilling', 'PaymentMethod'
]

numerical_vars = ['tenure', 'MonthlyCharges', 'TotalCharges']

# Create a reference table of selected variables
focus_variables = pd.DataFrame({
    "Variable": categorical_vars + numerical_vars,
    "Type": ["Categorical"] * len(categorical_vars) + ["Numerical"] * len(numerical_vars),
    "Reason for Focus": (
        ["Potential service or customer demographic driver of churn"] * len(categorical_vars) +
        ["Financial or duration indicators impacting churn"] * len(numerical_vars)
    )
})

print(focus_variables)


            Variable         Type  \
0             gender  Categorical   
1      SeniorCitizen  Categorical   
2            Partner  Categorical   
3         Dependents  Categorical   
4       PhoneService  Categorical   
5      MultipleLines  Categorical   
6    InternetService  Categorical   
7     OnlineSecurity  Categorical   
8       OnlineBackup  Categorical   
9   DeviceProtection  Categorical   
10       TechSupport  Categorical   
11       StreamingTV  Categorical   
12   StreamingMovies  Categorical   
13          Contract  Categorical   
14  PaperlessBilling  Categorical   
15     PaymentMethod  Categorical   
16            tenure    Numerical   
17    MonthlyCharges    Numerical   
18      TotalCharges    Numerical   

                                     Reason for Focus  
0   Potential service or customer demographic driv...  
1   Potential service or customer demographic driv...  
2   Potential service or customer demographic driv...  
3   Potential service or customer d