# 01 Data_loading_and_cleaning

This notebook loads the Telco Customer Churn dataset, performs initial inspection,
identifies missing/incorrect values, and applies necessary cleaning steps.
This is the first part of the corpus preparation required in the assignment.


Imports libraries

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# ML (added for consistency across notebooks)
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras import Input

import joblib


Load Dataset

In [2]:
data = pd.read_csv("../data/Telco-Customer-Churn.csv")

print("Dataset Info:")
print(data.info())

print("\nMissing Values:")
print(data.isnull().sum())

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-n

Basic data Cleaning

In [4]:
# Convert TotalCharges to numeric (fix mixed string values)
data['TotalCharges'] = pd.to_numeric(data['TotalCharges'], errors='coerce')

# Fill missing TotalCharges
data['TotalCharges'] = data['TotalCharges'].fillna(data['TotalCharges'].median())

# Remove whitespace from categorical columns
for col in data.select_dtypes(include='object').columns:
    data[col] = data[col].astype(str).str.strip()


Additional Quality Checks

In [5]:
# Check duplicates
print("Duplicate rows:", data.duplicated().sum())

# Check numerical columns for invalid values"
numeric_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']
print("\nNegative Value Check:")
print((data[numeric_cols] < 0).sum())

# Confirm no missing values remain
print("\nMissing Values After Cleaning:")
print(data.isnull().sum())


Duplicate rows: 0

Negative Value Check:
tenure            0
MonthlyCharges    0
TotalCharges      0
dtype: int64

Missing Values After Cleaning:
customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64


Dataset Summary

In [6]:
print("\nClass Distribution (Churn):")
print(data['Churn'].value_counts())

print("\nStatistical Summary:")
print(data.describe())

# Identify numeric vs categorical
cat_cols = data.select_dtypes(include='object').columns
num_cols = data.select_dtypes(exclude='object').columns

print("\nCategorical Columns:", list(cat_cols))
print("Numeric Columns:", list(num_cols))



Class Distribution (Churn):
Churn
No     5174
Yes    1869
Name: count, dtype: int64

Statistical Summary:
       SeniorCitizen       tenure  MonthlyCharges  TotalCharges
count    7043.000000  7043.000000     7043.000000   7043.000000
mean        0.162147    32.371149       64.761692   2281.916928
std         0.368612    24.559481       30.090047   2265.270398
min         0.000000     0.000000       18.250000     18.800000
25%         0.000000     9.000000       35.500000    402.225000
50%         0.000000    29.000000       70.350000   1397.475000
75%         0.000000    55.000000       89.850000   3786.600000
max         1.000000    72.000000      118.750000   8684.800000

Categorical Columns: ['customerID', 'gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'Churn']
Numeric Columns: ['SeniorCitizen