
### Telco Customer Churn Data Ingestion and Loading
#### Importing Libraries and Loading Data
The following code imports the essential libraries and loads the dataset from a local path. The dataset is stored in a pandas DataFrame called `df`.

#### Displaying Data
The following code displays the first few rows of the DataFrame, prints the DataFrame info, and displays the descriptive statistics of the DataFrame.



In [2]:
# Import essential libraries
import pandas as pd
import numpy as np

# Display all columns when viewing DataFrames
pd.set_option('display.max_columns', None)


In [None]:
# Load dataset from local path
file_path = "/home/lenix/Downloads/Telco_Customer_Churn_Dataset  (3).csv"

# Read the CSV into a pandas DataFrame
df = pd.read_csv(file_path)

# Display first few rows
df.head()


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [4]:
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
count,7043.0,7043.0,7043.0
mean,0.162147,32.371149,64.761692
std,0.368612,24.559481,30.090047
min,0.0,0.0,18.25
25%,0.0,9.0,35.5
50%,0.0,29.0,70.35
75%,0.0,55.0,89.85
max,1.0,72.0,118.75


In [5]:
# Loop through all columns and display their unique values
for col in df.columns:
    unique_vals = df[col].unique()
    print(f"\n{col} → {unique_vals[:10]}\n{'-'*80}")



customerID → ['7590-VHVEG' '5575-GNVDE' '3668-QPYBK' '7795-CFOCW' '9237-HQITU'
 '9305-CDSKC' '1452-KIOVK' '6713-OKOMC' '7892-POOKP' '6388-TABGU']
--------------------------------------------------------------------------------

gender → ['Female' 'Male']
--------------------------------------------------------------------------------

SeniorCitizen → [0 1]
--------------------------------------------------------------------------------

Partner → ['Yes' 'No']
--------------------------------------------------------------------------------

Dependents → ['No' 'Yes']
--------------------------------------------------------------------------------

tenure → [ 1 34  2 45  8 22 10 28 62 13]
--------------------------------------------------------------------------------

PhoneService → ['No' 'Yes']
--------------------------------------------------------------------------------

MultipleLines → ['No phone service' 'No' 'Yes']
----------------------------------------------------------------

In [6]:
internet_related = [
    'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
    'TechSupport', 'StreamingTV', 'StreamingMovies'
]

phone_related = ['MultipleLines']


In [7]:
# inspect the unique values per column
for col in df.select_dtypes(include='object'):
    print(f"\n{col} → {df[col].unique()}")



customerID → ['7590-VHVEG' '5575-GNVDE' '3668-QPYBK' ... '4801-JZAZL' '8361-LTMKD'
 '3186-AJIEK']

gender → ['Female' 'Male']

Partner → ['Yes' 'No']

Dependents → ['No' 'Yes']

PhoneService → ['No' 'Yes']

MultipleLines → ['No phone service' 'No' 'Yes']

InternetService → ['DSL' 'Fiber optic' 'No']

OnlineSecurity → ['No' 'Yes' 'No internet service']

OnlineBackup → ['Yes' 'No' 'No internet service']

DeviceProtection → ['No' 'Yes' 'No internet service']

TechSupport → ['No' 'Yes' 'No internet service']

StreamingTV → ['No' 'Yes' 'No internet service']

StreamingMovies → ['No' 'Yes' 'No internet service']

Contract → ['Month-to-month' 'One year' 'Two year']

PaperlessBilling → ['Yes' 'No']

PaymentMethod → ['Electronic check' 'Mailed check' 'Bank transfer (automatic)'
 'Credit card (automatic)']

TotalCharges → ['29.85' '1889.5' '108.15' ... '346.45' '306.6' '6844.5']

Churn → ['No' 'Yes']


In [8]:
# Before standardizing, verify they align logically
# For each column that depends on InternetService
cols = ['OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']

for col in cols:
    print(f"\nCross-tab for {col}:")
    print(pd.crosstab(df['InternetService'], df[col]))


Cross-tab for OnlineSecurity:
OnlineSecurity     No  No internet service   Yes
InternetService                                 
DSL              1241                    0  1180
Fiber optic      2257                    0   839
No                  0                 1526     0

Cross-tab for OnlineBackup:
OnlineBackup       No  No internet service   Yes
InternetService                                 
DSL              1335                    0  1086
Fiber optic      1753                    0  1343
No                  0                 1526     0

Cross-tab for DeviceProtection:
DeviceProtection    No  No internet service   Yes
InternetService                                  
DSL               1356                    0  1065
Fiber optic       1739                    0  1357
No                   0                 1526     0

Cross-tab for TechSupport:
TechSupport        No  No internet service   Yes
InternetService                                 
DSL              1243                    

In [9]:
pd.crosstab(df['PhoneService'], df['MultipleLines'])


MultipleLines,No,No phone service,Yes
PhoneService,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
No,0,682,0
Yes,3390,0,2971


In [None]:
# check if the following returns number of inconsistent rows (should be 0)
mask = ((df['OnlineSecurity'] == 'No internet service') & (df['InternetService'] != 'No'))
mask.sum()


np.int64(0)

In [11]:
# raw counts
pd.crosstab(df['InternetService'], df['OnlineSecurity'])

# row percentages
pd.crosstab(df['InternetService'], df['OnlineSecurity'], normalize='index').round(3)*100

# column percentages
pd.crosstab(df['InternetService'], df['OnlineSecurity'], normalize='columns').round(3)*100

# totals
pd.crosstab(df['InternetService'], df['OnlineSecurity'], margins=True)


OnlineSecurity,No,No internet service,Yes,All
InternetService,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
DSL,1241,0,1180,2421
Fiber optic,2257,0,839,3096
No,0,1526,0,1526
All,3498,1526,2019,7043


In [12]:
# Standardize categorical responses for internet- and phone-related columns

# Define the related column groups
internet_related = [
    'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
    'TechSupport', 'StreamingTV', 'StreamingMovies'
]

phone_related = ['MultipleLines']

# Replace 'No internet service' with 'No' for internet-related columns
df[internet_related] = df[internet_related].replace('No internet service', 'No')

# Replace 'No phone service' with 'No' for phone-related columns
df[phone_related] = df[phone_related].replace('No phone service', 'No')


In [13]:
# Confirm the standardization worked
for col in internet_related + phone_related:
    print(f"{col} → {df[col].unique()}")


OnlineSecurity → ['No' 'Yes']
OnlineBackup → ['Yes' 'No']
DeviceProtection → ['No' 'Yes']
TechSupport → ['No' 'Yes']
StreamingTV → ['No' 'Yes']
StreamingMovies → ['No' 'Yes']
MultipleLines → ['No' 'Yes']


In [14]:
df.head(25)

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes
5,9305-CDSKC,Female,0,No,No,8,Yes,Yes,Fiber optic,No,No,Yes,No,Yes,Yes,Month-to-month,Yes,Electronic check,99.65,820.5,Yes
6,1452-KIOVK,Male,0,No,Yes,22,Yes,Yes,Fiber optic,No,Yes,No,No,Yes,No,Month-to-month,Yes,Credit card (automatic),89.1,1949.4,No
7,6713-OKOMC,Female,0,No,No,10,No,No,DSL,Yes,No,No,No,No,No,Month-to-month,No,Mailed check,29.75,301.9,No
8,7892-POOKP,Female,0,Yes,No,28,Yes,Yes,Fiber optic,No,No,Yes,Yes,Yes,Yes,Month-to-month,Yes,Electronic check,104.8,3046.05,Yes
9,6388-TABGU,Male,0,No,Yes,62,Yes,No,DSL,Yes,Yes,No,No,No,No,One year,No,Bank transfer (automatic),56.15,3487.95,No


In [15]:
# Sanity check using crosstab after standardization

# Check all internet-related columns
for col in internet_related:
    print(f"\nCross-tab for {col}:")
    print(pd.crosstab(df['InternetService'], df[col]))



Cross-tab for OnlineSecurity:
OnlineSecurity     No   Yes
InternetService            
DSL              1241  1180
Fiber optic      2257   839
No               1526     0

Cross-tab for OnlineBackup:
OnlineBackup       No   Yes
InternetService            
DSL              1335  1086
Fiber optic      1753  1343
No               1526     0

Cross-tab for DeviceProtection:
DeviceProtection    No   Yes
InternetService             
DSL               1356  1065
Fiber optic       1739  1357
No                1526     0

Cross-tab for TechSupport:
TechSupport        No   Yes
InternetService            
DSL              1243  1178
Fiber optic      2230   866
No               1526     0

Cross-tab for StreamingTV:
StreamingTV        No   Yes
InternetService            
DSL              1464   957
Fiber optic      1346  1750
No               1526     0

Cross-tab for StreamingMovies:
StreamingMovies    No   Yes
InternetService            
DSL              1440   981
Fiber optic      1345  1751
No

In [16]:
# Check phone-related column
for col in phone_related:
    print(f"\nCross-tab for {col}:")
    print(pd.crosstab(df['PhoneService'], df[col]))


Cross-tab for MultipleLines:
MultipleLines    No   Yes
PhoneService             
No              682     0
Yes            3390  2971



## Key Takeaways from Data Ingestion and Loading
### Key Steps to Perform Data Cleansing in Subsequent Notebook

*   Data ingestion and loading were performed in the previous notebook.
*   The data was loaded into a pandas DataFrame.
*   The data was verified to be in the correct format.
*   The data was standardized to remove any inconsistencies between categorical columns.

### Key Steps to Perform Data Cleansing in Subsequent Notebook

*   Check for any NaN/null values in the dataset.
*   Check for empty-string or space-only values across all object columns.
*   Check for total problematic entries (null + empty string).
*   Handle problematic entries.
