# Data Cleaning and Preprocessing
This notebook documents the data cleaning and preprocessing steps performed on the **Telco Customer Churn** dataset.

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

## Load Dataset

In [2]:
# Load dataset
file_path = 'WA_Fn-UseC_-Telco-Customer-Churn.csv'
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


## Inspect Unique Values in Each Column

In [3]:
# Display unique values per column
unique_values = {col: df[col].unique() for col in df.columns}
unique_values_df = pd.DataFrame({col: [unique_values[col]] for col in unique_values}).T
unique_values_df.columns = ['Unique Values']
unique_values_df

Unnamed: 0,Unique Values
customerID,"[7590-VHVEG, 5575-GNVDE, 3668-QPYBK, 7795-CFOC..."
gender,"[Female, Male]"
SeniorCitizen,"[0, 1]"
Partner,"[Yes, No]"
Dependents,"[No, Yes]"
tenure,"[1, 34, 2, 45, 8, 22, 10, 28, 62, 13, 16, 58, ..."
PhoneService,"[No, Yes]"
MultipleLines,"[No phone service, No, Yes]"
InternetService,"[DSL, Fiber optic, No]"
OnlineSecurity,"[No, Yes, No internet service]"


## Handling 'No internet service' Values
Some columns contained 'No internet service' values, which indicate that the customer does not have an internet connection. These values were replaced with 'No' to maintain binary categorization while preserving analytical integrity.

In [4]:
# Identify internet-related columns
internet_related_columns = [
    'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
    'TechSupport', 'StreamingTV', 'StreamingMovies'
]

# Replace 'No internet service' with 'No'
df[internet_related_columns] = df[internet_related_columns].replace('No internet service', 'No')
df[internet_related_columns].head()

Unnamed: 0,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies
0,No,Yes,No,No,No,No
1,Yes,No,Yes,No,No,No
2,Yes,Yes,No,No,No,No
3,Yes,No,Yes,Yes,No,No
4,No,No,No,No,No,No


## Checking for Missing Values
Missing values can affect analysis and modeling. This step identifies any missing values in the dataset.

In [5]:
# Check for missing values
missing_values = df.isnull().sum()
missing_values[missing_values > 0]

Series([], dtype: int64)

## Converting TotalCharges to Numeric
The `TotalCharges` column should be numeric, but it might have some non-numeric values. This step converts it properly.

In [6]:
# Convert TotalCharges to numeric, setting errors='coerce' to handle any issues
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df['TotalCharges'].dtype

dtype('float64')

## Handling Missing Values in TotalCharges
Since some values might not have converted properly, we will fill missing values with 0 as a reasonable assumption (e.g., new customers with no charges yet).

In [7]:
# Fill missing TotalCharges values with 0
df['TotalCharges'].fillna(0, inplace=True)
df.isnull().sum()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['TotalCharges'].fillna(0, inplace=True)


customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

## Converting SeniorCitizen to Categorical
The `SeniorCitizen` column is stored as a numeric column (0/1), but it should be treated as categorical for analysis.

In [8]:
# Convert SeniorCitizen to categorical
df['SeniorCitizen'] = df['SeniorCitizen'].map({0: 'No', 1: 'Yes'})
df['SeniorCitizen'].value_counts()

SeniorCitizen
No     5901
Yes    1142
Name: count, dtype: int64

In [9]:
# Save the processed dataset for modeling
processed_file_path = "Cleaned_Data.csv"
df.to_csv(processed_file_path, index=False)

# Provide the file for download
processed_file_path

'Cleaned_Data.csv'

# Converting Cleaned Dataset For Modeling Use
Converting categorical variables into binary and saving to new dataset for modeling puporses.

In [10]:
file_path = 'Cleaned_Data.csv'
df = pd.read_csv(file_path)

In [11]:
# Convert categorical variables into binary (one-hot encoding and label encoding where needed)

# Step 1: Convert SeniorCitizen back to binary
df['SeniorCitizen'] = df['SeniorCitizen'].map({'No': 0, 'Yes': 1})

# Step 2: Drop Unnecessary Columns
df.drop(columns=["customerID"], inplace=True)  # 'customerID' is not useful for predictions

# Step 3: Convert Binary Categorical Columns to 1/0
binary_columns = ["Partner", "Dependents", "PhoneService", "PaperlessBilling", "Churn",
                  "OnlineSecurity", "OnlineBackup", "DeviceProtection", "TechSupport",
                  "StreamingTV", "StreamingMovies", "MultipleLines"]

for col in binary_columns:
    df[col] = df[col].map({"Yes": 1, "No": 0, "No phone service": 0, "No internet service": 0})

# Step 4: One-Hot Encode Categorical Columns
df = pd.get_dummies(df, columns=["gender", "Contract", "PaymentMethod", "InternetService"], drop_first=True)

In [12]:
# Save the processed dataset for modeling
processed_file_path = "Model_Data.csv"
df.to_csv(processed_file_path, index=False)

# Provide the file for download
processed_file_path

'Model_Data.csv'