# Customer Churn Analysis - Data Preprocessing

This notebook handles the data preprocessing steps including loading the data, cleaning it, and encoding categorical variables.





### 1. Import Necessary Libraries and Configuration

In this cell, we import all the necessary libraries and configurations needed for our data preprocessing.


In [144]:
import json
import pandas as pd
import os
from sklearn.preprocessing import OneHotEncoder

# Load configuration
config_path = os.path.join(os.path.dirname(os.path.abspath('')), '..', 'config.json')
print(f"Config path: {config_path}")
with open(config_path, 'r') as f:
    config = json.load(f)

# Convert relative paths to absolute paths
project_root = os.path.dirname(os.path.dirname(os.path.abspath('')))
raw_data_path = os.path.join(project_root, config['raw_data_path'])
interim_cleaned_data_path = os.path.join(project_root, config['interim_cleaned_data_path'])
preprocessed_data_path = os.path.join(project_root, config['preprocessed_data_path'])

# Print the absolute paths for verification
print(f"Raw data path (absolute): {raw_data_path}")
print(f"Interim cleaned data path (absolute): {interim_cleaned_data_path}")
print(f"Preprocessed data path (absolute): {preprocessed_data_path}")


Config path: d:\Customer-Churn-Analysis\notebooks\..\config.json
Raw data path (absolute): d:\Customer-Churn-Analysis\data/raw/Dataset (ATS)-1.csv
Interim cleaned data path (absolute): d:\Customer-Churn-Analysis\data/interim/cleaned_dataset.csv
Preprocessed data path (absolute): d:\Customer-Churn-Analysis\Data_Preparation/preprocessed_dataset/cleaned_dataset.csv


### 2. Load Data

This cell loads the raw data from the specified path.


In [145]:
# Load the dataset
df = pd.read_csv(raw_data_path)
print("Initial columns after loading dataset:", df.columns.tolist())
# Display the first few rows of the dataset
df.head()


Initial columns after loading dataset: ['gender', 'SeniorCitizen', 'Dependents', 'tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'Contract', 'MonthlyCharges', 'Churn']


Unnamed: 0,gender,SeniorCitizen,Dependents,tenure,PhoneService,MultipleLines,InternetService,Contract,MonthlyCharges,Churn
0,Female,0,No,1,No,No,DSL,Month-to-month,29.85,No
1,Male,0,No,34,Yes,No,DSL,One year,56.95,No
2,Male,0,No,2,Yes,No,DSL,Month-to-month,53.85,Yes
3,Male,0,No,45,No,No,DSL,One year,42.3,No
4,Female,0,No,2,Yes,No,Fiber optic,Month-to-month,70.7,Yes


### 3. Clean Data

In this cell, we clean the data by handling missing values and encoding categorical variables using OneHotEncoder.


In [146]:
def clean_data(df):
    """
    Clean the data by handling missing values and encoding categorical variables.

    Parameters:
    df (pd.DataFrame): The data to clean.

    Returns:
    pd.DataFrame: Cleaned data.
    """
    try:
        # Handle missing values by dropping rows with missing values
        df = df.dropna()
        print("Missing values handled by dropping rows with missing values.")

        # Identify and encode categorical variables
        categorical_columns = df.select_dtypes(include=['object']).columns
        print(f"Categorical columns identified: {categorical_columns}")
        if len(categorical_columns) > 0:
            # Initialize the OneHotEncoder
            encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
            
            # Fit and transform the categorical columns
            encoded_data = pd.DataFrame(
                encoder.fit_transform(df[categorical_columns]),
                columns=encoder.get_feature_names_out(categorical_columns)
            )
            
            # Drop the original categorical columns from the DataFrame
            df = df.drop(columns=categorical_columns)
            
            # Concatenate the encoded data with the original DataFrame
            df = pd.concat([df, encoded_data], axis=1)
            print(f"Categorical columns {list(categorical_columns)} encoded.")
        else:
            print("No categorical columns found to encode.")
        
        return df
    except Exception as e:
        print(f"An error occurred during data cleaning: {e}")

# Clean the loaded data
df_cleaned = clean_data(df)
print("Columns after cleaning and encoding data:", df_cleaned.columns.tolist())
# Display the first few rows of the cleaned dataset
df_cleaned.head()

Missing values handled by dropping rows with missing values.
Categorical columns identified: Index(['gender', 'Dependents', 'PhoneService', 'MultipleLines',
       'InternetService', 'Contract', 'Churn'],
      dtype='object')
Categorical columns ['gender', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'Contract', 'Churn'] encoded.
Columns after cleaning and encoding data: ['SeniorCitizen', 'tenure', 'MonthlyCharges', 'gender_Female', 'gender_Male', 'Dependents_No', 'Dependents_Yes', 'PhoneService_No', 'PhoneService_Yes', 'MultipleLines_No', 'MultipleLines_Yes', 'InternetService_DSL', 'InternetService_Fiber optic', 'Contract_Month-to-month', 'Contract_One year', 'Contract_Two year', 'Churn_No', 'Churn_Yes']


Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,gender_Female,gender_Male,Dependents_No,Dependents_Yes,PhoneService_No,PhoneService_Yes,MultipleLines_No,MultipleLines_Yes,InternetService_DSL,InternetService_Fiber optic,Contract_Month-to-month,Contract_One year,Contract_Two year,Churn_No,Churn_Yes
0,0,1,29.85,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
1,0,34,56.95,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
2,0,2,53.85,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
3,0,45,42.3,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
4,0,2,70.7,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0


### 4. Save Cleaned Data

This cell saves the cleaned data to the specified interim and preprocessed paths.


In [147]:
# Save the cleaned dataset
df_cleaned.to_csv(interim_cleaned_data_path, index=False)
df_cleaned.to_csv(preprocessed_data_path, index=False)
print(f"Cleaned data saved to interim at {interim_cleaned_data_path}")
print(f"Cleaned data saved to preprocessed at {preprocessed_data_path}")


Cleaned data saved to interim at d:\Customer-Churn-Analysis\data/interim/cleaned_dataset.csv
Cleaned data saved to preprocessed at d:\Customer-Churn-Analysis\Data_Preparation/preprocessed_dataset/cleaned_dataset.csv


## Task 2: Handle Missing Data and Encode Categorical Variables


### Handle Missing Data and Encode Categorical Variables
In this step, we handle any missing data by imputing mean values for numeric columns and encode categorical variables using OneHotEncoder. This ensures that our dataset is clean and ready for further analysis.


In [148]:
import os
import sys

# Add the utils directory to the system path
utils_path = os.path.join(os.path.abspath(''), '..', 'utils')
if utils_path not in sys.path:
    sys.path.append(utils_path)

# Now import the handle_missing_and_encode module
from handle_missing_and_encode import handle_missing_and_encode

# Handle missing data and encode categorical variables
df_cleaned = handle_missing_and_encode(df)

# Verify the cleaned data
df_cleaned.head()

Missing data handled by mean imputation.
Categorical columns encoded: ['gender', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'Contract', 'Churn']
Data after handling missing values and encoding:
    SeniorCitizen  tenure  MonthlyCharges  gender_Female  gender_Male  \
0            0.0     1.0           29.85            1.0          0.0   
1            0.0    34.0           56.95            0.0          1.0   
2            0.0     2.0           53.85            0.0          1.0   
3            0.0    45.0           42.30            0.0          1.0   
4            0.0     2.0           70.70            1.0          0.0   

   Dependents_No  Dependents_Yes  PhoneService_No  PhoneService_Yes  \
0            1.0             0.0              1.0               0.0   
1            1.0             0.0              0.0               1.0   
2            1.0             0.0              0.0               1.0   
3            1.0             0.0              1.0               0



Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,gender_Female,gender_Male,Dependents_No,Dependents_Yes,PhoneService_No,PhoneService_Yes,MultipleLines_No,MultipleLines_Yes,InternetService_DSL,InternetService_Fiber optic,Contract_Month-to-month,Contract_One year,Contract_Two year,Churn_No,Churn_Yes
0,0.0,1.0,29.85,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
1,0.0,34.0,56.95,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
2,0.0,2.0,53.85,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
3,0.0,45.0,42.3,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
4,0.0,2.0,70.7,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0


### Save Preprocessed Data
After handling missing data and encoding categorical variables, we save the cleaned dataset to both interim and preprocessed paths. This allows us to have a checkpoint of our data before any further transformations or analyses.


In [149]:
# Save the cleaned dataset to interim and preprocessed paths
df_cleaned.to_csv(interim_cleaned_data_path, index=False)
df_cleaned.to_csv(preprocessed_data_path, index=False)

print("Cleaned data saved to interim at", interim_cleaned_data_path)
print("Cleaned data saved to preprocessed_dataset at", preprocessed_data_path)


Cleaned data saved to interim at d:\Customer-Churn-Analysis\data/interim/cleaned_dataset.csv
Cleaned data saved to preprocessed_dataset at d:\Customer-Churn-Analysis\Data_Preparation/preprocessed_dataset/cleaned_dataset.csv


## Task 3: Feature Scaling and Normalizing


### Import Libraries and Configuration
In this step, we import the necessary libraries and load the configuration file. We also convert the relative paths from the configuration file to absolute paths for easy access.


In [150]:
import json
import pandas as pd
import sys
import os
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Load configuration
config_path = os.path.join(os.path.dirname(os.path.abspath('')), '..', 'config.json')
print(f"Config path: {config_path}")
with open(config_path, 'r') as f:
    config = json.load(f)

# Convert relative paths to absolute paths
project_root = os.path.dirname(os.path.dirname(os.path.abspath('')))
raw_data_path = os.path.join(project_root, config['raw_data_path'])
interim_cleaned_data_path = os.path.join(project_root, config['interim_cleaned_data_path'])
preprocessed_data_path = os.path.join(project_root, config['preprocessed_data_path'])
standard_scaled_data_path = os.path.join(project_root, 'data_preparation/scaling_techniques/standard_scaled_dataset.csv')
min_max_scaled_data_path = os.path.join(project_root, 'data_preparation/scaling_techniques/min_max_scaled_dataset.csv')

print(f"Raw data path (absolute): {raw_data_path}")
print(f"Interim cleaned data path (absolute): {interim_cleaned_data_path}")
print(f"Preprocessed data path (absolute): {preprocessed_data_path}")
print(f"Standard scaled data path (absolute): {standard_scaled_data_path}")
print(f"Min-Max scaled data path (absolute): {min_max_scaled_data_path}")

# Ensure the utils module can be found
sys.path.append(os.path.join(project_root, 'utils'))

# Import custom modules
from data_loader import load_data
from data_cleaner import clean_data
from handle_missing_and_encode import handle_missing_and_encode
from scaler import apply_standard_scaling, apply_min_max_scaling


Config path: d:\Customer-Churn-Analysis\notebooks\..\config.json
Raw data path (absolute): d:\Customer-Churn-Analysis\data/raw/Dataset (ATS)-1.csv
Interim cleaned data path (absolute): d:\Customer-Churn-Analysis\data/interim/cleaned_dataset.csv
Preprocessed data path (absolute): d:\Customer-Churn-Analysis\Data_Preparation/preprocessed_dataset/cleaned_dataset.csv
Standard scaled data path (absolute): d:\Customer-Churn-Analysis\data_preparation/scaling_techniques/standard_scaled_dataset.csv
Min-Max scaled data path (absolute): d:\Customer-Churn-Analysis\data_preparation/scaling_techniques/min_max_scaled_dataset.csv


### Apply Standard Scaling
In this step, we apply standard scaling to the numeric columns in the cleaned dataset. Standard scaling ensures that each feature has a mean of 0 and a standard deviation of 1.


In [151]:
# Apply standard scaling to numeric columns
df_standard_scaled = apply_standard_scaling(df_cleaned)
print("Standard scaling applied.")


Numeric columns for scaling: Index(['SeniorCitizen', 'tenure', 'MonthlyCharges', 'gender_Female',
       'gender_Male', 'Dependents_No', 'Dependents_Yes', 'PhoneService_No',
       'PhoneService_Yes', 'MultipleLines_No', 'MultipleLines_Yes',
       'InternetService_DSL', 'InternetService_Fiber optic',
       'Contract_Month-to-month', 'Contract_One year', 'Contract_Two year',
       'Churn_No', 'Churn_Yes'],
      dtype='object')
Standard scaling applied.
Standard scaling applied.


### Apply Min-Max Scaling
Here, we apply min-max scaling to the numeric columns in the cleaned dataset. Min-max scaling transforms the features by scaling each feature to a given range, typically between 0 and 1.


In [152]:
# Apply min-max scaling to numeric columns
df_min_max_scaled = apply_min_max_scaling(df_cleaned)
print("Min-Max scaling applied.")


Numeric columns for scaling: Index(['SeniorCitizen', 'tenure', 'MonthlyCharges', 'gender_Female',
       'gender_Male', 'Dependents_No', 'Dependents_Yes', 'PhoneService_No',
       'PhoneService_Yes', 'MultipleLines_No', 'MultipleLines_Yes',
       'InternetService_DSL', 'InternetService_Fiber optic',
       'Contract_Month-to-month', 'Contract_One year', 'Contract_Two year',
       'Churn_No', 'Churn_Yes'],
      dtype='object')
Min-Max scaling applied.
Min-Max scaling applied.


### Save Scaled Datasets and Verify
In this final step, we save the scaled datasets to their respective paths and display the first few rows to verify the transformations.


In [153]:
# Save the scaled datasets
df_standard_scaled.to_csv(standard_scaled_data_path, index=False)
df_min_max_scaled.to_csv(min_max_scaled_data_path, index=False)
print(f"Standard scaled data saved at {standard_scaled_data_path}")
print(f"Min-Max scaled data saved at {min_max_scaled_data_path}")

# Display the first few rows of the standard scaled dataset
print("First few rows of the standard scaled dataset:")
df_standard_scaled.head()

# Display the first few rows of the min-max scaled dataset
print("First few rows of the min-max scaled dataset:")
df_min_max_scaled.head()


Standard scaled data saved at d:\Customer-Churn-Analysis\data_preparation/scaling_techniques/standard_scaled_dataset.csv
Min-Max scaled data saved at d:\Customer-Churn-Analysis\data_preparation/scaling_techniques/min_max_scaled_dataset.csv
First few rows of the standard scaled dataset:
First few rows of the min-max scaled dataset:


Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,gender_Female,gender_Male,Dependents_No,Dependents_Yes,PhoneService_No,PhoneService_Yes,MultipleLines_No,MultipleLines_Yes,InternetService_DSL,InternetService_Fiber optic,Contract_Month-to-month,Contract_One year,Contract_Two year,Churn_No,Churn_Yes
0,0.0,0.013889,0.115423,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
1,0.0,0.472222,0.385075,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
2,0.0,0.027778,0.354229,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
3,0.0,0.625,0.239303,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
4,0.0,0.027778,0.521891,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0


# Summary of Data Preprocessing
## Overview
In this notebook, we performed data preprocessing for the Customer Churn Analysis project. The preprocessing steps included data loading, cleaning, handling missing values, encoding categorical variables, and applying feature scaling and normalization.

### Task 1: Load Data
Objective: Load the raw dataset from the specified path.
Process: We used pandas to read the CSV file and displayed the initial columns and first few rows to verify the data loading process.
Result: Successfully loaded the raw dataset with the following columns: `['gender', 'SeniorCitizen', 'Dependents', 'tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'Contract', 'MonthlyCharges', 'Churn']`.

### Task 2: Handle Missing Data and Encode Categorical Variables
Objective: Handle any missing data and encode categorical variables for further analysis.
Process:
- Handle Missing Data: Used the SimpleImputer to impute mean values for numeric columns.
- Encode Categorical Variables: Applied OneHotEncoder to convert categorical variables into one-hot encoded format.
Result: The cleaned dataset was saved, and the columns were successfully encoded as one-hot vectors. The dataset now includes columns such as `['SeniorCitizen', 'tenure', 'MonthlyCharges', 'gender_Female', 'gender_Male', 'Dependents_Yes', 'Dependents_No', 'PhoneService_Yes', 'PhoneService_No', 'MultipleLines_Yes', 'MultipleLines_No', 'InternetService_Fiber optic', 'InternetService_DSL', 'Contract_Month-to-month', 'Contract_One year', 'Contract_Two year', 'Churn_No', 'Churn_Yes']`.

### Task 3: Feature Scaling and Normalizing
Objective: Apply standard scaling and min-max scaling to the numeric columns in the dataset.
Process:
- Standard Scaling: Used StandardScaler to transform the features such that they have a mean of 0 and a standard deviation of 1.
- Min-Max Scaling: Applied MinMaxScaler to scale the features to a range between 0 and 1.
Result: Both scaled datasets were saved to their respective paths, and the first few rows were displayed for verification.

### Results Obtained
- Initial Data Load: Successfully loaded the dataset with essential customer information.
- Cleaned and Encoded Data: Handled missing values and encoded categorical variables into a suitable format for analysis.
- Scaled Data: Applied standard scaling and min-max scaling to ensure the data is ready for modeling.

### Next Steps
1. Exploratory Data Analysis (EDA): Perform EDA to visualize and understand the relationships within the dataset.
2. Feature Engineering: Create new features that might help improve the model performance.
3. Modeling: Build and evaluate machine learning models to predict customer churn.

By following this structured approach, we have ensured that the data is properly preprocessed and ready for further analysis and modeling. This summary captures the key steps and results, providing a clear overview of the preprocessing phase.
